Your point #2 above suggests you are confused.
I was not talking at all about nVidia's GPUs. I was talking about the ConnectX series, now made by nVidia, which comes from their Mellanox acquisition.
As for your point #1: Apple faces a stark choice, because Thunderbolt simply doesn't have the bandwidth for connecting hosts as fast as you want to for scaling out AI (much less scaling up). TB is great compared to 10GbE, but not to 100, much less the 200 used in the DGX Spark, or the 400/800 used in bigger servers.
Putting that speed Ethernet in the Studio would be utterly impractical for the base model. But there's an easy solution. After all, they already make 10GbE optional. It would not be a stretch to make an OSFP or QSFP port with a really fast Ethernet chip behind it an option as well. And if they are supporting a mellanox chip already for their PCC servers, it's reasonable to use those for Studios as well, though there are also several other options. Perhaps Tim Cook has decided it's time to bury the hatchet?
Coming back to your second comment... If you think they're not connecting them with TB, then what do you think they're using? And why wouldn't they consider that tech for the studio, at least as an option?
Also, as far as I can tell, 128GB RAM per Mac is the minimum to be interesting when clustering Macs to run large models. As of right now, a 128GB Mac costs a minimum of $3500 in Mac Studio form, $4700 in MacBook Pro. These are not consumer Macs! The average Mac owner spent a fraction of those amounts on a base model Mini or Air with at most 16GB RAM. Clustering those together gets you nowhere.
Please stop scrubbing your comments. You seem very impassioned but you keep doing that and it makes following the conversation difficult. Let’s slow it down a bit and appreciate the intuition and experience of people involved here. I count myself lucky to have such people to chat with, differences of opinion and all.
Again, Feel free to discuss the cool stuff RDMA over Thunderbolt enables for consumer ML, and if so inclined, read and discuss the heavily documented info Apple gave for Private Cloud Compute. I've contributed everything I've could, so I'm ending my part of this post herePlease stop scrubbing your comments. You seem very impassioned but you keep doing that and it makes following the conversation difficult. Let’s slow it down a bit and appreciate the intuition and experience of people involved here. I count myself lucky to have such people to chat with, differences of opinion and all.
I think you potentially underestimate what hobbyists and small business will perhaps be inclined to spend on low end consumer AI.
Gamers are buying 5090 GPUs, to play games on.
As we move into the AI era, I can easily see somebody spending 5-10k on home AI hardware if it is actually useful and enables them to get such use without a subscription. I've personally got way more than that in apple hardware purchased with my own money in the past 12 months - and I'm nowhere near the upper end of Apple device spending. These are not the "typical" user, but there's certainly a market that does exist for this.
Mac developers / "prosumers" would certainly like to run local AI, especially if it is running Apple trained models for development.
(MacBook Pro m4 max, iPhone 16 pro max, iPad Pro m5, multiple HomePods and other things)...
![]()
- The thermal design is still aggressive. The 150mm chassis is compact and beautiful, and it still runs hot under sustained load. The February update helps (18W savings when the ConnectX-7 isn’t active), but I still keep a small USB fan pointed at mine. It’s not elegant. It works.
The one thing that really stood out to me though was this:
When the Project DIGITS was first announced, a bunch of us on the other site were trying to figure out what the specs actually were. Given only the 4bit rating and not knowing if it was Blackwell or Blackwell 2.0, it wasn't easy. In the end, given the size of the device, I concluded the GPU would be much smaller in size, closer to the ratio of FP32 to tensor cores found in server chips. As it turns out I was wrong and those on the other site who argued it would be bigger were right. Nvidia went with a ratio of FP32 to Tensor cores more similar to the consumer Blackwell 2.0 chips so that the GPU's FP32 compute was actually quite sizable, effectively a 5070, but with the full FP64 and TSMC N3 node of the server Blackwell - the latter lowering power draw relative the consumer chips still on the older node. They then further lowered power draw by of course using LPDDR instead of GDDR. However, it appears I was still right to question the thermals of effectively putting a 5070 into a chassis that size and trying to cool it effectively even with the better node and LPDDR memory. Apparently not only does the Spark have extreme difficulty maintaining or even reaching mac performance, but it is loud to boot. And even with the fixes Nvidia put in recently, its base power levels are quite high.
I wonder why Nvidia felt the need to make it so compact? A larger chassis with a bigger cooling system and more airflow would simply have been better (allowing for lager/more numerous and therefore quieter cooling too). It doesn't even need to be THAT big. A Studio is still pretty compact. Does anyone know if any of the OEMs have a bigger chassis or just more effective cooling? I know they'll advertise them as such, obviously, and an historical advantage of OEM graphics cards over "Founder" models was cooling, but has anyone seen non-Founder DGX Sparks get tested versus founder models?
I didn't say there was a discrepancy between BF16 with FP32 and FP4. I was discussing the ratio of standard FP32 versus Tensor core throughput, which as I wrote is indeed the same as the 5070 in the GB10, but which is also different from server Blackwell, which has a higher tensor core throughput relative to standard FP32 compute. And Nvidia didn't originally say off of which, server or consumer Blackwell, they were basing the GB10 GPU. However, it appears my information for FP64 throughput of GB10 being the same as server Blackwell rather than consumer Blackwell 2.0 was out-of-date based on an erroneous entry into techpowerup which has since been corrected:DGX Spark is supposed to be equivalent to a 5070, and detailed rates can be found in the whitewater: https://images.nvidia.com/aem-dam/S...ell/nvidia-rtx-blackwell-gpu-architecture.pdf
BF16 with FP32 accumulation is expected to run with 1/16 rate compared to sparse FP4 (page 55-55), so there is no discrepancy.
I didn't say there was a discrepancy between BF16 with FP32 and FP4.
[…]
Edit: was your note because I said "similar" or "effectively" instead of "identical" when comparing the GB10 and 5070? Sometimes I hedge when I can't remember specific numbers.
Ah got it. I was using Carmack's statement less about the ratio and more about the overall performance throttling which was reportedly repeatable and severe* and apparently is even still not great after the updates.Oh, sorry, I was only talking about Carmack’s statement. I guess I quoted too much of your post. Sorry about that. Don’t really have anything to add about the rest.
That post [by Carmack] set off a chain reaction. Tom’s Hardware reported that NVIDIA’s developer forums were flooding with crash reports and unexpected shutdowns under sustained load. ServeTheHome confirmed they couldn’t hit the 240W power ceiling in any workload.
Maybe just a very quick note since some reading this thread might not be aware: Nvidia recently changed their tensor cores to an outer product design (similar to what Apple does with AMX/SME), which I suppose makes it easier for them to ship different hardware implementations without changing the software model. I can imagine that the server designs use a wider engine. Again, this is entirely orthogonal to the discussion, just a bit of interesting trivia.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.