M4 Mac Announcements

Speaking of the M4 Ultra, this is behind a paywall ( https://asia.nikkei.com/Business/Te...xconn-to-produce-servers-in-Taiwan-in-AI-push ) but, according to MR's summmary ( https://forums.macrumors.com/thread...s-next-year-after-m2-ultra-this-year.2442148/ ), Apple will be replacing the M2 Ultra with M4 chips (which I assume eventually means the M4 Ultra) in its AI servers.

If true, that means their internal AI server development work will continue; the alternative would be Apple giving up on having its own AI severs, and farming this out to someone like Google.

I'm wondering what the relative volume of M2 Ultra chips going into the AI servers vs. the Macs has been thus far, and how that will change going forward.

I've read reports that, while Apple is trying to develop an LLM that will enable most requests to be processed on-device (this was probably a key part of their decision to increase the base RAM to 16 GB), cloud connectivity will still be required for more demanding requests, hence the need for the AI servers.

...which leads to another interesting question: Will the decision whether to process requests locally or remotely sometimes depend on device capability? E.g., might some requests that would be sent the cloud from a base M4 be processed locally on an M4 Ultra?
 
Last edited:
Speaking of the M4 Ultra, this is behind a paywall ( https://asia.nikkei.com/Business/Te...xconn-to-produce-servers-in-Taiwan-in-AI-push ) but, according to MR's summmary ( https://forums.macrumors.com/thread...s-next-year-after-m2-ultra-this-year.2442148/ ), Apple will be replacing the M2 Ultra with M4 chips (which I assume eventually means the M4 Ultra) in its AI servers.

If true, that means their internal AI server development work will continue; the alternative would be Apple giving up on having its own AI severs, and farming this out to someone like Google.

I'm wondering what the relative volume of M2 Ultra chips going into the AI servers vs. the Macs has been thus far, and how that will change going forward.

I've read reports that, while Apple is trying to develop an LLM that will enable most requests to be processed on-device (this was probably a key part of their decision to increase the base RAM to 16 GB), cloud connectivity will still be required for more demanding requests, hence the need for the AI servers.

...which leads to another interesting question: Will the decision whether to process requests locally or remotely sometimes depend on device capability? E.g., might some requests that would be sent the cloud from a base M4 be processed locally on an M4 Ultra?
I assume the differentiator would be the amount of memory to hold the model, and not the processing capabilities of the chip?
 
I've read reports that, while Apple is trying to develop an LLM that will enable most requests to be processed on-device (this was probably a key part of their decision to increase the base RAM to 16 GB), cloud connectivity will still be required for more demanding requests, hence the need for the AI servers.

...which leads to another interesting question: Will the decision whether to process requests locally or remotely sometimes depend on device capability? E.g., might some requests that would be sent the cloud from a base M4 be processed locally on an M4 Ultra?

A lot of what Apple is doing is with 3B SLMs, which should keep things manageable for the on-device scenarios. While I don’t have much insight on how much RAM these use during inference, I would not be surprised if it is close to 2GB. That depends on how much they can shrink the model using adapters for the different tasks. You don’t really want a feature that needs 25% of your RAM every time you want to summarize a notification or re-tone an email (maybe on iOS you can get away with this), and you likely want the model resident in memory to handle requests on the fly whenever there’s enough memory. That’s more where the RAM bump comes from I think.

So no, I don’t think the M4 Ultra will handle more requests locally. That complicates the engineering in ways which seem very unlikely.
 
No need for AGP these days, the GPU is now integrated to the ASi chip...
Hey, the upgradability was nice. I had one upgraded from 400 MHz PowerPC G4 7400 + Nvidia GeForce 2MX to a 1.8GHz G4 7447A + Nvidia GeForce 6200.

I would think an all-new Mac Pro Cube would basically be a taller variant on the Mac Studio, to allow for the larger cooling subsystem a Mn Extreme chip would require...
I don't think we'll ever get anything as close to the Cube spiritually as the trashcan Mac Pro unfortunately.
 
I assume the differentiator would be the amount of memory to hold the model, and not the processing capabilities of the chip?
Yeah, I suspected that as well when I was thinking about the base M4 vs. the Ultra.
A lot of what Apple is doing is with 3B SLMs, which should keep things manageable for the on-device scenarios. While I don’t have much insight on how much RAM these use during inference, I would not be surprised if it is close to 2GB. That depends on how much they can shrink the model using adapters for the different tasks. You don’t really want a feature that needs 25% of your RAM every time you want to summarize a notification or re-tone an email (maybe on iOS you can get away with this), and you likely want the model resident in memory to handle requests on the fly whenever there’s enough memory. That’s more where the RAM bump comes from I think.
In Dec 2023, Apple engineers published a paper proposing a more efficient way to split LLM parameter storage between DRAM and SSD so they could run larger (14 GB) LLMs on-device on models with limited RAM [1]. So while Apple's production on-device LLMs may be smaller, 2 GB could be an underestimate. I.e., Apple's way to avoid using too much RAM may be SSD caching rather than simply limiting the model size. Of course, Apple publishes a lot of stuff they don't implement, so it's possible they will not do this.

But if they do, then the difference in LLM operation between a large-RAM and a small-RAM device may not be on-device processing vs. sending to the cloud, but rather being able to keep the model resident in RAM vs. having to split it between the RAM and SSD.

"Currently, the standard approach is to load the entire model into DRAM (Dynamic Random Access Memory) for inference (Rajbhandari et al., 2021; Aminabadi et al., 2022). However, this severely limits the maximum model size that can be run. For example, a 7 billion parameter model requires over 14GB of memory just to load the parameters in half-precision floating point format, exceeding the capabilities of most personal devices such as smartphones."

[1] Alizadeh K, Mirzadeh I, Belenko D, Khatamifard K, Cho M, Del Mundo CC, Rastegari M, Farajtabar M. LLM in a flash: Efficient large language model inference with limited memory. arXiv preprint arXiv:2312.11514. 2023 Dec 12.

Link: https://arxiv.org/pdf/2312.11514
 
Last edited:
In Dec 2023, Apple engineers published a paper proposing a more efficient way to split LLM parameter storage between DRAM and SSD so they could run larger (14 GB) LLMs on-device on models with limited RAM [1]. So while Apple's production on-device LLMs may be smaller, 2 GB could be an underestimate. I.e., Apple's way to avoid using too much RAM may be SSD caching rather than simply limiting the model size. Of course, Apple publishes a lot of stuff they don't implement, so it's possible they will not do this.

Even if we don't see this at the user level, this could be useful in datacenter, and may be where Apple is thinking of deploying this (if they haven't already). Think of an Mn Ultra with a handful of these secure VMs running on it. Minimizing RAM usage there means you can host more in parallel on a node and reduce costs.

That said, I'd probably need to see a use case where it makes sense to use the amount of RAM and disk space for an LLM on-device for the end user. The SLMs in iOS 18/macOS 15 are larger in terms of parameters than many of the GPT-3 models as it is (outside of the larger curie and davinci models), and the adapters on top should make them more capable than their parameter count alone suggests.

I'm actually trying to get some folks on my end to look at SLMs for cost reasons in their thinking about features. Right now a lot of it is: "Natural Language? Throw an LLM at it." which makes certain features a lot more expensive than they need to be. Also working with folks to see if we can reduce how much work the model actually has to do before we can pass things off to a more classical algorithm to improve accuracy in some scenarios.
 
Back
Top