Apple to use it’s own server chips

I’m just another tech nerd speculator here, no special or professional insight:

1 The bottleneck for large language model token generation Is memory bandwidth. for this part of inference both Ultras are about the same speed and nearly double the speed of each Max. The M1 generation at least, appears to not be using it theoretical maximum bandwidth, later generations are closer.

The other important factor for running the largest most useful models is memory capacity, where M2 Ultras also crush all Max variations.

I would hope the datacenter team would arrange to have the M2 Ultras fully populated with faster RAM than the consumer Ultras.

2. if these rumors are true, it would make sense for the M4 Max and Ultra to be available for consumers later. serving their user base with datacenter AI is going to require many hundreds of thousands of Ultras. They might need two to three quarters of M4 Ultra production before they can spare to sell them.

According to Semianalysis:

”The other indication that Cupertino is serious about their AI hardware and infrastructure strategy is they made a number of major hires a few months ago. This includes Sumit Gupta who joined to lead cloud infrastructure at Apple in March. He’s an impressive hire. He was at Nvidia from 2007 to 2015, and involved in the beginning of Nvidia's foray into accelerated computing. After working on AI at IBM, he then joined Google’s AI infrastructure team in 2021 and eventually was the product manager for all Google infrastructure including the Google TPU and Arm based datacenter CPUs.

He’s been heavily involved in AI hardware at Nvidia and Google who are both the best in the business and are the only companies that are deploying AI infrastructure at scale today. This is the perfect hire.”

Hopefully he is discussing with the chip team about optimizing Apple Silicon capabilities in this arena.
Welcome!

I noted Sumit Gupta in another thread here that you may find interesting. I’m also hoping for significantly increased memory bandwidth as well as a TPU-like accelerator for Mac Pros. Either way, WWDC should be exciting.
 
I’m not sure I understand what API vs local is in this context. Or rather I get local, but am unsure what API is here. Is there another meaning beyond Application Programming Interface or is there some way that applies here that I’m missing?
it’s the same meaning. All of the major cloud, large language model providers I’m making money by letting businesses create private apps with direct interfaces into their LLM.

It’s true that hooking into cloud compute will provide far more capabilities than what can be done local on device. I think Apple’s angle with using their own servers is providing some type of secure link that can only be done Apple Silicon to Apple Silicon to provide cloud power and resources with the same control users have over their own devices.

It’s a genius move, it will be nearly impossible for anyone else to match that combination even if they were honest about trying. Non Apple consumers will have to make tough choices about using weak (40 tops will still only be for small, simple, basic AI) or exposing their data to others.
 
Welcome!

I noted Sumit Gupta in another thread here that you may find interesting. I’m also hoping for significantly increased memory bandwidth as well as a TPU-like accelerator for Mac Pros. Either way, WWDC should be exciting.
Thanks! I saw that, I’ve been lurking here for months. I had to join the discussion as it’s about to get super juicy in the next weeks.🥹
 
I’m just another tech nerd speculator here, no special or professional insight:

1 The bottleneck for large language model token generation Is memory bandwidth. for this part of inference both Ultras are about the same speed and nearly double the speed of each Max. The M1 generation at least, appears to not be using it theoretical maximum bandwidth, later generations are closer.

That’s true but only for doing a single inference at a time - ie a single user on a local machine. Batched inference with multiple users, like on these proposed servers, can be compute limited depending on the batch size and that is where the huge number of tensor cores of an Nvidia GPU can shine (and obviously for training).

The other important factor for running the largest most useful models is memory capacity, where M2 Ultras also crush all Max variations.

I would hope the datacenter team would arrange to have the M2 Ultras fully populated with faster RAM than the consumer Ultras.

2. if these rumors are true, it would make sense for the M4 Max and Ultra to be available for consumers later. serving their user base with datacenter AI is going to require many hundreds of thousands of Ultras. They might need two to three quarters of M4 Ultra production before they can spare to sell them.

According to Semianalysis:

”The other indication that Cupertino is serious about their AI hardware and infrastructure strategy is they made a number of major hires a few months ago. This includes Sumit Gupta who joined to lead cloud infrastructure at Apple in March. He’s an impressive hire. He was at Nvidia from 2007 to 2015, and involved in the beginning of Nvidia's foray into accelerated computing. After working on AI at IBM, he then joined Google’s AI infrastructure team in 2021 and eventually was the product manager for all Google infrastructure including the Google TPU and Arm based datacenter CPUs.

He’s been heavily involved in AI hardware at Nvidia and Google who are both the best in the business and are the only companies that are deploying AI infrastructure at scale today. This is the perfect hire.”

Hopefully he is discussing with the chip team about optimizing Apple Silicon capabilities in this arena.
Indeed.
 
That’s true but only for doing a single inference at a time - ie a single user on a local machine. Batched inference with multiple users, like on these proposed servers, can be compute limited depending on the batch size and that is where the huge number of tensor cores of an Nvidia GPU can shine (and obviously for training).


Indeed.
That’s why they need the deals with OpenAI and/or Google. For one, people are saying Apple will only offer a subset of what current models can do. But even the limited features will need some big model compute heavy lifting.

Those requests will probably be pre-processed and anonymized by the Ultras, the likely sticking point for a Google deal.
 
Apple doing their own servers is actually pretty smart at their scale. I was skeptical of this kind of thing for a while but with Apple services blowing up and AI specifically, it probably crossed some ROI for Apple to just take M2 Ultras and such and throw them in their own DC.

And yeah I think the M2 Ultras are still plenty powerful, particularly on the CPU end where really nothing else beats it for Perf/W or just Perf for servers (most servers are power constrained especially for all core).

GPU I could buy some arch improvements are worthwhile tho moving on from M2.
 
Only two q’s on my mind with regard to AI.

1: is Apple going to be able to license a model to run on their own hardware from OAI or Google?

If they can, will they, is that the plan or is this for their own?

2: Model class and compute bottleneck. It depends on what they’re using it for, also depends on if they have some spooky interconnect too.
 
Looks like Apple has trained both 7B and 70B LLMs called Fuji on 15T tokens. The architecture looks pretty standard, i.e. like Llama-3. I need to have a closer look for multimodality, but nothing’s jumped out at me so far.


Edit: I should clarify for this thread that they appear to be currently using TPUs and H100s/A100s.
 
Last edited:
As a bit of an AI hater all I can think of when hearing about so many huge models being trained is “what a waste of energy”. So many watt hour to have a large fleet of different huge 70B-scale, incoherent misinformation generation machines. While there are cases where the LLMs we have now can be of mild use they are few and far between I think. Vast majority of the time a regular google search and a good human source of info gets you further just as easily
 
As a bit of an AI hater all I can think of when hearing about so many huge models being trained is “what a waste of energy”. So many watt hour to have a large fleet of different huge 70B-scale, incoherent misinformation generation machines. While there are cases where the LLMs we have now can be of mild use they are few and far between I think. Vast majority of the time a regular google search and a good human source of info gets you further just as easily
As a student of ML, I totally understand the sentiment. However, there are many ways to train these models, for example distillation of other models takes much less energy (which I think are where the Apple OpenAI/Google deals come in), and grounding and attribution helps a model cite references and reduces confabulation. Dataset curation is also absolutely essential, i.e. garbage-in-garbage-out, and Google’s recent deployment using bullshit Reddit posts advising people to eat rocks and such point to the main issue: companies are deploying half baked products, causing massive concern and ruining the field’s reputation.
 
Last edited:
As a student of ML, I totally understand the sentiment. However, there are many ways to train these models, for example distillation of other models takes much less energy (which I think are where the Apple OpenAI/Google deals come in), and grounding and attribution helps a model cite references and reduces confabulation. Dataset curation is also absolutely essential, i.e. garbage-in-garbage-out, and Google’s recent deployment using bullshit Reddit posts advising people to eat rocks and such point to the main issue: companies are deploying half baked products, causing massive concern and ruining the field’s reputation.
So just to be clear, I have no issues with AI and ML, especially in the academic fields, and they also have various perfectly good and valid uses, for which I have no issues with the cost of either training or inference. - But I am sick and tired of LLMs and their overuse for things they aren't even that good at.
 
People keep referring to current LLMs and AI if they are done technologies. This is the Wright brothers airplane stage. This is the NCSA Mosaic web browser stage. This is the Intel releasing the 4004 stage. Not immediately useful, but new society changing paradigms.

I believe Jensen’s Industrial Revolution comparison is not hyperbole.
 
Can they really provide enough compute for these really demanding tasks by using M2 Ultras? These are nearly beaten by an M3 Max now and certainly big iron from Nvidia will destroy these. Can really support their user base of ~1 billion users?

I’d say Yes.

Nvidia might have the outright chip performance but at what power draw?

In a datacenter you’re limited by power and cooling not just ultimate performance in an unconstrained power and thermal envelope.
 
I’d say Yes.

Nvidia might have the outright chip performance but at what power draw?

In a datacenter you’re limited by power and cooling not just ultimate performance in an unconstrained power and thermal envelope.
Having read the technical information Apple put up on its security blog about how they intend to provide privacy and security guarantees, and reading between the lines a bit, I believe Apple's goal is that third party devs will be able to distribute models that can run either on-device or in the cloud. The decision of where to run is made at runtime, on-device, based on job size and device capabilities. If the job gets shipped off to cloud servers, I think those don't even need the app to package and upload the model; just the data is enough. If the app's distributed through the App Store, Apple can pull the model binary from their own servers.

So, the cloud compute needs to be the same hardware and software architecture as the device, just with more resources. Most notably, lots more RAM than an iPhone, but IIRC there was a second unused neural engine hanging out on both M1 and M2 Max die, so maybe they're lighting that up to get more compute throughput out of each node.

If all this is true, power and cooling considerations are secondary, because Apple's architected a system where there isn't any realistic option other than their own silicon.
 
So, the cloud compute needs to be the same hardware and software architecture as the device, just with more resources.

Is that a given? I thought CoreML enabled models that could migrate between CPU, GPU, and NPU and would be distilled down to the target that would be running inference.

I realize it would limit what models would benefit from this, but it's not exactly unlike Apple to do this sort of "Do this, and you don't have to worry about that" engineering. It's practically their bread and butter. Also that they'd have to port CoreML to the server architecture, but that seems more unlikely than infeasible.
 
Is that a given? I thought CoreML enabled models that could migrate between CPU, GPU, and NPU and would be distilled down to the target that would be running inference.

I realize it would limit what models would benefit from this, but it's not exactly unlike Apple to do this sort of "Do this, and you don't have to worry about that" engineering. It's practically their bread and butter. Also that they'd have to port CoreML to the server architecture, but that seems more unlikely than infeasible.
Good point - I'm not sure how exactly they migrate those models between execution resources. How much compilation is done at runtime to enable model execution? Does a deployable CoreML model have something inside akin to GPU shader code, or is it just the network weights and a precompiled fat binary? (Do they even need to compile code, or can a static, generic system library just ingest network weight & layer structure data? Not familiar enough with ML to guess.)

They're taking pretty extreme preventive measures in the compute cloud - the nodes don't even have a shell binary installed, for example. So I was thinking that there's no chance they'd have a compiler installed, the nodes would only be capable of executing trusted signed binaries. But I am definitely handwaving a lot here.

I do think they wouldn't want to port CoreML to run on top of Linux on x86, or even Arm Linux on one of the Arm server CPU options. As you say, feasible but unlikely.
 
Good point - I'm not sure how exactly they migrate those models between execution resources. How much compilation is done at runtime to enable model execution? Does a deployable CoreML model have something inside akin to GPU shader code, or is it just the network weights and a precompiled fat binary? (Do they even need to compile code, or can a static, generic system library just ingest network weight & layer structure data? Not familiar enough with ML to guess.)

They're taking pretty extreme preventive measures in the compute cloud - the nodes don't even have a shell binary installed, for example. So I was thinking that there's no chance they'd have a compiler installed, the nodes would only be capable of executing trusted signed binaries. But I am definitely handwaving a lot here.

My understanding is that the model itself exists as a type of IR. I don't know if actual code is involved this early rather than something that simply represents the network graph, but there's a few steps of optimizing the neural network itself before code generation should be a concern. But the fact that you can programmatically tell Apple what devices to run the model on, and that not all layer types can leverage the ANE (you'll get fallback behaviors) tells me that it's pretty reasonable to expect a well defined model to be platform agnostic.

I do think they wouldn't want to port CoreML to run on top of Linux on x86, or even Arm Linux on one of the Arm server CPU options. As you say, feasible but unlikely.

Yeah, I'm a little more mixed because Apple has a lot more Linux expertise these days, and exclusively uses it for server stuff up to this point. Going to Apple Silicon and a macOS variant of some kind just for compute resources just feels "off", rather than providing some sort of execution engine for CoreML models that works on Linux (it doesn't necessarily need to be all of CoreML, just enough to be able to load and execute the models for compatibility). Being on Linux would generally make it easier to support additional model types of they chose to, although I'd be surprised by that.
 
Yeah, I'm a little more mixed because Apple has a lot more Linux expertise these days, and exclusively uses it for server stuff up to this point. Going to Apple Silicon and a macOS variant of some kind just for compute resources just feels "off", rather than providing some sort of execution engine for CoreML models that works on Linux (it doesn't necessarily need to be all of CoreML, just enough to be able to load and execute the models for compatibility). Being on Linux would generally make it easier to support additional model types of they chose to, although I'd be surprised by that.
Are you suggesting Linux on Apple Silicon? To me, it's something of a given that once they committed to AS hardware, they'd use a Darwin-based OS. Linux on AS is too rough to use at scale like this. They'd have a lot of extra work to do on fully integrating key AS features and security hardening.

So it comes down to that choice: why AS?

I do wonder what the primary compute resource would be if it wasn't AS. Existing Arm and x86 datacenter CPUs are mostly about providing a big sea of arm64 or amd64 CPU cores - not the most efficient or performant option for inference. All three ways to run inference on AS chips might well be better. (Even the CPU. M4's SME is the formalization and public exposure of AMX, an accelerator that's been around since before M1, so if Apple needs to do inference work on the CPU clusters in these machines, they've got a much higher performance option than plain old NEON code.)

I don't doubt that in a technical sense, NVidia server GPUs might be the superior choice, but there's two barriers there. One is that Apple has this well known beef with NVidia going way back, and probably wouldn't want to touch CUDA. The other is that right now, I'd say it's a safe bet that NV GPUs cost a lot more (than custom Apple hardware!!!) and are more difficult to acquire. NVidia's server GPUs are astonishingly expensive and they're still selling product faster than they can make it. They are making so much money off this AI bubble.
 
So, @mr_roboto and @Nycturne, I was going to try to reply to each of your points, but the thread got away from me so maybe it'd be better if I give a quick rundown of large language model architecture and the current state of Apple's deployed machine learning.

But first, a disclaimer about the field: A consequence of ML being extremely inclusive of other fields is that terminology is an absolute mess (more so than all interdisciplinary fields I've studied combined). Everything is overloaded and poorly defined, and the irony of people working in natural language causing a communication nightmare is not lost on me.

Anyway, there are a few things that make up a "model" in this thread's context:
  • Model Parameters - Often referred to as "weights", these are the billions of usually FP16 numbers that represent the static state of a pre-trained neural network, typically comprised of layers of the same size.
  • Model Software - This usually includes code for the input tokenizer, position encoder, layer forward (inference) pass, layer backward (training) pass, and so on. Each of those are actually pretty simple and there are GitHub repos that only have one file with the entire implementation; where things get complicated is writing code (particularly the forward and backward passes) that targets a particular accelerator and fuses all of the simple operations so caches don't get thrashed. Actually, one of the reasons the `mlx` library is so well regarded is that it takes care of the complex optimization for you and the API is sensible and elegant.
  • Model Metadata - Just a few json files that describe the model and thus what's necessary to run the model. I suppose this isn't really necessary if the software is bespoke to the model, but thankfully the field has grown conventions which allows software like `llama.cpp`, `mlx`, and such to easily support lots of model configurations.
If you want to run a model on device and in the cloud, you need the model parameters and software on both. Model parameters/metadata can be trivially converted to work with other software. However, the hardware architecture and OS can be anything, and indeed you can fairly straightforwardly execute a model with pen and paper, albeit very slowly. There's actually a guy who does exactly that as a learning resource; I can dig up some links to resources if anyone’s interested.

Apple in particular targets execution on Apple Silicon GPUs and CPUs via `mlx`, on the ANE and CPUs via `coreml`, and on Google TPUs (and other Google compute resources) via `axlearn`. Since the inception of `mlx`, models have always been convertible from various open source projects (e.g. huggingface.co), but a cool new thing (that I haven't yet had the time to try out myself) is that apparently model params/metadata from `mlx` can be converted to `coreml`. In theory this should mean that pretty much any model out there should be convertible to be run on the ANE. Edit: Also I understand that Apple intends to allow such custom models to be used with their Apple Intelligence platform on-device, which has enormous potential, but also `mlx` has distributed computation that allows one to run big models right on your very own local network.

Sorry if this post is a bit rambly! I tried to make it a quick rundown... I intentionally skipped over tokenization, embeddings, dynamic state/memory, low rank adaptation (LoRA), context-extension, quantization, distillation, representation engineering, multimodality, other architectures like recurrent neural networks, graph neural networks, encoder-decoder models, etc. I also didn't address ASi vs Nvidia vs Google TPUs, since you all probably know the hardware capabilities better than I. Anyway, if anyone would like me to elaborate on anything, just ask 🙂
 
Last edited:
Back
Top