Advantages of 128 GB unified memory on LLAMA2

tomO2013

Power User
Joined
Nov 8, 2021
Posts
125
This is a nice comparison between M1 Pro, M3 Max and a desktop 16core AMD + RTX 4090 on LLama2 with 13billion and 70billion parameters.



TLDR; it highlights the huge advantages that Apple silicon has with it’s unified memory architecture for large locally run inference models compared to 4090’s etc… also highlights that M3 Max on smaller parameter (e.g. 7billion) is not far behind a 4090 at a fraction of the power consumption.

Not really anything new for the folks that I see regularly posting here who will know this stuff already - however still nice to see tangible real-world side by side benchmarks!

Enjoy.
 
Now imagine if Apple had as many tensor cores for training purposes … who knows? maybe the next big die shrink …
 
It's possible. With all the rumours of apple investing heavier into 'Siri' and their preference for on-device processing.... I wouldn't be surprised to see a bigger piece of die real estate being dedicated to the neural engine in M4.

In any case, this test really left me with a few takeaways....
1. for larger models or parameters that simply can't be fit within the 4090's memory , the M3 Max runs rings around the 4090
2. For smaller parameters of 7billion and 13 billion, the M3 Max was not far off the performance of a 4090 running optimized CUDA code.... that's not too shabby when you factor in the power disparity in terms of wattage when both are running under full load.

I'd be interested to know how well the Apple Silicon code path is optimized relative to the x86 + 4090 code path too even for the smaller model with 7billion parameters that can fit within the 24GB ram of the 4090.
 
It's possible. With all the rumours of apple investing heavier into 'Siri' and their preference for on-device processing.... I wouldn't be surprised to see a bigger piece of die real estate being dedicated to the neural engine in M4.

In any case, this test really left me with a few takeaways....
1. for larger models or parameters that simply can't be fit within the 4090's memory , the M3 Max runs rings around the 4090
2. For smaller parameters of 7billion and 13 billion, the M3 Max was not far off the performance of a 4090 running optimized CUDA code.... that's not too shabby when you factor in the power disparity in terms of wattage when both are running under full load.

I'd be interested to know how well the Apple Silicon code path is optimized relative to the x86 + 4090 code path too even for the smaller model with 7billion parameters that can fit within the 24GB ram of the 4090.
It should be noted this is for doing single inference which is heavily memory bound. If you do multiple simultaneous inferences, the more the extra compute power of the 4090 will come into play (as long as the model fits into memory!). And of course training is a different beast.
 
For comparison, M1 Ultra 48C/64GB takes 1m 16s to run that query on the 70b params model.
GPU power reported by powermetrics tops out around 60W.
Command: time ollama run llama2:70b “write an essay for about the usa revolution.”
 

Attachments

  • Screenshot 2023-11-27 at 16.18.32.png
    Screenshot 2023-11-27 at 16.18.32.png
    544.8 KB · Views: 100
This is a nice comparison between M1 Pro, M3 Max and a desktop 16core AMD + RTX 4090 on LLama2 with 13billion and 70billion parameters.



TLDR; it highlights the huge advantages that Apple silicon has with it’s unified memory architecture for large locally run inference models compared to 4090’s etc… also highlights that M3 Max on smaller parameter (e.g. 7billion) is not far behind a 4090 at a fraction of the power consumption.

Not really anything new for the folks that I see regularly posting here who will know this stuff already - however still nice to see tangible real-world side by side benchmarks!

Enjoy.

Very interesting. A little alarmed to hear the M3 Pro panicked! Performance looks great and I’m very interested to see how the M3 Ultra does.

Saw this video earlier covering some other AI programs. Iirc Stable Diffusion, an LLM and some ray tracing on Blender compared to his pc with a 3080. Nice results.

 
Back
Top