Mac Pro - no expandable memory per Gurman

theorist9 · Mar 10, 2023

leman said:
Why would Apple go to Nvidia if they can realistically deliver superior tech themselves? I don't know whether it will happen for the Mac Pro, but it will happen eventually. They published a bunch of new patents over the last few weeks that deal with efficient program execution on very large GPUs, so something big is likely incoming

I gather from context that you're referring to graphics use rather than GPU computing. But if they wanted to beat NVIDIA for the latter, they'd need not only superior hardware, but software that equals or betters CUDA for power and usability. That seems like it would be very challenging for Apple to do.

leman · Mar 10, 2023

theorist9 said:
I gather from context that you're referring to graphics use rather than GPU computing. But if they wanted to beat NVIDIA for the latter, they'd need not only superior hardware, but software that equals or betters CUDA for power and usability. That seems like it would be very challenging for Apple to do.

What makes you think so? Apple has a solid foundation for compute in place. The critical components missing are:

- unified virtual memory (same pointer space for CPU and GPU)
- at least limited forward progress guarantee (useful for some advanced compute algorithms)
- memory ordering fir atomics (squire/release semantics)
- more control from the shader programs (limited malloc, ability to launch shaders directly, etc.)

Apple is currently couple of years behind Nvidia in this space but they have been moving very fast. I’d expect interesting announcements this WWDC.

But these also are some of the more advanced topics for when you want to do complex cooperative multitasking and data structures on the GPU. For many compute tasks, Metal is already in a good spot.

Yoused · Mar 10, 2023

Colstan said:
The argument is "Apple has to make the Mac Pro fast enough to beat an RTX 4090". If having the fastest GPU in the world inside of the Mac Pro is necessary, then Apple would have made nice with Nvidia and make their latest and greatest graphics cards available inside the Mac Pro already.

What I understand is that one thing holding back the ASiGPU is that it is clocked at, what 1.8, or something like that, while the upper-order nVidia GPUs are running some 5~700MHz faster. So it might seem that Apple could make an Ultra model for the Studio Pro with a 2-speed clock that would allow the GPU to run faster, if the workload is big.

Of course, GPUs, by design, eat a lot of data. Just being able to ramp up the clock might merely starve the cores for data, leaving the whole effort a bust. But if it is a real, workable possibility, there can be little doubt that Apple has at least seriously considered it.

leman · Mar 10, 2023

Yoused said:
What I understand is that one thing holding back the ASiGPU is that it is clocked at, what 1.8, or something like that, while the upper-order nVidia GPUs are running some 5~700MHz faster.

M2 variants run at 1.4GHz actually. A 4080 runs at up to 2.5Ghz. Huge difference.

Yoused said:
Of course, GPUs, by design, eat a lot of data. Just being able to ramp up the clock might merely starve the cores for data, leaving the whole effort a bust. But if it is a real, workable possibility, there can be little doubt that Apple has at least seriously considered it.

Yeah, that’s a huge problem. Then again, Apple offers more bandwidth per compute unit than current Nvidia designs. Of course, graphics is cache-friendly. But more demanding compute task might not be. It is possible that Nvidia puts a ridiculous amount of compute on their gaming flagships for marketing reasons, even if it would be bandwidth limited in serous compute work. Nvidia Hopper delivers whopping 3TB/s of RAM bandwidth — and it’s nominally slower for compute than the 4090, but it’s also an extremely expensive device.

theorist9 · Mar 10, 2023

leman said:
What makes you think so? Apple has a solid foundation for compute in place. The critical components missing are:

- unified virtual memory (same pointer space for CPU and GPU)
- at least limited forward progress guarantee (useful for some advanced compute algorithms)
- memory ordering fir atomics (squire/release semantics)
- more control from the shader programs (limited malloc, ability to launch shaders directly, etc.)

Apple is currently couple of years behind Nvidia in this space but they have been moving very fast. I’d expect interesting announcements this WWDC.

But these also are some of the more advanced topics for when you want to do complex cooperative multitasking and data structures on the GPU. For many compute tasks, Metal is already in a good spot.

When I said I thought it would be very challenging for Apple to equal or exceed CUDA for for power and usability, I meant that literally—not that they couldn't do it, but that it would be very difficult. My reason for thinking this is two-fold. But let me first set this up, with a focus on usability:

If you want to assess whether an API has high usability (i.e., is as easy to use as possible), then don't ask computer scientsts; instead, ask users who aren't professional programmers—such as natural scientists. The latter are particularly sensitive to ease of use, because we want our challenges to come from the science, not the tools. And, IIUC, the general consensus among natural scientists doing GPGPU computing has been that CUDA is significantly easier to use than OpenCL.

So the question is: Will it be challenging for Apple to develop a GPGPU API whose ease of use equals or exceeds CUDA's?

I think it will be, for two reasons:

1) On general principle, it's challenging—for anyone—to develop an API that accomplishes highly complex tasks yet is relative accessible to those, like research scientists, who aren't professional programmers. CUDA is a significant achievement in that sense.

2) Apple has already tried its hand at this—they were the original developers of OpenCL—and while other companies took over the development of OpenCL after them, it was based on their original vision and framework, and OpenCL wasn't able to equal CUDA's ease of use. That doesn't mean Apple won't be able to succeed in achieving high usability this time. It just means it will probably be challenging for them—as it would be for anyone.

I guess you could think of the question like this: Suppose it's three or four years from now, and you're, say, a meterologist who wants to write a GPU-accelerated weather simulation—or a chemist who wants to write a GPU-acclerated particle simulation. On average, is it going to be easier for you to do it using NVIDIA's tools, or Apple's?

dada_dave · Mar 10, 2023

theorist9 said:
When I said I thought it would be very challenging for Apple to equal or exceed CUDA for for power and usability, I meant that literally—not that they couldn't do it, but that it would be very difficult. My reason for thinking this is two-fold. But let me first set this up, with a focus on usability:

If you want to assess whether an API has high usability (i.e., is as easy to use as possible), then don't ask computer scientsts; instead, ask users who aren't professional programmers—in this case, natural scientists. The latter are particularly sensitive to ease of use, because we want our challenges to come from the science, not the tools. And, IIUC, the general consensus among natural scientists doing GPGPU computing has been that CUDA is significantly easier to use than OpenCL.

So the question is: Will it be challenging for Apple to develop a GPGPU API whose ease of use equals or exceeds CUDA's?

I think it will be, for two reasons:

1) On general principle, it's challenging—for anyone—to develop an API that accomplishes highly complex tasks yet is relative accessible to those, like research scientists, who aren't professional programmers. CUDA is a significant achievement in that sense.

2) Apple has already tried its hand at this—they were the original developers of OpenCL—and while other companies took over the development of OpenCL after them, it was based on their original vision and framework, and OpenCL wasn't able to equal CUDA's ease of use. That doesn't mean Apple won't be able to succeed in achieving high usability this time. It just means it will probably be challenging for them—as it would be for anyone.

I guess you could think of the question like this: Suppose it's three or four years from now, and you're, say, a meterologist who wants to write a GPU-accelerated weather simulation—or a chemist who wants to write a GPU-acclerated particle simulation. On average, is it going to be easier for you to do it using NVIDIA's tools, or Apple's?

I’ll say this as a natural sciences CUDA developer: while OpenCL was not quite as user friendly in the beginning as CUDA was in the beginning, the actual difference was in the rate of features and the user friendliness of those added features over time. Nvidia leveraged their control over hardware and software (sound familiar?) to deliver far more features, faster, with yes an eye towards ease of use as well as performance. Apple is in a position to do the same and I know @leman is very impressed with Metal’s API approach relative to other graphics API but I’m not qualified to judge the compute aspects.

The biggest problem from a scientific standpoint is that because clusters of Apple Silicon machines/chips running metal will be rare to non-existent and CUDA has such an entrenched market share that there is little reason to port. Running problems on your own device is fine up to the point, but even though AS has certain advantages like memory capacity that gives it a substantial edge, Apple really would have to encourage cluster adoption which is a different can of worms.

Now there’s also a huge aspect of compute which won’t suffer from this, like the backend of big machine learning APIs. They want everyone to use their API everywhere and since it’s at a higher level they don’t care if it’s CUDA on the cluster or Metal on your home machine. So anytime you’re using a higher level API that sits above the low language is an opportunity for Apple to make inroads especially if it’s an open source project. And that’s what a lot of people actually use - people doing machine learning rarely if ever touch CUDA, it’s the programmers making the machine learning API who do - much like the Unity game engine versus DirectX/Metal/Vulkan at non-AAA studios. And for a lot of natural scientists that will also just be true of standard GPU compute. They’re going to use Python or Julia or Matlab and various packages/APIs built on top of CUDA/Metal/C++ (standard C++ gaining basic accelerator functionality) etc … to send work to the GPU. Again there will be difficulty with overcoming entrenchment but depending on the type of project, the size of the community, etc … back ends can use whatever and again Apple if it cares to can contribute and grow adoption.

Edit: I should note that Apple has hired people from Nvidia probably specifically for this reason (eg Olivier Giroux). So I’d say this is an active area of interest for them.

throAU · Mar 10, 2023

Cmaier said:
Apple's muted 2023 hardware launches to include Mac Pro with fixed memory | AppleInsider

Apple's 2023 lineup of updates will be muted and headlined by a New Mac Pro, one that will look just like the 2019 model but with a lack of user-upgradable memory.

appleinsider.com

Yeah, that’s what I thought.

Maybe, at least, the CPU will be on an easily replaceable board. Then you could upgrade memory in the future by swapping the entire SoC package. But that’s about as far as Apple could go without doing a lot of engineering work on a special CPU that can treat the on-package memory as some sort of cache, or do hardware virtual memory mapping and treating the off-package RAM as a RAM-disk.

Somewhat surprised that allowing graphic cards seems to be in the works. Not a huge technical problem but someone will have to keep writing drivers.

To be honest what I was expecting was a big backplane board with slots for combined SOC/ram modules hooked up via some proprietary high speed bus.

I think we are past apple allowing expansion of individual components, I see them taking a Lego brick approach where you have different spec of pro/max/ultra processor modules and you just add as many modules as appropriate.

Roller · Mar 10, 2023

Cmaier said:
I asked chatgpt about me, and it said i designed all sorts of chips i never designed, starting about 20 years before I started designing CPUs, and told me that the exponential x704 had multiple cores (it did not).

It just makes stuff up.

I had it write a bio about me. It did a great job as far as structure and wording were involved. But almost all the details, including where I was born and my career, were flat out wrong. That’s odd, since almost all is readily available on the Internet with a simple search.

throAU · Mar 10, 2023

People forget that chatgpt is a language model not a logic engine.

It doesn’t deal in facts. It matches language inputs to language outputs. That’s it.

It doesn’t know anything.

leman · Mar 11, 2023

theorist9 said:
So the question is: Will it be challenging for Apple to develop a GPGPU API whose ease of use equals or exceeds CUDA's?

I think it will be, for two reasons:

1) On general principle, it's challenging—for anyone—to develop an API that accomplishes highly complex tasks yet is relative accessible to those, like research scientists, who aren't professional programmers. CUDA is a significant achievement in that sense.

2) Apple has already tried its hand at this—they were the original developers of OpenCL—and while other companies took over the development of OpenCL after them, it was based on their original vision and framework, and OpenCL wasn't able to equal CUDA's ease of use. That doesn't mean Apple won't be able to succeed in achieving high usability this time. It just means it will probably be challenging for them—as it would be for anyone.

I guess you could think of the question like this: Suppose it's three or four years from now, and you're, say, a meterologist who wants to write a GPU-accelerated weather simulation—or a chemist who wants to write a GPU-acclerated particle simulation. On average, is it going to be easier for you to do it using NVIDIA's tools, or Apple's?

Really great post and very important questions!

I think it is important to understand why researches would see CUDA as "easy to use". On the surface, this is surprising — CUDA is C++ after all, an extremely complex programming language with many footguns, which is further complicated by parallel execution, extremely difficult to understand memory model, and various idiosyncratic limitations. It's not something a non-programmer should find "easy" or "comfortable". But I think something CUDA does very well is eroding the difference between CPU and GPU code. They give you a C++ compiler, unified codebase (where functions can run either on CPU or the GPU), their version of C++ standard library and you can just code away. I suppose it's indeed something a technically minded researcher will find conceptually simpler, because all these things will make sense to them — unlike API glue which they are likely to consider irrelevant inconvenient noise.

But there are also disadvantages to CUDA model. Unified codebases are convenient, but you are forced to use a non-standard compiler and your programs are written in a way that's difficult to disentangle. I do understand however that this is of little concern to an average researcher who targets specific compute infrastructure and doesn't really care much about software quality or cross-compatibility.

Let's focus on this particular aspect (convenience of mixing CPU and GPU code) and ask ourself whether Apple can do better. I believe they indeed can. From the start, Metal has pursued a conceptually simple model of CPU/GPU data interfacing, which pretty much peaked with Metal 3. All data is placed in a structured buffer using usual C/C++ layout rules, and can contain pointers to other buffers or resources. The layout is identical between the CPU and the GPU, the only caveat being that pointers have different representation on different devices (e.g. the CPU sees an abstract device address as 64-bit integer while the GPU sees an actual data pointer). But essentially, this is shared memory programming model, exactly how you would use it in CPU-side RPC/IPC, and it's a model intimately familiar to any C++ programmer. So it's already a good foundation. The only thing missing here is unified virtual memory, so that CPU and GPU use the same memory address space, so that marshalling pointers and resource types becomes unnecessary.

On the topic of mixing CPU and GPU code without API glue, I think Apple is in a unique position, as they control both he hardware and the software, including the standard developer tools. They could go a step further than Nvidia and actually allow linking of CPU and GPU code. Imagine if you could combine binary CPU and GPU objects in the same executable, with GPU functions being exposed on the CPU side (and via versa?).

What I am envisioning here is a programming model where there is still a separation between CPU and GPU code (because once you remove it, you are merging domains and losing fine-grained control), but interfacing becomes much simpler at the tooling level. Data structures are shared between CPU and GPU, and API glue is no longer necessary. This would be a more principled approach than CUDA, allowing more complex software patterns and better program architecture, without losing convenience. As to the rest of the things (extending MSL to support modern C++ features and standard library), well, that's just a matter of some effort and is much less of a deal.

So yes, I do think Apple can do a lot here, if they want. The big question is whether they do. Without Apple-branded supercomputers, targeting this kind of market probably doesn't make much sense, and application developers don't really need these "convenience features" as they are ok with some API glue. It's really up to Apple leadership to decide the direction. And of course, everything that @dada_dave said in their above post applies as well.

P.S. Apple, if you are reading this, call me

I have more good ideas!

dada_dave · Mar 11, 2023

leman said:
Really great post and very important questions!

I think it is important to understand why researches would see CUDA as "easy to use". On the surface, this is surprising — CUDA is C++ after all, an extremely complex programming language with many footguns, which is further complicated by parallel execution, extremely difficult to understand memory model, and various idiosyncratic limitations. It's not something a non-programmer should find "easy" or "comfortable". But I think something CUDA does very well is eroding the difference between CPU and GPU code. They give you a C++ compiler, unified codebase (where functions can run either on CPU or the GPU), their version of C++ standard library and you can just code away. I suppose it's indeed something a technically minded researcher will find conceptually simpler, because all these things will make sense to them — unlike API glue which they are likely to consider irrelevant inconvenient noise.

But there are also disadvantages to CUDA model. Unified codebases are convenient, but you are forced to use a non-standard compiler and your programs are written in a way that's difficult to disentangle. I do understand however that this is of little concern to an average researcher who targets specific compute infrastructure and doesn't really care much about software quality or cross-compatibility.

Let's focus on this particular aspect (convenience of mixing CPU and GPU code) and ask ourself whether Apple can do better. I believe they indeed can. From the start, Metal has pursued a conceptually simple model of CPU/GPU data interfacing, which pretty much peaked with Metal 3. All data is placed in a structured buffer using usual C/C++ layout rules, and can contain pointers to other buffers or resources. The layout is identical between the CPU and the GPU, the only caveat being that pointers have different representation on different devices (e.g. the CPU sees an abstract device address as 64-bit integer while the GPU sees an actual data pointer). But essentially, this is shared memory programming model, exactly how you would use it in CPU-side RPC/IPC, and it's a model intimately familiar to any C++ programmer. So it's already a good foundation. The only thing missing here is unified virtual memory, so that CPU and GPU use the same memory address space, so that marshalling pointers and resource types becomes unnecessary.

On the topic of mixing CPU and GPU code without API glue, I think Apple is in a unique position, as they control both he hardware and the software, including the standard developer tools. They could go a step further than Nvidia and actually allow linking of CPU and GPU code. Imagine if you could combine binary CPU and GPU objects in the same executable, with GPU functions being exposed on the CPU side (and via versa?).

What I am envisioning here is a programming model where there is still a separation between CPU and GPU code (because once you remove it, you are merging domains and losing fine-grained control), but interfacing becomes much simpler at the tooling level. Data structures are shared between CPU and GPU, and API glue is no longer necessary. This would be a more principled approach than CUDA, allowing more complex software patterns and better program architecture, without losing convenience. As to the rest of the things (extending MSL to support modern C++ features and standard library), well, that's just a matter of some effort and is much less of a deal.

So yes, I do think Apple can do a lot here, if they want. The big question is whether they do. Without Apple-branded supercomputers, targeting this kind of market probably doesn't make much sense, and application developers don't really need these "convenience features" as they are ok with some API glue. It's really up to Apple leadership to decide the direction. And of course, everything that @dada_dave said in their above post applies as well.

P.S. Apple, if you are reading this, call me I have more good ideas!

I’m not quite sure what you’re trying to say about linking: CUDA does link CPU and GPU code into a single binary and objects and functions can extend across both. There are a few limitations like no function pointers on the GPU, but otherwise you can have a single interface to GPU/CPU code and objects.

Also, clang supports CUDA compilation. I played around with it a little, there were slight differences in behavior between it and nvcc (Nvidia also has two different compilers with slight variations in behavior) and it was slightly less performant, but otherwise pretty decent. And this more of an aside: nvcc is actually the EDGE compiler with of course some extra stuff tacked on (at least it was I haven’t checked recently).

However I agree that CUDA itself is as friendly as GPU compute programming can get for a C/C++ based language but it is still fundamentally a C/C++ language. Which means indeed most people doing the science/engineering will want a higher level programming language and API to abstract the hard work away which is why Nvidia has worked to bring CUDA into Julia/Python/etc … and write back ends for popular APIs. But as I said this gives Apple an opportunity to do the same so that for most people they won’t care what is happening in the background, they can, potentially with a few tweaks, take their higher level code base and run it on an Apple GPU just as they would an Nvidia GPU. And it should be noted that Nvidia recognizes this and even markets their efforts here with standard C++ acceleration, decorator GPU acceleration, Python, etc … as the avenues both for GPU neophytes but also portability. CUDA itself is increasingly marketed only for advanced users who want to squeeze every last performance out of their algorithms on Nvidia GPUs.

My own project, though CUDA-based, even follows this idea. I’m writing an evolutionary simulation but also an API so that writing code to run and analyze the simulations in C++ is as easy as scripting such models and analysis in a higher level language with the eventual goal of providing the glue to indeed write such scripts in a higher language. Basically at no point should a user of my code touch CUDA unless they want to and even the C++ should be as simple as is possible. After all I want other scientists to actually use it! Theoretically the GPU back end could then be anything, though in practice being a one man operation I’m not likely going to be able to port and maintain both a CUDA and Metal back end. Bigger projects however would not suffer so.

leman · Mar 11, 2023

dada_dave said:
I’m not quite sure what you’re trying to say about linking: CUDA does link CPU and GPU code into a single binary and objects and functions can extend across both. There are a few limitations like no function pointers on the GPU, but otherwise you can have a single interface to GPU/CPU code and objects.

Sure, but Apple could actually provide that on OS level. CUDA compiler transforms your code and injects appropriate runtime calls so that your application looks like a single CPU/GPU binary, but behind the scenes the execution model is still fairly traditional: the runtime will load the shader code (compiling PTX to target architecture if necessary), configure the compute pipeline and insert the required CPU/GPU interfacing code.

dada_dave said:
Theoretically the GPU back end could then be anything, though in practice being a one man operation I’m not likely going to be able to port and maintain both a CUDA and Metal back end

I think it depends on the complexity of your algorithms as well as your code organisation. If you don't need advanced CUDA features, MSL and CUDA can use the same codebase with some strategically placed preprocessor macros.

dada_dave · Mar 11, 2023

leman said:
Sure, but Apple could actually provide that on OS level. CUDA compiler transforms your code and injects appropriate runtime calls so that your application looks like a single CPU/GPU binary, but behind the scenes the execution model is still fairly traditional: the runtime will load the shader code (compiling PTX to target architecture if necessary), configure the compute pipeline and insert the required CPU/GPU interfacing code.

The three compiler-linkers I know of for CUDA actually have slightly different behavior here in terms of how they work. I’ll admit my kids just got me sick yet again (first year of preschool and just awful winter for everyone here as far as I can tell) so my brain is getting a bit fuzzy but I think the behavior you’re looking for is very similar to Nvidia’s new-ish nvc++ compiler.

https://mobile.Twitter or X not allowed/blelbach/status/1381271696171130882

However I may be wrong as I didn’t quite follow what OS-level integration nets you and your other arguments around that - again I’m sure the result of my own brain fog rather than what you wrote.

leman said:
I think it depends on the complexity of your algorithms as well as your code organisation. If you don't need advanced CUDA features, MSL and CUDA can use the same codebase with some strategically placed preprocessor macros.

Absolutely for a lot of code bases. Unfortunately in my case I’m definitely using said advanced features and would have to replace certain libraries with Metal equivalent ones if they exist.

theorist9 · Mar 11, 2023

leman said:
I think it is important to understand why researches would see CUDA as "easy to use". On the surface, this is surprising — CUDA is C++ after all, an extremely complex programming language with many footguns, which is further complicated by parallel execution, extremely difficult to understand memory model, and various idiosyncratic limitations. It's not something a non-programmer should find "easy" or "comfortable". But I think something CUDA does very well is eroding the difference between CPU and GPU code. They give you a C++ compiler, unified codebase (where functions can run either on CPU or the GPU), their version of C++ standard library and you can just code away. I suppose it's indeed something a technically minded researcher will find conceptually simpler, because all these things will make sense to them — unlike API glue which they are likely to consider irrelevant inconvenient noise.

I haven't done any GPU-based computing but, speaking as one of those researchers who began with little coding background, I can tell you I found C++ reasonably accessible, at least for my needs. [As compared with say, Wolfram Language's style of functional programming, which took me a while to make sense of.]

When I started doing scientific computing as a grad student, I had minimal coding background—just some BASIC in college, and an evening course in Unix. I began with Perl, but found it much too slow. I would have liked to switch to Fortran (which I found cleaner, simpler, and even faster than C++), but it lacked a key ability I required—calling other programs to work inside it (I think this was added in a later version).

So I needed something fast and modern, and C++ it was. Memory was a bit of a challenge, since how fast it ran depended on how I ordered my 2D and 3D arrays in memory. But other that that, since I mostly used it procedurally, it wasn't that bad. I.e., I didn't need to use many of its fancier object-oriented features. Is that latter aspect of C++ needed for GPGPU programming in CUDA?

dada_dave said:
But as I said this gives Apple an opportunity to do the same so that for most people they won’t care what is happening in the background, they can, potentially with a few tweaks, take their higher level code base and run it on an Apple GPU just as they would an Nvidia GPU.

Do you have a feel for what percentage of those currently using CUDA need FP64 capability? I believe that would preclude Apple GPU's.

dada_dave · Mar 11, 2023

theorist9 said:
I haven't done any GPU-based computing but, speaking as one of those researchers who began with little coding background, I can tell you I found C++ reasonably accessible, at least for my needs. [As compared with say, Wolfram Language's style of functional programming, which took me a while to make sense of.]

When I started doing scientific computing as a grad student, I had minimal coding background—just some BASIC in college, and an evening course in Unix. I began with Perl, but found it much too slow. I would have liked to switch to Fortran (which I found cleaner, simpler, and even faster than C++), but it lacked a key ability I required—calling other programs to work inside it (I think this was added in a later version).

So I needed something fast and modern, and C++ it was. Memory was a bit of a challenge, since how fast it ran depended on how I ordered my 2D and 3D arrays in memory. But other that that, since I mostly used it procedurally, it wasn't that bad. I.e., I didn't need to use many of its fancier object-oriented features. Is that latter aspect of C++ needed for GPGPU programming in CUDA?

No - basic CUDA is based on basic C. There are fancy things you can do with modern C++ (some which makes writing code easier) and CUDA but it’s not required.

theorist9 said:
Do you have a feel for what percentage of those currently using CUDA need FP64 capability? I believe that would preclude Apple GPU's.

FP64 can be emulated on Apple Silicon. I understand the performance is not that bad interestingly. Honestly not sure about percentages of how many require it, probably depends on the field - eg while I believe some machine learning models use it, my understanding is that mostly to they go the other way and use FP/INT8/16 and tensor cores are by far the more valuable. For simulations and other applications FP64 can be extremely useful. I try to focus on FP32 when I can but even I need FP64 and for others the extra precision is completely necessary.

Colstan · Mar 12, 2023

It's time for more Mac Pro musings, plus a rumored leak.

First, John Siracusa provided his analysis of the alleged "Compute Module" that could be used for a modular Apple Silicon Mac Pro in the latest episode of ATP.

ATP 525: The Glory Speakers — Accidental Tech Podcast

Three nerds discussing tech, Apple, programming, and loosely related matters.

atp.fm

He starts talking about a potential "compute card" around the 1:04:00 minute mark. FYI, Siracusa did all of the acclaimed OS X reviews at Ars Technica, back when it was a hardcore technical analysis site. He's not a big fan of the "Lego brick" concept of plugging in compute modules to add resources to a Mac Pro, calling it a "fantasy rumor", stating the reasons why he thinks that's easier said than done. (Not to be confused with speciality cards like Afterburner, or having a slot to replace rather than add a compute module.) The discussion is about seven minutes long.

Second, in other news, while I don't visit the MR forum as much as I did in the past, there has been some movement on the Apple Silicon Mac Pro rumors. "Amethyst" is the poster who leaked all of the details for the Mac Studio, including the specs, name and form factor, before the official announcement by Apple, back when most of us thought that the desktop Mac line was set in stone. That poster had an update for the first time since January:

I have some information regard Mac Pro.

- They have try to push M2 Ultra P-core to run around 4.2 GHz
- it success, but with much more power consumption.
- Mac Pro thermal heatsink can be easily handle that heat outputs.
- Doesn't decided that how the final spec will be.

Obviously, this poster's native tongue is not English, but we can understand the general idea.

If this leak is correct, then it follows the notion that Apple will use the Mac Pro's superior cooling to crank up the juice on the M2 Ultra, with higher clocks as a way to differentiate the Mac Pro from the other Apple Silicon Macs, as well as their PC competitors.

Credit to @theorist9 for doing the Geekbench math:

theorist9 said:
Based on Primate's GB6 benchmark for the M2 Max MBP (3.66 GHz), 4.2 GHz should give a GB6 single-core score of:

2747 * 4.2/3.66 = 3152

This would be 3152/3044 = 3.5% faster than the fastest processor listed on Primate's site (i9-13900KS, which is a 6.0 GHz/328 W special gaming edition of the i9-13900K, and Intel's fastest SC CPU), and 3152/2935 = 7% faster than the i9-13900K. [For single-core performance.]

I think this is a more likely scenario for the Apple Silicon Mac Pro compared to the other ideas that have been bandied about, but of course, I'm open to being wrong. In fact, I hope I am wrong, because I like to be surprised. After the Mac Pro is finally announced, the M-series and Mac line are going to become more predictable. The upcoming Apple Silicon model is the last refuge for a moon shot Mac project.

leman · Mar 12, 2023

dada_dave said:
However I may be wrong as I didn’t quite follow what OS-level integration nets you and your other arguments around that - again I’m sure the result of my own brain fog rather than what you wrote.

What I meant is something like this: imagine that the ability to link with GPU binaries is part of the standard system linker and does not require any special compiler. And the GPU functions behave like C functions with additional parameters. So you can use any language and tooling you want.

dada_dave said:
FP64 can be emulated on Apple Silicon. I understand the performance is not that bad interestingly. Honestly not sure about percentages of how many require it, probably depends on the field - eg while I believe some machine learning models use it, my understanding is that mostly to they go the other way and use FP/INT8/16 and tensor cores are by far the more valuable. For simulations and other applications FP64 can be extremely useful. I try to focus on FP32 when I can but even I need FP64 and for others the extra precision is completely necessary.

And if all you need is more precision — without the extra exponent range of FP64 — you can always use extended precision math, which is faster than FP64 emulation.

Colstan said:
It's time for more Mac Pro musings, plus a rumored leak.

First, John Siracusa provided his analysis of the alleged "Compute Module" that could be used for a modular Apple Silicon Mac Pro in the latest episode of ATP.

ATP 525: The Glory Speakers — Accidental Tech Podcast

Three nerds discussing tech, Apple, programming, and loosely related matters.

atp.fm

He starts talking about a potential "compute card" around the 1:04:00 minute mark. FYI, Siracusa did all of the acclaimed OS X reviews at Ars Technica, back when it was a hardcore technical analysis site. He's not a big fan of the "Lego brick" concept of plugging in compute modules to add resources to a Mac Pro, calling it a "fantasy rumor", stating the reasons why he thinks that's easier said than done. (Not to be confused with speciality cards like Afterburner, or having a slot to replace rather than add a compute module.) The discussion is about seven minutes long.

Second, in other news, while I don't visit the MR forum as much as I did in the past, there has been some movement on the Apple Silicon Mac Pro rumors. "Amethyst" is the poster who leaked all of the details for the Mac Studio, including the specs, name and form factor, before the official announcement by Apple, back when most of us thought that the desktop Mac line was set in stone. That poster had an update for the first time since January:

Obviously, this poster's native tongue is not English, but we can understand the general idea.

If this leak is correct, then it follows the notion that Apple will use the Mac Pro's superior cooling to crank up the juice on the M2 Ultra, with higher clocks as a way to differentiate the Mac Pro from the other Apple Silicon Macs, as well as their PC competitors.

Credit to @theorist9 for doing the Geekbench math:

I think this is a more likely scenario for the Apple Silicon Mac Pro compared to the other ideas that have been bandied about, but of course, I'm open to being wrong. In fact, I hope I am wrong, because I like to be surprised. After the Mac Pro is finally announced, the M-series and Mac line are going to become more predictable. The upcoming Apple Silicon model is the last refuge for a moon shot Mac project.

All this makes sense to me as well. I'd love to see modular compute boards, but it is indeed a significant challenge. The notion of higher clocked M2 ultra would make sense as well — a more conservative option, but sufficient to put Apple in a very good spot CPU-wise. A 4.2 ghz single clock/ 4.0 Ghz full core clock on 16 M2 cores would put it in the ballpark of 35k GB 5.0 multicore, which would conformably beat the new Sapphire Rapids Xeon w7 and maybe even rival the w9 series, while delivering a considerably better single core performance. And if Apple manages to get the GPU to 2.0 Ghz, they are looking at ~ 40 TFLOPs of FP32 performance, which isn't too shabby either. Of course, RAM bandwidth might become a problem.

dada_dave · Mar 12, 2023

leman said:
What I meant is something like this: imagine that the ability to link with GPU binaries is part of the standard system linker and does not require any special compiler. And the GPU functions behave like C functions with additional parameters. So you can use any language and tooling you want.

Ah okay. Got it now.

leman said:
And if all you need is more precision — without the extra exponent range of FP64 — you can always use extended precision math, which is faster than FP64 emulation.

Indeed I should’ve been more precise

- sometimes FP64 is needed due to its extra precision to ensure calculations don’t suffer from excessive errors (GPU parallel math can actually help here without FP64’s extra bits), sometimes for the extra range since of course things really go wrong if the answer is too small or too large for FP32!

Yoused · Mar 12, 2023

dada_dave said:
if the answer is too small or too large for FP32

10E ±38? That is PF large/small. The observable universe is ~9E+26 meters; the Planck length is 1.6E-35 meters. If you need values exceeding those, wth are you calculating? FP32 is weak in mantissa, but its range is pretty damn wide.

theorist9 · Mar 12, 2023

leman said:
All this makes sense to me as well. I'd love to see modular compute boards, but it is indeed a significant challenge. The notion of higher clocked M2 ultra would make sense as well — a more conservative option, but sufficient to put Apple in a very good spot CPU-wise. A 4.2 ghz single clock/ 4.0 Ghz full core clock on 16 M2 cores would put it in the ballpark of 35k GB 5.0 multicore, which would conformably beat the new Sapphire Rapids Xeon w7 and maybe even rival the w9 series, while delivering a considerably better single core performance. And if Apple manages to get the GPU to 2.0 Ghz, they are looking at ~ 40 TFLOPs of FP32 performance, which isn't too shabby either. Of course, RAM bandwidth might become a problem.

EDIT: Using Wikipedia's ALU numbers (256/GPU core, https://en.wikipedia.org/wiki/Apple_M2) I initially got double that for GPU TFLOPs, but I now think Wikpedia's numbers are wrong, since the M1 had 128/core. Using the latter figure, I get what you did:

128 ALUs/core * 76 cores = 9,728 ALUs. Hence:

9,728 ALU * (1 scalar FP32 instruction)/(ALU * cycle) * 2.0 * 10^9 cycle/second * 2 FP32 FMA operations/(scalar FP32 instruction)*1 TFLOPs/(10^12 operations/second) = 38.9 FMA FP32 TFLOPs

If LPDDR5 doesn't give enough bandwidth, Apple could switch to 8,500 MHz LPDDR5x, which would improve things over the 6,400 MHz LPPDR5 currently used by 8,500/6,400 = 1.33 x. The MP's low expected sales volume should allow provising of LPDDR5x even if its current availability is limited.

Mac Pro - no expandable memory per Gurman

Site Champ

Site Champ

up

Site Champ

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ

Elite Member

up

Site Champ

Similar threads