Hmm I can’t make the CPU monkey numbers work for the Immortalis GPUHmmm that is a discrepancy.
2300/(1x12x2) = 96 FP32 units per core which doesn’t make any sense unless they cut the cores down to (16x6 = 96)
Hmm I can’t make the CPU monkey numbers work for the Immortalis GPUHmmm that is a discrepancy.
I’m not sure how reputable cpu monkey are. Its possible they have the wrong information.Hmm I can’t make the CPU monkey numbers work for the Immortalis GPU
2400/(1x12x2) = 100 FP32 units per core which doesn’t make any sense.
I miscalculated initially, it’s 2.3 TFLOPs yielding 96 FP units which is possible (6x16 warps) but still requires the cores themselves to be smaller than either 128 or 196 listed … bah!I’m not sure how reputable cpu monkey are. Its possible they have the wrong information.
However, this may be “peak” TFLOPs but in reality the clocks on a phone never reach it for very long? It’s actually operating well below that for most of its time.
Like the ARM it claims double the FP16 TFLOPs, unclear if the ARM cores have separate FP16 pipes though like Apple does or run x2 FP16 through the FP32 pipes.
I believe Nvidia runs two FP16 through their FP32 pipesIn fact, it is really not even peak, it is theoretical performance capacity. The math shows that capability based on configuration, but it probably cannot be sustained for even half a dozen clock cycles. It is marketspeak.
There are at least three 16-bit formats. I believe the one most commonly used is BFloat16, which is merely FP32 with a truncated mantissa. That would work well through a FP32 pipe, since the exponent is the same. IEEE754 standard FP16 would call for different logic structures – would it be worth the trouble?
So I was able to find some information on the Immortalis GPU 720 MC12- 12 cores, as the name implies, at 1.3Ghz which is definitely substantial. There are apparently 192 execution units per core which gives 2x192x12x1.3GHz = 5.99 TFLOPs. The math for the Immortalis GPU 720 MC11 works out similarly 2x192x11x0.85GHz = 3.6 TFLOPs.
[
Thanks. From what I can tell (and @leman can confirm or refute) the Apple A17 GPU cores are 1.4 GHz, 128 units, warp size 32. With 6 cores that’s 2.15TFLOPs FP32. Like the ARM it claims double the FP16 TFLOPs, unclear if the ARM cores have separate FP16 pipes though like Apple does or run x2 FP16 through the FP32 pipes.
There are at least three 16-bit formats. I believe the one most commonly used is BFloat16, which is merely FP32 with a truncated mantissa. That would work well through a FP32 pipe, since the exponent is the same. IEEE754 standard FP16 would call for different logic structures – would it be worth the trouble?
I believe Nvidia runs two FP16 through their FP32 pipes
I have difficulty understanding that ARM document. They say four units and warp width of 16, that would mean 64 operations per clock, where does 256 come from? Or does each thread do packed math, i.e. 4-wide SIMD per thread x16 per unit? If the latter is true I’m not surprised it performs poorly in real world, that’s a design big players abandoned long time ago. Again, I’m just trying to guess what might be happening here.
M3 doesn’t have 2x FP16, it can execute FP16 and FP32 simultaneously.
GPUs support IEEE FP16, bfloat is a relatively new thing as it’s used primarily in ML (limited precision makes it less useful for lighting calculation where FP16 is traditionally used). On M1 at least bfloat is implemented by padding a 16-bit value with zeros and using it as FP32.
Ah yes, I got confused as they do issue FP16 through the FP32 pipe and they also have two potential FP32 pipes though one of those also serves as an Int32 pipe. I confused that with doing two FP16 packed through the FP32 - or I must’ve confused that with AMD.Nvidia used to have two FP16 pipes but that was a long time ago. They haven’t had dual FP16 in a while. AMD does though, via packed SIMD within SIMT.
I asked the creator of the Twitter thread why the figures quoted are so much lower than benchmarks show. They were kind enough to answer.
I'm sure the answer is true, but I’m also unconvinced it explains the discrepancy.
I have difficulty understanding that ARM document. They say four units and warp width of 16, that would mean 64 operations per clock, where does 256 come from? Or does each thread do packed math, i.e. 4-wide SIMD per thread x16 per unit? If the latter is true I’m not surprised it performs poorly in real world, that’s a design big players abandoned long time ago. Again, I’m just trying to guess what might be happening here.
The Valhall programmable Execution Core (EC) consists of one or more Processing Engines (PEs). The Mali-G57 and Mali-G77 have two PEs, and several shared data processing units, all of which are linked by a messaging fabric.
A Valhall core can perform 32 FP32 FMAs, read four bilinear filtered texture samples, blend two fragments, and write two pixels per clock.
The processing engines
Each PE executes the programmable shader instructions.
Each PE includes three arithmetic processing pipelines:
The FMA and SVT [CVT] pipelines are 16-wide, the SFU pipeline is 4-wide and runs at one quarter of the throughput of the other two.
- FMA pipeline with is used for complex maths operations
- CVT pipeline which is used for simple maths operations
- SFU pipeline which is used for special functions
Arithmetic fused multiply accumulate unit (FMA)
The FMA pipelines are the main arithmetic pipelines, implementing the floating-point multipliers that are widely used in shader code. Each FMA pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.
Most programs that are arithmetic-limited are limited by the performance of the FMA pipeline.
Arithmetic convert unit (CVT)
The CVT pipelines implement simple operations, such as format conversion and integer addition. Each CVT pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.
Arithmetic special functions unit (SFU)
The SFU pipelines implement a special functions unit for computation of complex functions such as reciprocals and transcendental functions. Each SFU pipeline implements a 4-wide issue path, executing a 16-wide warp over 4 clock cycles.
Valhall maintains native support for int8, int16, and fp16 data types. These data types can be packed using SIMD instructions to fill each 32-bit data processing lane. This arrangement maintains the power efficiency and performance that is provided by the types that are narrower than 32-bits.
A single 16-wide warp maths unit can therefore perform 32x fp16/int16 operations per clock cycle, or 64x int8 operations per clock cycle.
Whether you get more than 256 ops/cy because of the CVT and SFU units depends on the operation and GPU generation. Some can issue in parallel, some can't, as we continuously rebalance the shader core design to get the right balance of operations for industry content trends and to optimize for energy efficiency.
From one very limited standpoint, perhaps.
So should I believe drivers could double performance for compute?
I don’t think anyone could blame you! The situation is clearly a mess.Honestly? I don’t know. It’s not like the Vulkan compute score are great but again . I’d be very curious if anyone can replicate the supposed >5TFLOPs of either chip under any circumstance, especially of the Immortalis which I calculate shouldn’t actually have that. Maybe I’ve made a mistake, maybe there’s something I’m missing.
Yes, I don’t know what to think, but it does make it clear to me why I prefer iOS to Android. What a mess.But from what I can tell it should only have a 4TFLOP GPU at best. Which btw if it actually hits that is still an incredibly good GPU. But it doesn’t seem to really even get 2x the performance of an Apple A17Pro GPU even in its best benchmarks which makes me think even that number of 4TF, never mind 6, is suspect as a practical matter.
Could you link my Immortalis calculations where I go through each discrepancy to them? Maybe they can find an error in my explanations or calculations. Or maybe not and it’s really the reported numbers that are off.I don’t think anyone could blame you! The situation is clearly a mess.
Yes, I don’t know what to think, but it does make it clear to me why I prefer iOS to Android. What a mess.
Post number 70?Could you link my Immortalis calculations where I go through each discrepancy to them? Maybe they can find an error in my explanations or calculations. Or maybe not and it’s really the reported numbers that are off.
YesPost number 70?
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.