Sorry for the confusion. It’s not written by me. I just reposted it. The article was written by Joel Hruska for ExtremeTech.Not a bad article, but you need to be clearer that Apple Silicon is not ARM in the strictest sense. It uses an ISA that has the ARM ISA in it but its microarchitecture differs radically from Cortex and also from all the other ARM processors.
I am not sure I would say radically. It does the same stuff and is capable of running the same object code, perhaps with an extra feature or two. It just does it significantly more efficiently than does anyone else's implementation. In a way, it is unfortunate that the license does not have a sort of GPL-like clause so that everyone would have to share their design principles with other license holders. That would really make Intel sweat.… has the ARM ISA in it but its microarchitecture differs radically …
I am not sure I would say radically. It does the same stuff and is capable of running the same object code, perhaps with an extra feature or two. It just does it significantly more efficiently than does anyone else's implementation. In a way, it is unfortunate that the license does not have a sort of GPL-like clause so that everyone would have to share their design principles with other license holders. That would really make Intel sweat.
I’m sure someone else will step up. If not, we can put together information from multiple sources. But pretty sure M2 single core will look a lot like A15.Hi everybody!
I'm wondering where we will get our good and detailed insights into the new M2 chips, since AndreiF doesn't work for Anandtech anymore. Any Ideas?
How wide do you think they will be able to go?pretty sure M2 single core will look a lot like A15
How wide do you think they will be able to go?
I'm not entirely sure the question makes sense. Depends on what you mean by "commit". Let me pseudocode and label the instructions to make it easier to talk about...What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.
Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value?
AAUI, the reorder buffer on a Firestorm has something like 630 ops in flight, which suggests that the dispatcher has a pretty panoramic view of what is downstream. I could imagine that an op in the buffer could easily be tagged with a provisional writeback-bypass flag that would allow it to go directly to the retire stage, barring an exception. Compiling code to do most of its work in a small range of scratch registers could optimize this kind of behavior, the same way compilers have become smart enough to turn verbose source into compact object code.It has no clue that the value will never be useful again. Can't, it's not psychic!
What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.
Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value? In other words, is there a more efficient way to use/discard rename registers (kind of an op-fusion scheme, as it were), or do they already do that?
Not sure if this has been discussed [when I try using the search function on this thread I always get "No Results], but this article [https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better/] says:
"The RISC ISA emphasizes software over hardware. The RISC instruction set requires one to write more efficient software (e.g., compilers or code) with fewer instructions. CISC ISAs use more transistors in the hardware to implement more instructions and more complex instructions as well."
I take that to mean software optimization is more critical for ARM (RISC) than x86 (CISC) in order to achieve optimum performance.
They mention this needed optimization refers to both conventional program code (i.e., what most developers write), and optimization of the assembly code generated by the compiler.
Thus it seems this also could refer to optimization of low-level libraries—like, for instance, the ARM equivalent of x86's Intel Math Kernel Library.
I note this because it appears that Mathematica still isn't optimized for Apple Silicon. I've seen several WolframMark benchmarks posted for the M1, and they're never over 3.2. By contrast, my 2014 MBP gets 3.0 (my 2019 i9 iMac gets 4.5, but it's hard to tell how many cores the benchmark is using; at least I know core number doesn't put the 4+4 core M1 at a disadvantage in comparison to my 4-core MBP). Some have opined this is partly because no one has yet written a ARM version of the MKL that is as highly optimized as it is. This is typically explained by the substantial time and expertise Intel has devoted to MKL. But could part of this also be (here I'm purely speculating) that it's harder to achieve high optimization of low-level libraries with RISC than CISC because RISC performance is more sensitive to software inefficiencies?
Mathematica aside, I'm wondering whether the original quote means there's a lot more optimization still to be had in programs written for ARM—or if the needed software optimization they're referring to has, for the most part, already been done.
Got it. But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?It’s just as easy to optimize RISC code as CISC code. In fact, it’s probably easier. Think of it as building a house using Legos. CISC gives you big bricks with lots of complex shapes. RISC gives you tiny 1x1 bricks, from which you can build anything you want.
CISC code is being broken up into microOps by the processor, anyway, at the instruction decoder stage. I‘d rather have a compiler, with lots of resources and the ability to understand the entirety of the code and the developer‘s intent, rather than an instruction decoder that sees only a window into maybe 100 instructions, figure out how to optimize things.
There is a version for ARM they can use, and I believe are using. It's just (probably) not as good. It seems it's challenging to write a fast math library. E.g., AMD produced its own version (which is now EOL), called ACML (AMD Core Math Library), and it was significantly slower than Intel's MKL, even when run on an AMD system:The issue with Mathematioca seems to simply be That people haven‘t yet optimized MKL for ARM. And since my understanding is that MKL comes from Intel, it is unlikely to be any time soon unless someone comes up with their own version.
Got it. But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?
There is a version for ARM they can use, and I believe are using. It's just (probably) not as good. It seems it's challenging to write a fast math library. E.g., AMD produced its own version (which is now EOL), called ACML (AMD Core Math Library), and it was significantly slower than Intel's, even when run on an AMD system:
Intel MKL vs. AMD Math Core Library
Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and amstackoverflow.com
Got it. But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?
RISC is more forgiving - you have fewer registers, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.
LOL, right. RISC has more registers, CISC has fewer (as a rule of thumb).I agree it’s the wrong premise, with OoO execution, micro-ops and other techniques, the CPU has a lot of control no matter the ISA. Microarchitecture seems more important to the final result than the ISA. The ISA does place some restrictions on the microarchitecture, but that has become less relevant over the years. And when people write higher level code, and not assembler, the ISA itself is an implementation detail left to the compiler, but you could very well make optimizations based on the microarchitecture’s behaviors if you really need to wring out every drop of performance.
The article is written from the perspective of microcontrollers which are usually years if not decades behind desktop/laptop chips, and even smartphone chips. When PPC/Pentium was the latest thing in the early 2000s, the microcontrollers I worked with were similar to the Z80. These days, the microcontrollers are starting to adopt ARM, but may be running on simpler cores and reliant on Thumb. I’m not even sure OoO is supported on some of these newer microcontrollers.
I assume you meant something else with the bolded bit? You describe both x86 and RISC having fewer registers.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.