May 7 “Let Loose” Event - new iPads


insights about m4

Sadly, vague insights. They usually have a lot of depth in their free content, vs what’s now just an advertisement ☹️
Aye though to be fair, that the M4 uses a different, high performance, cell library for the P-cores as opposed to the high density library used previously and for the rest of the current SOC is quite interesting. If I remember right from the MR discussion between him and @theorist9, I think @leman was already ruminating on recapitulating his tests of power vs frequency for the M4 and those should be extra insightful into the design of these cores if there is anything to be seen in those graphs vs the others.
 
Aye though to be fair, that the M4 uses a different, high performance, cell library for the P-cores as opposed to the high density library used previously and for the rest of the current SOC is quite interesting. If I remember right from the MR discussion between him and @theorist9, I think @leman was already ruminating on recapitulating his tests of power vs frequency for the M4 and those should be extra insightful into the design of these cores if there is anything to be seen in those graphs vs the others.

I may be running up against the limits of my understanding here, so this may be obvious to others, but... how can they know that it is as they say - HP for CPUs, HD for everything else? Isn't the whole point of finflex that you have a lot more flexibility in the details? I would expect it to be *approximately* as they say, but with plenty of variances.
 
I may be running up against the limits of my understanding here, so this may be obvious to others, but... how can they know that it is as they say - HP for CPUs, HD for everything else? Isn't the whole point of finflex that you have a lot more flexibility in the details? I would expect it to be *approximately* as they say, but with plenty of variances.

Yeah, I have questions here. My experience was difference, because we always designed our own cell libraries. And, for a given gate, we had many variations, with different power levels and, sometimes, alternate layouts.

So, for example, we had a NAND2x1, NAND2x2, NAND2x4, etc. We might even have a NAND2x4a or whatever, if, for some reason, somebody needed an alternate version of NAND2x4 with a different layout.

The “x” number was the relative width-to-length ratio of the cell as compared to x1. So NAND2x2 had double the strength as NAND2x1. Of course, with FINFETS, you are looking at number of fins (or whatever terminology different companies use) and not W/L ratio.

What we would do is size the gates as optimally as we could, choosing x4’s or x8’s when necessary to make timing on some important path, and using x1’s where we could get away with it.

All of which is to say, we could mix and match high-density and high performance gates.

What I think is going on with TSMC’s libraries, though, is something different.(I have no access to their cell libraries, so my guess could be wrong). What i think is going on is the HD and HP cells use different “standard cell architectures,“ which means you can’t mix and match them. By “standard cell architecture”I refer to the cell height and location of the power rails. The cell is a rectangle, with a horizontal metal layer on the top and another on the bottom, one for power and one for ground. These line up with neighboring cells, so you distribute power and ground just by abutting them. But if a HD inverter has a different cell height than a HP inverter, then you can’t do that.

We did something like that at Exponential. Our “datapath” cells were actually designed to have constant width but variable height, so that each bit of a datapath formed a uniform column. Our “control” cells were constant height and variable width, which is what everyone in CMOS does.
 
I may be running up against the limits of my understanding here, so this may be obvious to others, but... how can they know that it is as they say - HP for CPUs, HD for everything else? Isn't the whole point of finflex that you have a lot more flexibility in the details? I would expect it to be *approximately* as they say, but with plenty of variances.

Yeah, I have questions here. My experience was difference, because we always designed our own cell libraries. And, for a given gate, we had many variations, with different power levels and, sometimes, alternate layouts.

So, for example, we had a NAND2x1, NAND2x2, NAND2x4, etc. We might even have a NAND2x4a or whatever, if, for some reason, somebody needed an alternate version of NAND2x4 with a different layout.

The “x” number was the relative width-to-length ratio of the cell as compared to x1. So NAND2x2 had double the strength as NAND2x1. Of course, with FINFETS, you are looking at number of fins (or whatever terminology different companies use) and not W/L ratio.

What we would do is size the gates as optimally as we could, choosing x4’s or x8’s when necessary to make timing on some important path, and using x1’s where we could get away with it.

All of which is to say, we could mix and match high-density and high performance gates.

What I think is going on with TSMC’s libraries, though, is something different.(I have no access to their cell libraries, so my guess could be wrong). What i think is going on is the HD and HP cells use different “standard cell architectures,“ which means you can’t mix and match them. By “standard cell architecture”I refer to the cell height and location of the power rails. The cell is a rectangle, with a horizontal metal layer on the top and another on the bottom, one for power and one for ground. These line up with neighboring cells, so you distribute power and ground just by abutting them. But if a HD inverter has a different cell height than a HP inverter, then you can’t do that.

We did something like that at Exponential. Our “datapath” cells were actually designed to have constant width but variable height, so that each bit of a datapath formed a uniform column. Our “control” cells were constant height and variable width, which is what everyone in CMOS does.
Those are good points. We’d probably have to pay for their full analysis to find out!
 
My experience was difference, because we always designed our own cell libraries. And, for a given gate, we had many variations, with different power levels and, sometimes, alternate layouts.

That was the same for the small (6-12 person) chip company I worked for in the past. Which was a huge advantage with respect to performance, power consumption, and die size compared to large well-known companies (National/Harris/TI/AD) that used sophisticated/expensive design tools.
 
That was the same for the small (6-12 person) chip company I worked for in the past. Which was a huge advantage with respect to performance, power consumption, and die size compared to large well-known companies (National/Harris/TI/AD) that used sophisticated/expensive design tools.
At Exponential Technology we had a pretty cool setup. We did our design using C, but the C provide structure, not just logic. I can’t remember the syntax exactly, but it was this sort of thing:

z = (a | b) & !c // !1600 (100, 1500)

So you could do a logical simulation, but when run through the tools, the comments would tell you the gate size (current) and placement location. (we eventually moved placement into another file so that changing placement didn’t ”touch” the logic file and trigger a re-verification.). The cool part is that since we were designing using ECL/CML, pretty much any logic function could be done with a ”single” gate (max number of variable depending on whether you are doing ECL or CML). So if you used a function nobody had used before, it automatically created the cell (it would create a .lef, fire off SPICE to characterize it, and even do an automatic sticks-based layout). We had a couple of folks who would then hand-optimize cells that got used the most, or that were particularly critical for some reason. But pretty much you could just write the logic in C, and didn’t have to figure out the cell library ahead of time.
 
At Exponential Technology we had a pretty cool setup. We did our design using C, but the C provide structure, not just logic. I can’t remember the syntax exactly, but it was this sort of thing:

z = (a | b) & !c // !1600 (100, 1500)

So you could do a logical simulation, but when run through the tools, the comments would tell you the gate size (current) and placement location. (we eventually moved placement into another file so that changing placement didn’t ”touch” the logic file and trigger a re-verification.). The cool part is that since we were designing using ECL/CML, pretty much any logic function could be done with a ”single” gate (max number of variable depending on whether you are doing ECL or CML). So if you used a function nobody had used before, it automatically created the cell (it would create a .lef, fire off SPICE to characterize it, and even do an automatic sticks-based layout). We had a couple of folks who would then hand-optimize cells that got used the most, or that were particularly critical for some reason. But pretty much you could just write the logic in C, and didn’t have to figure out the cell library ahead of time.

Being a very small startup company (Graychip Inc, in Palo Alto), we designed our own tools and chip library. Schematic entry, our own simulator, and our own library of gates, counters, adders, buffers/drivers of different strengths, a multiplier generator, etc. With chips being laid out by hand. Which made for very compact (and relatively low power) designs.

The small number of employees we had (I was employee #4) came from ESL Inc in Sunnyvale (a defense aerospace systems contractor) - all systems/hardware design engineers with a lot experience in aerospace programs/projects, which carried over into the chips we designed, that mostly ended up initially in government programs; and then latter in commercial cellular basestation systems as that industry took off. Commercial cellular telecom was a market Texas Instruments was really interested in, which led to them acquiring us, where we designed digital up/down converter chips (ie digital radios using digitial signal processing techniques) used in cellular basestations. Needless to say, Dallas was quite a change in company culture from Palo Alto. Fortunately, I didn't have to travel there very often. :)
 
Last edited:
Aye though to be fair, that the M4 uses a different, high performance, cell library for the P-cores as opposed to the high density library used previously and for the rest of the current SOC is quite interesting. If I remember right from the MR discussion between him and @theorist9, I think @leman was already ruminating on recapitulating his tests of power vs frequency for the M4 and those should be extra insightful into the design of these cores if there is anything to be seen in those graphs vs the others.
Ah! When skimming the blog post I’d missed the full article with the floor plan and other interesting details 🤦‍♂️

Anyway, one thing that jumps out at me in the M4 die shot is that the number of Thunderbolt blocks has doubled to 4, now matching the M3 Pro/Max. That should make a bunch of prospective MacBook Air / Mac Mini / iMac owners happy.
 
Ah! When skimming the blog post I’d missed the full article with the floor plan and other interesting details 🤦‍♂️

Anyway, one thing that jumps out at me in the M4 die shot is that the number of Thunderbolt blocks has doubled to 4, now matching the M3 Pro/Max. That should make a bunch of prospective MacBook Air / Mac Mini / iMac owners happy.
Interesting! Those are in the lower right corner, yes?

I suspect that Apple's feeling a little bit of competitive pressure on the base MBP - they want to be able to support all the ports the higher-end Pros support. It also would give that model more differentiation from the Air. (Much less likely, they give the Air more ports too, a different justification for this change, but a good one.)
 
Interesting! Those are in the lower right corner, yes?

I suspect that Apple's feeling a little bit of competitive pressure on the base MBP - they want to be able to support all the ports the higher-end Pros support. It also would give that model more differentiation from the Air. (Much less likely, they give the Air more ports too, a different justification for this change, but a good one.)
Yep, lower right side!

There’s also speculation about TB5, though I’m leaning hard toward TB4 here. However, that doesn’t mean the larger variants of the M4 family won’t get TB5. Certainly TB5 will be more expensive, and thus not really suitable for tablets and entry level machines.
 
Last edited:
There’s also speculation about TB5, though I’m leaning hard toward TB4 here. However, that doesn’t mean the larger variants of the M4 family won’t get TB5. Certainly TB5 will be more expensive, and thus not really suitable for tablets and entry level machines.

Seems the extra bandwidth provided by a single TB5 port could allow for tablets & entry level laptops to be used with a TB5 docking station when at a desk, providing more ports and whatnot than the tablet or entry level laptop has alone...?
 
Seems the extra bandwidth provided by a single TB5 port could allow for tablets & entry level laptops to be used with a TB5 docking station when at a desk, providing more ports and whatnot than the tablet or entry level laptop has alone...?

I think the question there is: Are the use cases enough to want >40Gbps?
I know I have somewhat different needs than other professionals using their Macs, but TB3/4 is for: Ethernet, USB input devices and audio output, Monitor, iPhone/Android device being debugged. I'm not missing out much in my use cases if I don't get TB5 at the moment, except for the ability to daisy chain hi-res displays, which I don't do. I go the "single large display for focus" approach.

My understanding though is that generally it's display and NVME bandwidth that is keeping up with the bus speed increases these days.
 
I think the question there is: Are the use cases enough to want >40Gbps?
I know I have somewhat different needs than other professionals using their Macs, but TB3/4 is for: Ethernet, USB input devices and audio output, Monitor, iPhone/Android device being debugged. I'm not missing out much in my use cases if I don't get TB5 at the moment, except for the ability to daisy chain hi-res displays, which I don't do. I go the "single large display for focus" approach.

My understanding though is that generally it's display and NVME bandwidth that is keeping up with the bus speed increases these days.
I think one of the main use cases will be high resolution with high refresh rate together. Right now you can't get a 5K 120 Hz display over TB4 without compression, TB5 solves this problem.
 
I think one of the main use cases will be high resolution with high refresh rate together. Right now you can't get a 5K 120 Hz display over TB4 without compression, TB5 solves this problem.

While I've been on the other side of the fence, I've come to agree with those saying that the compression isn't that big a deal. I say that as someone who currently has a 32" 4K OLED as my primary monitor running at 240Hz on my gaming and work machines (144Hz on my M1 Max).

Although I did point out displays are one of the few things keeping up with bandwidth improvements.
 
I think one of the main use cases will be high resolution with high refresh rate together. Right now you can't get a 5K 120 Hz display over TB4 without compression, TB5 solves this problem.
Is this why Apple didn't add ProMotion support to the Studio Display?
 
Is this why Apple didn't add ProMotion support to the Studio Display?
There's a couple aspects to this question:

1) Getting the display to 120Hz
2) Variable Refresh Rate

Getting the display up to the higher refresh on DisplayPort 1.4 (and Thunderbolt 3/4) does require DSC, but otherwise is doable. It will exclude older Macs that don't support DSC though. But at this point, those are ~6 years old. These are the same machines that require multiple DisplayPort streams to achieve 6K at 60Hz. The XDR does support DSC for newer machines, so I'm not really sure why Apple couldn't do DSC in the Studio.

The other aspect to faster refresh is how quickly the pixels can be updated. Apple's 5K/6K displays aren't exactly known for their quick response times at 60Hz. 120Hz may just not be a great experience at the moment with the panels that Apple is using and negate much of the benefits.

In terms of VRR, I don't see why they couldn't support it today. Being able to drop down to ~48Hz for proper video cadence would still benefit some folks even if 120Hz wasn't possible. But being able to hit 120 is cleaner for getting smooth system animations and matching film cadence.
 
Apple already runs the XDR with compression, by default, when it's connected to a DSC-capable GPU, which includes those on all AS Macs. According to a discussion I had with user joevt on MR, who is the author of the AllRez display utility ( https://github.com/joevt/AllRez ), Apple runs DSC at 12 bpp. As joevt explains, since the framebuffer is 12 bpc = 36 bpp, that's 3:1 compression. That ratio is sufficient to allow them to drive the XDR (6k@60) over TB4 using single-tile HBR2, and it's obviously a level of compression Apple deems acceptable. [ For GPU's without DSC, the XDR can also be driven over TB4 without compression, using dual-tile HBR3. ]

So I suspect the issue with driving high-res (>= 5k) displays at >= 120 Hz over TB4 isn't that compression per se is needed (since Apple is already fine with compression); instead, it's a question of how much compression Apple would need to use to drive those with a single TB4 cable, and whether Apple would deem that degree of compression to be visually lossless.

See joevt's posts starting here. Hopefully my own confusion about this is not too distracting!:
And for an explanation of how uncompressed dual-tile HBR3 works, see this post by joevt:
 
Last edited:
I think the question there is: Are the use cases enough to want >40Gbps?
I know I have somewhat different needs than other professionals using their Macs, but TB3/4 is for: Ethernet, USB input devices and audio output, Monitor, iPhone/Android device being debugged. I'm not missing out much in my use cases if I don't get TB5 at the moment, except for the ability to daisy chain hi-res displays, which I don't do. I go the "single large display for focus" approach.

My understanding though is that generally it's display and NVME bandwidth that is keeping up with the bus speed increases these days.

I could see a TB5 docking station being useful for using Logic on an iPad Pro; external monitor, external RAID, audio I/O interface...?
 
I could see a TB5 docking station being useful for using Logic on an iPad Pro; external monitor, external RAID, audio I/O interface...?

There’s a lot of vagueness in this hypothetical, but this is something TB3 is already pretty decent at today, and the setup on your RAID will matter a lot here, along with the specific monitor you intend to connect as well.

That said, this still comes back to my original question: are the use cases enough to want this much bandwidth? Is someone actually going to be trying to replace their desktop in a studio with an iPad Pro? Can it actually do the job in it’s current state (I’d argue it can’t, but that’s another topic)? In the scenario you give, that’s a lot of equipment to carry around, so it makes more sense for inside a studio. But is someone going to be trying to use a single device for everything, or are they more likely to have a Mac Mini or Studio with the rest of the hardware?

I ask in part because how people will actually work drives what will actually be needed.

TB5 would be great, but I think we’ve reached the point where the use cases start getting more niche, at least in the short term. And so those niche cases seem less likely to be a concern for the base models of AS.
 
There’s a lot of vagueness in this hypothetical, but this is something TB3 is already pretty decent at today, and the setup on your RAID will matter a lot here, along with the specific monitor you intend to connect as well.

That said, this still comes back to my original question: are the use cases enough to want this much bandwidth? Is someone actually going to be trying to replace their desktop in a studio with an iPad Pro? Can it actually do the job in it’s current state (I’d argue it can’t, but that’s another topic)? In the scenario you give, that’s a lot of equipment to carry around, so it makes more sense for inside a studio. But is someone going to be trying to use a single device for everything, or are they more likely to have a Mac Mini or Studio with the rest of the hardware?

I ask in part because how people will actually work drives what will actually be needed.

TB5 would be great, but I think we’ve reached the point where the use cases start getting more niche, at least in the short term. And so those niche cases seem less likely to be a concern for the base models of AS.
I agree about this particular scenario - you could push TB3 too far doing audio, but it would take a pretty big job if you're not using a giant (6k) screen.

However, TB5 isn't that niche- or it won't be, if the cost isn't extreme (it probably will be, sigh, even the cables). Nearly all PCIe4/5 x4 NVMe SSDs exceed the bandwidth of TB3/4 by 2-4x. Nobody likes leaving that kind of performance on the table, and more and more workloads today actually demand it.
 
Back
Top