WWDC 2024

Status
The first post of this thread is a WikiPost and can be edited by anyone with the appropiate permissions. Your edits will be public.

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
I liked the WWDC overall. Their approach to ML sounds very reasonable: specialize, optimize, and don't overdo it. It's very Apple-like, and I can see it delivering practical value. I am particularly impressed by the cloud architecture they have presented, which seems like a great way forward for private computing.

Some more info from Apple:



I also liked image generation. It's very basic yet practical for the use cases they presented it for. They can achieve good performance and acceptable quality by limiting the diffusion model. I can definitely see using it for my presentations and in chats. Overall, I really like their ML design with semantic index and app-provided info. It is much more scalable than Microsoft's "let's record the video of the screen and do a search on that" stuff.

The software updates were mostly meh. For me, the winners are the new Notes capabilities and the Passwords app. New ML-enabled Safari functionality also sounds interesting, but I'd like to play with it first.

I also expected large updates to Metal this year, which did not arrive. There are some quality-of-life improvements for resource management that fix some friction points with other APIs. They also have this new device memory coherency feature, but I am not quite sure yet how it works.
“device memory coherency” sounds interesting what is it even generally?
 
Last edited:

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
Unfortunately, he is far from an idiot.
Megalomaniac Racist A**hole - definitely.
Perhaps fool is a better description than idiot though ... ehhh ... idiot still kinda works. Maybe not in all things, but overall? He's definitely a fool.
 

leman

Site Champ
Posts
722
Reaction score
1,374
“device memory coherency” sounds interesting what is it even generally?

Until now, the Metal memory model only offered work coordination for threads within a single threadgroups (that means up to 1024 threads). If you needed to synchronize access to data between different thread groups you were pretty much out of luck. There was no way to ensure that threads from a second threadgroup would see writes to the data in correct order or at all. The only way to synchronize was to launch separate kernels with a memory barrier in between.

If I understand the new features correctly, they give you a global memory fence and a mechanism to synchronize reads and writes to memory. This should potentially enable a producer/consumer relationship between threadgroups and allow more advanced algorithms. This has been a point of criticism for a while now, so it's great that they are addressing it. Although I am not 100% certain how it works in detail.
 

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
Until now, the Metal memory model only offered work coordination for threads within a single threadgroups (that means up to 1024 threads). If you needed to synchronize access to data between different thread groups you were pretty much out of luck. There was no way to ensure that threads from a second threadgroup would see writes to the data in correct order or at all. The only way to synchronize was to launch separate kernels with a memory barrier in between.

If I understand the new features correctly, they give you a global memory fence and a mechanism to synchronize reads and writes to memory. This should potentially enable a producer/consumer relationship between threadgroups and allow more advanced algorithms. This has been a point of criticism for a while now, so it's great that they are addressing it. Although I am not 100% certain how it works in detail.
Nice okay so that sounds like it is enabling a cooperative kernel analog with Nvidia's CUDA. (Apologies, I know I'm always bringing up CUDA but that's my relevant frame of reference for this sort of thing)


Unfortunately it only briefly mentions the kernel case (allowing synchronization across blocks - what is called thread groups here) at the very end. They talk a little bit more about it here:


Is this an M3/M4 thing or are they enabling across all their GPUs?
 
Last edited:

Citysnaps

Elite Member
Staff Member
Site Donor
Posts
3,895
Reaction score
9,514
Main Camera
iPhone
This is interesting and pretty neat... Today Blackmagicdesign announced a new camera with a dual lens system that supports Apple Immersive Video for creating 3D movies for its AVP headset. They also announced a new version of DaVinci Resolve to support editing video from the camera.

That should help kickstart the creation of some really outstanding 3D videos. As soon as someone creates an immersive 3D video of the prehistoric cave paintings in Lascaux, France, letting me walk through the caves and seeing paintings on cave walls that go back 15,000 years, or lets me run a San Francisco Marathon with 20,000 other people, I'll be ready for an AVP. :)

Check it out:
 

Jimmyjames

Site Champ
Posts
867
Reaction score
999
Nice okay so that sounds like it is enabling a cooperative kernel analog with Nvidia's CUDA. (Apologies, I know I'm always bringing up CUDA but that's my relevant frame of reference for this sort of thing)


Unfortunately it only briefly mentions the kernel case (allowing synchronization across blocks - what is called thread groups here) at the very end. They talk a little bit more about it here:


Is this an M3/M4 thing or are they enabling across all their GPUs?
I suppose there is no sign of thread forward progress guarantees?
 

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
I suppose there is no sign of thread forward progress guarantees?
I haven't been looking but I'd be a bit surprised. That would require big enough changes to the GPU hardware that I feel like that would've been made a big deal out of during the M4 iPad reveal - that they would've said something more than the M4 GPU is based on the M3. You have to change the way threads are scheduled.
 

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
That is what I am wodering as well. Algorithms like decoupled look-back require both global synchronization and kernel-level parallel forward progress.
That can be added without forward guarantees at the thread level - for instance Pascal had that but forward progress guarantees at thread level weren’t added until Volta.
 

leman

Site Champ
Posts
722
Reaction score
1,374
That can be added without forward guarantees at the thread level - for instance Pascal had that but forward progress guarantees at thread level weren’t added until Volta.

Lack of parallel forward progress at the thread level is usually not a problem, more of a slight annoyance or source of bugs for some algorithms. What’s more important is parallel progress at the kernel level, that is, a guarantee that every threadgroup that started execution will eventually continue it. Without this, it is not possible to establish more complex relationships between the threadgroups, as you could have a threadgroup infinitely blocked by another one.

Edit: Metal guarantees concurrent forward progress for simdgroups in the same threadgroup. One can build quite a lot of concurrent algorithms using this property, it just doesn't scale beyond a single GPU core.
 
Last edited:

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
Lack of parallel forward progress at the thread level is usually not a problem, more of a slight annoyance or source of bugs for some algorithms. What’s more important is parallel progress at the kernel level, that is, a guarantee that every threadgroup that started execution will eventually continue it. Without this, it is not possible to establish more complex relationships between the threadgroups, as you could have a threadgroup infinitely blocked by another one.
I disagree. Without forward progression at thread level you cannot have fine grained control over the progress of your algorithm. You are limited to the SIMD group as the source of your parallelism but the speed up of the GPU is that you can use parallel execution of the SIMD group itself to achieve parallelism - that’s a factor of 32 for both Nvidia and Apple GPUs! Without that you are limited to having an entire SIMD group dedicated to a single control flow, great for divergence but not so great for overall performance making a huge class of algorithms simply out of reach or hellishly complicated. That’s what Bryce and Olivier’s talks focused on - Olivier who is now at Apple.

In fact I think you have it reversed: being able to coordinate amongst thread group blocks is nice to have. It enables things like single pass scans which are faster than the standard two pass scans. It means that you sometimes don’t have to launch multiple kernels and can reduce your overhead - that’s the example given by Nvidia. Certain Atomics become available and more germane to Nvidia than Apple perhaps, multiple GPUs can be coordinated more easily together. There’s good reason why developers should care. I agree. But fundamentally they are the same class of algorithms as before. Without forward progress at the thread level though huge classes of algorithms, basically anything requiring complex synchronization through mutexes and locks like hash tables and linked lists are simply out of practical reach on the GPU (other use cases as well). They are either incredibly cumbersome to program or simply not performant enough to warrant their use as your forced into lock-free or SIMD group aware versions. I’ll see if I get track down Bryce’s talk and there was another guy as well who goes over the “why you should care”. But if memory serves Olivier’s talk linked to by @Jimmyjames does a really decent job of explaining it as well and emphasizing just how difficult an engineering challenge it was to achieve and why they bothered to do it.

Now I’ll admit for my own work I don’t make use of this mutex capability (yet), but I almost certainly rely (through a library) on the single pass scan technique. But one of the algorithms I’d love to work on, the CPU version is filled with these kinds of control flows mostly in the use of hash tables. Now I believe people had worked on hash tables prior to this development but it’s either Bryce’s or Olivier’s talk or both that explains why this new system is so very much nicer for that in particular.

Edit: Metal guarantees concurrent forward progress for simdgroups in the same threadgroup. One can build quite a lot of concurrent algorithms using this property, it just doesn't scale beyond a single GPU core.
I’m a little confused. That should’ve been the case before? Or is that the case now? Maybe I have my terminology mixed up: SIMD group = warp, thread group = block, multiple thread groups/blocks in a kernel. Yes? If I have my terminology correct forward progress amongst SIMD groups in a thread group should have always been the case. I’m sorry if I’m not following, woke up at 5 in the morning not feeling great.
 
Last edited:

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
iPhone on the Mac is an interesting way to get around an app that refuses to bring their app the Mac. 🙃
Speaking of … I saw this about the “real reason more iOS apps are unavailable on macOS” but not being an actual developer I wasn’t sure what they were talking about:


@Nycturne @Andropov can you guys explain?
 

Jimmyjames

Site Champ
Posts
867
Reaction score
999
Speaking of … I saw this about the “real reason more iOS apps are unavailable on macOS” but not being an actual developer I wasn’t sure what they were talking about:


@Nycturne @Andropov can you guys explain?
Not either of those two obviously, but I’d guess they are saying some automated tests are being run that do a very stupid check for an app’s vulnerability to piracy or running on a jailbroken iPhone or something similar. The test flags their apps as susceptible and they refuse to publish the app on the Mac App Store as a result, despite the fact that their app isn’t actually threatened by the existence of /bin/bash etc.
 

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
Not either of those two obviously, but I’d guess they are saying some automated tests are being run that do a very stupid check for an app’s vulnerability to piracy or running on a jailbroken iPhone or something similar. The test flags their apps as susceptible and they refuse to publish the app on the Mac App Store as a result, despite the fact that their app isn’t actually threatened by the existence of /bin/bash etc.
That’s the sense I get too, but I don’t understand why or the history of what’s going on. Like is this an Apple guideline? … why does this fail for the macOS version? … is this a particular popular CI system or any of them?
 

Jimmyjames

Site Champ
Posts
867
Reaction score
999
That’s the sense I get too, but I don’t understand why or the history of what’s going on. Like is this an Apple guideline? … why does this fail for the macOS version? … is this a particular popular CI system or any of them?
I guess there is a third party tool that does these checks?
 

leman

Site Champ
Posts
722
Reaction score
1,374
I disagree. Without forward progression at thread level you cannot have fine grained control over the progress of your algorithm. You are limited to the SIMD group as the source of your parallelism but the speed up of the GPU is that you can use parallel execution of the SIMD group itself to achieve parallelism - that’s a factor of 32 for both Nvidia and Apple GPUs! Without that you are limited to having an entire SIMD group dedicated to a single control flow, great for divergence but not so great for overall performance making a huge class of algorithms simply out of reach or hellishly complicated. That’s what Bryce and Olivier’s talks focused on - Olivier who is now at Apple.

In fact I think you have it reversed: being able to coordinate amongst thread group blocks is nice to have. It enables things like single pass scans which are faster than the standard two pass scans. It means that you sometimes don’t have to launch multiple kernels and can reduce your overhead - that’s the example given by Nvidia. Certain Atomics become available and more germane to Nvidia than Apple perhaps, multiple GPUs can be coordinated more easily together. There’s good reason why developers should care. I agree. But fundamentally they are the same class of algorithms as before. Without forward progress at the thread level though huge classes of algorithms, basically anything requiring complex synchronization through mutexes and locks like hash tables and linked lists are simply out of practical reach on the GPU (other use cases as well). They are either incredibly cumbersome to program or simply not performant enough to warrant their use as your forced into lock-free or SIMD group aware versions. I’ll see if I get track down Bryce’s talk and there was another guy as well who goes over the “why you should care”. But if memory serves Olivier’s talk linked to by @Jimmyjames does a really decent job of explaining it as well and emphasizing just how difficult an engineering challenge it was to achieve and why they bothered to do it.

Now I’ll admit for my own work I don’t make use of this mutex capability (yet), but I almost certainly rely (through a library) on the single pass scan technique. But one of the algorithms I’d love to work on, the CPU version is filled with these kinds of control flows mostly in the use of hash tables. Now I believe people had worked on hash tables prior to this development but it’s either Bryce’s or Olivier’s talk or both that explains why this new system is so very much nicer for that in particular.

Ah, I see what you mean. I must admit that I don't have enough experience with advanced data structures on GPUs to have an informed opinion. The most complex thing I did was a radix sort kernel that uses SIMD-wide data synchronization to very quickly order keys across a SIMD, and that worked very well on Apple hardware without within-SIMD locking. I certainly agree that having the ability to serialize threads within a single SIMD is useful.

I’m a little confused. That should’ve been the case before? Or is that the case now? Maybe I have my terminology mixed up: SIMD group = warp, thread group = block, multiple thread groups/blocks in a kernel. Yes? If I have my terminology correct forward progress amongst SIMD groups in a thread group should have always been the case. I’m sorry if I’m not following, woke up at 5 in the morning not feeling great.

No, no, you are correct, it was always the case. I was just mentioning this for completeness.
 

dada_dave

Elite Member
Posts
2,440
Reaction score
2,469
he’s an idiot.

"Musk warns that he will ban Apple devices if OpenAI is integrated at operating system level". From what exactly... his car that already doesn't have it? Twitter, a company that he's essentially killed anyway? It's like how he caters to a all the right wing nutjobs who will never buy one of his cars, he is his own worst enemy.

Elon Musk drops lawsuit against OpenAI, Sam Altman one day after criticizing Apple for using ChatGPT​


Naturally he doesn’t want his claims actually tested in court.
 

Citysnaps

Elite Member
Staff Member
Site Donor
Posts
3,895
Reaction score
9,514
Main Camera
iPhone
Top Bottom
1 2