WWDC 2024

dada_dave · Jun 11, 2024

leman said:
I liked the WWDC overall. Their approach to ML sounds very reasonable: specialize, optimize, and don't overdo it. It's very Apple-like, and I can see it delivering practical value. I am particularly impressed by the cloud architecture they have presented, which seems like a great way forward for private computing.

Some more info from Apple:

Private Cloud Compute: A new frontier for AI privacy in the cloud - Apple Security Research

Secure and private AI processing in the cloud poses a formidable new challenge. To support advanced features of Apple Intelligence with larger foundation models, we created Private Cloud Compute (PCC), a groundbreaking cloud intelligence system designed specifically for private AI processing...

security.apple.com

leman said:
I also liked image generation. It's very basic yet practical for the use cases they presented it for. They can achieve good performance and acceptable quality by limiting the diffusion model. I can definitely see using it for my presentations and in chats. Overall, I really like their ML design with semantic index and app-provided info. It is much more scalable than Microsoft's "let's record the video of the screen and do a search on that" stuff.

The software updates were mostly meh. For me, the winners are the new Notes capabilities and the Passwords app. New ML-enabled Safari functionality also sounds interesting, but I'd like to play with it first.

I also expected large updates to Metal this year, which did not arrive. There are some quality-of-life improvements for resource management that fix some friction points with other APIs. They also have this new device memory coherency feature, but I am not quite sure yet how it works.

“device memory coherency” sounds interesting what is it even generally?

cbum · Jun 11, 2024

Cmaier said:
he’s an idiot.

Unfortunately, he is far from an idiot.
Megalomaniac Racist A**hole - definitely.

Citysnaps · Jun 11, 2024

AAPL is up 6% and at a new high.

dada_dave · Jun 11, 2024

cbum said:
Unfortunately, he is far from an idiot.
Megalomaniac Racist A**hole - definitely.

Perhaps fool is a better description than idiot though ... ehhh ... idiot still kinda works. Maybe not in all things, but overall? He's definitely a fool.

leman · Jun 11, 2024

dada_dave said:
“device memory coherency” sounds interesting what is it even generally?

Until now, the Metal memory model only offered work coordination for threads within a single threadgroups (that means up to 1024 threads). If you needed to synchronize access to data between different thread groups you were pretty much out of luck. There was no way to ensure that threads from a second threadgroup would see writes to the data in correct order or at all. The only way to synchronize was to launch separate kernels with a memory barrier in between.

If I understand the new features correctly, they give you a global memory fence and a mechanism to synchronize reads and writes to memory. This should potentially enable a producer/consumer relationship between threadgroups and allow more advanced algorithms. This has been a point of criticism for a while now, so it's great that they are addressing it. Although I am not 100% certain how it works in detail.

dada_dave · Jun 11, 2024

leman said:
Until now, the Metal memory model only offered work coordination for threads within a single threadgroups (that means up to 1024 threads). If you needed to synchronize access to data between different thread groups you were pretty much out of luck. There was no way to ensure that threads from a second threadgroup would see writes to the data in correct order or at all. The only way to synchronize was to launch separate kernels with a memory barrier in between.

If I understand the new features correctly, they give you a global memory fence and a mechanism to synchronize reads and writes to memory. This should potentially enable a producer/consumer relationship between threadgroups and allow more advanced algorithms. This has been a point of criticism for a while now, so it's great that they are addressing it. Although I am not 100% certain how it works in detail.

Nice okay so that sounds like it is enabling a cooperative kernel analog with Nvidia's CUDA. (Apologies, I know I'm always bringing up CUDA but that's my relevant frame of reference for this sort of thing)

Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to…

developer.nvidia.com

Unfortunately it only briefly mentions the kernel case (allowing synchronization across blocks - what is called thread groups here) at the very end. They talk a little bit more about it here:

Recent posts for: “CUDA”

News and tutorials for developers, scientists, and IT admins

developer.nvidia.com

Is this an M3/M4 thing or are they enabling across all their GPUs?

Citysnaps · Jun 11, 2024

This is interesting and pretty neat... Today Blackmagicdesign announced a new camera with a dual lens system that supports Apple Immersive Video for creating 3D movies for its AVP headset. They also announced a new version of DaVinci Resolve to support editing video from the camera.

That should help kickstart the creation of some really outstanding 3D videos. As soon as someone creates an immersive 3D video of the prehistoric cave paintings in Lascaux, France, letting me walk through the caves and seeing paintings on cave walls that go back 15,000 years, or lets me run a San Francisco Marathon with 20,000 other people, I'll be ready for an AVP.

Check it out:

Media | Blackmagic Design

Jimmyjames · Jun 11, 2024

dada_dave said:
Nice okay so that sounds like it is enabling a cooperative kernel analog with Nvidia's CUDA. (Apologies, I know I'm always bringing up CUDA but that's my relevant frame of reference for this sort of thing)

Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to…

developer.nvidia.com

Unfortunately it only briefly mentions the kernel case (allowing synchronization across blocks - what is called thread groups here) at the very end. They talk a little bit more about it here:

Recent posts for: “CUDA”

News and tutorials for developers, scientists, and IT admins

developer.nvidia.com

Is this an M3/M4 thing or are they enabling across all their GPUs?

I suppose there is no sign of thread forward progress guarantees?

dada_dave · Jun 11, 2024

Jimmyjames said:
I suppose there is no sign of thread forward progress guarantees?

I haven't been looking but I'd be a bit surprised. That would require big enough changes to the GPU hardware that I feel like that would've been made a big deal out of during the M4 iPad reveal - that they would've said something more than the M4 GPU is based on the M3. You have to change the way threads are scheduled.

leman · Jun 11, 2024

Jimmyjames said:
I suppose there is no sign of thread forward progress guarantees?

That is what I am wodering as well. Algorithms like decoupled look-back require both global synchronization and kernel-level parallel forward progress.

dada_dave · Jun 11, 2024

leman said:
That is what I am wodering as well. Algorithms like decoupled look-back require both global synchronization and kernel-level parallel forward progress.

That can be added without forward guarantees at the thread level - for instance Pascal had that but forward progress guarantees at thread level weren’t added until Volta.

leman · Jun 12, 2024

dada_dave said:
That can be added without forward guarantees at the thread level - for instance Pascal had that but forward progress guarantees at thread level weren’t added until Volta.

Lack of parallel forward progress at the thread level is usually not a problem, more of a slight annoyance or source of bugs for some algorithms. What’s more important is parallel progress at the kernel level, that is, a guarantee that every threadgroup that started execution will eventually continue it. Without this, it is not possible to establish more complex relationships between the threadgroups, as you could have a threadgroup infinitely blocked by another one.

Edit: Metal guarantees concurrent forward progress for simdgroups in the same threadgroup. One can build quite a lot of concurrent algorithms using this property, it just doesn't scale beyond a single GPU core.

dada_dave · Jun 12, 2024

leman said:
Lack of parallel forward progress at the thread level is usually not a problem, more of a slight annoyance or source of bugs for some algorithms. What’s more important is parallel progress at the kernel level, that is, a guarantee that every threadgroup that started execution will eventually continue it. Without this, it is not possible to establish more complex relationships between the threadgroups, as you could have a threadgroup infinitely blocked by another one.

I disagree. Without forward progression at thread level you cannot have fine grained control over the progress of your algorithm. You are limited to the SIMD group as the source of your parallelism but the speed up of the GPU is that you can use parallel execution of the SIMD group itself to achieve parallelism - that’s a factor of 32 for both Nvidia and Apple GPUs! Without that you are limited to having an entire SIMD group dedicated to a single control flow, great for divergence but not so great for overall performance making a huge class of algorithms simply out of reach or hellishly complicated. That’s what Bryce and Olivier’s talks focused on - Olivier who is now at Apple.

In fact I think you have it reversed: being able to coordinate amongst thread group blocks is nice to have. It enables things like single pass scans which are faster than the standard two pass scans. It means that you sometimes don’t have to launch multiple kernels and can reduce your overhead - that’s the example given by Nvidia. Certain Atomics become available and more germane to Nvidia than Apple perhaps, multiple GPUs can be coordinated more easily together. There’s good reason why developers should care. I agree. But fundamentally they are the same class of algorithms as before. Without forward progress at the thread level though huge classes of algorithms, basically anything requiring complex synchronization through mutexes and locks like hash tables and linked lists are simply out of practical reach on the GPU (other use cases as well). They are either incredibly cumbersome to program or simply not performant enough to warrant their use as your forced into lock-free or SIMD group aware versions. I’ll see if I get track down Bryce’s talk and there was another guy as well who goes over the “why you should care”. But if memory serves Olivier’s talk linked to by @Jimmyjames does a really decent job of explaining it as well and emphasizing just how difficult an engineering challenge it was to achieve and why they bothered to do it.

Now I’ll admit for my own work I don’t make use of this mutex capability (yet), but I almost certainly rely (through a library) on the single pass scan technique. But one of the algorithms I’d love to work on, the CPU version is filled with these kinds of control flows mostly in the use of hash tables. Now I believe people had worked on hash tables prior to this development but it’s either Bryce’s or Olivier’s talk or both that explains why this new system is so very much nicer for that in particular.

leman said:
Edit: Metal guarantees concurrent forward progress for simdgroups in the same threadgroup. One can build quite a lot of concurrent algorithms using this property, it just doesn't scale beyond a single GPU core.

I’m a little confused. That should’ve been the case before? Or is that the case now? Maybe I have my terminology mixed up: SIMD group = warp, thread group = block, multiple thread groups/blocks in a kernel. Yes? If I have my terminology correct forward progress amongst SIMD groups in a thread group should have always been the case. I’m sorry if I’m not following, woke up at 5 in the morning not feeling great.

dada_dave · Jun 12, 2024

dada_dave said:
iPhone on the Mac is an interesting way to get around an app that refuses to bring their app the Mac.

Speaking of … I saw this about the “real reason more iOS apps are unavailable on macOS” but not being an actual developer I wasn’t sure what they were talking about:

Siguza (@siguza@infosec.space)

The real reason iOS apps don't run on macOS is because some dumb CI/testing suites flag it as an error if the app "works on jailbroken devices" so everyone and their mom check for /Applications/blackra1n.app and /bin/bash and crash the process if those exist, because some blog post from 2009...

infosec.space

@Nycturne @Andropov can you guys explain?

Jimmyjames · Jun 12, 2024

dada_dave said:
Speaking of … I saw this about the “real reason more iOS apps are unavailable on macOS” but not being an actual developer I wasn’t sure what they were talking about:

Siguza (@siguza@infosec.space)

The real reason iOS apps don't run on macOS is because some dumb CI/testing suites flag it as an error if the app "works on jailbroken devices" so everyone and their mom check for /Applications/blackra1n.app and /bin/bash and crash the process if those exist, because some blog post from 2009...

infosec.space

@Nycturne @Andropov can you guys explain?

Not either of those two obviously, but I’d guess they are saying some automated tests are being run that do a very stupid check for an app’s vulnerability to piracy or running on a jailbroken iPhone or something similar. The test flags their apps as susceptible and they refuse to publish the app on the Mac App Store as a result, despite the fact that their app isn’t actually threatened by the existence of /bin/bash etc.

dada_dave · Jun 12, 2024

Jimmyjames said:
Not either of those two obviously, but I’d guess they are saying some automated tests are being run that do a very stupid check for an app’s vulnerability to piracy or running on a jailbroken iPhone or something similar. The test flags their apps as susceptible and they refuse to publish the app on the Mac App Store as a result, despite the fact that their app isn’t actually threatened by the existence of /bin/bash etc.

That’s the sense I get too, but I don’t understand why or the history of what’s going on. Like is this an Apple guideline? … why does this fail for the macOS version? … is this a particular popular CI system or any of them?

Jimmyjames · Jun 12, 2024

dada_dave said:
That’s the sense I get too, but I don’t understand why or the history of what’s going on. Like is this an Apple guideline? … why does this fail for the macOS version? … is this a particular popular CI system or any of them?

I guess there is a third party tool that does these checks?

leman · Jun 12, 2024

dada_dave said:
I disagree. Without forward progression at thread level you cannot have fine grained control over the progress of your algorithm. You are limited to the SIMD group as the source of your parallelism but the speed up of the GPU is that you can use parallel execution of the SIMD group itself to achieve parallelism - that’s a factor of 32 for both Nvidia and Apple GPUs! Without that you are limited to having an entire SIMD group dedicated to a single control flow, great for divergence but not so great for overall performance making a huge class of algorithms simply out of reach or hellishly complicated. That’s what Bryce and Olivier’s talks focused on - Olivier who is now at Apple.

In fact I think you have it reversed: being able to coordinate amongst thread group blocks is nice to have. It enables things like single pass scans which are faster than the standard two pass scans. It means that you sometimes don’t have to launch multiple kernels and can reduce your overhead - that’s the example given by Nvidia. Certain Atomics become available and more germane to Nvidia than Apple perhaps, multiple GPUs can be coordinated more easily together. There’s good reason why developers should care. I agree. But fundamentally they are the same class of algorithms as before. Without forward progress at the thread level though huge classes of algorithms, basically anything requiring complex synchronization through mutexes and locks like hash tables and linked lists are simply out of practical reach on the GPU (other use cases as well). They are either incredibly cumbersome to program or simply not performant enough to warrant their use as your forced into lock-free or SIMD group aware versions. I’ll see if I get track down Bryce’s talk and there was another guy as well who goes over the “why you should care”. But if memory serves Olivier’s talk linked to by @Jimmyjames does a really decent job of explaining it as well and emphasizing just how difficult an engineering challenge it was to achieve and why they bothered to do it.

Now I’ll admit for my own work I don’t make use of this mutex capability (yet), but I almost certainly rely (through a library) on the single pass scan technique. But one of the algorithms I’d love to work on, the CPU version is filled with these kinds of control flows mostly in the use of hash tables. Now I believe people had worked on hash tables prior to this development but it’s either Bryce’s or Olivier’s talk or both that explains why this new system is so very much nicer for that in particular.

Ah, I see what you mean. I must admit that I don't have enough experience with advanced data structures on GPUs to have an informed opinion. The most complex thing I did was a radix sort kernel that uses SIMD-wide data synchronization to very quickly order keys across a SIMD, and that worked very well on Apple hardware without within-SIMD locking. I certainly agree that having the ability to serialize threads within a single SIMD is useful.

dada_dave said:
I’m a little confused. That should’ve been the case before? Or is that the case now? Maybe I have my terminology mixed up: SIMD group = warp, thread group = block, multiple thread groups/blocks in a kernel. Yes? If I have my terminology correct forward progress amongst SIMD groups in a thread group should have always been the case. I’m sorry if I’m not following, woke up at 5 in the morning not feeling great.

No, no, you are correct, it was always the case. I was just mentioning this for completeness.

dada_dave · Jun 12, 2024

Citysnaps said:
Seems Musk is a little steamed about Apple's AI plans:

https://www.reuters.com/technology/elon-musk-says-he-will-ban-apple-devices-if-it-integrates-os-with-openai-2024-06-10/

Cmaier said:
he’s an idiot.

Eric said:
"Musk warns that he will ban Apple devices if OpenAI is integrated at operating system level". From what exactly... his car that already doesn't have it? Twitter, a company that he's essentially killed anyway? It's like how he caters to a all the right wing nutjobs who will never buy one of his cars, he is his own worst enemy.

Elon Musk drops lawsuit against OpenAI, Sam Altman one day after criticizing Apple for using ChatGPT

The court dismissed the case without prejudice.

www.tomshardware.com

Elon Musk drops lawsuit against OpenAI, Sam Altman one day after criticizing Apple for using ChatGPT

Naturally he doesn’t want his claims actually tested in court.

Citysnaps · Jun 12, 2024

dada_dave said:
Elon Musk drops lawsuit against OpenAI, Sam Altman one day after criticizing Apple for using ChatGPT

The court dismissed the case without prejudice.

www.tomshardware.com

Naturally he doesn’t want his claims actually tested in court.

I heard on the news yesterday he wants to ban Apple devices (iPhones and Macs) at his companies. Childish. Meanwhile AAPL is up 12% since yesterday morning

WWDC 2024

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elon Musk drops lawsuit against OpenAI, Sam Altman one day after criticizing Apple for using ChatGPT​

Elite Member

Similar threads

Elon Musk drops lawsuit against OpenAI, Sam Altman one day after criticizing Apple for using ChatGPT