Crowdstrike screwup destroys commerce

The (world) thing was hyperbole but I have to say trying to be a developer in the modern world without internet is quite painful. I don't know what part of the IT infrastructure here uses CrowdStrike (I'm a contractor with my own equipment) but no one had internet. If I had known, I could have brought my iPad and tethered to that but my iPhone tethering is severely throttled.
To my knowledge no networking infrastructure was hit at all around here (around here meaning all of Denmark)
 

Microsoft seems to blame the EU for the outage. Their argument is they can’t do what Apple did and push this junk out of kernel-space, because the EU requires that third parties have the same access to the kernel that Microsoft does (for Defender).

Of course, they don’t explain why they can’t spin out their own security software into a separate company, or why they can’t build in SDKs that everyone, including Defender, can use.

But, in any case, I always like blaming the EU.
 
Of course, they don’t explain why they can’t spin out their own security software into a separate company, or why they can’t build in SDKs that everyone, including Defender, can use.

That would require work, and therefore money, and there would be no clear return on that investment for Microsoft. Lack of a black eye cannot be measured in revenue before-hand. Therefore it is impossible to do such a thing, or even to think in those terms.

(I have literally had arguments with managers on topics like this. It’s the whole "nothing going wrong doesn’t get noticed like a new feature" issue)
 

Microsoft seems to blame the EU for the outage. Their argument is they can’t do what Apple did and push this junk out of kernel-space, because the EU requires that third parties have the same access to the kernel that Microsoft does (for Defender).

Of course, they don’t explain why they can’t spin out their own security software into a separate company, or why they can’t build in SDKs that everyone, including Defender, can use.

But, in any case, I always like blaming the EU.
There is an upcoming API that is in use in Linux eBPF (used to stand for extended Berkeley Packet Filter but that no longer fits so it's not an acronym any more) that will allow software like Crowdstrike to do its job without needing to be a kernel extension. It will still be optional but reputable companies will all use it to avoid problems like this one. Unfortunately, it isn't quite ready for Windows yet.
 
There is an upcoming API that is in use in Linux eBPF (used to stand for extended Berkeley Packet Filter but that no longer fits so it's not an acronym any more) that will allow software like Crowdstrike to do its job without needing to be a kernel extension. It will still be optional but reputable companies will all use it to avoid problems like this one. Unfortunately, it isn't quite ready for Windows yet.
eBPF is coming to Windows? Haven't heard this and almost don't see how. I assume it won't be plug and play with existing eBPF programs. - Have a friend who's looked into formal verification strategies for eBPF programs
 
Apparently Crowdstrike has been causing more limited issues on Linux kernels for awhile despite eBPF:

Even with the full protection of user space, misbehaving processes can cause issues. Excessive file writes for example. Or much to my annoyances crashpad_handler which often starts eating way too much cpu until killed where it will relax for a while. So I guess another moral of this whole story is to try and avoid bad software :P
 
One of the surprising outcomes of this disaster is that, in my circle of non-devs, this has sadly been classified as "the Microsoft issue". The nuance that this was an issue caused by a security software vendor was lost. I don't think anyone could accuse me of defending Microsoft, but this is 100% on CrowdStrike.
 
One of the surprising outcomes of this disaster is that, in my circle of non-devs, this has sadly been classified as "the Microsoft issue". The nuance that this was an issue caused by a security software vendor was lost. I don't think anyone could accuse me of defending Microsoft, but this is 100% on CrowdStrike.

100%? IMHO, Microsoft has some culpability here because it still hasn’t provided a migration path for these sorts of apps outside of kernel space.
 
Even with the full protection of user space, misbehaving processes can cause issues. Excessive file writes for example. Or much to my annoyances crashpad_handler which often starts eating way too much cpu until killed where it will relax for a while. So I guess another moral of this whole story is to try and avoid bad software :p

No real disagreement, although I’d say that limiting the damage bad code can do is a worthy engineering goal In and of itself. We generally don’t go ‘oh well’ when earthquakes take down buildings or wind storms take down bridges, we put improved mitigations into action, backed by regulation.

Honestly, I’m getting a bit concerned that we don’t have some sort of software engineering ’code’ similar to building code yet. Clearly software infrastructure is becoming critical to people’s lives and the economy, but we’re still acting like it’s some infant industry that will fall over if we were to dare to put some guardrails in.
 
No real disagreement, although I’d say that limiting the damage bad code can do is a worthy engineering goal In and of itself. We generally don’t go ‘oh well’ when earthquakes take down buildings or wind storms take down bridges, we put improved mitigations into action, backed by regulation.

Honestly, I’m getting a bit concerned that we don’t have some sort of software engineering ’code’ similar to building code yet. Clearly software infrastructure is becoming critical to people’s lives and the economy, but we’re still acting like it’s some infant industry that will fall over if we were to dare to put some guardrails in.
Fully agree. That’s also part of the point of address space randomization. If your program is written perfectly randomizing its address space is pointless. And if it isn’t, there might still be a way of exploiting it with randomized address space. But it’s harder to do so and may prevent some cases or just make it tricky enough to make it infeasible.

I am in no way arguing that we might as well run everything in kernel mode. That’d be horrible. But code running with your user privileges can still delete all the data you have access to as a user. (Barring additional permission granularity systems like on macOS with separate permissions for seeing documents, downloads and full disk access on top of user privileges)
 
Ideally everything always runs with least required privileges. I would be in favor of expanding the current user privilege system on Unix to a lattice structure where all programs effectively spawn a new user that cannot see anything made by a sibling but the user who spawned the program is their shared upper bound that can see all their data and grant them per process:pseudo user sibling access if need be.
I guess all we need is a semi lattice. No need for a greatest lower bound operation. Just spitballing.
 
I would be in favor of expanding the current user privilege system on Unix to a lattice structure where all programs effectively spawn a new user that cannot see anything made by a sibling but the user who spawned the program is their shared upper bound that can see all their data and grant them per process:pseudo user sibling access if need be.

Depends on your definition of "see". I tend to take a strict view on visibility: any given process should be allowed to decide what data it owns is visible to whichever other processes it wants to expose it to. A parent should not have full access to its children's data without approval, in order to make possible attack surfaces as small as we can.
 
100%? IMHO, Microsoft has some culpability here because it still hasn’t provided a migration path for these sorts of apps outside of kernel space.
I mean, I think Apple's approach is far superior. Things like System Integrity Protection and the entitlement system greatly minimize the damage a process with root privileges can do, and I believe it's the right choice for a mainstream OS. But it does come with limitations. One may have legitimate reasons to want to do something that is now (on macOS & iOS) gated behind an entitlement that Apple won't issue to you. Is this extra freedom worth significantly downgrading the security of the system? In my opinion, for a OS used by non-tech people, the answer is no.

The reason I felt this was 100% on CrowdStrike is that, while in this particular instance other operating systems did a better job, in the end even a buggy user space program can cause significant issues (significant enough to make the system inoperative). In this case it happened to be a crash, that on Windows took the entire system down because it was running on kernel space, but imagine if instead of a crash the faulty CrowdStrike process had started writing bytes to disk nonstop (until the disk was full). That would have caused major disruptions as well, and as far as I know no OS can do much to prevent that kind of issue. Or if it entered an infinite loop on several threads, pinning the CPU utilization to 100%, effectively DDoSing the system. Or a million other possibilities.

I can't imagine what happened behind the scenes to end up pushing untested changes to billions of devices at once, with no phased release (on a Friday, no less!). At that scale, you must phase the releases and monitor them. Getting such a critical thing wrong is miles worse than having a bad privilege system in your OS.
 
Back
Top