13900K and 14900K failures

mr_roboto

Site Champ
Joined
Nov 9, 2021
Posts
377
Don't think I've seen any posts about this here... Intel has been having some field reliability problems with both 13900K and 14900K. At first people suspected too-aggressive PL1 and PL2 limits on gamer motherboards, but over time it's become clear there's also problems on motherboards and systems which come with much more conservative (and Intel-approved) settings, such as Dell.

 
It will be interesting to see the follow-up that was teased in this video.
But this could really be a serious problem for Intel, if they pushed the CPUs so hard that they fry themselves.
 
The pie chart the Warframe devs generated is interesting. The 13700/14700 are represented, but are a significantly smaller share of the total crashes than 13900/14900 variants. 13700/14700 are the second-best versions of 13900/14900 - probably the same die, but detuned to slightly lower clocks with fewer efficiency cores enabled. Since Intel probably sells a lot more *700 chips than *900, seeing the *900 variants top the chart like that suggests the frequency of problems with the top models is extremely high.

(This problem is bad even on the *700 chips - Intel shouldn't be selling products which are factory overclocked to the point of unreliability.)
 
There's more but it's not that interesting imo - just speculation about the latching system, substrate warping, uneven pressure, etc. It doesn't make a lot of sense as a theory to me because if it was that, problems should occur at about the same rate on lower bin models.
 
There's more but it's not that interesting imo - just speculation about the latching system, substrate warping, uneven pressure, etc. It doesn't make a lot of sense as a theory to me because if it was that, problems should occur at about the same rate on lower bin models.
It still occurs but at a much lower rate it’s almost certainly electromigration and degradation.
 
There’s now a claim that mobile 13th and 14th gen processors may be affected too.


Intel acknowledges that there is a problem but says it’s a different one but their explanation was rather vague and unsatisfactory. They almost seem to be saying these are “normal” computer crashes.

Intel is aware of a small number of instability reports on Intel Core 13th/14th Gen mobile processors.

Based on our in-depth analysis of the reported Intel Core 13/14 Gen desktop processor instability issues, Intel has determined that mobile products are not exposed to the same issue. The symptoms being reported on 13/14 Gen mobile systems – including system hangs and crashes – are common symptoms stemming from a broad range of potential software and hardware issues.
Cassells responded to Intel's statement in a Reddit thread:

"The laptops crash in the exact same way as the desktop parts including workloads under Unreal Engine, decompression, ycruncher or similar. Laptop chips we have seen failing include but not limited to 13900HX etc.," Cassells said.
Continuing quote:
"Intel seems to be down playing the issues here most likely due to the expensive costs related to BGA rework and possible harm to OEMs and Partners," he continued. "We have seen these crashes on Razer, MSI, Asus Laptops and similar used by developers in our studio to work on the game. The crash reporting data for my game shows a huge amount of laptops that could be having issues."

🤷‍♂️
 
Intel claims to have a fix:


However, the microcode update will not repair impacted processors.
Intel's advisory says an erroneous CPU microcode is the root cause of the incessant instability issues. The microcode caused the CPU to request elevated voltage levels, resulting in the processor operating outside its safe boundaries.
The bug causes irreversible degradation of the impacted processors. We're told that the microcode patch will not repair processors already experiencing crashes, but it is expected to prevent issues on processors that aren't currently impacted by the issue. For now, it is unclear if CPUs exposed to excessive voltage have suffered from invisible degradation or damage that hasn't resulted in crashes yet but could lead to errors or crashes in the future.
 
Yeesh. Probably the EU’s fault, yeah? Microcode regulations and so forth.

Jokes aside, any idea on the number of irreparable units? I wonder if a shitstorm of litigation is heading Intel’s way. Consumers are going to want their broken stuff replaced, OEMs will want their reputation fixed, and lawyers… well, they’ll continue being the helpful bunch they are 🙂
 
Wow. They're releasing a patch *mid-august* with no solution in the meantime. Plus (as the article states) I wouldn't be comfortable owning one of the affected processors even if I hadn't experienced issues yet. Who knows how many CPUs are "invisibly" affected.
 
Wow. They're releasing a patch *mid-august* with no solution in the meantime. Plus (as the article states) I wouldn't be comfortable owning one of the affected processors even if I hadn't experienced issues yet. Who knows how many CPUs are "invisibly" affected.
Zen 5 reviews will be interesting to watch/read. I wonder what reviewers wil do with Raptor Lake.
 
it sounds like what’s going on here is hot carrier degradation. If so, this is the worst case of it I’ve ever heard of. I definitely wouldn’t want to own one of these; even if you aren’t experiencing crashes yet, you will, and your processor is already on the way to being toast.
 
I guess I should elaborate slightly. What seems to be going on is that Intel is sometimes causing the on-chip voltage for at least one voltage domain to be too high. The higher the voltage, the higher the electric field. A high enough electric field can cause charge carriers to be injected into the dielectric at the transistor gates. These carriers get trapped in the dielectric, and never come out. But a dielectric is supposed to be an electrical insulator. If you have a bunch of trapped charge carriers in it, it becomes a conductor. This causes all sorts of problems - for example, you may not be able to shut the transistor off.

This is a permanent problem. Once the carriers are in the dielectric, they ain’t coming out.

Note that the voltage must be WAY too high. Just running at the chip’s max voltage shouldn’t cause any problems for at least 10 years. To cause problems so quickly, it’s gotta be way higher than that. An alternative is this is always affecting the same transistor, and it’s in some analog circuit, and it’s a very poorly designed transistor with some weird gate shape that has a geometric corner or something that is concentrating the electric field lines (very unlikely nowadays - FINFETs aren’t drawn like that, unlike MOSFETS). Of course I have very little information, but I’d have to guess that they are somehow setting the on-chip voltage in some voltage domain to several volts.
 
Back
Top