I take apart the good, bad, and ugly of when you use software to heal hardware.
I just had to retire a 50 month old 13-inch MacBook Pro. Still running Snow Leopard due to app incompatibilities, the SMC (System Management Controller) throws all sorts of errors when I run Apple Hardware Test. Sometimes it’s the RAM, sometimes the thermal sensor, yadda, yadda.
Bottom line, the Main Logic Board, or Motherboard in PC terminology, is broken. Because memory failures are in the picture, it’s only usable at this point for web surfing, which it still does quite well.
For kicks, I updated it to OS X Mavericks. And, a funny thing happened… the machine stopped throwing kernel panics left and right. Now, I know the machine is broken… the EFI/firmware diagnostics (Apple Hardware Test, or AHT) confirmed it. But, yet, it still works.
Looking online, it looks like this MacBook was far from alone in having this condition – page after page of Google search results show the identical machine throwing the identical trouble codes in AHT.
I quickly realized what Apple had done – they trained OS X to work around the problem. There’s no way to fix the broken Main Logic Boards without recalling them, a costly thing that Apple has no obligation to do… the SMC appears to fry itself long after the warranties have expired.
What Apple appears to be doing, is telling the kernel to ignore SMC glitches, and wait a few moments for the SMC to reset itself before throwing a panic. For thermal error codes, the most common glitch, this is a really cool solution.
Now, I doubt this is a total fix. If AHT is showing memory test errors, the RAM pool is still getting corrupted, which presents a real-and-present data loss scenario. So, the machine still had to be retired in my opinion.
I do draw some pause in Apple’s handling of this. Considering that the SMC can cause memory corruption, and is still broken… many users could think their machines are fine. If there is indeed memory corruption, this is playing with fire a bit. You run the risk of data loss if the contents of corrupted memory are written to the drive, for example.
Memory corruption can easily be applied to a file system, totaling out an entire machine’s drive contents.
In the end, this shows how far you can go to deliver a superior experience when you control both the hardware and the software. It shows why Mac is more stable often than Windows. Microsoft can’t code workarounds for failing controllers on hundreds of thousands of different hardware combinations. A driver developer could – but that’s assuming you can easily get the driver into end-user hands regularly.
But, on the flip side, is a driver fix likely to alert users that the fix itself is just a band-aid? Unlikely. It’s much more likely that a user will just go along their merry way, and eventually discover that their filesystem is gone.
As cool as this fix is, I’d prefer my machine to kernel panic, and tell me that it’s time to buy a new computer or replace the motherboard. I now have a really nice web surfing machine, but I’m thanking my lucky stars that I don’t have a total data loss on my hands… and I fear for the next user that this will happen to.