If this looks far too TL:DR then here's the summary - FUCKING BASTARD PCs.
So then, last night my PC turned on me.
I was happily working away on formuas in Excel to improve my online slots stats-crunching automation, when my PC froze, but in a really weird sort of way. Excel was completely hung, but Firefox and all its tabs remained responsive, and the online slot I had running in Chrome carried on just fine. (I keep my browsing and slotting separate, as the Flash plugin has a habit of crashing after slots have been running for a long time, so by having that in a separate browser it doesn't affect the Flash plugin for Firefox.)
Then after about 30 seconds Firefox and all its tabs stopped responding, but the slot carried on running in Chrome.
I tried to bring up Task Manager but that caused an Explorer hang, and it then dropped to the Windows backdrop screen with the egg-timer icon (or rather, the modern equivalent thereof) with a 'Preparing Secuity Policy' message, but I could still hear my slot running behind it.
Then the slot went quiet and the PC just sat there with its 'Preparing Security Policy' message and egg-timer icon.
I gave it another minute and nothing seemed to be happening, it was completely unresponsive to any keyboard or mouse inputs, so in the end I just powered it off and turned it back on again.
Thereafter all appeared normal, it POSTed, stepped through the BIOS checks, booted into Windows, all OK. But I was already perturbed, as Windows 7 doesn't do shit like that, so I was thinking I had a hardware problem of some sort.
I carried on working away for a bit, and then tried to do a search on my data drive, at which point it did a variation of the lock I'd seen earlier, but did carry on to let me perform the search after a hang of around 30 seconds, and seemed to be working normally again. Then a few minutes later it did a proper hang, and again it was 'one thing at a time' that stopped working, rather than an outright crash or total hang.
It finally locked completely, and after a last dramatic pause, I got a BSOD (which I quickly grabbed a photo of with my phone whilst it was doing the minidump), before it rebooted itself.
It then sat on the BIOS screen for about 90 seconds, attempting to auto-detect what was attached to the SATA ports, it failed to find my SSD, carried on forward and then attempted to boot from my data drive, which of course failed and left me sat at a nice old-fashioned DOS style boot disk error screen.
CTRL+ALT+DELETE reset it, and I hit DEL on the POST screen to get into the BIOS.
And, erm... My SSD has disappeared, it wasn't listed as a hard drive, and therefore had disappeared from the boot device priority list.
I physically powered the PC off, then back on and.... it worked normally, (I had to go into the BIOS to put the SSD back to the top boot device), detected all the drives, booted up, into Windows, Windows complained about the BSOD and gave me the error and minidump details, but other than that it was fine.
Now there was no doubt something was amiss, so I set about doing some diagnostics, I shut the PC down and physically unplugged the data drive to remove that from the equation, then turned it back on.
1) Scheduled a checkdisk for next reboot, it rebooted, ran a checkdisk, no errors.
2) F10ed into the recovery console at boot, ran the Windows Memory Diagnostic, no errors.
3) Windows event logs were no help, there was absolutely nothing to indicate it was unhappy about anything prior to the crashes, and all it logged on reboot was stuff like 'the last reboot was unexpected'.
4) The BSOD STOP messages weren't much help, Googling around revealed it to be a real sort of 'Windows has just completely lost its shit and has no idea what's wrong' type of STOP, with the list of potential culprits being potentially fucking everything.
5) Device Manager was happy, there were no recent Windows Updates that were known to cause problems, I did a full MSE scan just in case, that was happy.
6) Checked all temperatures and fans, all fine, nothing was overheating. And it would run a game just fine, it's not like it was falling over when it got stressed, it would play BF3 for five minutes just fine, and then crash opening my gmail.
Basically, there was ABSOLUTELY NOTHING to indicate that the PC or Windows were upset about anything, and yet, here it was routinely hanging/crashing/blue screening after a short period of use.
(Whilst I was doing the above stuff it was still crashing and hanging, and I got a couple more blue screens.)
The only real lead I had at this point was the SSD disappearing from the BIOS, plus I had noticed some consistent behaviour.
1) If the PC was simply REBOOTED after a crash or blue screen, it would fail to find the SSD at POST, and thus fail to boot. Going into the BIOS at this point would show the SSD as completely missing.
2) If the PC was POWERED OFF and then turned back on, it worked completely normally (after putting the SSD back to the top of the boot priority list). But after some random amount of time, 10-45 minutes, it would do some sort of crash/hang and finally blue screen or stop responding completely.
I kind of figured my SSD was fucked, although I wasn't quite sure how, as when it worked it was 100% fine, data throughput was as fast as ever and so on, I always thought an SSD would either work or it wouldn't, since it's not a mechanical device.
Anyway, as my backups were up to date and in order, and I've been thinking about doing a rebuild for a while anyway as it's an old install of Windows 7 (not that it had been giving me any trouble prior to yesterday evening) - I figured I'd just go for a rebuild. If it was still fucked after the rebuild, I'd know for sure I had a hardware issue, most likely the SSD. (Although it could have been the SATA controller on the motherboard, or something like that, of course.)
So off I went with the rebuild, and since it's a pretty slow process with all the Windows Updates that are required, I did more investigation into the errors I'd been seeing and based on the assumption it was something to do with the SSD I refined my search terms (controller going west? dodgy NAND cell?). I used Mrs AE's laptop for this as my own PC was out of action of course, and she was out for the evening anyway so she wouldn't need it.
Eventually I found this, and I was like OH FOR FUCK'S SAKE, REALLY?
Fourth post down -
http://forum.crucial.com/t5/Solid-State ... td-p/96549A firmware update for the Crucial M4 that addressed this bizarre issue (who would have seen this one coming?):
Quote:
"Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive."
I toddled off to Crucial's website and found the firmware (there's actually been another release since the one that fixed the clanger listed above), but it runs as an installer from within Windows itself and applies the upgrade on the next reboot, and of course I'd just blown away my installation of Windows
So as soon as my new install of Windows was in a usable state AND I'D INSTALLED .NET4 AS THE CRUCIAL UPDATER NEEDS THAT (FFS), and now horribly aware that my PC could commit hari-kari at a moment's notice because the issue was fundamentally to do with the firmware on the SSD rather than a Windows/hardware issue, I ran the Crucial firmware updater, it rebooted the PC, applied the firmware upgrade and carried on booting into Windows.
That was last night, (like, ALL of last night), I stayed up until about 2am getting all the Windows Updates on, and I'm still in the process of getting everything else installed/patched/configured. Office, Steam, Origin, backup software, Afterburner, FTP client, browsers and so on - fortunately my backup regime is very good so it's all fairly straightforward (even stuff like my Firefox bookmarks I regularly backup), and I keep all my usernames and passwords in an encrypted document for example - so nothing has been lost, it's just a lengthy process.
PC has been behaving itself fine so it's definitely fixed the problem, although annoyingly there was no need to rebuild it after all.
But seriously, what the fuck?
Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on timeI mean, really.