May 25, 20251 yr Hi community,I've got a serious problem and I don't know what my next step should be. Attempting to boot my Unraid server leads to a repeating error:EFI stub: WARNING: Decompression failed: unexpected EOFThis on boot, right after the blue Lime Technology screen, and occurs no mater what option is chosen. The error repeats infinity and immediately, so it's tough to catch what precedes it, but I think it's these 2 lines:Loading /bzimage...okLoading /bzroot...okThis occoured or twice in the past week or two, infrequently. And in the immortal troubleshooting strategy of "Well, let's just try booting it again," it would resolve and I'd get the system to boot into Unraid. However, I also was getting odd hangs where the system would get unresponsive after a bit of web interface lag. I tried to get diagnostics, but the system pretty much fully locked up, requiring a hard restart. After the last one of these, and this same "decompression failed" error, I suspected I had a bad USB drive, or a corrupted drive.I tried restoring the drive from the Unraid Connect backup taken less then 8 hours ago, but the error persisted. I just replaced the drive with a new one and the error occurred again. After another repeated set of reboots, i think I've finally got it booted up on a new drive, but I an now VERY worried about this system. What would be my proposed next steps to troubleshoot this? Is there perhaps a file in my drive and backup that's corrupted? I'll see if I can get diagnostics posted if the system fully boots (still starting up now.) When this happened in the past and the system would succeed in booting, full parity checks always came back good, and my SMART drive status still shows good on all the drives. Other major errors or notifications are non-existant.For reference, the motherboard I'm using is the ASUS Pro WS W680-ACE IPMI. The original flash drive was a 64 GB SanDisk Cruzer Blade (USB 3.0). The new one is a SanDisk Ultra 32GB (USB 3.0). I believe I was on the latest Unraid version as well, though it was occurring on 7.0.1 as well. Edited May 25, 20251 yr by FirbyKirby Added drive info.
May 25, 20251 yr Author Was able to pull diagnostics, but of course, this is on new boot, after the past hang, and repeated "decompression errors" when attempting to boot, so I don't think they're ideal. Also, since I've updated the drive, I will need to invalidate the old one and register the new one to get the array started. I'm holding off on that since I'm not at all convinced the drive is the issue and I'd love someone to help me out with some new troubleshooting steps before I chunk the old drive (the new one can be a spare, which I should have had anyway.)wondermutt-diagnostics-20250525-1457.zip Edited May 25, 20251 yr by FirbyKirby Clarity and added info.
May 26, 20251 yr Community Expert The errors indicate problems reading some of the bz* type files from the flash drive as these are the compressed archives that are read to load Unraid into RAM ready to start running.. This is normally a problem with the flash drive but if it occurs with a new one as well then it could be something else like bad RAM and/or motherboard/CPU issues leading to corruption of the files as they are loaded into RAM. Definitely worth running memtest from the Unraid boot menu as that is an easy test to do and bad RAM can cause all sorts of unexpected issues.Sometimes downloading the zip file for the release you are running and extracting all the bz* type files overwriting the ones on the flash can help if the problem is some dodgy sectors on the flash drive.
June 2, 20251 yr Author Thanks for the advice. I followed it to the letter and for documentation purposes, I'll post my results.I updated my MB firmware to the latest stable in case there were any issues in the specific BIOS build first. Then I ran Memtest86+ for 24 hours (the one on the Unraid drive wouldn't run, so I grabbed the latest from their website and ran that.)I heard that the SanDisk drives might be suspect and despite running for more then a year, both drives I tried are SanDisk, so I switched to a PNY USB 2.0 32GB drive (I bought a 5 pack of these for spares, and as a pro tip posted elsewhere in the forum, these are marked as 16GB for sale at Best Buy, but they booted as 32GB drives.) I also replaced all the bz* files as recommended. At this time, I have not had this error again (and it's been about 3 days.) However, I am still getting complete lockups of the Unraid machine every few days, to as many as once a day. These lockups are infuriating as my IPMI syslog show no hardware errors, and from a software standpoint, the server is completely unresponsive. No ability to update the web GUI, no ability to SSH into the machine, and no change to the IPMI display output. A hard power cycle is the only recovery, so pulling diagnostics before the power cycle is impossible.I expect this might have caused the bz* file issue in the first place as these lockups have been happening infrequently (but becoming more frequent) for the past month or 2. Parity checks are now frequent as well, but have always come up with zero errors.While this is no longer related to the initial error of this forum thread, I'll post my last 2 diagnostics AFTER the hard power cycle from a lockup. Maybe someone can see an issue/pattern in them that would give me some new troubleshooting steps. For now, I don't think it's a hardware issue since the Memtest86+ came back clean, and the IPMI and BIOS syslog are both empty as well, and parity checks are always clean as well. I think it's something in software. I have a dummy plug on a passthrough GPU that may be preventing me from seeing the IPMI output, so I may take that off to see if I can get any additional data before the lockup occours. But that's the only other idea I have.wondermutt-diagnostics-20250602-1012.zip wondermutt-diagnostics-20250531-0902.zip Edited June 2, 20251 yr by FirbyKirby
June 2, 20251 yr Community Expert Since you have a 14700K it could also be related to the Intel 13/14 gen issue, lots of confirmed cases in the forum of those CPUs being the problem, mostly 13700K, 14700K, 13900K and 14900K, BIOS update may help if that's issue and CPU is not too far gone.
June 6, 20251 yr Author Hmm. That's concerning @JorgeB . But I really appreciate you bringing this to my attention. I sort of assumed I dodged that particular bullet since I built this machine in December 2023 and hadn't had any issues in the first year (or so.) I've mostly forgotten about this reported issue. But based on your note and some frantic research, it sounds like the Vmin shift issues are cumulative and irreparably damaging. So maybe that would explain my ever increasing frequency of crashes in Unraid, and the thread's initial symptom of a USB bz* file corruption (I've decided that was probably a symptom, and not the cause of the crash.) I have been updating my BIOS regularly on this machine, so I've gotten the updated microcode within about 6 months of it's release, but you may be right.So, here's what I did to try and confirm it was a processor issue. I built a Windows 11 machine on a USB drive, and booted into that rather then Unraid. I installed the Intel Processor Diagnostic Tool (for anyone arriving here later and wanting to replicate, I tried installign that tool on Hiren's BootCD PE, but it wouldn't run and was forced to use Rufus to make a Windows 11 bootable USB drive.) I ran the 3 hour CPU burn-in test using this tool on a loop for 24 hours (closer to 30 hours if you ignore some interruption to update drivers before the straight 24 hour run.) I was hoping it would definitively lock-up on Windows doing this burn-in (100% processor usage with intermittent frequency and function checks) but sadly, it was rock solid and never crashed. As proof, here's a screenshot of the test running just past 24 hours.So, that's frustrating. Absolutely no errors and rock solid operation on Windows for absolutely 24 hours of CPU thrashing.I shut that Windows USB drive down and returned to Unraid. And sure enough, just under 24 hours after booting up, it crashed again. But this time, I got a bit more data. In the past, I've never been able to see what was output on the console because of a GPU passthrough/dummy plug. The machine is headless other then the IPMI interface and that was always blank after booting into Unraid with the dummy plug. But this time, I left that off and hooked up a monitor to the iGPU interface. So this time, I captured what looks to me like a CPU crash dump? I'm not sure exactly what I'm looking at here, but it seems like processor register contents and stack contents from the CPU. Here's a literal photo (pardon the quality as this is an absolute cellphone photo since the IPMI interface couldn't catch it.)This log repeats every 2-3 minutes and the machine is completely unresponsive otherwise. Does anyone have familiarity with this type of error? Is this my 13th/14th gen Intel processor Vmin shift smoking gun? Why would I experience this crash so repeatably on Unraid but not on Windows 11 after a much more strenuous CPU test? Inquiring minds (and frustrated admins) want to know.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.