Parity check disabling disks?


augot

Recommended Posts

I've been using Unraid for a year now, and - after getting over each new learning curve as a total Linux newbie - I've been loving it. However, I've been having some issues over the last month with disks becoming disabled, and, while I'm currently thinking it might be a hardware rather than disk issue (probably my LSI card), I was hoping for some advice before I continue troubleshooting. Hardware:

 

Processor: AMD Ryzen 5 1600AF
Motherboard: ASUS Prime X570-P
Memory: Corsair Vengeance LPX 2x8GB 2666MHz DDR4
Graphics Card: Gigabyte GeForce NVIDIA GTX 1660 OC (passed through to Plex)
Case: Uh well it's an Ikea Alex (with added intake fans), with a PC workbench + drive caddy inside (made more sense than a case inside a cupboard for cooling!)

PSU: 500W Silverstone SX500-LG
Hard Drives: (Cache) 2 x 1Tb Samsung 860 PRO + 1 x 2TB WD Blue m.2, (Array) 7 x 8TB WD (mixture of shucked EFAZ, EFAX, EFZX, EMAZ) + 1 x 8TB shucked Seagate ST8000AS0002
HBA: HP H221 660087-001 LSI SAS 9207-8e (flashed to IT mode, array is connected in two groups of 4 disks)

 

The first time the issue appeared was when I rebooted the array around five weeks ago. When the server came back online, the server had detected an unclean shutdown (strangely, it was a normal reboot sequence), initiated a parity check, and immediately disabled one of the disks. I went into maintenance mode, ran a long SMART test, checked cabling, etc, etc, all the usual steps. It was a fairly new disk (<6mo old) and everything came back fine, so I figured I'd take the chance it was a one-off and rebuild. Worked fine for a week... until my scheduled parity check started on the first of the month. I was immediately hit with an alert that a disk had been disabled, but, this time, a different disk to the last one. Checked everything again over the couple of days or so, disk itself seemed fine, so rebuilt again and continued on for another week or so until I had to reboot the server again. When it came back up... again, it had tried to start a parity check on boot, but now there were two disks disabled, and they were also completely different to either of the first two. Checked the disks again, again nothing seemed amiss, so rebuilt and everything seemed fine... until it happened again on the first of this month, when parity check starting knocked out one of my array disks again.

 

So at this point I know for sure that I obviously have something very wrong, somewhere, and I can't keep rebuilding my array like this if I want my drives to last. (Also worth noting, these disks have been on either of the two cables coming from the H221.) Every time this had happened, too, it seemed to be happening immediately on starting a parity check. Since I have everything on my array backed up and can recover later, I felt I could test things a little more, so once the last round of parity rebuilding finished in maintenance mode I initiated a reboot to try and manually recreate the issue. No disks disabled! OK, but that's maintenance mode - what about normally? Started the array, waited for all dockers to finish starting up, hit reboot, and... yep, another random disk disabled after parity check was attempted on boot.

 

Now, because I'm an idiot, I forgot that unraid logs don't persist between reboots and didn't save them before the last reboot and rebuild. (It's been the same error each time, some kind of read error about whichever disk goes down, with the number "2048" in there too.) Since I feel like I can now recreate the issue if I need to - but to also save my drives unecessary stress from more rebuilding if possible - I was hoping for some help on figuring out what might be the likely issue, and some other troubleshooting steps I might be able to take, before I deliberately cause another disabled disk (or two) and generate new logs with the error. My hunch is the H221 is failing, but any ideas or advice would be greatly appreciated.

Link to comment

Yeah, I figured that'd be required. Ah well. Bit the bullet and started a parity check just now - immediately, Disk 2 was disabled this time (after 2048 errors, which I remember being identical each time this has happened, and the read/write/print errors in the log look the same as well). Cancelled the check and generated a diagnostics zip which hopefully will shed more light on what's happening...

tower-diagnostics-20201008-2218.zip

Link to comment

Problems with multiple devices:

Oct  2 09:35:46 Tower kernel: sd 9:0:5:0: Power-on or device reset occurred
Oct  2 09:35:46 Tower kernel: sd 9:0:7:0: Power-on or device reset occurred
Oct  2 09:35:46 Tower kernel: sd 9:0:1:0: Power-on or device reset occurred
Oct  2 09:35:46 Tower kernel: sd 9:0:2:0: Power-on or device reset occurred
Oct  2 09:35:46 Tower kernel: sd 9:0:4:0: Power-on or device reset occurred

This usually suggests a connection/power problem.

 

Also make sure to update the LSI firmware:

mpt2sas_cm0: LSISAS2308: FWVersion(20.00.06.00)

All p20 releases except 20.00.07.00 have known issues

Link to comment

Thanks for taking a look! Hmm - I think I have a spare SATA power cable somewhere. I'll swap it in and check, but the disks that have been going down have been on separate cables before now. But I wonder if more likely is that it's the PSU isn't handling the power surge when all eight array disks spin up under heavy load simultaneously? It handles all the disks being spun up and used fine, but then that's not a max load, all at once. And it's a six-year-old PSU at this point (salvaged from a previous desktop), so this could be the first signs of it dying.

 

It's been a year since I flashed the LSI card, but I remember there being some issue that meant 20.00.06.00 was the most recent firmware it would accept... I'll look into that though, thanks for flagging, I hadn't even thought that might be problematic since I hadn't had any kinds of disk problems since I first flashed it. Can't hurt to try and use the most stable release.

Link to comment
11 minutes ago, augot said:

But I wonder if more likely is that it's the PSU isn't handling the power surge when all eight array disks spin up under heavy load simultaneously?

It's very possible, like mentioned most likely connection or power problem, firmware might or not have an impact, but good to upgrade if possible.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.