jam Posted April 3, 2020 Share Posted April 3, 2020 Something terrible seems to have happened to my server while I was performing some upgrades 😨  I wanted to replace the existing 10TB parity drive with a 14TB one so that I could add larger capacity drives to the array. I ran a parity check beforehand, which reported 0 errors.  After replacing the drive, Unraid began a parity rebuild, but didn't get very far before all drives in the array began spewing errors into the system log. I've tried the process a couple more times & the same thing keeps happening. After stopping the parity check & attempting a recursive ls on each disk, some files/directories can be accessed while others fail with "Input/output error". I can't view the SMART status of any of the drives either; it just says that "a mandatory command failed".  What the hell has happened?? How could all 7 disks in the array simultaneously go bad, after a full parity check beforehand not report any errors?  I didn't touch any of the cables to the array drives during the upgrade... the new drive is connected to a totally different SATA controller/PSU power cable. In the interest of full disclosure, I also added more RAM to the system recently, but this has been in place for a week without issues, and was there during the parity check prior to adding the new drive.  holt-diagnostics-20200403-1326.zip Quote Link to comment
JorgeB Posted April 3, 2020 Share Posted April 3, 2020 Problem with the SATA controller, this is unfortunately rather common with Ryzen boards: Apr 3 13:13:07 Holt kernel: ahci 0000:01:00.1: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0800000000000000 flags=0x0010] Â Disabling IOMMU and/or a BIOS update might help. Quote Link to comment
jam Posted April 3, 2020 Author Share Posted April 3, 2020 Thanks for the suggestions. I updated to the latest BIOS & tried again with IOMMU both enabled and disabled, but I'm still experiencing the problem. Â I'm guessing that the motherboard is screwed then? Quote Link to comment
JorgeB Posted April 3, 2020 Share Posted April 3, 2020 12 minutes ago, jam said: I'm guessing that the motherboard is screwed then? I wouldn't say the motherboard is bad, since this happens with multiple Ryzen models, so possibly a kernel/compatibility issue, you can also try v6.9-beta1 which uses a much newer kernel, if still the same then a different model board might help, if you're lucky. Quote Link to comment
jam Posted April 3, 2020 Author Share Posted April 3, 2020 Thanks again, I'll update now & give it a try. If it was a kernel/compatibility issue though, wouldn't I have seen this before? I've had the system for just over 7 months now & never seen errors like this until now. Quote Link to comment
JorgeB Posted April 3, 2020 Share Posted April 3, 2020 It might happen mostly under more load, or even a specific type of load. Quote Link to comment
jam Posted April 3, 2020 Author Share Posted April 3, 2020 Okay, I'm quietly confident that the 6.9 beta is working properly 🤞The parity rebuild is at 10% now without errors; it never reached 5% before.  Thanks again for the fast assistance! Quote Link to comment
dcoulson Posted April 4, 2020 Share Posted April 4, 2020 18 hours ago, jam said: Okay, I'm quietly confident that the 6.9 beta is working properly 🤞The parity rebuild is at 10% now without errors; it never reached 5% before. I'm having similar issues on a x399 TR board - Did the rebuild complete with the 6.9-beta release? Quote Link to comment
jam Posted April 4, 2020 Author Share Posted April 4, 2020 Just now, dcoulson said: I'm having similar issues on a x399 TR board - Did the rebuild complete with the 6.9-beta release? It’s still going but it’s reached 75% without issues. I’m using an X399 TR board too (ASRock Taichi) so I’d recommend giving the beta a try. Quote Link to comment
dcoulson Posted April 5, 2020 Share Posted April 5, 2020 On 4/4/2020 at 7:50 AM, jam said: It’s still going but it’s reached 75% without issues. I’m using an X399 TR board too (ASRock Taichi) so I’d recommend giving the beta a try. Did it complete successfully? I tried the beta and had the same issue. Trying with IOMMU disabled in the BIOS now... Quote Link to comment
jam Posted April 5, 2020 Author Share Posted April 5, 2020 Yeah, mine finished successfully with IOMMU still enabled. I’m preclearing the new drives now, also without errors. Quote Link to comment
jam Posted April 8, 2020 Author Share Posted April 8, 2020 Well, after a couple of days without issue, I'm getting I/O errors again... this time with the cache pool 😒Is there anything else I can try?  holt-diagnostics-20200408-0924.zip Quote Link to comment
JorgeB Posted April 8, 2020 Share Posted April 8, 2020 24 minutes ago, jam said: I'm getting I/O errors again... Unrelated to previous issues, cache pool is completely full, note that the GUI can show wrong values when using different size devices. Quote Link to comment
jam Posted April 8, 2020 Author Share Posted April 8, 2020 Oh damn it, sorry! I was running the mover when the errors first happened so didn't think to check how full the cache was. Must've been filling up faster than it was emptying. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.