autumnwalker Posted August 22, 2019 Share Posted August 22, 2019 Hello, I woke up this morning to a warning on my dashboard that my parity drive was in failed state. I also noticed that seven of my other drives were showing errors. Screenshot below: I am running 6.6.5 as I am afraid to upgrade to the 6.7.x series due to the ongoing SQLLite corruption bug. Parity and drives 1 - 7 are on a Dell PERC H310, drives 8 - 10 are on the onboard SATA on my motherboard. I'm running a few things in Docker and I have one VM. Otherwise this is just straight NAS. Thoughts on where I go from here? Did I lose my data across the first seven drives? Quote Link to comment
JorgeB Posted August 22, 2019 Share Posted August 22, 2019 Based on the screenshot looks like a controller problem, the one connected to the 8 drives with errors, make sure it's well seated and cooled, if possible try a different slot, if it keeps happening could be going bad. Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 53 minutes ago, johnnie.black said: Based on the screenshot looks like a controller problem, the one connected to the 8 drives with errors, make sure it's well seated and cooled, if possible try a different slot, if it keeps happening could be going bad. That was my first thought as well (re: controller problem). The server has not been physically touched in nearly six months so it's leading me to think cooling issue or failure versus being unseated. I know there was high I/O last night. The server is currently off. Are you suggesting I reboot it after it has had time to cool down and see where I am? Quote Link to comment
JorgeB Posted August 22, 2019 Share Posted August 22, 2019 After rebooting all drives should come online, unless the HBA is really dead, either way parity will need to be re-synched. Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 6 minutes ago, johnnie.black said: After rebooting all drives should come online, unless the HBA is really dead, either way parity will need to be re-synched. Ok - once I'm back at the server I'll try powering it back on. Should I force maintenance mode? I wasn't able to check the box when powering down; can I force with "startArray=“no”" in disk.cfg? I'm concerned about all of the errors on the seven data drives. Are those "false" and will be clear with powering back on (assuming the controller isn't dead) or am I looking at data loss? I have diagnostics and syslogs. Not sure if those are helpful at this point. Quote Link to comment
JorgeB Posted August 22, 2019 Share Posted August 22, 2019 Those errors are normal if the HBA stopped working, but Unraid only disables one disk max with single parity, 2 with dual parity, and luckily parity was the first one to error so the one that got disabled, errors will be reset after a power cycle, data on all disks should be fine, just parity will need re-synching. Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 (edited) Ok! So just power it back on and see what happens? (afraid of data loss ... and dealing with restoring 18 TB from CrashPlan) Edited August 22, 2019 by autumnwalker Quote Link to comment
JorgeB Posted August 22, 2019 Share Posted August 22, 2019 4 hours ago, johnnie.black said: make sure it's well seated and cooled, if possible try a different slot, 1 Quote Link to comment
Frank1940 Posted August 22, 2019 Share Posted August 22, 2019 4 hours ago, autumnwalker said: The server has not been physically touched in nearly six months so it's leading me to think cooling issue or failure versus being unseated. I know there was high I/O last night. And clean the inside of the case if it is dirty. Check that all fans are working and that cooling fins on heat sinks, air intakes, and exhaust ports are not blocked. 1 Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 52 minutes ago, Frank1940 said: And clean the inside of the case if it is dirty. Check that all fans are working and that cooling fins on heat sinks, air intakes, and exhaust ports are not blocked. Agree - I'll take the opportunity to do that now that it's offline. Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 Ok. I opened it up to take a look. Surprisingly little dust. Nothing clogged / blocked. No obvious damage (blown capacitors, scorch marks, etc.). I powered it back up in safe mode. Parity is still disabled and two disks are missing - disk 5 and disk 6. Thoughts? Quote Link to comment
Frank1940 Posted August 22, 2019 Share Posted August 22, 2019 Get the Diagnostics file Tools >>> Diagnostics and upload it in a new post. Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 (edited) Before seeing your post I powered it down, removed and reseat all of the SATA cables and powered it back on. Now all disks are showing; parity still disabled. Dare I take this out of safe mode and let it attempt to rebuild parity? @Frank1940 I assume the diagnostics are no good now that all drives are visible? Edited August 22, 2019 by autumnwalker Quote Link to comment
Frank1940 Posted August 22, 2019 Share Posted August 22, 2019 To the best of my knowledge, you can rebuild parity while in the Safe Mode. Having said, I would still grab the Diagnostics file and post it up. Then folks can have a look at the SMART data on that parity drive. Quote Link to comment
autumnwalker Posted August 22, 2019 Author Share Posted August 22, 2019 Current diagnostics attached. System still in safe mode, array offline, parity disk disabled. nas01-diagnostics-20190822-2050.zip Quote Link to comment
Frank1940 Posted August 23, 2019 Share Posted August 23, 2019 Parity disk seems to be fine. Go ahead and see if you can rebuild parity. See this for a how-to-do: http://lime-technology.com/wiki/index.php?title=FAQ#How_do_I_re-enable_a_failed_disk.3F Quote Link to comment
autumnwalker Posted August 23, 2019 Author Share Posted August 23, 2019 Thank you @Frank1940 and @johnnie.black. Parity rebuild just finished 12h 2m, 92.3 MB/s, 0 errors. Safe to pull this out of safe mode now? Quote Link to comment
autumnwalker Posted August 25, 2019 Author Share Posted August 25, 2019 All was fine for a dayish and the same eight drives (controller) went bad again. This time disk 4 was disabled. I suspect the controller is bad (or the PCIe slot) ... should I do anything different this time where disk 4 was disabled vs. parity? Diagnostics attached.nas01-diagnostics-20190825-1409.zip Quote Link to comment
itimpi Posted August 25, 2019 Share Posted August 25, 2019 It might be worth checking the card is properly seated in the PCIe slot. I once had similar problems that I eventually tracked down to the fact that the card was not properly seated and if would every so often drop all drives for no apparent reason. Quote Link to comment
autumnwalker Posted August 25, 2019 Author Share Posted August 25, 2019 I'll remove / reseat the card to be sure. When powering back on will I need to rebuild the array? Can I "trust" it? Quote Link to comment
JorgeB Posted August 26, 2019 Share Posted August 26, 2019 Yes, if you're sure no writes were done to the disabled disk, you'll still need to run a parity check. Quote Link to comment
autumnwalker Posted August 26, 2019 Author Share Posted August 26, 2019 4 hours ago, johnnie.black said: Yes, if you're sure no writes were done to the disabled disk, you'll still need to run a parity check. I'm not sure. Parity rebuild then? What happens if it's in the middle of a rebuild and goes bad again? Quote Link to comment
JorgeB Posted August 26, 2019 Share Posted August 26, 2019 If you're not sure best to rebuild the disk, to the same or a new one, but make sure it's mounting correctly before overwriting the old one. If it goes bad again you'll need to re-start the rebuild from the beginning. Quote Link to comment
autumnwalker Posted August 26, 2019 Author Share Posted August 26, 2019 3 minutes ago, johnnie.black said: If you're not sure best to rebuild the disk, to the same or a new one, but make sure it's mounting correctly before overwriting the old one. If it goes bad again you'll need to re-start the rebuild from the beginning. I'll try re-seating the LSI card and power it back on, see if everything mounts ok. The last time it ran for just over a day before going bad again. My fear is whatever this is is now intermittent - it initially powers on fine, but dies somewhere after a few hours of power. If it powers on ok (disk looks mounted properly) and it starts parity rebuild, but dies mid rebuild is my data trashed or am I exactly in the same spot I am now (one "bad" drive)? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.