Jump to content
whitewlf

Unclean shutdown, one drive errors, other disabled

14 posts in this topic Last Reply

Recommended Posts

Posted (edited)

I have/had a 5 drive system (4x8tb w/parity, 1xSsd cache). While it is on a UPS, it shares it with my PC and the PC had the USB line to it. (swapped/fixed that now).

 

This morning we had a 2hour power outage, I am unsure if there was any flicker. All drives and both PC/unraid are on the UPS.

 

When the unraid spun back up, it had about 1000 read errors on disk 2, and disk 3 was Disabled. 

 

I had -just- bought a new drive friday to add space to the array but hadn't put it in yet. 

 

I ran out and bought 2 more drives today, and all three are in the unit now, 2 pre-clearing, 1 waiting to be a second parity once that finishes. (Can't do it all at once).

 

Disk 2 is now throwing tons of read errors, and took smb offline. It passed a btrfs and smart test before I added the new drives, and, until this morning's unclean shut, had not shown any errors at all.

 

Disk 3 passed long and short smart tests, and btrfs.  The data on it appears to be intact as well, but, due to what is likely a tiny parity mismatch, it's not in the array. 

 

With Disk 2 acting this squirrely, yet not yet kicked from the array, I am in doubt I will get too far with salvaging the data from it, but, it will also make Disk 3's emulated data broken. 

 

While I am not sure all the data on Disk 3 is 100%, I really don't care if a little is lost. I can likely replace it, but, replacing 7TB..+ 7TB.. (they were both nearly full) is entirely another pain in the ass. These are media files, most are huge, and, can survive a little damage, possibly. 

 

What is the best suggestion to proceed once the 2 new data drives enter the array. I'm holding off on the parity for now,  plus, wondering if I should use that drive to direct copy the drive 3 data to if things go any more pear shaped. 

 

Is there any way to force Disk 3 to be back to active, accepting a bit of the parity might be bogus? This would allow me to remove Disk 2 and rebuild to the new drives, given that real errors>a likely touch of parity alignment. 

 

Note, all the data drives are Seagate expansion drives, identical models in and out, afaict, mag shingle design. They are currently all individual units, USB 3, on  2xPCIx-USB3-4 port low-pro cards (until a new server is built and drives shucked.) They are all under a year old. I plan on a second cache drive as soon as this data is fixed, as well. (2 parity, 5-6 data, 2 cache ssd.. depending on seagate's warranty for the erroring drive)  

 

I would have loved to do IronWolf's but, they are 2.5x the cost, and, I don't have a tower server ready for that atm. This server is (hilariously, and impressively) running on a low profile i5 ex-office pc. The USB connections are obviously less than ideal, but, have been running for nearly a year with only minor slowness (heavy use of sabnzb, plex, samba, parity, etc likely causes high IO load, large file deletes are slow, but, it's been adequate for simple plex/sab/sick/CP). 

 

On a side question, is there any way, in the future, to pre-empty an array data drive.. ie. have it spread it's contents to the other devices, before being removed/replaced for old age or if it starts tossing errors? (Versus the cold turkey, pull it and rebuild method.) I'm assuming this isn't a common method perhaps due to the similar amount of read/writes involved to the eventual rebuild anyway, but, I though I should ask.

 

 

selene-diagnostics-20180611-0009.zip

Edited by whitewlf

Share this post


Link to post

Disk2 is failing and needs to be replaced, you can re-enable disk3 to rebuild disk2 but there could be some corruption depending on how much out of sync it is with parity.

Share this post


Link to post
7 hours ago, johnnie.black said:

Disk2 is failing and needs to be replaced, you can re-enable disk3 to rebuild disk2 but there could be some corruption depending on how much out of sync it is with parity.

 

I could not see a direct way of doing so, to enable disk 3.  The two new devices are still clearing, taking an awfully long time.  been about 6-7 hours, only 40% done. 

 

I found a mention about enabling an out of parity disk for Unraid 6, stating to drop the parity disk from the array, add it back, and click "Accept parity".. would this be the process to use to re-enable Disk 3? 

 

Also, the pre-clearing, I am not sure what that process does. Should I wait for it to complete, or, just stop the preclear now, stop array, drop the parity, re-add it, etc?

 

I am just trying to not make a mis-step that makes this more difficult or loses more data. 

 

On that note, anyone have an idea why Disk 2 would go from seemingly fine to spewing errors after only an unclean shut down?  My only guess is, the 2 hour down time cooled the drive and exposed a flaw by the thermal fluctuation. It was the first time any of the drives had been shut down longer than a couple minutes since install, and even then only a couple times. 

 

Share this post


Link to post

On second glance, those drives preclearing have spun down, and stopped doing anything. They were writing at ~98-110MB/sec last night. The counter is still stuck at 39% progress. I am thinking something hung up. 

Share this post


Link to post
Posted (edited)
5 hours ago, whitewlf said:

On that note, anyone have an idea why Disk 2 would go from seemingly fine to spewing errors after only an unclean shut down?  My only guess is, the 2 hour down time cooled the drive and exposed a flaw by the thermal fluctuation.

 

It is not relate thermal flactuation or temperature issue. If there are any pending / queue writing on disk itself but suddently power off and disk can't recover after power on, then error would be occur.

 

17 hours ago, whitewlf said:

I would have loved to do IronWolf's but, they are 2.5x the cost, and, I don't have a tower server ready for that atm.

 

Do you means Seagate IronWolf ? Seems you really love Seagate and like all disk was same brand. Anyway WD mybook 8TB also a good choose and not cost too much.

 

6 hours ago, whitewlf said:

On that note, anyone have an idea why Disk 2 would go from seemingly fine to spewing errors after only an unclean shut down?  My only guess is, the 2 hour down time cooled the drive and exposed a flaw by the thermal fluctuation. It was the first time any of the drives had been shut down longer than a couple minutes since install, and even then only a couple times. 

 

Many years ago, I have hard time with Seagate / WD harddisk, they always got trouble. So I have turn a retire ARM QNAP to handle those disk stuff. I seldem perform such task in live system because doing 1 or 2 disk job but let lot of unrelated disk in same boat and make them in risk was not prefer.

Anyway, I haven't use that QNAP longtime, because I haven't issue for quite longtime. The fact was I am lucky and take extra care to select reliable disk for me.

Edited by Benson

Share this post


Link to post
15 hours ago, johnnie.black said:

Disk2 is failing and needs to be replaced, you can re-enable disk3 to rebuild disk2 but there could be some corruption depending on how much out of sync it is with parity.

As this array is simply a personal only, cost effective mass storage solution I was choosing the least expensive drives which are readily, easily available. My local Fry's has them for 159 or less on sales. They do not seem to stock the WD in 8tb often. 

 

I am still hoping someone who knows more about the force enabling of the out of sync offline drive could advise on the above.

Share this post


Link to post
Posted (edited)

Just all disk sit at org. disk assignment. Stop the array, in tools section, perform new config. with RETAIN ALL DISK SETTING. Then start the array with parity already valid tick.

 

Then check what abnormal and plan for next action.

 

Or, due to only 1 parity have, if assume disk 2 was bad. Then

 

ensure all disk as org. assignment

stop array

new config with retain all disk setting

start in maintenance mode with parity valid

stop array

set disk2 to none

start array

 

then disk2 in emulate, check all disk file system success mount or not and plan for next action

Edited by Benson

Share this post


Link to post

You can also use the invalid slot command, you need to use a new disk2:

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s) including new disk2
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type:

mdcmd set invalidslot 2 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box, disk2 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

Share this post


Link to post

Already did the new config, "parity is valid" approach, then checked disk 2 & 3 btfs --repair, checked all disks short smart, stopped it and added two more drives. It's clearing them now overnight. Will add the second parity when/if that finishes.  I won't know if the data is damaged really until I run across it. No errors reported yet. 

selene-diagnostics-20180611-2344.zip

Share this post


Link to post



  I won't know if the data is damaged really until I run across it.


Since you're using btrfs you can run a scrub on those disks to confirm there's no data corruption.

Share this post


Link to post

If you have ticked the ‘parity is valid’ box then you probably also want to run a correcting parity check.   The number of errors corrected will give an indication of the likely level of corruption (if any).  A small number (single figures) tends to be normal and nothing to worry about.

Share this post


Link to post

Letting it add these two new drive in first, will do more checks when it finishes, and run parity check. Best to add the second drive first? Otherwise it will likely be 3days+3days to do it twice. I am still fairly wary of Disk 2. If it has more than a small bit of error it's likely best to warranty out, it's only 9-10mo old. 

 

Thanks for all the help, I'll let you know how it ends up. or blows up. ? 

Share this post


Link to post

Quick update, everything does seem ok, no errors thus far. The array rebuilt, the new drives cleared and added in, and now it's syncing up the second parity, which will take quite a while. Since it needs to read 100% of all the drives, if there are any exposable problems, it should trip them. 27.8% done, 40TB usable, 17.7TB free.

Share this post


Link to post

Sound good if no big problem.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now