[Solved] Power Failure - Unclean Shutdown - Many Each Sync Errors Corrected


Recommended Posts

First things first, I acknowledge that I am dumb and needed a UPS. It is now on the way.

 

My subdivision had a power outage today, courtesy of a... particularly bad driver meeting a power line and/or transformer box, and as a result wreaked havoc on my unRAID system.

Immediately upon power restoration I begin a Parity Check, writing corrections to Parity, and am now 4 hours in. As is expected I have more than a day to go as I have an 8TB drive for parity. What is not expected, however, is the fact that I am currently sitting at 80334694 Sync Errors Corrected.

 

Observations:

  1. Everything appears to be running normally. All drives, parity, array, cache, and boot appear to be functioning normally. Additionally, dockers appear to be fine as well.
  2. There are no errors being reported in the smart reports nor Main GUI page for any of the drives.
  3. FS is xfs across the board, with the exception being Parity, and cache (which is btrfs for the latter).
  4. From what I can tell, all files directly accessed through SAMBA appear to be working as expected as well.
  5. Last Parity Operation was done last week with the installation of a new 8TB data drive to the array and there were no issues reported.

 

Questions:

  1. How screwed am I?
  2. Do I allow it to continue, or stop it for further troubleshooting?
  3. Why didn't I buy a UPS sooner?
  4. What is the next step in the process to recovery, should there be an error found?
    • What should I be looking for in an error, or in the logs. (I want to learn to be less ignorant, not necessarily just be told to forget)

 

Attached is the Diagnostics archive.

I appreciate any and all replies and thank you for your time. 

 

~Omni

tower-diagnostics-20180605-2132.zip

Edited by omninewb
Link to comment
18 hours ago, johnnie.black said:

That's a lot of sync errors for an unclean shutdown, even if it was in the middle of writing data to the array when the power went off, but your only option now it to correct them.

Thank you for the reply. It's sitting at just under 500,000,000 now at 40% complete... Craziness. 

 

It was most definitely writing data to the array, but I have no idea how this many errors were generated. I certainly feel the absolute need for the a UPS at this point however. In that vein, what is the metric its actually measuring/correcting, bits? Bytes? 

 

You mentioned my "only option now", what would have been the options before, and what is before for that matter? 

 

Edit: The first round of an error check is now complete.

 

Result: Last check completed on Wed 06 Jun 2018 06:34:20 PM MDT (today), finding 533551368 errors. 

 

Should I go ahead and call this a day, or submit another diagnostics archive and continue with another step? The UPS is on its way scheduled for next week before delivery, btw.

Edited by omninewb
Link to comment

That's an impossibly large number so either something is really wrong now, or something was very wrong earlier.


I think the numbers represents number of 4 kB blocks with incorrect data in which case 500 million blocks would be 2 TB of incorrect data after 40% processed.

 

Your figures are so large that they basically indicates that the parity drive did not contain any valid parity data at all. Or that one of your data drives right now just sends out garbage.

 

You should at the very least try to read the contents of at least one file from every data disk, to make sure all the data disks do read back valid content.

 

  • Like 1
Link to comment

So a little over 2TB worth of data that wasn't on the parity drive overall after it was all finished. That's no fun...

 

I did as you suggested and played some media of off each drive but one (that is empty) and everything appears to be in order. I assume that another full parity check after I get the UPS in would be a good idea then? Just to make sure that Parity is actually up to date and contains proper data?

 

Thanks for the reply @pwm.

Link to comment
4 hours ago, omninewb said:

replaced a 4TB drive with a fresh 3TB (after moving the files via unBALANCE to the other drives off the emulated one).

Wait, what? How did you manage to get unraid to let you put in a smaller drive? Did you possibly do a new config and check that parity was already valid? If so, that explains it. Parity would definitely NOT be valid in that instance.

  • Like 1
Link to comment

Negative, in the 4TB > 3TB instance I moved all the files off the emulated disk. Stopped the array, removed the old disk, started, stopped, popped the new drive into its spot and started again. It did a proper parity check following the preclear and formatting and continued on its merry way. 

 

For the 8TB however, I did check the parity is valid box once it finished its preclear and formatting routine. Maybe that's the one it came from? Odd that it would only have been bad for 2TB worth of it though. 

Link to comment

If you add a 8TB disk and first have unRAID clear it before it's added, then unRAID doesn't need to recompute the parity for a single-parity system because the addition of a zeroed disk doesn't affect the "P"-parity equation. So it was ok to check the "parity is valid" box.

 

But if you replaced the 4TB disk with a 3TB disk, then the parity would have to be rebuilt. If you didn't, then you had invalid parity even before you replaced the 8TB disk.

 

The only way you could get away with replacing the 4TB disk with a 3TB disk would have been if you first wrote zero to the full emulated disk (to cancel out the parity contribution) and then inserted a 3TB disk that was zeroed (and so hadn't any parity contribution).

 

So in the end, you can consider yourself lucky that you got a power failure that forced you to repair the parity to a correct state.

  • Like 1
Link to comment

Interesting... So even though I dropped the old drive off the config and added the new one with the regular drive replacement procedure, because the old drive wasn't zero'd out, it left a bad parity state?

 

I followed the procedure listed here: https://lime-technology.com/wiki/Replacing_a_Data_Drive and it all seemed to work out as planned, despite it stating that the replacement drive could not be smaller.

 

An interesting issue, but one I am glad to be finished with. Thank you all for your help, I certainly appreciate it! For any future drive replacements I will be sure to use a same size or larger disk, and triple-check that the rebuild parity is checked.

Link to comment
  • cmon_google_wtf changed the title to [Solved] Power Failure - Unclean Shutdown - Many Each Sync Errors Corrected

Something is wrong. It's impossible to rebuild 4TB onto 3TB. It won't fit.

 

You had to have done a new config somewhere in there to get unraid to accept the smaller drive into the larger drive's slot.

 

Parity doesn't hold files, it holds the entire drive, full, empty, corrupted, whatever. An empty 4TB drive still contains 4TB worth of 1's and 0's. You can't rebuild 4TB worth of stuff onto 3TB. The replacement drive must be equal or larger for a successful rebuild.

  • Like 1
Link to comment

The procedure you linked to does not handle replacing a drive with a smaller one.    It only handles replacing a drive with one of the same size or larger.

 

the only standard way to get a smaller drive into the array is to go via the New Config route, and since this invalidates parity you must NOT tick the “parity is valid” checkbox when you start the array after the New Config.

  • Like 1
Link to comment

Oh, something definitely went wrong, and I have no idea where it did. I did not need to go in to build a new config (through the button) or anything other than the steps I listed up above.

 

The solution seems again seems to be to NOT use a smaller drive (which I never will again) and to make sure that the rebuild parity is checked. As for the cause? Unknown. I did not keep any sort of documentation on the procedure other than just memory, but if I need to do any further replacements in the future, I will take greater care in what I am doing.

Link to comment
1 hour ago, itimpi said:

The procedure you linked to does not handle replacing a drive with a smaller one.    It only handles replacing a drive with one of the same size or larger.

 

the only standard way to get a smaller drive into the array is to go via the New Config route, and since this invalidates parity you must NOT tick the “parity is valid” checkbox when you start the array after the New Config.

 

Exactly - the only way to replace a drive with a smaller drive without rebuilding the parity would be to go the route I wrote earlier, where the old drive is zeroed before it's removed and a new zeroed drive is then inserted instead of the old drive. unRAID will then automatically "fill up" the missing size of the new and empty drive with "virtual" zero values - which is the same that the removed drive did contain before removal.

 

If it's possible to get unRAID to switch from larger to smaller without selecting "New Config", then I think a bug hunt is needed.

  • Like 1
Link to comment
5 minutes ago, trurl said:

 

I think the most likely explanation is the user doesn't know what he did.

I think this is more likely than an actual bug. I've never managed to find one that I could reproduce. Ever. Unless it was in my own code...

 

I've outlined the steps I took above as to what I did, but as far as an actual issue I don't believe there really is one. I fully think that I was in the wrong in some way or another, and was saved by the fact that there are systems in place to correct stupid like that.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.