Jump to content

[SOLVED] disk faililure at start of parity check, wrote to array after failure


Recommended Posts

I had a data disk failure (today) at the start of the monthly parity check cycle, then another user wrote data to a disk/array early this morning before I saw the failure notification.

 

I have a new, blank precleared disk I can insert into the array. however the parity disk is indicating array parity is invalid; The user still has the original data (written to disk/array) on a local PC.

 

Not sure what the next best step would be?  Attempt to recover data from the failed disk since parity is invalid (due to timing of simultaneous data disk failure at monthly parity check interval)?

 

update:  Monthly parity check was set to NOCORRECT, so regardless if a data disk failed at start or during parity check process, parity would not be overwritten, correct?  If so then parity is still valid and I can insert a new disk in place of the failed data disk and rebuild the failed data disk, correct?  Can someone please confirm?  Please check the syslog and look at the timestamps at midnight on 5/1.  Curious if the syslog indicates the parity check was aborted before it ever started.

 

Thanks for any advice.

 

(syslog in my next post)

Link to comment

Follow up question.  See two attached screenshots and attached syslog.  I have unMenu and Dynamix installed.  unMenu indicates parity is valid and Dynamix indicates parity is invalid, which is correct?

 

From the syslog it appears disk 4 failed at the start of the parity check process.  I'm not sure if the parity check was aborted before overwriting any previous valid parity.  In other words can I simply replace disk 4 with a new disk then rebuild disk 4 contents from the parity data I have?

 

With respect to the user data written to the array after disk 4 failure, would that corrupt parity to rebuild disk 4?  If so am I better off attempting to copy contents from the bad disk 4 to a new disk and rebuild parity with the bad disk removed from the array?

ParityVaild_unMenu.png.725386a7a0babed443a5c1ea2058b1a3.png

ParityInvalid_Dynamix.png.2c92dcc999f4bf820ce0fe6bb1ab4225.png

syslog-20150502-074403.zip

Link to comment

That screenshot doesn't say parity is invalid, it says data is invalid. A redball means unRAID has disabled the disk due to a write failure. Any further access to that disk would use the emulated disk, which means any data read from it would actually be read by calculating what that data would be based on reading all the other disks plus parity. Any data written to that disk would actually be a write to parity calculated based on the other disks data.

 

When exactly did the disk redball? Was it during the parity check, or after? If during, did the parity check abort, did you stop it, or did it complete?

Link to comment

Sorry for the NOOB questions.  :)

 

From the syslog, it appears the data disk redballed at 12:01 am on 5/1/15, the very beginning of the monthly parity check (starts at 12am).  I did not manually stop or abort the parity check process.  Can the syslog tell me if the parity check completed or was automatically aborted when the data disk redballed?

 

The other question I have is what about user data that was written to another good data disk (~ 7am on 5/1) before I was alerted to the redball disk (~12pm on 5/1)?  What happens to the parity disk when a user writes to a good data disk in the array when there is already a disabled data disk?  I'm guessing the parity disk gets updated with each write to the array (even with a disabled disk).

 

(FYI - we still have a another copy of the user data that was written to the array after the redball disk.)

 

Basically this is the sequence:

 

1.  12am on 5/1 - monthly parity check (NOCORRECT) starts.

2.  ~ 12:01am - disk 4 shows errors and is subsequently disabled.  (as best I can tell from the syslog)

3.  7am - I see wife is writing data to the array at when I leave for work.

4.  12pm - I check my unRAID server alerts from work and see I have a disabled data disk.  I call wife and tell her not to write to array and confirm she still has a local copy of everything she wrote to array at 7am.

Link to comment

If they're NOOB questions then I must be a NOOB, since I am uncertain what happens if a redball occurs during a parity check. Hope someone else will chime in.

 

As for writing to disk while another disk is redballed, that should not be a problem. The normal parity write is read disk to be written, read parity, write disk, calculate parity change and write parity. It is not necessary to know anything about the other disks to determine the change to parity.

Link to comment

Thank you trurl.

 

The monthly parity check was set to NOCORRECT, so regardless if a data disk failed at the start or during parity check process, parity would not be overwritten, correct?  If so then parity is still valid and I can insert a new disk in place of the failed data disk and rebuild the failed data disk, correct?

 

If someone more knowledgeable than me can check the syslog and look at the timestamps starting at midnight on 5/1, I'm curious if the syslog indicates the parity check was aborted before it ever started.

Link to comment

Another NOOB here. :)

 

If a disk were to red ball during a parity check, I believe that unRAID would immediately stop the parity check, since a parity check has nothing to check against if the array is one disk down.

 

Either that or it would slug along recomputing the missing disk's sectors and then using that to perform the parity check (which would always be valid because parity was just used to recompute the failed sector). This would be dumb behavior but since I have never encountered this, a possibility.

 

There is some evidence that when a disk red balls, it can corrupt the file that was being copied at that moment. It is good to have MD5s to validate data.

 

It is quite likely that the "failed" disk failed because of loose cabling, and that it is actually just fine. You'd need a parity check to tell.

Link to comment

Thank you trurl.

 

The monthly parity check was set to NOCORRECT, so regardless if a data disk failed at the start or during parity check process, parity would not be overwritten, correct?  If so then parity is still valid and I can insert a new disk in place of the failed data disk and rebuild the failed data disk, correct?

 

If someone more knowledgeable than me can check the syslog and look at the timestamps starting at midnight on 5/1, I'm curious if the syslog indicates the parity check was aborted before it ever started.

Lots of read/write errors on disk4 at 00:02:20, then

May  1 00:02:20 Artoo-Detoo kernel: md: md_do_sync: got signal, exit...
May  1 00:02:20 Artoo-Detoo kernel: md: recovery thread sync completion status: -4

which I think is the parity check aborting.

 

You should be OK to rebuild.

Link to comment

Thanks trurl and bjp999!

 

I was able to rebuild the disk on a new drive.  I did not have any data loss.  Fortunately no one was writing to the disk when it became disabled.  I'll look at creating MD5s to validate integrity going forward.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...