Jump to content

Unclean Powerdown, disks spun down, Parity Check detected 1 Error (correct disabled), canceled, reran with correct, no error found


Go to solution Solved by JorgeB,

Recommended Posts

Hey there everyone,


I'm just wondering how / why this would occur.  8 disk array, dual parity, no cache.  There was an unclean shutdown due to power issues (I have UPS, but don't have the automation setup to shutdown yet).  There were no writes going on and disks were spun down. 


The next day, I ran parity check with correct disabled and it found 1 error about 60% of through.  I read these forums and decided to cancel and restart the parity check with correct enabled. 

 

The parity check ran with no errors detected. 


What could a reason be why this happened?  Should I be concerned?

 

Here are lines related from the log:  (suspect line Jun 13 07:06:23 Tower kernel: md: recovery thread: PQ incorrect, sector=807021208)

Jun 13 06:05:27 Tower kernel: mdcmd (42): check 
Jun 13 06:05:27 Tower kernel: md: recovery thread: check P Q ...
Jun 13 06:05:50 Tower kernel: mdcmd (43): nocheck Cancel
Jun 13 06:05:50 Tower kernel: md: recovery thread: exit status: -4
Jun 13 06:05:55 Tower kernel: mdcmd (44): check nocorrect
Jun 13 06:05:55 Tower kernel: md: recovery thread: check P Q ...
Jun 13 06:08:57 Tower nmbd[6513]: [2022/06/13 06:08:57.865107,  0] ../../source3/nmbd/nmbd_become_lmb.c:397(become_local_master_stage2)
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   Samba name server TOWER is now a local master browser for workgroup HOME on subnet 192.168.122.1
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 06:08:57 Tower nmbd[6513]: [2022/06/13 06:08:57.865228,  0] ../../source3/nmbd/nmbd_become_lmb.c:397(become_local_master_stage2)
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   Samba name server TOWER is now a local master browser for workgroup HOME on subnet 172.17.0.1
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 07:06:23 Tower kernel: md: recovery thread: PQ incorrect, sector=807021208
Jun 13 13:03:31 Tower kernel: mdcmd (45): nocheck Cancel
Jun 13 13:03:31 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:03:40 Tower kernel: mdcmd (46): check 
Jun 13 13:03:40 Tower kernel: md: recovery thread: check P Q ...
Jun 13 13:03:58 Tower kernel: mdcmd (47): nocheck Cancel
Jun 13 13:03:58 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:04:02 Tower kernel: mdcmd (48): check nocorrect
Jun 13 13:04:02 Tower kernel: md: recovery thread: check P Q ...
Jun 13 13:04:10 Tower kernel: mdcmd (49): nocheck Cancel
Jun 13 13:04:10 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:04:14 Tower kernel: mdcmd (50): check 
Jun 13 13:04:14 Tower kernel: md: recovery thread: check P Q ...
Jun 13 13:05:29 Tower kernel: mdcmd (51): nocheck Cancel
Jun 13 13:05:29 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:12:29 Tower kernel: mdcmd (52): check 
Jun 13 13:12:29 Tower kernel: md: recovery thread: check P Q ...

 

Any thoughts much appreciated. 


Thanks

Link to comment

Log snippet doesn't show the full correcting check, assuming it did run until the end without finding any errors it suggests the previous error found was unrelated to unclean shutdown, and possibly something like a RAM bit flip, unclean shutdown related errors, when they exist, mostly exist in the beggining of the disks, which were where the metadata is stored, assuming XFS filesystem for the array.

Link to comment
1 hour ago, JorgeB said:

Log snippet doesn't show the full correcting check, assuming it did run until the end without finding any errors it suggests the previous error found was unrelated to unclean shutdown, and possibly something like a RAM bit flip, unclean shutdown related errors, when they exist, mostly exist in the beggining of the disks, which were where the metadata is stored, assuming XFS filesystem for the array.

Thanks so much Jorge, let me get the whole log.  All disks are XFS.

 

I attached the whole diagnostic stack.  I appreciate your reply and help. 

 

Let me know if I can provide anymore information.  I also provided screenshot for times of the 1 sync error vs clean run. 

 

 

Screenshot 2022-06-15 153533.png

tower-diagnostics-20220615-1543.zip

Link to comment
3 hours ago, JorgeB said:

If it was a bit flip it might be hard to catch any issues with memtest, still worth running a couple of passes, if no errors are found see how the next parity checks go.

Alright, will give it a go, thanks.  I didn't see anything related to full parity check with correction in the syslog... did I need to have debug turned on or something?


Thanks

Link to comment
6 hours ago, JorgeB said:

Because there wasn't one, all checks were non correct.

Hey Jorge,

 

Thanks.  The last full check shown below (16hr, 20 min, 57 sec) did have correcting enabled.  I was expecting something to show up in the log like "parity check started" and "parity check completed."  You are saying there is basically no identification for this, and things in the log would only show up if errors were found and corrected (or not corrected as the previous run shows in the log)?

 

Thanks,

Chris

Screenshot 2022-06-15 153533.png

Link to comment

There is, and you're right, it was correct, do you remember if it was manually started or scheduled? They were logged differently, though it's been some time since I checked, could have changed.

 

Jun 13 13:12:29 Tower kernel: mdcmd (52): check
Jun 13 13:12:29 Tower kernel: md: recovery thread: check P Q ...
...
Jun 14 05:33:26 Tower kernel: md: sync done. time=58857sec
Jun 14 05:33:26 Tower kernel: md: recovery thread: exit status: 0

 

Start, duration and end (exit status 0 means it completed successfully, independent of finding sync errors or not) are always logged.

  • Upvote 1
Link to comment
  • 2 years later...
Posted (edited)

Warming this back up as I have some clarification questions:

 

This happened again.  I had PSU failure and a few unclean shutdowns (PSU would cause a reboot when disks spun up or doing anything).

 

One stability was achieved, I ran a non-correcting parity check, it had 1 error. 

 

I ran it again (non-correcting) and it found no errors. 

 

This is all in maintenance mode BTW.  Here is the log:

 

Jun 14 14:24:41 Tower kernel: mdcmd (36): check nocorrect
Jun 14 14:24:41 Tower kernel: md: recovery thread: check P Q ...
Jun 14 14:41:00 Tower kernel: md: recovery thread: PQ incorrect, sector=217794056
Jun 15 06:37:38 Tower kernel: md: sync done. time=58377sec
Jun 15 06:37:38 Tower kernel: md: recovery thread: exit status: 0
Jun 15 09:07:11 Tower kernel: mdcmd (37): check nocorrect
Jun 15 09:07:11 Tower kernel: md: recovery thread: check P Q ...
Jun 16 01:20:27 Tower kernel: md: sync done. time=58396sec
Jun 16 01:20:27 Tower kernel: md: recovery thread: exit status: 0

 

You can see I ran these 1 after the other pretty much without a reboot. 

 

My question is more of a general computing question.  In this post and others, I see mention of a "flipped bit" and the potential for that to show up here.  With non-ECC memory which I believe is still far more common, wouldn't this be more of a problem? 

 

What does memory have to do with this process in the first place, what does it load into memory that it needs to compare?

 

In addition, without ECC memory, is my array always at risk for data corruption?  For example, if I'm copying 30GiB of information over my TCP/IP Network to the Unraid array, we've got memory (non-ECC in this case) of the host computer, memory (non-ECC) of the unraid array, and the TCP/IP protocol as well.  If there is a "bit flip" anywhere along the process does that corrupt whatever I'm copying?  Wouldn't this be more of an issue? 

 

Do I really need to rum memtest?  I didn't run it before, but I guess I'm willing to run it now since I've had the array in maintenance mode and unavailable for some time at this point. 

 

I appreciate the insight, I am just trying to determine the best way forward as I have some data that is 25 years old and I don't want bit rot.

 

Should I run a 3rd parity check (I haven't rebooted, it's still in maintenance mode) or should I just ignore this? 


Thanks

Edited by cpetro45
Link to comment
  • Solution
1 hour ago, cpetro45 said:

I see mention of a "flipped bit" and the potential for that to show up here

That's the most likely reason.

 

1 hour ago, cpetro45 said:

What does memory have to do with this process in the first place, what does it load into memory that it needs to compare?

Everything goes through memory.

 

1 hour ago, cpetro45 said:

In addition, without ECC memory, is my array always at risk for data corruption?

Definitely it's more at risk, but other thing can cause data corruption, though RAM is probably the biggest one.

 

1 hour ago, cpetro45 said:

Do I really need to rum memtest?

Most likely it won't find anything.

 

1 hour ago, cpetro45 said:

Should I run a 3rd parity check (I haven't rebooted, it's still in maintenance mode) or should I just ignore this? 

I would just wait for the next scheduled one.

  • Like 1
Link to comment
25 minutes ago, JorgeB said:

That's the most likely reason.

 

Everything goes through memory.

 

Definitely it's more at risk, but other thing can cause data corruption, though RAM is probably the biggest one.

 

Most likely it won't find anything.

 

I would just wait for the next scheduled one.

Thanks Jorge!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...