Unclean Powerdown, disks spun down, Parity Check detected 1 Error (correct disabled), canceled, reran with correct, no error found

cpetro45 · June 15, 2022

Hey there everyone,

I'm just wondering how / why this would occur. 8 disk array, dual parity, no cache. There was an unclean shutdown due to power issues (I have UPS, but don't have the automation setup to shutdown yet). There were no writes going on and disks were spun down.

The next day, I ran parity check with correct disabled and it found 1 error about 60% of through. I read these forums and decided to cancel and restart the parity check with correct enabled.

The parity check ran with no errors detected.

What could a reason be why this happened? Should I be concerned?

Here are lines related from the log: (suspect line Jun 13 07:06:23 Tower kernel: md: recovery thread: PQ incorrect, sector=807021208)

Jun 13 06:05:27 Tower kernel: mdcmd (42): check 
Jun 13 06:05:27 Tower kernel: md: recovery thread: check P Q ...
Jun 13 06:05:50 Tower kernel: mdcmd (43): nocheck Cancel
Jun 13 06:05:50 Tower kernel: md: recovery thread: exit status: -4
Jun 13 06:05:55 Tower kernel: mdcmd (44): check nocorrect
Jun 13 06:05:55 Tower kernel: md: recovery thread: check P Q ...
Jun 13 06:08:57 Tower nmbd[6513]: [2022/06/13 06:08:57.865107,  0] ../../source3/nmbd/nmbd_become_lmb.c:397(become_local_master_stage2)
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   Samba name server TOWER is now a local master browser for workgroup HOME on subnet 192.168.122.1
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 06:08:57 Tower nmbd[6513]: [2022/06/13 06:08:57.865228,  0] ../../source3/nmbd/nmbd_become_lmb.c:397(become_local_master_stage2)
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   Samba name server TOWER is now a local master browser for workgroup HOME on subnet 172.17.0.1
Jun 13 06:08:57 Tower nmbd[6513]:   
Jun 13 06:08:57 Tower nmbd[6513]:   *****
Jun 13 07:06:23 Tower kernel: md: recovery thread: PQ incorrect, sector=807021208
Jun 13 13:03:31 Tower kernel: mdcmd (45): nocheck Cancel
Jun 13 13:03:31 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:03:40 Tower kernel: mdcmd (46): check 
Jun 13 13:03:40 Tower kernel: md: recovery thread: check P Q ...
Jun 13 13:03:58 Tower kernel: mdcmd (47): nocheck Cancel
Jun 13 13:03:58 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:04:02 Tower kernel: mdcmd (48): check nocorrect
Jun 13 13:04:02 Tower kernel: md: recovery thread: check P Q ...
Jun 13 13:04:10 Tower kernel: mdcmd (49): nocheck Cancel
Jun 13 13:04:10 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:04:14 Tower kernel: mdcmd (50): check 
Jun 13 13:04:14 Tower kernel: md: recovery thread: check P Q ...
Jun 13 13:05:29 Tower kernel: mdcmd (51): nocheck Cancel
Jun 13 13:05:29 Tower kernel: md: recovery thread: exit status: -4
Jun 13 13:12:29 Tower kernel: mdcmd (52): check 
Jun 13 13:12:29 Tower kernel: md: recovery thread: check P Q ...

Any thoughts much appreciated.

Thanks

JorgeB · June 15, 2022

Log snippet doesn't show the full correcting check, assuming it did run until the end without finding any errors it suggests the previous error found was unrelated to unclean shutdown, and possibly something like a RAM bit flip, unclean shutdown related errors, when they exist, mostly exist in the beggining of the disks, which were where the metadata is stored, assuming XFS filesystem for the array.

cpetro45 · June 15, 2022

1 hour ago, JorgeB said:

Log snippet doesn't show the full correcting check, assuming it did run until the end without finding any errors it suggests the previous error found was unrelated to unclean shutdown, and possibly something like a RAM bit flip, unclean shutdown related errors, when they exist, mostly exist in the beggining of the disks, which were where the metadata is stored, assuming XFS filesystem for the array.

Thanks so much Jorge, let me get the whole log. All disks are XFS.

I attached the whole diagnostic stack. I appreciate your reply and help.

Let me know if I can provide anymore information. I also provided screenshot for times of the 1 sync error vs clean run.

tower-diagnostics-20220615-1543.zip

trurl · June 15, 2022

Have you done memtest?

cpetro45 · June 15, 2022

2 minutes ago, trurl said:

Have you done memtest?

I have not

JorgeB · June 15, 2022

If it was a bit flip it might be hard to catch any issues with memtest, still worth running a couple of passes, if no errors are found see how the next parity checks go.

cpetro45 · June 15, 2022

3 hours ago, JorgeB said:

If it was a bit flip it might be hard to catch any issues with memtest, still worth running a couple of passes, if no errors are found see how the next parity checks go.

Alright, will give it a go, thanks. I didn't see anything related to full parity check with correction in the syslog... did I need to have debug turned on or something?

Thanks

JorgeB · June 16, 2022

9 hours ago, cpetro45 said:

I didn't see anything related to full parity check with correction in the syslog...

Because there wasn't one, all checks were non correct.

cpetro45 · June 16, 2022

6 hours ago, JorgeB said:

Because there wasn't one, all checks were non correct.

Hey Jorge,

Thanks. The last full check shown below (16hr, 20 min, 57 sec) did have correcting enabled. I was expecting something to show up in the log like "parity check started" and "parity check completed." You are saying there is basically no identification for this, and things in the log would only show up if errors were found and corrected (or not corrected as the previous run shows in the log)?

Thanks,

Chris

JorgeB · June 16, 2022

There is, and you're right, it was correct, do you remember if it was manually started or scheduled? They were logged differently, though it's been some time since I checked, could have changed.

Jun 13 13:12:29 Tower kernel: mdcmd (52): check
Jun 13 13:12:29 Tower kernel: md: recovery thread: check P Q ...
...
Jun 14 05:33:26 Tower kernel: md: sync done. time=58857sec
Jun 14 05:33:26 Tower kernel: md: recovery thread: exit status: 0

Start, duration and end (exit status 0 means it completed successfully, independent of finding sync errors or not) are always logged.

itimpi · June 16, 2022

Just a hint - it you install the Parity Check Tuning plugin even if you use none of its facilities you will find that the log entries get enhanced so that you can tell how any parity check was started, and whether it was correcting or non-correcting.

cpetro45 · June 17, 2022

Thanks to you both, OK, I'll take a look at the plugin and also run memtest.

Jorge parity check was started manually pretty much directly after the non-correcting check was canceled.

cpetro45 · June 17

Warming this back up as I have some clarification questions:

This happened again. I had PSU failure and a few unclean shutdowns (PSU would cause a reboot when disks spun up or doing anything).

One stability was achieved, I ran a non-correcting parity check, it had 1 error.

I ran it again (non-correcting) and it found no errors.

This is all in maintenance mode BTW. Here is the log:

Jun 14 14:24:41 Tower kernel: mdcmd (36): check nocorrect
Jun 14 14:24:41 Tower kernel: md: recovery thread: check P Q ...
Jun 14 14:41:00 Tower kernel: md: recovery thread: PQ incorrect, sector=217794056
Jun 15 06:37:38 Tower kernel: md: sync done. time=58377sec
Jun 15 06:37:38 Tower kernel: md: recovery thread: exit status: 0
Jun 15 09:07:11 Tower kernel: mdcmd (37): check nocorrect
Jun 15 09:07:11 Tower kernel: md: recovery thread: check P Q ...
Jun 16 01:20:27 Tower kernel: md: sync done. time=58396sec
Jun 16 01:20:27 Tower kernel: md: recovery thread: exit status: 0

You can see I ran these 1 after the other pretty much without a reboot.

My question is more of a general computing question. In this post and others, I see mention of a "flipped bit" and the potential for that to show up here. With non-ECC memory which I believe is still far more common, wouldn't this be more of a problem?

What does memory have to do with this process in the first place, what does it load into memory that it needs to compare?

In addition, without ECC memory, is my array always at risk for data corruption? For example, if I'm copying 30GiB of information over my TCP/IP Network to the Unraid array, we've got memory (non-ECC in this case) of the host computer, memory (non-ECC) of the unraid array, and the TCP/IP protocol as well. If there is a "bit flip" anywhere along the process does that corrupt whatever I'm copying? Wouldn't this be more of an issue?

Do I really need to rum memtest? I didn't run it before, but I guess I'm willing to run it now since I've had the array in maintenance mode and unavailable for some time at this point.

I appreciate the insight, I am just trying to determine the best way forward as I have some data that is 25 years old and I don't want bit rot.

Should I run a 3rd parity check (I haven't rebooted, it's still in maintenance mode) or should I just ignore this?

Thanks

Edited June 17 by cpetro45

JorgeB · June 17

1 hour ago, cpetro45 said:

I see mention of a "flipped bit" and the potential for that to show up here

That's the most likely reason.

1 hour ago, cpetro45 said:

What does memory have to do with this process in the first place, what does it load into memory that it needs to compare?

Everything goes through memory.

1 hour ago, cpetro45 said:

In addition, without ECC memory, is my array always at risk for data corruption?

Definitely it's more at risk, but other thing can cause data corruption, though RAM is probably the biggest one.

1 hour ago, cpetro45 said:

Do I really need to rum memtest?

Most likely it won't find anything.

1 hour ago, cpetro45 said:

Should I run a 3rd parity check (I haven't rebooted, it's still in maintenance mode) or should I just ignore this?

I would just wait for the next scheduled one.

cpetro45 · June 17

25 minutes ago, JorgeB said:

That's the most likely reason.

Everything goes through memory.

Definitely it's more at risk, but other thing can cause data corruption, though RAM is probably the biggest one.

Most likely it won't find anything.

I would just wait for the next scheduled one.

Thanks Jorge!

Unclean Powerdown, disks spun down, Parity Check detected 1 Error (correct disabled), canceled, reran with correct, no error found

Recommended Posts

cpetro45

Link to comment

JorgeB

Link to comment

cpetro45

Link to comment

trurl

Link to comment

cpetro45

Link to comment

JorgeB

Link to comment

cpetro45

Link to comment

JorgeB

Link to comment

cpetro45

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

cpetro45

Link to comment

cpetro45

Link to comment

JorgeB

Link to comment

cpetro45

Link to comment

Join the conversation