Parity check Errors and Checksum Errors

tillo · March 13, 2018

Hello Unraid forum.

I get from time to time Errors after a parity check (not always but sometime, in the range of 0-10 errors/ per check). The check is scheduled to run every 30 days.

1* This is not normal right? I am to expect errors 0 all the time right?

2* 0-10 errors, are we talking about blocks on the disk, files or some other unit. What is defined by 1 error of unit?

I also have the checksum plugin added to my Unraid, and that throws me a few errors from time to time to.

3* I assume that this is related to the same errors as the parity check gives me?

FYI: The server is running on an intel S1200V3rps with ECC memory. So hopefully no problem there, but if anyone knows if there is some magical switch in BIOS that have to be enabled in order for the ECC memory to do its magic pls tell me. It might not be enabled if not by default.

I am great full for all the help I can get.

JorgeB · March 13, 2018

Post the diagnostics after a parity check with errors, maybe some clue there.

tillo · March 15, 2018

@johnnie.black Logfile.

Today, I got a whole ton of sync errors! (It is still running the syn, and had made it about 50% when I downloaded the diagnostic file

JorgeB · March 15, 2018

Mar 14 17:14:13 Nano emhttpd: unclean shutdown detected

Sync errors are expected after an unclean shutdown.

PS: syslog is filled with this:

Mar 14 17:17:38 Nano shfs: share cache fullMar 14 17:17:39 Nano shfs: share cache fullMar 14 17:17:40 Nano shfs: share cache full

Some share has an higher then it should be minimum space setting.

tillo · March 15, 2018

I have: 316785 errors during the sync. Is that amount expected after an unclean shutdown?

Also, minimum spacing setting, how would that affect the share cash?

JorgeB · March 15, 2018

6 minutes ago, tillo said:

I have: 316785 errors during the sync. Is that amount expected after an unclean shutdown?

That's a lot, I stopped reading the syslog after unclean shutdown, but you already remove the diags...

7 minutes ago, tillo said:

Also, minimum spacing setting, how would that affect the share cash?

There's a share configured to use cache but with a minimum free space setting higher then currently available on cache, so data written to that share is going directly to the array and that warning is constantly logged, flooding the syslog.

JorgeB · March 15, 2018

This error repeats several times, it's usually a hardware problem:

Mar 14 18:39:27 Nano kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0.
Mar 14 18:39:27 Nano kernel: Do you have a strange power saving mode enabled?
Mar 14 18:39:27 Nano kernel: Dazed and confused, but trying to continue

One of those times it happened just before a lot of sync errors being detect, so possibly related:

Mar 15 01:23:35 Nano kernel: Uhhuh. NMI received for unknown reason 31 on CPU 0.
Mar 15 01:23:35 Nano kernel: Do you have a strange power saving mode enabled?
Mar 15 01:23:35 Nano kernel: Dazed and confused, but trying to continue
Mar 15 01:23:35 Nano kernel: md: recovery thread: PQ incorrect, sector=4570300664
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570310904
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570321144
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570331384
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570341624
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570351864
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570362104
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570372344
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570382584
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570392824
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570403064
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570413304
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570423544
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570433784
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570444024
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570454264
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570464504
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570474744
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570484984
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570495224
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570505464
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570515704
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570525944
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570536184
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570546424
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570556664
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570566904
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570577144
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570587384
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570597624
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570607864
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570618104
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570628344
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570638584
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570648824
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570659064
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570669304
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570679544
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570689784
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570700024
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570710264
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570720504
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570730744
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570740984
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570751224
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570761464
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570771704
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570781944
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570792184
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570802424
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570812664
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570822904
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570833144
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570843384
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570853624
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570863864
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570874104
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570884344
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570894584
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570904824
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570915064
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570925304
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570935544
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570945784
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570956024
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570966264
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570976504
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570986744
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570996984
Mar 15 01:23:39 Nano kernel: md: recovery thread: stopped logging

There's also filesystem corruption on disk1:

Mar 15 04:40:14 Nano kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0xae/0xbe [xfs], xfs_inode block 0x15dd76708
Mar 15 04:40:14 Nano kernel: XFS (md1): Unmount and run xfs_repair

tillo · March 15, 2018

Okey, could the first one be related to the new RAM modules I installed?

Quote


Mar 14 18:39:27 Nano kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0.
Mar 14 18:39:27 Nano kernel: Do you have a strange power saving mode enabled?
Mar 14 18:39:27 Nano kernel: Dazed and confused, but trying to continue

The other problem, Corrupt data on Disk 1. what should I do ith that, Unmount and run xfs_repair? Should I replace it? replace cables?

Quote


Mar 15 04:40:14 Nano kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0xae/0xbe [xfs], xfs_inode block 0x15dd76708
Mar 15 04:40:14 Nano kernel: XFS (md1): Unmount and run xfs_repair

tillo · March 15, 2018

Okey, I have done a complete teardown of the server.

1* I ran "xfs_repair -nv" Disk1, it did not show any faults or errors.

2* I ran memtest on the RAM memtest86 7.5 version, it showed no errors except on Test 10 (Fade test). Reading about that test in different forums threads, a few error during that test seemed to be normal depending on firmware and motherboard.

3* I checked the cables, they are all good (Original Super Micro cables with a built in lock feature)

4* I am total at a loss currently. What would be the next step?

Edited March 16, 2018 by tillo

JorgeB · March 15, 2018

2 hours ago, tillo said:

1* I ran "xfs_repair -nv" Disk1, it did not show any faults or errors.

It's frequently difficult to identify if there are filesystem issues or not by only looking at the output of xfs_repair, you either need to check the exit code after running with -n (1=corruption detected, 0=no corruption detected) or just run it without -n (no modify flag) so any issues found are repaired.

2 hours ago, tillo said:

I am total at a loss currently. What would be the next step?

The NMI errors can be caused by PSU or board, but hard to confirm unless you have spares, board especially, maybe you can test with a PSU from another system, also, if you havn't yet check the system event log, there might be some clues there.

tillo · March 16, 2018

Quote

9 hours ago, tillo said:

1* I ran "xfs_repair -nv" Disk1, it did not show any faults or errors.

It's frequently difficult to identify if there are filesystem issues or not by only looking at the output of xfs_repair, you either need to check the exit code after running with -n (1=corruption detected, 0=no corruption detected) or just run it without -n (no modify flag) so any issues found are repaired.

So I should run an "xfs_repair -n" or an "xfs_repair". Just to confirm, non of them will with the -n flag, or no flag at all, change the files of my system ("non correcting/altering checks")
Is this correct?

Could I run the test from the web terminal from unraid?

I have currently started the array and I am running a parity sync (without writing any corrections, just to get an ide on how extensive the errors are) I will upload the diagnostic file when it is done.

Quote

9 hours ago, tillo said:

I am total at a loss currently. What would be the next step?

The NMI errors can be caused by PSU or board, but hard to confirm unless you have spares, board especially, maybe you can test with a PSU from another system, also, if you havn't yet check the system event log, there might be some clues there.

I got a spare board, and a spare PSU. The PSU is a Seasonic PRIME ULTRA TITANIUM.

I rather not pull the board or PSU out at this point. I will let it run for a few days and then check the log file.

JorgeB · March 16, 2018

54 minutes ago, tillo said:

non of them will with the -n flag, or no flag at all, change the files of my system

Without -n it will fix the file system, this has nothing to do with parity check, parity is maintained as long as xfs_repair is run on the mdX device.

tillo · March 16, 2018

Quote

root@Nano:~# sudo xfs_repair -n /dev/sde1 && echo ${PIPESTATUS[0]}
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 1
- agno = 0
- agno = 2
- agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
0

This is the output that I get.

Edited March 16, 2018 by tillo

JorgeB · March 16, 2018

7 minutes ago, tillo said:

No modify flag set

Your still using -n, with that it won't fix anything.

tillo · March 16, 2018

Quote

No modify flag set, skipping filesystem flush and exiting.
0

I could run it without -n but in any case it showed a exit code of 0 and that would indicate no errors? 0=no corruption detected ?

Would it be a different if I run it without the -n flag?

JorgeB · March 16, 2018

Didn't notice the exit code, if it's 0 all is fine unlike what the syslog indicated, keep an eye on it for more errors, could be related to your sync errors.

tillo · March 16, 2018

Okey, So should I re-run the parity check, and accept all the error (for this time).
And then continue as normal, if no more errors occur after that?

or should I continue to investigate the motherboard, CPU and PSU?

trurl · March 16, 2018

36 minutes ago, tillo said:

sudo xfs_repair -n /dev/sde1

That is not the mdX device. You will invalidate parity if you repair the sd device. You need md# of disk, not sdX1 letter of disk.

tillo · March 16, 2018

@trurl Okey, when running sudo xfs_repair -n /dev/md1 I get an error that the disk is mounted and not accessible?

How should I proceed in order to do an xfs_repair correctly?

itimpi · March 16, 2018

34 minutes ago, tillo said:

@trurl Okey, when running sudo xfs_repair -n /dev/md1 I get an error that the disk is mounted and not accessible?

How should I proceed in order to do an xfs_repair correctly?

Stop the array; restart it in Maintenance mode; click on the drive on the Main tab; select the option to check the filesystem; if check indicates an error rerun the check without the -n flag (and if prompted add -L flag) to actually do the repair. When finished stop the array and restart in normal mode/

doing it this way maintains parity as all writes to ‘md’ type devices update parity appropriately.

Edited March 16, 2018 by itimpi

JorgeB · March 16, 2018

1 hour ago, trurl said:

That is not the mdX device. You will invalidate parity if you repair the sd device. You need md# of disk, not sdX1 letter of disk.

Good spot, didn't even notice that, I told the OP to run on mdX.

tillo · March 16, 2018

@johnnie.black My bad, I tried, but since I recived an error that I could not run on that device, I googled and it sugested that I would do it with "sde1". So I just asume that that was the correct way to do it :-p

I will put unraid in maintenance mode and give it a try within an hour, and we will see if that gives an other result.

Just to confirm: have i in anyway damaged the file-system by running xfs_repair on sde1 (note, sde1 is Data Disk 1)

JorgeB · March 16, 2018

Just now, tillo said:

have i in anyway damaged the file-system by running xfs_repair on sde1 (note, sde1 is Data Disk 1)

No, but likely added some more sync errors.

trurl · March 16, 2018

Since you did it with the -n switch it shouldn't have made any changes.

JorgeB · March 16, 2018

25 minutes ago, trurl said:

Since you did it with the -n switch it shouldn't have made any changes.

Good point.

Parity check Errors and Checksum Errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation