Parity check Errors and Checksum Errors


Recommended Posts

Hello Unraid forum.

I get from time to time Errors after a parity check (not always but sometime, in the range of 0-10 errors/ per check). The check is scheduled to run every 30 days.

1* This is not normal right? I am to expect errors 0 all the time right?

2* 0-10 errors, are we talking about blocks on the disk, files or some other unit. What is defined by 1 error of unit?

 

I also have the checksum plugin added to my Unraid, and that throws me a few errors from time to time to.

3* I assume that this is related to the same errors as the parity check gives me?

 

FYI: The server is running on an intel S1200V3rps with ECC memory. So hopefully no problem there, but if anyone knows if there is some magical switch in BIOS that have to be enabled in order for the ECC memory to do its magic pls tell me. It might not be enabled if not by default.

 

I am great full for all the help I can get.

Link to comment

Mar 14 17:14:13 Nano emhttpd: unclean shutdown detected

Sync errors are expected after an unclean shutdown.

 

PS: syslog is filled with this:

 

Mar 14 17:17:38 Nano shfs: share cache fullMar 14 17:17:39 Nano shfs: share cache fullMar 14 17:17:40 Nano shfs: share cache full

Some share has an higher then it should be minimum space setting.

 

 

Link to comment
6 minutes ago, tillo said:

I have: 316785 errors during the sync. Is that amount expected after an unclean shutdown?

That's a lot, I stopped reading the syslog after unclean shutdown, but you already remove the diags...

 

7 minutes ago, tillo said:

Also, minimum spacing setting, how would that affect the share cash?

There's a share configured to use cache but with a minimum free space setting higher then currently available on cache, so data written to that share is going directly to the array and that warning is constantly logged, flooding the syslog.

Link to comment

This error repeats several times, it's usually a hardware problem:

 

Mar 14 18:39:27 Nano kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0.
Mar 14 18:39:27 Nano kernel: Do you have a strange power saving mode enabled?
Mar 14 18:39:27 Nano kernel: Dazed and confused, but trying to continue

One of those times it happened just before a lot of sync errors being detect, so possibly related:

 

Mar 15 01:23:35 Nano kernel: Uhhuh. NMI received for unknown reason 31 on CPU 0.
Mar 15 01:23:35 Nano kernel: Do you have a strange power saving mode enabled?
Mar 15 01:23:35 Nano kernel: Dazed and confused, but trying to continue
Mar 15 01:23:35 Nano kernel: md: recovery thread: PQ incorrect, sector=4570300664
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570310904
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570321144
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570331384
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570341624
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570351864
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570362104
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570372344
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570382584
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570392824
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570403064
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570413304
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570423544
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570433784
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570444024
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570454264
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570464504
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570474744
Mar 15 01:23:36 Nano kernel: md: recovery thread: PQ incorrect, sector=4570484984
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570495224
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570505464
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570515704
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570525944
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570536184
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570546424
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570556664
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570566904
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570577144
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570587384
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570597624
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570607864
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570618104
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570628344
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570638584
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570648824
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570659064
Mar 15 01:23:37 Nano kernel: md: recovery thread: PQ incorrect, sector=4570669304
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570679544
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570689784
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570700024
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570710264
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570720504
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570730744
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570740984
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570751224
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570761464
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570771704
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570781944
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570792184
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570802424
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570812664
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570822904
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570833144
Mar 15 01:23:38 Nano kernel: md: recovery thread: PQ incorrect, sector=4570843384
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570853624
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570863864
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570874104
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570884344
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570894584
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570904824
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570915064
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570925304
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570935544
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570945784
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570956024
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570966264
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570976504
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570986744
Mar 15 01:23:39 Nano kernel: md: recovery thread: PQ incorrect, sector=4570996984
Mar 15 01:23:39 Nano kernel: md: recovery thread: stopped logging

There's also filesystem corruption on disk1:

 

Mar 15 04:40:14 Nano kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0xae/0xbe [xfs], xfs_inode block 0x15dd76708
Mar 15 04:40:14 Nano kernel: XFS (md1): Unmount and run xfs_repair

 

Link to comment

Okey, could the first one be related to the new RAM modules I installed?
 

Quote

Mar 14 18:39:27 Nano kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0.
Mar 14 18:39:27 Nano kernel: Do you have a strange power saving mode enabled?
Mar 14 18:39:27 Nano kernel: Dazed and confused, but trying to continue



The other problem, Corrupt data on Disk 1.  what should I do ith that, Unmount and run xfs_repair? Should I replace it? replace cables?

Quote

Mar 15 04:40:14 Nano kernel: XFS (md1): Metadata corruption detected at xfs_inode_buf_verify+0xae/0xbe [xfs], xfs_inode block 0x15dd76708
Mar 15 04:40:14 Nano kernel: XFS (md1): Unmount and run xfs_repair

 

Link to comment

Okey, I have done a complete teardown of the server.

1* I ran "xfs_repair -nv"  Disk1, it did not show any faults or errors.

2* I ran memtest on the RAM memtest86 7.5 version, it showed no errors except on Test 10 (Fade test). Reading about that test in different forums threads, a few error during that test seemed to be normal depending on firmware and motherboard.

3* I checked the cables, they are all good (Original Super Micro cables with a built in lock feature)

4* I am total at a loss currently. What would be the next step?

Edited by tillo
Link to comment
2 hours ago, tillo said:

1* I ran "xfs_repair -nv"  Disk1, it did not show any faults or errors.

 

It's frequently difficult to identify if there are filesystem issues or not by only looking at the output of xfs_repair, you either need to check the exit code after running with -n (1=corruption detected, 0=no corruption detected) or just run it without -n (no modify flag) so any issues found are repaired.

 

2 hours ago, tillo said:

I am total at a loss currently. What would be the next step?

 

The NMI errors can be caused by PSU or board, but hard to confirm unless you have spares, board especially, maybe you can test with a PSU from another system, also, if you havn't yet check the system event log, there might be some clues there.

Link to comment
Quote
  9 hours ago, tillo said:

1* I ran "xfs_repair -nv"  Disk1, it did not show any faults or errors.

 

It's frequently difficult to identify if there are filesystem issues or not by only looking at the output of xfs_repair, you either need to check the exit code after running with -n (1=corruption detected, 0=no corruption detected) or just run it without -n (no modify flag) so any issues found are repaired.

 

So I should run an "xfs_repair -n" or an "xfs_repair". Just to confirm, non of them will with the -n flag, or no flag at all, change the files of my system ("non correcting/altering checks")
Is this correct?

Could I run the test from the web terminal from unraid?

I have currently started the array and I am running a parity sync (without writing any corrections, just to get an ide on how extensive the errors are) I will upload the diagnostic file when it is done. 

 

Quote
  9 hours ago, tillo said:

I am total at a loss currently. What would be the next step?

 

The NMI errors can be caused by PSU or board, but hard to confirm unless you have spares, board especially, maybe you can test with a PSU from another system, also, if you havn't yet check the system event log, there might be some clues there.


I got a spare board, and a spare PSU. The PSU is a Seasonic PRIME ULTRA TITANIUM.

I rather not pull the board or PSU out at this point. I will let it run for a few days and then check the log file.

Link to comment
Quote

root@Nano:~# sudo xfs_repair -n /dev/sde1 && echo ${PIPESTATUS[0]}
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 0
        - agno = 2
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
0

 

This is the output that I get.

Edited by tillo
Link to comment
Quote

No modify flag set, skipping filesystem flush and exiting.
0

I could run it without -n but in any case it showed a exit code of 0 and that would indicate no errors? 0=no corruption detected ?

Would it be a different if I run it without the -n flag?

Link to comment
34 minutes ago, tillo said:

@trurl Okey, when running sudo xfs_repair -n /dev/md1 I get an error that the disk is mounted and not accessible?

How should I proceed in order to do an xfs_repair correctly?

Stop the array; restart it in Maintenance mode; click on the drive on the Main tab;  select the option to check the filesystem;  if check indicates an error rerun the check without the -n flag (and if prompted add -L flag) to actually do the repair.     When finished stop the array and restart in normal mode/

 

doing it this way maintains parity as all writes to ‘md’ type devices update parity appropriately.

Edited by itimpi
  • Upvote 1
Link to comment

@johnnie.black My bad, I tried, but since I recived an error that I could not run on that device, I googled and it sugested that I would do it with "sde1". So I just asume that that was the correct way to do it :-p

I will put unraid in maintenance mode and give it a try within an hour, and we will see if that gives an other result.
 

Just to confirm: have i in anyway damaged the file-system by running xfs_repair on sde1 (note, sde1 is Data Disk 1)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.