Parity check - writing corrections to data disk


Recommended Posts

I turned off my unraid server via the GUI a couple times this week, and when restarting it yesterday it came back up with two unmountable disks with 'Corruption warning: Metadata has LSN (1:83814) ahead of current LSN (1:80338).'

 

I restarted the array in maintenance mode and ran xfs_repair -v for both devices which indicated -L was needed. I reran it with -L and the output looked good:

 

Phase 1 - find and verify superblock...
        - block cache size set to 2292464 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 451270 tail block 451266
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:452526) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Sat Dec  8 08:49:37 2018

Phase           Start           End             Duration
Phase 1:        12/08 08:44:23  12/08 08:44:23
Phase 2:        12/08 08:44:23  12/08 08:45:44  1 minute, 21 seconds
Phase 3:        12/08 08:45:44  12/08 08:45:45  1 second
Phase 4:        12/08 08:45:45  12/08 08:45:45
Phase 5:        12/08 08:45:45  12/08 08:45:45
Phase 6:        12/08 08:45:45  12/08 08:45:45
Phase 7:        12/08 08:45:45  12/08 08:45:45

Total run time: 1 minute, 22 seconds
done


xfs_repair -v -L /dev/md15
Phase 1 - find and verify superblock...
        - block cache size set to 2292464 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 80338 tail block 80334
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:83814) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Sat Dec  8 08:50:28 2018

Phase           Start           End             Duration
Phase 1:        12/08 08:45:19  12/08 08:45:19
Phase 2:        12/08 08:45:19  12/08 08:47:15  1 minute, 56 seconds
Phase 3:        12/08 08:47:15  12/08 08:47:15
Phase 4:        12/08 08:47:15  12/08 08:47:15
Phase 5:        12/08 08:47:15  12/08 08:47:15
Phase 6:        12/08 08:47:15  12/08 08:47:15
Phase 7:        12/08 08:47:15  12/08 08:47:15

Total run time: 1 minute, 56 seconds
done

 

I restarted the array and it detected the disks normally and everything 'looks' okay. Now I need to run a consistency check, but I'd like the check to consider the parity authoritative rather than the data disks in case there are differences. How can I do this?

diagnostics-20181208-0926.zip

Link to comment
2 minutes ago, DarkKnight said:

Now I need to run a consistency check, but I'd like the check to consider the parity authoritative rather than the data disks in case there are differences. How can I do this?

You can't. All you can do is a non-correcting check and see if parity is consistent. I understand what you are getting at, but if you think it through, it's not possible. The parity disk by itself has no way of recording which member of the parity set is the wrong bit, only that ONE of the several data disks is inconsistent. Theoretically you could examine each non matching address offset and flip a bit on each drive one by one and see which of the solutions made the most logical sense, but you would have to determine which file was effected for each drive and check for corruption with external validation, or if that address was in unused space, in which case you wouldn't be able to tell what was correct or incorrect.

 

The best you can do is a non-correcting check, if there are errors you would have to do a byte level comparison with backups or checksum to verify which file if any were affected.

 

tldr; Parity is a sum of all disks, so if the array consists of more than one data disk there is no way to tell which data disk is wrong.

  • Upvote 1
Link to comment
1 minute ago, Sven88 said:

If you want to just rebuild a disk, simply stop the array>unassign the trouble disk>start array>stop array>reassign disk>start array and start rebuilding.

I was down two disks. I did not want to take the chance of a problem occurring during rebuild that would lose all of that data. I don't have 4TB of space available outside the array for backup of the emulated contents either.

Link to comment
1 hour ago, DarkKnight said:

The server is at about 30/50TBs. used. There's no other backup. 

 

Unraid is capable of emulating the disks when they are missing using parity, provided enough other disks are available. If it can do that, why can't we choose to have the data corrected rather than the parity?

If you have single parity, then the number of disks required to emulate a single missing disk is ALL of the other disks. If you had dual parity then you would be able to rebuild both of those data disks, but it is extremely unlikely to have fixed anything. Filesystem corruption needs to be repaired in the way you already did it.

 

And you really must have backups. You don't have to backup everything, but you need a plan. You must have another copy of anything important and irreplaceable on another system. Parity will not save you.

Link to comment

I have dual parity. My concern was the warning message that data corruption could get worse due to using -L in the repair. If this is not the case in this instance, than I have nothing to worry about. I'm running a non-corrective parity check. I also noticed that after 18 consecutive months of error free checks, I got 394 errors on my last monthly check. No new smart warnings, but I did have to shut unraid off a couple times in the past month while I was doing work on my servers. I suppose I could have had an unclean shutdown then.

 

In terms of backups of *really* important data like photos, I do have those on multiple machines. I don't have an off-site backup configured for older photos, but it's on the list. B2 looking pretty cheap for that. Newer photos are covered by iCloud. I have a few TB in project files for old VHS home movies that I'd be pretty pissed to lose, but uncompressed they are like 30GB/hr and I have something like 100-200hrs of footage, though not all of it has been digitized yet. I can't imagine what my ISP would do if I tried to pass 10+TB of upload in a month over top of my already high usage to cover backing up that plus my existing irreproducible data. Would raise a red flag I don't want. Besides, I don't wan't the monthly sub. Would make more sense to me to get a 10-14TB drive, back it up locally and store that off-site. Just not in the budget. 😒

Link to comment
Just now, DarkKnight said:

I got 394 errors on my last monthly check. No new smart warnings, but I did have to shut unraid off a couple times in the past month while I was doing work on my servers. I suppose I could have had an unclean shutdown then.

Most likely

 

Consider creating checksums for your files, very handy for situations like these.

Link to comment
  • 4 years later...
On 12/8/2018 at 11:50 AM, JorgeB said:

Most likely

 

Consider creating checksums for your files, very handy for situations like these.

I realize this is like 5 yrs old, but considering the drive issues I am dealing with (you are helping me currently actually) I thought something like this might be a good idea.  I was wondering if there was a specific tool for unraid that you recommend for creating and reconciling the checksums. TIA

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.