Parity check - writing corrections to data disk

DarkKnight · December 8, 2018

I turned off my unraid server via the GUI a couple times this week, and when restarting it yesterday it came back up with two unmountable disks with 'Corruption warning: Metadata has LSN (1:83814) ahead of current LSN (1:80338).'

I restarted the array in maintenance mode and ran xfs_repair -v for both devices which indicated -L was needed. I reran it with -L and the output looked good:

Phase 1 - find and verify superblock...
        - block cache size set to 2292464 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 451270 tail block 451266
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:452526) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Sat Dec  8 08:49:37 2018

Phase           Start           End             Duration
Phase 1:        12/08 08:44:23  12/08 08:44:23
Phase 2:        12/08 08:44:23  12/08 08:45:44  1 minute, 21 seconds
Phase 3:        12/08 08:45:44  12/08 08:45:45  1 second
Phase 4:        12/08 08:45:45  12/08 08:45:45
Phase 5:        12/08 08:45:45  12/08 08:45:45
Phase 6:        12/08 08:45:45  12/08 08:45:45
Phase 7:        12/08 08:45:45  12/08 08:45:45

Total run time: 1 minute, 22 seconds
done


xfs_repair -v -L /dev/md15
Phase 1 - find and verify superblock...
        - block cache size set to 2292464 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 80338 tail block 80334
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (1:83814) is ahead of log (1:2).
Format log to cycle 4.

        XFS_REPAIR Summary    Sat Dec  8 08:50:28 2018

Phase           Start           End             Duration
Phase 1:        12/08 08:45:19  12/08 08:45:19
Phase 2:        12/08 08:45:19  12/08 08:47:15  1 minute, 56 seconds
Phase 3:        12/08 08:47:15  12/08 08:47:15
Phase 4:        12/08 08:47:15  12/08 08:47:15
Phase 5:        12/08 08:47:15  12/08 08:47:15
Phase 6:        12/08 08:47:15  12/08 08:47:15
Phase 7:        12/08 08:47:15  12/08 08:47:15

Total run time: 1 minute, 56 seconds
done

I restarted the array and it detected the disks normally and everything 'looks' okay. Now I need to run a consistency check, but I'd like the check to consider the parity authoritative rather than the data disks in case there are differences. How can I do this?

diagnostics-20181208-0926.zip

JonathanM · December 8, 2018

2 minutes ago, DarkKnight said:

Now I need to run a consistency check, but I'd like the check to consider the parity authoritative rather than the data disks in case there are differences. How can I do this?

You can't. All you can do is a non-correcting check and see if parity is consistent. I understand what you are getting at, but if you think it through, it's not possible. The parity disk by itself has no way of recording which member of the parity set is the wrong bit, only that ONE of the several data disks is inconsistent. Theoretically you could examine each non matching address offset and flip a bit on each drive one by one and see which of the solutions made the most logical sense, but you would have to determine which file was effected for each drive and check for corruption with external validation, or if that address was in unused space, in which case you wouldn't be able to tell what was correct or incorrect.

The best you can do is a non-correcting check, if there are errors you would have to do a byte level comparison with backups or checksum to verify which file if any were affected.

tldr; Parity is a sum of all disks, so if the array consists of more than one data disk there is no way to tell which data disk is wrong.

DarkKnight · December 8, 2018

The server is at about 30/50TBs. used. There's no other backup.

Unraid is capable of emulating the disks when they are missing using parity, provided enough other disks are available. If it can do that, why can't we choose to have the data corrected rather than the parity?

SkippyAlpha · December 8, 2018

If you want to just rebuild a disk, simply stop the array>unassign the trouble disk>start array>stop array>reassign disk>start array and start rebuilding.

DarkKnight · December 8, 2018

1 minute ago, Sven88 said:

If you want to just rebuild a disk, simply stop the array>unassign the trouble disk>start array>stop array>reassign disk>start array and start rebuilding.

I was down two disks. I did not want to take the chance of a problem occurring during rebuild that would lose all of that data. I don't have 4TB of space available outside the array for backup of the emulated contents either.

JonathanM · December 8, 2018

1 hour ago, DarkKnight said:

Unraid is capable of emulating the disks when they are missing using parity, provided enough other disks are available. If it can do that, why can't we choose to have the data corrected rather than the parity?

Which disk do you want it to correct?

DarkKnight · December 8, 2018

md4 & md15 both had log errors.

Edit: I believe it was related to an unclean shutdown due to too short of a default timer on the disks. I set it to 7 min per the recommendation today.

Edited December 8, 2018 by DarkKnight

JonathanM · December 8, 2018

1 minute ago, DarkKnight said:

md4 & md15 both had log errors.

So which one is in error? My point is, parity can't tell which one is wrong, only that one (or more) of the array members is inconsistent with what currently is on the parity disk at that address.

trurl · December 8, 2018

1 hour ago, DarkKnight said:

The server is at about 30/50TBs. used. There's no other backup.

Unraid is capable of emulating the disks when they are missing using parity, provided enough other disks are available. If it can do that, why can't we choose to have the data corrected rather than the parity?

If you have single parity, then the number of disks required to emulate a single missing disk is ALL of the other disks. If you had dual parity then you would be able to rebuild both of those data disks, but it is extremely unlikely to have fixed anything. Filesystem corruption needs to be repaired in the way you already did it.

And you really must have backups. You don't have to backup everything, but you need a plan. You must have another copy of anything important and irreplaceable on another system. Parity will not save you.

DarkKnight · December 8, 2018

I have dual parity. My concern was the warning message that data corruption could get worse due to using -L in the repair. If this is not the case in this instance, than I have nothing to worry about. I'm running a non-corrective parity check. I also noticed that after 18 consecutive months of error free checks, I got 394 errors on my last monthly check. No new smart warnings, but I did have to shut unraid off a couple times in the past month while I was doing work on my servers. I suppose I could have had an unclean shutdown then.

In terms of backups of *really* important data like photos, I do have those on multiple machines. I don't have an off-site backup configured for older photos, but it's on the list. B2 looking pretty cheap for that. Newer photos are covered by iCloud. I have a few TB in project files for old VHS home movies that I'd be pretty pissed to lose, but uncompressed they are like 30GB/hr and I have something like 100-200hrs of footage, though not all of it has been digitized yet. I can't imagine what my ISP would do if I tried to pass 10+TB of upload in a month over top of my already high usage to cover backing up that plus my existing irreproducible data. Would raise a red flag I don't want. Besides, I don't wan't the monthly sub. Would make more sense to me to get a 10-14TB drive, back it up locally and store that off-site. Just not in the budget. 😒

JorgeB · December 8, 2018

Just now, DarkKnight said:

I got 394 errors on my last monthly check. No new smart warnings, but I did have to shut unraid off a couple times in the past month while I was doing work on my servers. I suppose I could have had an unclean shutdown then.

Most likely

Consider creating checksums for your files, very handy for situations like these.

Garbonzo · September 26, 2023

On 12/8/2018 at 11:50 AM, JorgeB said:

Most likely

Consider creating checksums for your files, very handy for situations like these.

I realize this is like 5 yrs old, but considering the drive issues I am dealing with (you are helping me currently actually) I thought something like this might be a good idea. I was wondering if there was a specific tool for unraid that you recommend for creating and reconciling the checksums. TIA

JorgeB · September 26, 2023

Personally I would use btrfs or zfs, since they automatically checksum all data, for xfs you can use the File Integrity plugin, or an external tool like corz.

Parity check - writing corrections to data disk

Recommended Posts

DarkKnight

Link to comment

JonathanM

Link to comment

DarkKnight

Link to comment

SkippyAlpha

Link to comment

DarkKnight

Link to comment

JonathanM

Link to comment

DarkKnight

Link to comment

JonathanM

Link to comment

trurl

Link to comment

DarkKnight

Link to comment

JorgeB

Link to comment

Garbonzo

Link to comment

JorgeB

Link to comment

Join the conversation