Questions regarding drives upgrade (XFS and Parity errors)

Devotee · April 25

I'm in the process of upgrading my Unraid server by replacing some drives in the array with bigger ones. That's what I've done for now:

1. I precleared all new drives before doing anything. Meanwhile, I ran a Parity-Check on the existing array just in case, which went fine:

Parity-Check 2024-04-14, 04:33:07 18 TB 1 day, 10 hr, 25 min, 34 sec 145.2 MB/s OK 0

2. Since I had a 18TB parity drive and the new drives were all 20TB, I removed the 18TB parity drive and installed a 20TB drive, Parity-Sync went fine:

Parity-Sync 2024-04-16, 05:27:27 20 TB 1 day, 15 hr, 24 min, 56 sec 141.0 MB/s OK 0

3. I removed one of my drives from the array (12TB) and I installed a 20TB drive, Data-Rebuild went fine:

Data-Rebuild 2024-04-18, 08:02:15 20 TB 1 day, 14 hr, 40 min, 55 sec 143.6 MB/s OK 0

4. Next step, I removed another of my drives from the array (12TB) and I installed a 20TB drive. Again, Data-Rebuild went (apparently) fine:

Data-Rebuild 2024-04-24, 12:09:44 20 TB 1 day, 11 hr, 7 min 158.2 MB/s OK 0

However, when I restarted the array the last drive that was rebuilt couldn't be mounted (the previous one -step 3- was fine) and logs showed an XFS error ("Corruption warning: Metadata has LSN ahead of current LSN unraid") and instructed me to "Please unmount and run xfs_repair". I reached the forum for some information on this error and I ran xfs_repair as suggested in many discussions:

Spoiler

root@unraid:~# xfs_repair -vn /dev/sdg1
Phase 1 - find and verify superblock...
        - block cache size set to 2985856 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1541477 tail block 1541477
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 6
        - agno = 10
        - agno = 7
        - agno = 3
        - agno = 8
        - agno = 5
        - agno = 4
        - agno = 9
        - agno = 1
        - agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Apr 24 13:53:29 2024

Phase           Start           End             Duration
Phase 1:        04/24 13:53:25  04/24 13:53:26  1 second
Phase 2:        04/24 13:53:26  04/24 13:53:26
Phase 3:        04/24 13:53:26  04/24 13:53:28  2 seconds
Phase 4:        04/24 13:53:28  04/24 13:53:28
Phase 5:        Skipped
Phase 6:        04/24 13:53:28  04/24 13:53:29  1 second
Phase 7:        04/24 13:53:29  04/24 13:53:29

Total run time: 4 second

root@unraid:~# xfs_repair -v /dev/sdg1
Phase 1 - find and verify superblock...
        - block cache size set to 2985856 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1541477 tail block 1541477
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 4
        - agno = 10
        - agno = 6
        - agno = 9
        - agno = 5
        - agno = 8
        - agno = 2
        - agno = 7
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (14:1541489) is ahead of log (14:1541477).
Format log to cycle 17.

        XFS_REPAIR Summary    Wed Apr 24 13:55:05 2024

Phase           Start           End             Duration
Phase 1:        04/24 13:54:42  04/24 13:54:42
Phase 2:        04/24 13:54:42  04/24 13:54:42
Phase 3:        04/24 13:54:42  04/24 13:54:44  2 seconds
Phase 4:        04/24 13:54:44  04/24 13:54:44
Phase 5:        04/24 13:54:44  04/24 13:54:44
Phase 6:        04/24 13:54:44  04/24 13:54:46  2 seconds
Phase 7:        04/24 13:54:46  04/24 13:54:46

Total run time: 4 seconds
done

This seemed to fix the issue and starting the array would now mount all drives.

I'm now running a Parity-Check and early in the process it showed 31 sync errors corrected, it was staying like that but now (~29%) it just jumped to 521770 sync errors corrected. I never had a single parity check error before (and the Unraid server has been running fine since September 2022

Should I be worried? Should I have done anything differently? Should I be doing something else now?

I'm guessing I should let the Parity-Check process finish and then run it again (in non-correcting mode this time). If it doesn't show any errors I should be good, right?

To be honest, I'm not sure why the second drive rebuild "failed" leaving the drive in an unmountable state (though xfs_repair seem to easily fix it and didn't find any major problems with it, AFAIK) and I also don't know why the Parity-Check should fail now (unless the xfs_repair changed something in the drive data that requires the parity to be adjusted).

Any hints/suggestions are appreciated.

Edited April 25 by Devotee

JorgeB · April 25

1 hour ago, Devotee said:

Should I be worried?

Certainly not normal, if you have the diags from the rebuilds post them.

Devotee · April 25

4 hours ago, JorgeB said:

Certainly not normal, if you have the diags from the rebuilds post them.

These are the diagnostics from the rebuild of the second data drive (after rebuilding parity drive and one data drive without errors).

Old disk was:

WDC_WD120EDAZ-11F3RA0_8CKXAD2F

Replacement disk was:

WDC_WD200EDGZ-11B9PA0_2HG0PK2N (sdf)

While the rebuild was done I precleared the first old drive, WDC_WD120EDAZ-11F3RA0_5PG9BXWC (sdg).

I will post the diagnostics again when the Parity-Check task ends.

I did made a checksum of the contents of both data drives I planned on replacing, I think I will check them once the Parity-Check ends to be sure that all files (the rebuilt data) is correct. I already checked he first 20TB I had replaced (with no mismatch in checksums, that's why I precleared the old 12TB drive) so I will do the second 20TB disk next. The good thing is I still have the second 12TB drive untouched so if there is any mismatch I should be able to recover the original files from it.

boxy-diagnostics-20240424-1231.zip

JonathanM · April 25

It's a good idea to do a non-correcting parity check after doing a disk rebuild. Rebuilds don't "check their work" by reading what was written to the rebuilt drive, it's assumed if a write completes without error, it wrote correctly.

Devotee · April 25

4 minutes ago, JonathanM said:

It's a good idea to do a non-correcting parity check after doing a disk rebuild. Rebuilds don't "check their work" by reading what was written to the rebuilt drive, it's assumed if a write completes without error, it wrote correctly.

Ah, thanks, I'm always extra careful with these things and I usually do extra steps to ensure everything is ok. Since I was replacing three "big" drives this time and doing a parity check after each replacement would mean an extra day "lost" and there was (apparently) nothing wrong with the parity or data rebuilds, I thought I would be safe enough with a final parity check after all drives had been swapped and rebuilt. My fault then 😔

I will take note of your suggestion and from now on I will play on the even safer side doing extra checks after each important step. Maybe it would be nice to add some kind of warning after a rebuild is done to let the user know that a non-correcting parity check is recommended?

is at 54% right now with 521794 sync errors corrected. Once it ends I will test the rebuilt data drives' contents against my checksum files to verify if all files match and I will let you know the result. If checksums match, I will do a non-correcting parity check that will hopefully return zero errors. If there's a problem with the file checksums I will also come back running to ask for more advice (although I'm assuming it should be "plug the old data drive and copy the good files over"). I said it once before and I will say it again: that's one of the beauties of Unraid, if something goes wrong you can always go to each individual drive and try to recover as many data as possible. It's even possible and easy to go back to a previous state by just re-adding an old drive to the array (if the rebuild was done in maintenance mode and you are sure that nothing was changed on the array).

itimpi · April 25

As long as you have the original drives intact after the rebuild then you can always check the data integrity at a later date.

JorgeB · April 25

The diags posted only have a rebuild, and there weren't any disk errors, but if it's for example RAM problem nothing would be logged anyway.

Devotee · April 25

2 hours ago, itimpi said:

As long as you have the original drives intact after the rebuild then you can always check the data integrity at a later date.

After I swapped the first data drive (12TB) for the new one (20TB) and the rebuild was done, I did check the data on it against my checksum file and it didn't find any discrepancies. I have not checked the second data drive that was rebuilt because just after the rebuild it started with the XFS errors. I'll check it against the original checksum file from the old drive as soon as the parity check ends.

49 minutes ago, JorgeB said:

The diags posted only have a rebuild, and there weren't any disk errors, but if it's for example RAM problem nothing would be logged anyway.

Coincidentally, one of the things I did before I started the server upgrade was retesting the memory using Memtest Pro and no errors were found. I'll do another memory test anyway, leaving it for as long as possible, just in case.

Devotee · April 27

Parity check ended with 521800 sync errors corrected. Diagnostics attached.

I checked the rebuilt data drives against my original checksum files and the first drive rebuilt was totally fine (0 mismatches). The second drive rebuilt did have ONE single file where the checksum did not match (a 6.6G mkv file). I will recover that particular file from the old data drive. That seems consistent with the sync errors corrected, I'm not sure how much data is one sync error, but if it's one bit or even one byte it makes sense it only affected so little data (one single file). I'm also not sure if it's normal that only one file is affected by corruption, instead of having more scattered errors around.

I still have one more data drive to swap (the 14TB Seagate drive for the old 18TB Western Digital I was using for Parity) but right now I'm a bit worried 😅The next logical steps before doing that drive replacement and data rebuild should be a new non-correcting parity check as @JonathanM recommended, followed by a new Memtest86 Pro long test (as long as possible) to rule out a possible problem with RAM.

Since the first thing I did was to replace the parity drive and the first 20TB data drive rebuilt was totally fine (all files matched the original) although I didn't do a non-correcting parity checks after every step I have to assume that the problem appeared when I rebuilt the second data drive. At that point I had a corrupted file in that drive and doing a correcting Parity-Check just made things worse as parity errors were found and "corrected" when, in fact, the parity was fine. Had I done a non-correcting parity check after the second data drive rebuilt and catching the corrupted file might have been solved by just rebuilding the data drive again (although the XFS error was already telling that something was not right).

Lesson learned. Besides comparing checksums for all files after a data drive rebuild I will also perform a non-correcting parity check after every rebuild process from now on.

I'll report the results of the non-correcting parity check and the Memtest86 Pro tests.

boxy-diagnostics-20240427-1103.zip

JonathanM · April 27

4 hours ago, Devotee said:

At that point I had a corrupted file in that drive and doing a correcting Parity-Check just made things worse as parity errors were found and "corrected" when, in fact, the parity was fine. Had I done a non-correcting parity check after the second data drive rebuilt and catching the corrupted file might have been solved by just rebuilding the data drive again

This is why automatic parity checks should always be non-correcting. If errors are found, there should be an investigation and a probable cause found before action is taken.

Devotee · April 27

5 hours ago, JonathanM said:

This is why automatic parity checks should always be non-correcting. If errors are found, there should be an investigation and a probable cause found before action is taken.

I'm running the non-correcting parity check right now, 35% and zero errors found. Yay!

I still have a few things to do (test the RAM and upgrade the 14TB drive to 18TB using the old spare parity drive) but once I'm done I think I will suggest a few changes to the Unraid team:

1. After a rebuild is done, suggest to the user that, even if no errors are found during the rebuild process, performing a non-correcting parity check is recommended

2. Since the "apply corrections" option is enabled by default on the main page, maybe it would be a good idea to disable it and if the user clicks on it, show a warning suggesting that a non-correcting parity check is recommended before doing any corrections/changes to the array

3. When scheduling a parity check, do not set "Write corrections to parity disk" to yes by default and warn the user that (and I'm quoting you) "automatic parity checks should always be non-correcting. If errors are found, there should be an investigation and a probable cause found before action is taken".

I always ran the parity checks with the default "apply corrections" setting enabled because I never gave it a second thought. If I had read the recommendations here in the forum (which usually doesn't happen until you have a problem) I would have always done a non-correcting parity check first, as you suggest. Even the "schedule parity check" is set to "Write corrections to parity disk" by default which, after what you've told me, is not the best thing to do (hence the third suggestion I would made to the Unraid developers).

JonathanM · April 27

This is a contentious subject, as in normal operation the parity drive is the last to flush writes, meaning it's way more likely for the parity disk to be the one that needs to be changed if there is a sync error. The defaults have been switched back and forth a few times.

Non-correcting is obviously my preference, but the longer a sync error stays around, the greater the chance a failed drive will happen before it's corrected, and have corruption as a result on the rebuilt disk. It's not completely black and white.

Questions regarding drives upgrade (XFS and Parity errors)

Recommended Posts

Devotee

Link to comment

JorgeB

Link to comment

Devotee

Link to comment

JonathanM

Link to comment

Devotee

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

Devotee

Link to comment

Devotee

Link to comment

JonathanM

Link to comment

Devotee

Link to comment

JonathanM

Link to comment

Join the conversation