Repeated parity checks showing errors. also, data loss

xSmick · June 1, 2022

I know data loss is a scary term to use here but in my case it happened. I lost 3 VM images completely. I had parity errors before and continue to have them now which is what I'm trying to fix. Here's the back story:

I logged in yesterday to discover the parity check that had run the day before had found and corrected 23819 errors. This is the 1st time in almost 2.5 years that I've had a discrepancy between my data disk and parity. I ran another parity check which generated 4118 errors. I didn't think it was a big deal but figured I'd check some of my VMs as a precaution. Out of my 8 VMs, 3.5 were total losses (the vdisks no longer existed). The .5 was functional enough that I was able to SSH in and grab a settings backup. This indicated to me that I had corruption as I should not be able to have several vdisks evaporate overnight. I ran a filesystem check (-nv) which discovered issues related to metadata then attempted the corrections gracefully (-v). This failed and I since I had already lost data I used the -L flag. This worked to correct the filesystem check issues. I ran a 3rd parity check, discovering 4231 errors. At this point, I installed some backup plugins. After, I gracefully shutdown and initiated a memtest. There were no errors after 1 pass (I know this can be deceiving, I will be running a longer test today). I decided to rebuild several of the corrupted VMs to get some of my network functionality back online. Finally, I ran another parity check overnight, generating 1763 errors. I'm not sure what could be causing the repeated parity errors and I'd like to get back to 0. Any help or insight would be greatly appreciated.

tl;dr

Parity check 1: 23819 errors

Parity check 2: 4118

Parity check 3: 4231

Parity check 4: 1763

FS check: issues discovered and corrected (-L)

memtest: 0 issues

data loss somewhere in the mix

I have attached my server diagnostics. I probably should have generated one when all this went down but hindsight is 20/20. I can guarantee the server logs in the diag don't go back far enough as I had significant server activity yesterday.

server-diagnostics-20220601-0917.zip

JorgeB · June 1, 2022

RAM would still be my #1 suspect, run memtest for a few more passes.

xSmick · June 3, 2022

Several additional memtest passes came back clean. When I fired back up this morning I was met with corrupted metadata warnings. Had 3 VMs get nuked. Had to -L again. At this point I'm suspecting a failing drive. Any other ideas?

FS check generated:

Phase 1 - find and verify superblock...
        - block cache size set to 1523432 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 406849 tail block 406849
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0x8edaf38/0x1000
btree block 0/18724327 is suspect, error -74
bad magic # 0x241a9c92 in inode 171 (data fork) bmbt block 18724327
bad data fork in inode 171
would have cleared inode 171
        - agno = 1
        - agno = 2
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0xb3c75ff8/0x1000
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0x7c3a1d88/0x1000
btree block 2/16333609 is suspect, error -74
bad magic # 0x241a9c92 in inode 2149391518 (data fork) bmbt block 284769065
bad data fork in inode 2149391518
would have cleared inode 2149391518
        - agno = 3
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0xb3c75ff8/0x1000
btree block 3/10739507 is suspect, error -74
bad magic # 0x241a9c92 in inode 3306848649 (data fork) bmbt block 413392691
bad data fork in inode 3306848649
would have cleared inode 3306848649
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
entry "vdisk1.img" in shortform directory 170 references free inode 171
would have junked entry "vdisk1.img" in directory inode 170
bad magic # 0x241a9c92 in inode 171 (data fork) bmbt block 18724327
bad data fork in inode 171
would have cleared inode 171
entry "vdisk1.img" in shortform directory 2149391517 references free inode 2149391518
would have junked entry "vdisk1.img" in directory inode 2149391517
bad magic # 0x241a9c92 in inode 2149391518 (data fork) bmbt block 284769065
bad data fork in inode 2149391518
would have cleared inode 2149391518
entry "vdisk1.img" in shortform directory 3306848648 references free inode 3306848649
would have junked entry "vdisk1.img" in directory inode 3306848648
bad magic # 0x241a9c92 in inode 3306848649 (data fork) bmbt block 413392691
bad data fork in inode 3306848649
would have cleared inode 3306848649
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
entry "vdisk1.img" in shortform directory inode 170 points to free inode 171
would junk entry
        - agno = 1
        - agno = 2
entry "vdisk1.img" in shortform directory inode 2149391517 points to free inode 2149391518
would junk entry
        - agno = 3
entry "vdisk1.img" in shortform directory inode 3306848648 points to free inode 3306848649
would junk entry
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1984442613:-916355275) is ahead of log (48:406849).
Would format log to cycle 1984442616.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Fri Jun  3 08:13:11 2022

Phase		Start		End		Duration
Phase 1:	06/03 08:13:03	06/03 08:13:03
Phase 2:	06/03 08:13:03	06/03 08:13:05	2 seconds
Phase 3:	06/03 08:13:05	06/03 08:13:10	5 seconds
Phase 4:	06/03 08:13:10	06/03 08:13:10
Phase 5:	Skipped
Phase 6:	06/03 08:13:10	06/03 08:13:11	1 second
Phase 7:	06/03 08:13:11	06/03 08:13:11

Total run time: 8 seconds

trurl · June 3, 2022

1 minute ago, xSmick said:

corrupted metadata warnings.

attach diagnostics to your NEXT post in this thread

xSmick · June 3, 2022

5 minutes ago, trurl said:

attach diagnostics to your NEXT post in this thread

server-diagnostics-20220603-0843.zip

trurl · June 3, 2022

Disk1 SMART attributes look fine. You will have to disable spindown on the disk if you want it to complete an extended SMART test.

JorgeB · June 3, 2022

53 minutes ago, xSmick said:

At this point I'm suspecting a failing drive.

Could be, did you run schedule parity checks? Any errors there?

xSmick · June 3, 2022

2 hours ago, JorgeB said:

Could be, did you run schedule parity checks? Any errors there?

This all started with errors on parity checks. See my 1st post. I'm an hour into another with 8792 errors so far.

JorgeB · June 3, 2022

If you have a spare replace disk1 and then run a couple of parity checks to see if there are no more sync errors.

JorgeB · June 3, 2022

18 minutes ago, xSmick said:

This all started with errors on parity checks.

Yes, and sorry, I reply to so many threads that sometimes get lost.

Disk corrupting data is uncommon, but it wouldn't be the first time, worth testing if you have a spare you could use, even a smaller one, though that would require a new config.

xSmick · June 3, 2022

40 minutes ago, JorgeB said:

If you have a spare replace disk1 and then run a couple of parity checks to see if there are no more sync errors.

31 minutes ago, JorgeB said:

Yes, and sorry, I reply to so many threads that sometimes get lost.

Disk corrupting data is uncommon, but it wouldn't be the first time, worth testing if you have a spare you could use, even a smaller one, though that would require a new config.

No worries, kind of figured that was the case and I appreciate all the input. I do have a spare although it's smaller. At this point, I'm going to wait until tomorrow. I have 2 new drives arriving to help troubleshoot/replace. Worst case, I have spare drives.

When the new drives arrive, would you recommend rebuilding parity or rebuilding disk1? I would imagine I should start with disk1.

JorgeB · June 3, 2022

If the new disk is the same size rebuild from parity, then run a couple of parity checks.

xSmick · June 6, 2022

So this was most likely what I would assume to be early stages of drive failure. disk1 has been swapped with a new drive, rebuilt from parity, ~30 hours of operation and 3 parity checks with no issues. Thanks for all the input @JorgeB🍻

Edited June 6, 2022 by xSmick

Repeated parity checks showing errors. also, data loss

Recommended Posts

xSmick

Link to comment

JorgeB

Link to comment

xSmick

Link to comment

trurl

Link to comment

xSmick

Link to comment

trurl

Link to comment

JorgeB

Link to comment

xSmick

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

xSmick

Link to comment

JorgeB

Link to comment

xSmick

Link to comment

Join the conversation