Jump to content

Repeated parity checks showing errors. also, data loss


Recommended Posts

I know data loss is a scary term to use here but in my case it happened. I lost 3 VM images completely. I had parity errors before and continue to have them now which is what I'm trying to fix. Here's the back story:

 

I logged in yesterday to discover the parity check that had run the day before had found and corrected 23819 errors. This is the 1st time in almost 2.5 years that I've had a discrepancy between my data disk and parity. I ran another parity check which generated 4118 errors. I didn't think it was a big deal but figured I'd check some of my VMs as a precaution. Out of my 8 VMs, 3.5 were total losses (the vdisks no longer existed). The .5 was functional enough that I was able to SSH in and grab a settings backup. This indicated to me that I had corruption as I should not be able to have several vdisks evaporate overnight. I ran a filesystem check (-nv) which discovered issues related to metadata then attempted the corrections gracefully (-v). This failed and I since I had already lost data I used the -L flag. This worked to correct the filesystem check issues. I ran a 3rd parity check, discovering 4231 errors. At this point, I installed some backup plugins. After, I gracefully shutdown and initiated a memtest. There were no errors after 1 pass (I know this can be deceiving, I will be running a longer test today). I decided to rebuild several of the corrupted VMs to get some of my network functionality back online. Finally, I ran another parity check overnight, generating 1763 errors. I'm not sure what could be causing the repeated parity errors and I'd like to get back to 0. Any help or insight would be greatly appreciated.

 

tl;dr

Parity check 1: 23819 errors

Parity check 2: 4118

Parity check 3: 4231

Parity check 4: 1763

FS check: issues discovered and corrected (-L)

memtest: 0 issues

data loss somewhere in the mix

 

I have attached my server diagnostics. I probably should have generated one when all this went down but hindsight is 20/20. I can guarantee the server logs in the diag don't go back far enough as I had significant server activity yesterday.

server-diagnostics-20220601-0917.zip

Link to comment

Several additional memtest passes came back clean. When I fired back up this morning I was met with corrupted metadata warnings. Had 3 VMs get nuked. Had to -L again. At this point I'm suspecting a failing drive. Any other ideas?

 

FS check generated:

Phase 1 - find and verify superblock...
        - block cache size set to 1523432 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 406849 tail block 406849
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0x8edaf38/0x1000
btree block 0/18724327 is suspect, error -74
bad magic # 0x241a9c92 in inode 171 (data fork) bmbt block 18724327
bad data fork in inode 171
would have cleared inode 171
        - agno = 1
        - agno = 2
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0xb3c75ff8/0x1000
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0x7c3a1d88/0x1000
btree block 2/16333609 is suspect, error -74
bad magic # 0x241a9c92 in inode 2149391518 (data fork) bmbt block 284769065
bad data fork in inode 2149391518
would have cleared inode 2149391518
        - agno = 3
Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0xb3c75ff8/0x1000
btree block 3/10739507 is suspect, error -74
bad magic # 0x241a9c92 in inode 3306848649 (data fork) bmbt block 413392691
bad data fork in inode 3306848649
would have cleared inode 3306848649
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
entry "vdisk1.img" in shortform directory 170 references free inode 171
would have junked entry "vdisk1.img" in directory inode 170
bad magic # 0x241a9c92 in inode 171 (data fork) bmbt block 18724327
bad data fork in inode 171
would have cleared inode 171
entry "vdisk1.img" in shortform directory 2149391517 references free inode 2149391518
would have junked entry "vdisk1.img" in directory inode 2149391517
bad magic # 0x241a9c92 in inode 2149391518 (data fork) bmbt block 284769065
bad data fork in inode 2149391518
would have cleared inode 2149391518
entry "vdisk1.img" in shortform directory 3306848648 references free inode 3306848649
would have junked entry "vdisk1.img" in directory inode 3306848648
bad magic # 0x241a9c92 in inode 3306848649 (data fork) bmbt block 413392691
bad data fork in inode 3306848649
would have cleared inode 3306848649
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
entry "vdisk1.img" in shortform directory inode 170 points to free inode 171
would junk entry
        - agno = 1
        - agno = 2
entry "vdisk1.img" in shortform directory inode 2149391517 points to free inode 2149391518
would junk entry
        - agno = 3
entry "vdisk1.img" in shortform directory inode 3306848648 points to free inode 3306848649
would junk entry
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1984442613:-916355275) is ahead of log (48:406849).
Would format log to cycle 1984442616.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Fri Jun  3 08:13:11 2022

Phase		Start		End		Duration
Phase 1:	06/03 08:13:03	06/03 08:13:03
Phase 2:	06/03 08:13:03	06/03 08:13:05	2 seconds
Phase 3:	06/03 08:13:05	06/03 08:13:10	5 seconds
Phase 4:	06/03 08:13:10	06/03 08:13:10
Phase 5:	Skipped
Phase 6:	06/03 08:13:10	06/03 08:13:11	1 second
Phase 7:	06/03 08:13:11	06/03 08:13:11

Total run time: 8 seconds

 

Link to comment
18 minutes ago, xSmick said:

This all started with errors on parity checks.

Yes, and sorry, I reply to so many threads that sometimes get lost.

 

Disk corrupting data is uncommon, but it wouldn't be the first time, worth testing if you have a spare you could use, even a smaller one, though  that would require a new config.

Link to comment
40 minutes ago, JorgeB said:

If you have a spare replace disk1 and then run a couple of parity checks to see if there are no more sync errors.

 

31 minutes ago, JorgeB said:

Yes, and sorry, I reply to so many threads that sometimes get lost.

 

Disk corrupting data is uncommon, but it wouldn't be the first time, worth testing if you have a spare you could use, even a smaller one, though  that would require a new config.

No worries, kind of figured that was the case and I appreciate all the input. I do have a spare although it's smaller. At this point, I'm going to wait until tomorrow. I have 2 new drives arriving to help troubleshoot/replace. Worst case, I have spare drives.

 

When the new drives arrive, would you recommend rebuilding parity or rebuilding disk1? I would imagine I should start with disk1.

Link to comment

So this was most likely what I would assume to be early stages of drive failure. disk1 has been swapped with a new drive, rebuilt from parity, ~30 hours of operation and 3 parity checks with no issues. Thanks for all the input @JorgeB🍻

Edited by xSmick
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...