xSmick Posted June 1, 2022 Share Posted June 1, 2022 I know data loss is a scary term to use here but in my case it happened. I lost 3 VM images completely. I had parity errors before and continue to have them now which is what I'm trying to fix. Here's the back story: I logged in yesterday to discover the parity check that had run the day before had found and corrected 23819 errors. This is the 1st time in almost 2.5 years that I've had a discrepancy between my data disk and parity. I ran another parity check which generated 4118 errors. I didn't think it was a big deal but figured I'd check some of my VMs as a precaution. Out of my 8 VMs, 3.5 were total losses (the vdisks no longer existed). The .5 was functional enough that I was able to SSH in and grab a settings backup. This indicated to me that I had corruption as I should not be able to have several vdisks evaporate overnight. I ran a filesystem check (-nv) which discovered issues related to metadata then attempted the corrections gracefully (-v). This failed and I since I had already lost data I used the -L flag. This worked to correct the filesystem check issues. I ran a 3rd parity check, discovering 4231 errors. At this point, I installed some backup plugins. After, I gracefully shutdown and initiated a memtest. There were no errors after 1 pass (I know this can be deceiving, I will be running a longer test today). I decided to rebuild several of the corrupted VMs to get some of my network functionality back online. Finally, I ran another parity check overnight, generating 1763 errors. I'm not sure what could be causing the repeated parity errors and I'd like to get back to 0. Any help or insight would be greatly appreciated. tl;dr Parity check 1: 23819 errors Parity check 2: 4118 Parity check 3: 4231 Parity check 4: 1763 FS check: issues discovered and corrected (-L) memtest: 0 issues data loss somewhere in the mix I have attached my server diagnostics. I probably should have generated one when all this went down but hindsight is 20/20. I can guarantee the server logs in the diag don't go back far enough as I had significant server activity yesterday. server-diagnostics-20220601-0917.zip Quote Link to comment
JorgeB Posted June 1, 2022 Share Posted June 1, 2022 RAM would still be my #1 suspect, run memtest for a few more passes. 1 Quote Link to comment
xSmick Posted June 3, 2022 Author Share Posted June 3, 2022 Several additional memtest passes came back clean. When I fired back up this morning I was met with corrupted metadata warnings. Had 3 VMs get nuked. Had to -L again. At this point I'm suspecting a failing drive. Any other ideas? FS check generated: Phase 1 - find and verify superblock... - block cache size set to 1523432 entries Phase 2 - using internal log - zero log... zero_log: head block 406849 tail block 406849 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0x8edaf38/0x1000 btree block 0/18724327 is suspect, error -74 bad magic # 0x241a9c92 in inode 171 (data fork) bmbt block 18724327 bad data fork in inode 171 would have cleared inode 171 - agno = 1 - agno = 2 Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0xb3c75ff8/0x1000 Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0x7c3a1d88/0x1000 btree block 2/16333609 is suspect, error -74 bad magic # 0x241a9c92 in inode 2149391518 (data fork) bmbt block 284769065 bad data fork in inode 2149391518 would have cleared inode 2149391518 - agno = 3 Metadata CRC error detected at 0x44ea1d, xfs_bmbt block 0xb3c75ff8/0x1000 btree block 3/10739507 is suspect, error -74 bad magic # 0x241a9c92 in inode 3306848649 (data fork) bmbt block 413392691 bad data fork in inode 3306848649 would have cleared inode 3306848649 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 3 - agno = 2 entry "vdisk1.img" in shortform directory 170 references free inode 171 would have junked entry "vdisk1.img" in directory inode 170 bad magic # 0x241a9c92 in inode 171 (data fork) bmbt block 18724327 bad data fork in inode 171 would have cleared inode 171 entry "vdisk1.img" in shortform directory 2149391517 references free inode 2149391518 would have junked entry "vdisk1.img" in directory inode 2149391517 bad magic # 0x241a9c92 in inode 2149391518 (data fork) bmbt block 284769065 bad data fork in inode 2149391518 would have cleared inode 2149391518 entry "vdisk1.img" in shortform directory 3306848648 references free inode 3306848649 would have junked entry "vdisk1.img" in directory inode 3306848648 bad magic # 0x241a9c92 in inode 3306848649 (data fork) bmbt block 413392691 bad data fork in inode 3306848649 would have cleared inode 3306848649 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 entry "vdisk1.img" in shortform directory inode 170 points to free inode 171 would junk entry - agno = 1 - agno = 2 entry "vdisk1.img" in shortform directory inode 2149391517 points to free inode 2149391518 would junk entry - agno = 3 entry "vdisk1.img" in shortform directory inode 3306848648 points to free inode 3306848649 would junk entry - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... Maximum metadata LSN (1984442613:-916355275) is ahead of log (48:406849). Would format log to cycle 1984442616. No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Fri Jun 3 08:13:11 2022 Phase Start End Duration Phase 1: 06/03 08:13:03 06/03 08:13:03 Phase 2: 06/03 08:13:03 06/03 08:13:05 2 seconds Phase 3: 06/03 08:13:05 06/03 08:13:10 5 seconds Phase 4: 06/03 08:13:10 06/03 08:13:10 Phase 5: Skipped Phase 6: 06/03 08:13:10 06/03 08:13:11 1 second Phase 7: 06/03 08:13:11 06/03 08:13:11 Total run time: 8 seconds Quote Link to comment
trurl Posted June 3, 2022 Share Posted June 3, 2022 1 minute ago, xSmick said: corrupted metadata warnings. attach diagnostics to your NEXT post in this thread Quote Link to comment
xSmick Posted June 3, 2022 Author Share Posted June 3, 2022 5 minutes ago, trurl said: attach diagnostics to your NEXT post in this thread server-diagnostics-20220603-0843.zip Quote Link to comment
trurl Posted June 3, 2022 Share Posted June 3, 2022 Disk1 SMART attributes look fine. You will have to disable spindown on the disk if you want it to complete an extended SMART test. Quote Link to comment
JorgeB Posted June 3, 2022 Share Posted June 3, 2022 53 minutes ago, xSmick said: At this point I'm suspecting a failing drive. Could be, did you run schedule parity checks? Any errors there? Quote Link to comment
xSmick Posted June 3, 2022 Author Share Posted June 3, 2022 2 hours ago, JorgeB said: Could be, did you run schedule parity checks? Any errors there? This all started with errors on parity checks. See my 1st post. I'm an hour into another with 8792 errors so far. Quote Link to comment
JorgeB Posted June 3, 2022 Share Posted June 3, 2022 If you have a spare replace disk1 and then run a couple of parity checks to see if there are no more sync errors. Quote Link to comment
JorgeB Posted June 3, 2022 Share Posted June 3, 2022 18 minutes ago, xSmick said: This all started with errors on parity checks. Yes, and sorry, I reply to so many threads that sometimes get lost. Disk corrupting data is uncommon, but it wouldn't be the first time, worth testing if you have a spare you could use, even a smaller one, though that would require a new config. Quote Link to comment
xSmick Posted June 3, 2022 Author Share Posted June 3, 2022 40 minutes ago, JorgeB said: If you have a spare replace disk1 and then run a couple of parity checks to see if there are no more sync errors. 31 minutes ago, JorgeB said: Yes, and sorry, I reply to so many threads that sometimes get lost. Disk corrupting data is uncommon, but it wouldn't be the first time, worth testing if you have a spare you could use, even a smaller one, though that would require a new config. No worries, kind of figured that was the case and I appreciate all the input. I do have a spare although it's smaller. At this point, I'm going to wait until tomorrow. I have 2 new drives arriving to help troubleshoot/replace. Worst case, I have spare drives. When the new drives arrive, would you recommend rebuilding parity or rebuilding disk1? I would imagine I should start with disk1. Quote Link to comment
JorgeB Posted June 3, 2022 Share Posted June 3, 2022 If the new disk is the same size rebuild from parity, then run a couple of parity checks. 1 Quote Link to comment
xSmick Posted June 6, 2022 Author Share Posted June 6, 2022 (edited) So this was most likely what I would assume to be early stages of drive failure. disk1 has been swapped with a new drive, rebuilt from parity, ~30 hours of operation and 3 parity checks with no issues. Thanks for all the input @JorgeB🍻 Edited June 6, 2022 by xSmick 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.