XFS: Metadata corruption detected

February 9, 20233 yr

Hi fine folks,

I need some advice on how to best repair a corrupted filesystem on my Disk 2 (md2/sdc) drive, and what happened that caused the issue in the first place. It's a new hard drive that I replaced less than 2 weeks ago, due to the old disk had more and more bad sectors. The new disk was initialized by doing a rebuild of the array.

Everything seemed to work fine until today, when suddenly all shares were in read-only mode. This surprised me since the web dashboard didn't show any alerts/errors as far as I could tell, and it was only when I looked at the logs I noticed that there was something going on.

I then ran a scan-only of the main drives in my system (minus the parity drive) by using:

xfs_repair -nv /dev/mdY   (where Y = [1..3])

No issues were found on md1 and md3, but md2 had a scary long list of issue. See the attached file md2-xfs-scan.zip to see the output.

Is my best option to try to rebuild disk 2 again, or should I try xfs_repair -v /dev/md2? My understanding is that by using /dev/md2 parity will be preserved, and that's not the case if I use /dev/sdc. According to the SMART data (fresh report attached), the physical drive itself seems to be fine. I ran a parity check 11 days ago, and 0 errors were found then. So I'm hoping that my parity data is still intact. I currently have the array in maintenance mode to avoid further damage. I have not had a thorough look at the shares to see if anything is missing, but I'm pretty sure I'm missing some files in the appdata share.

Hardware:

PC: Dell Optiplex 9020 w/ Intel Core i7-4770 CPU and 24 GiB DDR3 RAM
Parity drive: Seagate IronWolf 10TB
Disk 1: Western Digital 10TB WD Red Plus NAS
Disk 2: Western Digital 10TB WD Red Plus NAS
Disk 3: Seagate Exos x10 10TB Enterprise
Unassigned drive: Seagate Exos x10 10TB Enterprise
I/O Crest 4 Port SATA III PCI-e 2.0 x1 Controller Card Marvell 9215 Non-Raid with Low Profile Bracket SI-PEX40064 (not 100% sure about this one, but it's Marvell based at least).

Unraid version 6.9.2 (Plus license). Been using Unraid since late 2015, and love it! 😊

As for why this happened, I have a hunch that it's due to my cheap SATA controller. Initially I only used it for the unassigned drives (I had 2 before), but when my drives started dying one by one and I replaced them with newer drives, some of them did not show up when I used the SATA ports on the motherboard. So for the array I have to drives hooked up to the motherboard, and two to the PCI Express SATA controller card. I recall reading on this forum that cards based on this chip was not recommended by the professionals, so I guess I was asking for trouble by doing so anyway 😣 Things seemed to work fine though with that setup for a long time.

tower2-diagnostics-20230209-1405.zip md2-xfs-scan.zip tower2-smart-20230209-1709.zip

Quote

February 10, 20233 yr

Community Expert

Marvell controllers are not recommended as you noted.

Looks like it was having connection problems with disk3 while trying to emulate disk2.

Jan 29 01:57:34 Tower2 kernel: md: recovery thread: recon D2 ...
Jan 29 02:10:00 Tower2 kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x190002 action 0xe frozen
Jan 29 02:10:00 Tower2 kernel: ata8.00: irq_stat 0x80400000, PHY RDY changed
Jan 29 02:10:00 Tower2 kernel: ata8: SError: { RecovComm PHYRdyChg 10B8B Dispar }
Jan 29 02:10:00 Tower2 kernel: ata8.00: failed command: READ DMA EXT
Jan 29 02:10:00 Tower2 kernel: ata8.00: cmd 25/00:40:18:66:53/00:05:68:00:00/e0 tag 6 dma 688128 in
Jan 29 02:10:00 Tower2 kernel:         res 50/00:00:17:66:53/00:00:68:00:00/e0 Emask 0x10 (ATA bus error)
Jan 29 02:10:00 Tower2 kernel: ata8.00: status: { DRDY }
Jan 29 02:10:00 Tower2 kernel: ata8: hard resetting link
Jan 29 02:10:10 Tower2 kernel: ata8: softreset failed (1st FIS failed)
Jan 29 02:10:10 Tower2 kernel: ata8: hard resetting link

[8:0:0:0]    disk    ATA      ST10000NM0086-2A SN05  /dev/sdd   /dev/sg3 
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/8:0:0:0  [/sys/devices/pci0000:00/0000:00:1c.2/0000:04:00.0/ata8/host8/target8:0:0/8:0:0:0]

Not sure another rebuild would work since it was over a week ago, unless nothing had been written to the array in all that time.

Do you still have original disk2? You could check filesystem on it as an Unassigned Device to see if it has similar results.

Do you have backups of anything important and irreplaceable?

Quote

February 10, 20233 yr

Author

Hi @trurl, and thank you for your reply! Ah, I missed that part about disk 3. Darn, this seems to be worse than I thought then. I have backup of the most important data, and very little data has been written to the array in the past 11 days. I did move some files around on it last night though. Unfortunately I don't have the original disk 2 any more 😕

Can I start the array in read-only mode so I can run another backup and inspect the files that are still there?

Quote

February 10, 20233 yr

Community Expert

3 minutes ago, equinox said:

very little data has been written to the array in the past 11 days. I did move some files around on it last night though

Moves are writes which include parity updates. If moves were between disks then both source and destination disk were written including parity updates.

I'm not sure there is a way to start the array read-only. Maintenance mode doesn't mount any disks, so they can't be read either.

Disable Docker and VM Manager in Settings, and don't let anything else write to the server. Copy anything you want to backup somewhere off the server.

Quote

February 10, 20233 yr

Author

OK, thanks, I will do that. When the backup is complete (won't be until tomorrow at the earliest -- lots of data), is it worthwhile to try to run xfs_repair -v /dev/md2 to repair disk 2, or do you think the file system is beyond repair, and as such I'm better off wiping the entire array, and starting over?

And yes, I also plan to replace the SATA controller with something from the recommended list.

Edited February 10, 20233 yr by equinox

Quote

February 10, 20233 yr

Community Expert
Solution

6 minutes ago, equinox said:

better off wiping the entire array, and starting over?

I would never suggest wiping the entire array since only one disk is corrupt. Each data disk in the array is an independent filesystem that can be read all by itself on any linux. Disk2 corruption has nothing to do with anything on the other disks.

8 minutes ago, equinox said:

When the backup is complete (won't be until tomorrow at the earliest -- lots of data)

Are you backing up only disk2? That seems to be the only affected disk. Of course, if you don't have backups of anything important and irreplaceable, you should take care of that whatever disk they are on.

11 minutes ago, equinox said:

try to run xfs_repair -v /dev/md2 to repair disk 2

After you get the backups done, certainly worth trying to repair disk2 just to see what happens. Then you can consider what to do with the results.

Quote

February 10, 20233 yr

Author

Ah, I was not sure how independent each array drive was. Good to know 👍 I'll still make a backup of as much as possible, to be on the safe side 😊

Thanks for your help!

Quote

February 10, 20233 yr

Community Expert

4 minutes ago, equinox said:

was not sure how independent each array drive was

This is how Unraid allows different sized disks in the array (no striping).

This is also why Unraid array is slower than RAID (no striping).

Quote

1

XFS: Metadata corruption detected

Featured Replies

Solved by trurl

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)