XFS corruption, recoverable?

DanW · January 23, 2023

Hey everyone, I started getting I/O errors on one of my drives.
Noticed a load of my files suddenly disappear in my shares and went straight to the system log to see what was going on.

I've ran the check on all 13 drives in maintenance mode and it's just the one playing up (disk 7) from what I can see.
Any recommendations? Just run the check without -nv and see if it recovers the drive?
I have two parity drives and I have a new drive spare that I could drop in to replace it.

Some advice from someone who has experience in this area would be greatly appreciated, thank you

check-nv.txt dansunraidnas-diagnostics-20230123-2257.zip

Edited January 23, 2023 by DanW

DanW · January 24, 2023

I think I'm going to attempt to recover the file system then replace the drive later today.

DanW · January 24, 2023

I got this when running the xfs_repair -v command.

image.png.d8553a28c2fa08c9085ddd56c5ecabfb.png

I didn't see anything about this in the instructions so i have no idea what to do next.
Im going to just remove the drive and drop the new one in.

JorgeB · January 24, 2023

Use -L

DanW · January 24, 2023

I didn't attempt the recovery, I put a new drive in to replace this drive.
Shortly after starting recovery, disk 8 reported an I/O error too and has been disabled.
These drives are old and have a lot of uptime but seems to be a strange coincidence that they would both die together so soon.

To rule out heat issues I've pointed fans at my SAS devices.
I've also ordered some higher quality SAS cables.
Going to be keeping an eye on the SAS controller & HBA, it has been fine for months and my drives are old, so could just be a coincidence.

I'm currently using the following SAS devices:
IBM SAS HBA M1015 IT Mode 6Gbps PCI-e 2.0 x8 LSI 9220-8i

image.jpeg.21cb42b42355cd3e7dd5b69a72382228.jpeg
Intel 24 port 6 Gb/s SATA SAS RAID Expander Card PBA E91267-203 RES2SV240

image.jpeg.946c406c69cec4987a5d275b245bca7d.jpeg

Edited January 24, 2023 by DanW
Added more info

JorgeB · January 25, 2023

Replacing the disk won't help with the filesystem problem.

DanW · January 25, 2023

8 hours ago, JorgeB said:

Replacing the disk won't help with the filesystem problem.

Really? I've replaced the disk (disk 7) and the disk has been rebuilt without any issues.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

I've got to replace disk 8 now as it failed during the rebuild of disk 7, lucky I had two parity drives.

Edited January 25, 2023 by DanW

Ronan C · January 25, 2023

6 minutes ago, DanW said:

Really? I've replaced the disk (disk 7) and the disk has been rebuilt without any issues.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

I've got to replace disk 8 now as it failed during the rebuild of disk 7, lucky I had two parity drives.

Hello!

Did you replaced the drive 8? whats happen after?

Thanks

DanW · January 25, 2023

1 minute ago, Ronan C said:

Hello!

Did you replaced the drive 8? whats happen after?

Thanks

I haven't replaced disk 8 yet (I replaced disk 7 first as it was the initial problem disk with filesystem corruption), I'm going to change some SAS cables and insert a new drive to replace disk 8 then start the rebuild soon. I will provide updates.

JorgeB · January 25, 2023

1 hour ago, DanW said:

I've replaced the disk (disk 7) and the disk has been rebuilt without any issues.

This suggests parity wasn't 100% in sync, or else it would be rebuilt exactly as it was, including the filesytem issues, parity is just bits.

DanW · January 25, 2023

3 minutes ago, JorgeB said:

This suggests parity wasn't 100% in sync, or else it would be rebuilt exactly as it was, including the filesytem issues, parity is just bits.

Interesting, is that an issue?

JorgeB · January 25, 2023

Since the disk was rebuilt according to current parity a parity check now should not find any errors, but the data is OK so I wouldn't worry much about if for now.

DanW · January 25, 2023

1 minute ago, JorgeB said:

Since the disk was rebuilt according to current parity a parity check now should not find any errors, but the data is OK so I wouldn't worry much about if for now.

Thank you for your help 🙂

I really appreciate your knowledge and suggestions.
I'm going to go ahead and replace disk 8 now and rebuild it hopefully without any more issues 🤞

DanW · January 25, 2023

3 hours ago, Ronan C said:

Hello!

Did you replaced the drive 8? whats happen after?

Thanks

Disk 8 is rebuilding, no other issues so far.

Edited January 25, 2023 by DanW

trurl · January 25, 2023

Can't really see if the data is OK or not, no new diagnostics have been posted and all screenshots are clipped on the right so can't tell if the disk is unmountable.

DanW · January 25, 2023

1 hour ago, trurl said:

Can't really see if the data is OK or not, no new diagnostics have been posted and all screenshots are clipped on the right so can't tell if the disk is unmountable.

Apologies, please see attached.
Rebuild of disk 8, the second disk to fail, is still underway.

The array is live, the data that originally disappeared (when the first error with disk 7 occurred) is back, which is really positive. Disk 8 was emulated immediately when it failed so I didn't notice any data loss the second time.

dansunraidnas-diagnostics-20230125-2349.zip

Edited January 26, 2023 by DanW

DanW · January 26, 2023

Not long now.
image.png.8be1f1b77d6efbdc15c449487347e179.png

Ronan C · January 26, 2023

Nice news! glad you get there!

DanW · January 27, 2023

So everything is back to normal 🎉

Thank you everyone for the help and support.

dansunraidnas-diagnostics-20230127-1851.zip

trurl · January 27, 2023

Since these diagnostics are without the array started, can't tell anything about filesystems or shares.

Start the array and post new diagnostics.

DanW · January 27, 2023

1 hour ago, trurl said:

Since these diagnostics are without the array started, can't tell anything about filesystems or shares.

Start the array and post new diagnostics.

Oops, here's the correct diagnostics 🙂

dansunraidnas-diagnostics-20230127-2050.zip

trurl · January 27, 2023

Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array.

Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open.

You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned.

DanW · January 27, 2023

6 minutes ago, trurl said:

Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array.

Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open.

You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned.

Really good suggestions, thank you!
My appdata is set to use cache only, not sure why I have some "bytes" in the array?

Domains is just a backup of the VM vdisks I have running on the unassigned NVME drives so I dont have it in cache "only".
Unfortunately I am making use of one the unassigned SSDs right now and have plans for the other unassigned SSD.
I don't know why I had not set system to cache only, I have done this now but probably need to move the files back to cache.

DanW · January 27, 2023

I've noticed something weird.
image.png.3c7bb0b4c1f7293cf3f45015b8081d9c.png
Seems to be stuck like this.

Edited January 27, 2023 by DanW
Ignore, was due to the changes I made to the shares I think.

DanW · January 27, 2023

2 hours ago, trurl said:

Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array.

Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open.

You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned.

Fixed 👍 thank you again

XFS corruption, recoverable?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation