XFS corruption, recoverable?


DanW
Go to solution Solved by DanW,

Recommended Posts

Hey everyone, I started getting I/O errors on one of my drives.
Noticed a load of my files suddenly disappear in my shares and went straight to the system log to see what was going on.

 

image.thumb.png.33cfc4f594222167dce0a4f95df45cd3.png

 

I've ran the check on all 13 drives in maintenance mode and it's just the one playing up (disk 7) from what I can see.
Any recommendations? Just run the check without -nv and see if it recovers the drive?
I have two parity drives and I have a new drive spare that I could drop in to replace it.


Some advice from someone who has experience in this area would be greatly appreciated, thank you :)

check-nv.txt dansunraidnas-diagnostics-20230123-2257.zip

Edited by DanW
Link to comment
  • Solution

I didn't attempt the recovery, I put a new drive in to replace this drive.
Shortly after starting recovery, disk 8 reported an I/O error too and has been disabled.
These drives are old and have a lot of uptime but seems to be a strange coincidence that they would both die together so soon.

 

To rule out heat issues I've pointed fans at my SAS devices.
I've also ordered some higher quality SAS cables.
Going to be keeping an eye on the SAS controller & HBA, it has been fine for months and my drives are old, so could just be a coincidence.

 

I'm currently using the following SAS devices:
IBM SAS HBA M1015 IT Mode 6Gbps PCI-e 2.0 x8 LSI 9220-8i

image.jpeg.21cb42b42355cd3e7dd5b69a72382228.jpeg
Intel 24 port 6 Gb/s SATA SAS RAID Expander Card PBA E91267-203 RES2SV240

image.jpeg.946c406c69cec4987a5d275b245bca7d.jpeg

Edited by DanW
Added more info
Link to comment
8 hours ago, JorgeB said:

Replacing the disk won't help with the filesystem problem.

Really? I've replaced the disk (disk 7) and the disk has been rebuilt without any issues.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

image.thumb.png.58bd8eadf41fddecdd8533c549ab1eaf.png

 

I've got to replace disk 8 now as it failed during the rebuild of disk 7, lucky I had two parity drives.

Edited by DanW
Link to comment
6 minutes ago, DanW said:

Really? I've replaced the disk (disk 7) and the disk has been rebuilt without any issues.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

image.thumb.png.58bd8eadf41fddecdd8533c549ab1eaf.png

 

I've got to replace disk 8 now as it failed during the rebuild of disk 7, lucky I had two parity drives.

Hello!

Did you replaced the drive 8? whats happen after?

 

Thanks

Link to comment
1 minute ago, Ronan C said:

Hello!

Did you replaced the drive 8? whats happen after?

 

Thanks

I haven't replaced disk 8 yet (I replaced disk 7 first as it was the initial problem disk with filesystem corruption), I'm going to change some SAS cables and insert a new drive to replace disk 8 then start the rebuild soon. I will provide updates.

Link to comment
1 minute ago, JorgeB said:

Since the disk was rebuilt according to current parity a parity check now should not find any errors, but the data is OK so I wouldn't worry much about if for now.

Thank you for your help 🙂

I really appreciate your knowledge and suggestions.
I'm going to go ahead and replace disk 8 now and rebuild it hopefully without any more issues 🤞

  • Like 1
Link to comment
1 hour ago, trurl said:

Can't really see if the data is OK or not, no new diagnostics have been posted and all screenshots are clipped on the right so can't tell if the disk is unmountable.

Apologies, please see attached.
Rebuild of disk 8, the second disk to fail, is still underway. 
image.thumb.png.3a71e606640bdcee6e53429407bce6d4.png

 

The array is live, the data that originally disappeared (when the first error with disk 7 occurred) is back, which is really positive. Disk 8 was emulated immediately when it failed so I didn't notice any data loss the second time.

dansunraidnas-diagnostics-20230125-2349.zip

Edited by DanW
Link to comment

Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array.

 

Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open.

 

You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned.

  • Like 2
Link to comment
6 minutes ago, trurl said:

Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array.

 

Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open.

 

You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned.

Really good suggestions, thank you!
My appdata is set to use cache only, not sure why I have some "bytes" in the array?
image.thumb.png.22a9f4e0b2fae656ad7f8ec0a52be5b3.png

Domains is just a backup of the VM vdisks I have running on the unassigned NVME drives so I dont have it in cache "only".
Unfortunately I am making use of one the unassigned SSDs right now and have plans for the other unassigned SSD.
I don't know why I had not set system to cache only, I have done this now but probably need to move the files back to cache.

Link to comment
2 hours ago, trurl said:

Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array.

 

Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open.

 

You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned.

image.thumb.png.44acfd3aaa1fc3c3be473ec0c2e7646d.png

Fixed 👍 thank you again :D

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.