DanW Posted January 23, 2023 Share Posted January 23, 2023 (edited) Hey everyone, I started getting I/O errors on one of my drives. Noticed a load of my files suddenly disappear in my shares and went straight to the system log to see what was going on. I've ran the check on all 13 drives in maintenance mode and it's just the one playing up (disk 7) from what I can see. Any recommendations? Just run the check without -nv and see if it recovers the drive? I have two parity drives and I have a new drive spare that I could drop in to replace it. Some advice from someone who has experience in this area would be greatly appreciated, thank you check-nv.txt dansunraidnas-diagnostics-20230123-2257.zip Edited January 23, 2023 by DanW Quote Link to comment
DanW Posted January 24, 2023 Author Share Posted January 24, 2023 I think I'm going to attempt to recover the file system then replace the drive later today. Quote Link to comment
DanW Posted January 24, 2023 Author Share Posted January 24, 2023 I got this when running the xfs_repair -v command. I didn't see anything about this in the instructions so i have no idea what to do next. Im going to just remove the drive and drop the new one in. Quote Link to comment
Solution DanW Posted January 24, 2023 Author Solution Share Posted January 24, 2023 (edited) I didn't attempt the recovery, I put a new drive in to replace this drive. Shortly after starting recovery, disk 8 reported an I/O error too and has been disabled. These drives are old and have a lot of uptime but seems to be a strange coincidence that they would both die together so soon. To rule out heat issues I've pointed fans at my SAS devices. I've also ordered some higher quality SAS cables. Going to be keeping an eye on the SAS controller & HBA, it has been fine for months and my drives are old, so could just be a coincidence. I'm currently using the following SAS devices: IBM SAS HBA M1015 IT Mode 6Gbps PCI-e 2.0 x8 LSI 9220-8i Intel 24 port 6 Gb/s SATA SAS RAID Expander Card PBA E91267-203 RES2SV240 Edited January 24, 2023 by DanW Added more info Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 Replacing the disk won't help with the filesystem problem. Quote Link to comment
DanW Posted January 25, 2023 Author Share Posted January 25, 2023 (edited) 8 hours ago, JorgeB said: Replacing the disk won't help with the filesystem problem. Really? I've replaced the disk (disk 7) and the disk has been rebuilt without any issues. Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. I've got to replace disk 8 now as it failed during the rebuild of disk 7, lucky I had two parity drives. Edited January 25, 2023 by DanW Quote Link to comment
Ronan C Posted January 25, 2023 Share Posted January 25, 2023 6 minutes ago, DanW said: Really? I've replaced the disk (disk 7) and the disk has been rebuilt without any issues. Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. I've got to replace disk 8 now as it failed during the rebuild of disk 7, lucky I had two parity drives. Hello! Did you replaced the drive 8? whats happen after? Thanks Quote Link to comment
DanW Posted January 25, 2023 Author Share Posted January 25, 2023 1 minute ago, Ronan C said: Hello! Did you replaced the drive 8? whats happen after? Thanks I haven't replaced disk 8 yet (I replaced disk 7 first as it was the initial problem disk with filesystem corruption), I'm going to change some SAS cables and insert a new drive to replace disk 8 then start the rebuild soon. I will provide updates. Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 1 hour ago, DanW said: I've replaced the disk (disk 7) and the disk has been rebuilt without any issues. This suggests parity wasn't 100% in sync, or else it would be rebuilt exactly as it was, including the filesytem issues, parity is just bits. Quote Link to comment
DanW Posted January 25, 2023 Author Share Posted January 25, 2023 3 minutes ago, JorgeB said: This suggests parity wasn't 100% in sync, or else it would be rebuilt exactly as it was, including the filesytem issues, parity is just bits. Interesting, is that an issue? Quote Link to comment
JorgeB Posted January 25, 2023 Share Posted January 25, 2023 Since the disk was rebuilt according to current parity a parity check now should not find any errors, but the data is OK so I wouldn't worry much about if for now. 1 Quote Link to comment
DanW Posted January 25, 2023 Author Share Posted January 25, 2023 1 minute ago, JorgeB said: Since the disk was rebuilt according to current parity a parity check now should not find any errors, but the data is OK so I wouldn't worry much about if for now. Thank you for your help 🙂 I really appreciate your knowledge and suggestions. I'm going to go ahead and replace disk 8 now and rebuild it hopefully without any more issues 🤞 1 Quote Link to comment
DanW Posted January 25, 2023 Author Share Posted January 25, 2023 (edited) 3 hours ago, Ronan C said: Hello! Did you replaced the drive 8? whats happen after? Thanks Disk 8 is rebuilding, no other issues so far. Edited January 25, 2023 by DanW Quote Link to comment
trurl Posted January 25, 2023 Share Posted January 25, 2023 Can't really see if the data is OK or not, no new diagnostics have been posted and all screenshots are clipped on the right so can't tell if the disk is unmountable. Quote Link to comment
DanW Posted January 25, 2023 Author Share Posted January 25, 2023 (edited) 1 hour ago, trurl said: Can't really see if the data is OK or not, no new diagnostics have been posted and all screenshots are clipped on the right so can't tell if the disk is unmountable. Apologies, please see attached. Rebuild of disk 8, the second disk to fail, is still underway. The array is live, the data that originally disappeared (when the first error with disk 7 occurred) is back, which is really positive. Disk 8 was emulated immediately when it failed so I didn't notice any data loss the second time. dansunraidnas-diagnostics-20230125-2349.zip Edited January 26, 2023 by DanW Quote Link to comment
DanW Posted January 26, 2023 Author Share Posted January 26, 2023 Not long now. Quote Link to comment
Ronan C Posted January 26, 2023 Share Posted January 26, 2023 Nice news! glad you get there! 1 Quote Link to comment
DanW Posted January 27, 2023 Author Share Posted January 27, 2023 So everything is back to normal 🎉 Thank you everyone for the help and support. dansunraidnas-diagnostics-20230127-1851.zip 1 Quote Link to comment
trurl Posted January 27, 2023 Share Posted January 27, 2023 Since these diagnostics are without the array started, can't tell anything about filesystems or shares. Start the array and post new diagnostics. Quote Link to comment
DanW Posted January 27, 2023 Author Share Posted January 27, 2023 1 hour ago, trurl said: Since these diagnostics are without the array started, can't tell anything about filesystems or shares. Start the array and post new diagnostics. Oops, here's the correct diagnostics 🙂 dansunraidnas-diagnostics-20230127-2050.zip Quote Link to comment
trurl Posted January 27, 2023 Share Posted January 27, 2023 Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array. Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open. You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned. 2 Quote Link to comment
DanW Posted January 27, 2023 Author Share Posted January 27, 2023 6 minutes ago, trurl said: Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array. Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open. You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned. Really good suggestions, thank you! My appdata is set to use cache only, not sure why I have some "bytes" in the array? Domains is just a backup of the VM vdisks I have running on the unassigned NVME drives so I dont have it in cache "only". Unfortunately I am making use of one the unassigned SSDs right now and have plans for the other unassigned SSD. I don't know why I had not set system to cache only, I have done this now but probably need to move the files back to cache. Quote Link to comment
DanW Posted January 27, 2023 Author Share Posted January 27, 2023 (edited) I've noticed something weird. Seems to be stuck like this. Edited January 27, 2023 by DanW Ignore, was due to the changes I made to the shares I think. Quote Link to comment
DanW Posted January 27, 2023 Author Share Posted January 27, 2023 2 hours ago, trurl said: Not related to your original problems, but your appdata, domains, system shares have files on the array. In fact, domains and system shares are set to be moved to the array. Ideally, these shares would be all on fast pool (cache) so Docker/VM performance isn't impacted by slower parity, and so array disks can spin down since these files are always open. You have some unassigned SSDs mounted. How are you using these? Might be better as additional pools instead of unassigned. Fixed 👍 thank you again Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.