randomusername Posted April 1, 2021 Share Posted April 1, 2021 (edited) Hi all, In the last week or so I've had two discs become disabled (thankfully I have dual parity). Prior to this in the last year I had regular parity errors but being a covid healthcare worker things have been too busy to give it any energy. I put the parity errors down to PCIE SATA cards that had vibrated themselves out of their sockets (initially this caused trillions of read errors). Then I read that SATA cards are not recommended, so I got an LSI 9200-8i card, hoping this would fix things, replacing all SATA cables. But some errors continued. I recently moved the unRAID computer and then a disc became disabled, so I thought maybe the move had jiggled a cable free or something. I woke up this morning and a second disc was disabled, so I shut it down, took the computer apart and put it back together, reseating the RAM just in case. I'm not sure what to do from here, how I can see if the discs are still okay to use, and if so how I can re-connect them. And hopefully not get parity errors in the future. I guess I might just need to burn the server to the ground. Help much appreciated. Edited April 4, 2021 by randomusername Quote Link to comment
randomusername Posted April 2, 2021 Author Share Posted April 2, 2021 Bump - any ideas? Quote Link to comment
JorgeB Posted April 5, 2021 Share Posted April 5, 2021 Please post the diagnostics: Tools -> Diagnostics 1 Quote Link to comment
trurl Posted April 5, 2021 Share Posted April 5, 2021 On 4/1/2021 at 1:14 PM, randomusername said: read that SATA cards are not recommended You must have misunderstood. RAID cards are not recommended. Also certain models of SATA controllers but not SATA in general. 4 hours ago, JorgeB said: Please post the diagnostics: Tools -> Diagnostics If you had done this with your first post, maybe a bump wouldn't have been necessary instead of waiting for several days for a response. Quote Link to comment
randomusername Posted April 5, 2021 Author Share Posted April 5, 2021 8 hours ago, JorgeB said: Please post the diagnostics: Tools -> Diagnostics Thanks for your help, diagnostics attached. Since the first post the best advice I could find was to re-connect the drives one by one, so I did that. I am now running a parity check and one of the disks that disconnected (Disk 8 ) has 1710 errors. The parity check is also showing 849 sync errors. xeus-diagnostics-20210405-1843.zip Quote Link to comment
JorgeB Posted April 5, 2021 Share Posted April 5, 2021 Problem with disk8 looks more like a power/connection issue, replace/swap both cables and try again. 1 Quote Link to comment
randomusername Posted April 14, 2021 Author Share Posted April 14, 2021 On 4/5/2021 at 7:03 PM, JorgeB said: Problem with disk8 looks more like a power/connection issue, replace/swap both cables and try again. Thanks, it took a while for cables to arrive but they have and I've replaced power and data cables for both the parity disk and disk8. Before I turned off the system to await new cables, the parity disk disconnected again, and the log filled up to 100%. A few of my docker containers went a bit funny when that happened, Plex didn't update and Krusader showed my /user directory as empty. I downloaded the diagnostics file and shut the server down (diagnostics file labelled "diagnostics before new cables". So today the new cables arrived, I replaced them and turned on the machine, reconnected the drive so it is currently doing a rebuild of the parity disk. I then got a warning that the log has filled up to 63% already. I stopped all the containers, got another diagnostics file (labelled 14 APR 2021) however since the rebuild has started I didn't think it would be a good idea to reboot the server (which is what the "fix common problems" plugin suggests). I'm really grateful for your help with this, I'm completely lost. xeus diagnostics before new cables.zip xeus-diagnostics-20210414-2121.zip Quote Link to comment
randomusername Posted April 14, 2021 Author Share Posted April 14, 2021 On 4/5/2021 at 3:39 PM, trurl said: You must have misunderstood. RAID cards are not recommended. Also certain models of SATA controllers but not SATA in general. This is what I was using, until multiple people told me that they were not recommended. Maybe I misunderstood, but they seemed pretty certain about it. Either way, I replaced them with an LSI 9200-8i card so hopefully that's okay and not part of the problem I'm having. Quote Link to comment
trurl Posted April 15, 2021 Share Posted April 15, 2021 Yes Marvell is On 4/5/2021 at 10:39 AM, trurl said: not recommended...certain models of SATA controllers Quote Link to comment
JorgeB Posted April 15, 2021 Share Posted April 15, 2021 10 hours ago, randomusername said: I then got a warning that the log has filled up to 63% already. Check filesystem on disk8. 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 On 4/15/2021 at 8:01 AM, JorgeB said: Check filesystem on disk8. xfs_repair -nv showed many errors. I ran xfs_repair -v I ran xfs_repair -nv again, did not appear to show any errors. I then did a parity check, which showed 961 errors (not sure if this is to be expected). Diagnostics attached. I think the correct thing to do would be another parity check and hope that there are zero errors - does that sound right or am I completely wrong? xeus-diagnostics-20210416-1400.zip Quote Link to comment
itimpi Posted April 16, 2021 Share Posted April 16, 2021 Did you run the zfs_repair from the GUI or the command line? If the command line exactly what device name did you use? Was the parity check you ran correcting or Non-correcting? If correcting then the next one should show 0 errors. If it was non-correcting then you need to run a correcting check to get rid of the errors - and be aware that the correcting check will then show the same number of errors as unRaid misleadingly reports each correction as if it were an error in the summary (but syslog shows them being corrected). 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 Just now, itimpi said: Did you run the zfs_repair from the GUI or the command line? If the command line exactly what device name did you use? Was the parity check you ran correcting or Non-correcting? If correcting then the next one should show 0 errors. If it was non-correcting then you need to run a correcting check to get rid of the errors - and be aware that the correcting check will then show the same number of errors as unRaid misleadingly reports each correction as if it were an error in the summary (but syslog shows them being corrected). xfs_repair -nv was from the GUI I then went into the terminal to run "xfs_repair -v /dev/sdl" - I wasn't sure if this was possible in the GUI. The parity check was correcting. Does that all sound correct? Quote Link to comment
itimpi Posted April 16, 2021 Share Posted April 16, 2021 Using /dev/sdi would be wrong - it should be /dev/sdi1 (partition number required for /dev/sdX type devices), and doing it that way does not update parity as corrections are made. Not sure what the effect of leaving of the ‘1’ is - I would have thought it might fail. Using the sdX type device would certainly mean you should expect errors when doing the parity check. You could instead have used a device of the form ‘/dev/mdX’ where X was the disk number as that does not need the partition specified and maintains parity. You can run a check/repair from the GUI as described here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI. if you ran a correcting check then you should expect the next one to report 0 errors. 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 3 minutes ago, itimpi said: Using /dev/sdi would be wrong - it should be /dev/sdi1 (partition number required for /dev/sdX type devices), and doing it that way does not update parity as corrections are made. Not sure what the effect of leaving of the ‘1’ is - I would have thought it might fail. Using the sdX type device would certainly mean you should expect errors when doing the parity check. You could instead have used a device of the form ‘/dev/mdX’ where X was the disk number as that does not need the partition specified and maintains parity. Does this mean I should run it again using "xfs_repair -v /dev/sdl1"? Or not bother since it appears to have worked? When I ran the first xfs_repair, my laptop went to sleep and on waking the terminal window did not refresh. Since I didn't know how long the xfs_repair would take, I left the server for a few hours before rebooting it, checking the errors with xfs_repair -nv and then starting the parity check. So while I say it seems to have worked, this is based on the next xfs_repair -nv and not any message of success in the terminal after running "xfs_repair -v /dev/sdl" Quote Link to comment
itimpi Posted April 16, 2021 Share Posted April 16, 2021 1 hour ago, randomusername said: Does this mean I should run it again using "xfs_repair -v /dev/sdl1"? Or not bother since it appears to have worked? When I ran the first xfs_repair, my laptop went to sleep and on waking the terminal window did not refresh. Since I didn't know how long the xfs_repair would take, I left the server for a few hours before rebooting it, checking the errors with xfs_repair -nv and then starting the parity check. So while I say it seems to have worked, this is based on the next xfs_repair -nv and not any message of success in the terminal after running "xfs_repair -v /dev/sdl" I am afraid I have no idea if what you did was OK if you omitted the partition number and whether it would have damaged the file system. Whenever I made that mistake it failed as it could not find the superblock Unless there is serious corruption a xfs_repair is very fast (seconds/minutes) so if your laptop went to sleep this suggests something else was happening. @JorgeB might have a suggestion on best action to take at this point. 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 24 minutes ago, itimpi said: Unless there is serious corruption a xfs_repair is very fast (seconds/minutes) so if your laptop went to sleep this suggests something else was happening. I set it to run and walked away from the laptop as I assumed it would take a while (like a parity check), though now you mention it I believe the first message on the screen was something about not being able to find a superblock. Quote Link to comment
JorgeB Posted April 16, 2021 Share Posted April 16, 2021 2 hours ago, randomusername said: xfs_repair -v /dev/sdl I don't see how this could work, just run another check using the GUI to make sure all is well. 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 15 minutes ago, JorgeB said: I don't see how this could work, just run another check using the GUI to make sure all is well. Well now I feel foolish. Check completed in the GUI using parameters -nv got the following result: Phase 1 - find and verify superblock... - block cache size set to 1473264 entries Phase 2 - using internal log - zero log... zero_log: head block 182074 tail block 182074 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 5 - agno = 2 - agno = 4 - agno = 3 - agno = 6 - agno = 7 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Fri Apr 16 16:46:26 2021 Phase Start End Duration Phase 1: 04/16 16:46:24 04/16 16:46:24 Phase 2: 04/16 16:46:24 04/16 16:46:25 1 second Phase 3: 04/16 16:46:25 04/16 16:46:26 1 second Phase 4: 04/16 16:46:26 04/16 16:46:26 Phase 5: Skipped Phase 6: 04/16 16:46:26 04/16 16:46:26 Phase 7: 04/16 16:46:26 04/16 16:46:26 Total run time: 2 seconds Quote Link to comment
itimpi Posted April 16, 2021 Share Posted April 16, 2021 I would now rerun it removing the -n (no modify) flag to see if that has helped. You might want to also check if a lost+found folder has been created on the drive from files whose name could not be resolved. 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 4 minutes ago, itimpi said: I would now rerun it removing the -n (no modify) flag to see if that has helped Running with -v gives the following output: Phase 1 - find and verify superblock... - block cache size set to 1473264 entries Phase 2 - using internal log - zero log... zero_log: head block 182074 tail block 182074 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 5 - agno = 7 - agno = 1 - agno = 6 - agno = 4 - agno = 2 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... XFS_REPAIR Summary Fri Apr 16 16:57:46 2021 Phase Start End Duration Phase 1: 04/16 16:57:45 04/16 16:57:45 Phase 2: 04/16 16:57:45 04/16 16:57:45 Phase 3: 04/16 16:57:45 04/16 16:57:45 Phase 4: 04/16 16:57:45 04/16 16:57:45 Phase 5: 04/16 16:57:45 04/16 16:57:46 1 second Phase 6: 04/16 16:57:46 04/16 16:57:46 Phase 7: 04/16 16:57:46 04/16 16:57:46 Total run time: 1 second done Quote Link to comment
itimpi Posted April 16, 2021 Share Posted April 16, 2021 Now restart the array in normal mode to look at the drive contents. 1 Quote Link to comment
randomusername Posted April 16, 2021 Author Share Posted April 16, 2021 Okay I can view files on the disk through the GUI, and in Krusader the /user directory now shows the folders that used to be there, now with a new "lost+found" folder that I assume I need to go through and place its contents in their correct folders. Would the correct thing now to be a parity check? And assuming no parity errors, consider this solved? Quote Link to comment
itimpi Posted April 16, 2021 Share Posted April 16, 2021 27 minutes ago, randomusername said: Okay I can view files on the disk through the GUI, and in Krusader the /user directory now shows the folders that used to be there, now with a new "lost+found" folder that I assume I need to go through and place its contents in their correct folders. Yes. Is is likely that you would not have ended up with as much in tle lost+found folder if the correct device had been used for the first xfs_repair (not that that is much consolation at this point ) Sorting out the lost+found folder can be a lot of work unfortunately when file names are lost. You can use the Linux 'file' command to at least get the file type of files with cryptic names. 34 minutes ago, randomusername said: Would the correct thing now to be a parity check? And assuming no parity errors, consider this solved? Yes. I would not expect there to be any parity errors at this point. 1 Quote Link to comment
randomusername Posted April 17, 2021 Author Share Posted April 17, 2021 23 hours ago, itimpi said: Yes. Is is likely that you would not have ended up with as much in tle lost+found folder if the correct device had been used for the first xfs_repair (not that that is much consolation at this point ) Sorting out the lost+found folder can be a lot of work unfortunately when file names are lost. You can use the Linux 'file' command to at least get the file type of files with cryptic names. Yes. I would not expect there to be any parity errors at this point. Parity check shows 384 errors, diagnostics attached. Is there anything to suggest what the cause might be? Thanks for the lost+found tip, I'll make sure to use the correct procedure for xfs_repair in the future. xeus-diagnostics-20210417-1657.zip Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.