Parity errors + 2 discs disabled


Recommended Posts

Hi all,

 

In the last week or so I've had two discs become disabled (thankfully I have dual parity). Prior to this in the last year I had regular parity errors but being a covid healthcare worker things have been too busy to give it any energy. I put the parity errors down to PCIE SATA cards that had vibrated themselves out of their sockets (initially this caused trillions of read errors). Then I read that SATA cards are not recommended, so I got an LSI 9200-8i card, hoping this would fix things, replacing all SATA cables. But some errors continued. I recently moved the unRAID computer and then a disc became disabled, so I thought maybe the move had jiggled a cable free or something. I woke up this morning and a second disc was disabled, so I shut it down, took the computer apart and put it back together, reseating the RAM just in case.

 

I'm not sure what to do from here, how I can see if the discs are still okay to use, and if so how I can re-connect them. And hopefully not get parity errors in the future.

 

I guess I might just need to burn the server to the ground.

 

Help much appreciated. 

 

Edited by randomusername
Link to comment
  • randomusername changed the title to Parity errors + 2 discs disabled
On 4/1/2021 at 1:14 PM, randomusername said:

read that SATA cards are not recommended

You must have misunderstood. RAID cards are not recommended. Also certain models of SATA controllers but not SATA in general.

 

4 hours ago, JorgeB said:

Please post the diagnostics: Tools -> Diagnostics

If you had done this with your first post, maybe a bump wouldn't have been necessary instead of waiting for several days for a response.

Link to comment
8 hours ago, JorgeB said:

Please post the diagnostics: Tools -> Diagnostics

Thanks for your help, diagnostics attached. Since the first post the best advice I could find was to re-connect the drives one by one, so I did that. I am now running a parity check and one of the disks that disconnected (Disk 8 ) has 1710 errors. The parity check is also showing 849 sync errors.

xeus-diagnostics-20210405-1843.zip

Link to comment
  • 2 weeks later...
On 4/5/2021 at 7:03 PM, JorgeB said:

Problem with disk8 looks more like a power/connection issue, replace/swap both cables and try again.

Thanks, it took a while for cables to arrive but they have and I've replaced power and data cables for both the parity disk and disk8.

 

Before I turned off the system to await new cables, the parity disk disconnected again, and the log filled up to 100%. A few of my docker containers went a bit funny when that happened, Plex didn't update and Krusader showed my /user directory as empty. I downloaded the diagnostics file and shut the server down (diagnostics file labelled "diagnostics before new cables".

 

So today the new cables arrived, I replaced them and turned on the machine, reconnected the drive so it is currently doing a rebuild of the parity disk. I then got a warning that the log has filled up to 63% already.

 

I stopped all the containers, got another diagnostics file (labelled 14 APR 2021) however since the rebuild has started I didn't think it would be a good idea to reboot the server (which is what the "fix common problems" plugin suggests).

 

I'm really grateful for your help with this, I'm completely lost.

xeus diagnostics before new cables.zip xeus-diagnostics-20210414-2121.zip

Link to comment
On 4/5/2021 at 3:39 PM, trurl said:

You must have misunderstood. RAID cards are not recommended. Also certain models of SATA controllers but not SATA in general.

 

This is what I was using, until multiple people told me that they were not recommended. Maybe I misunderstood, but they seemed pretty certain about it. Either way, I replaced them with an LSI 9200-8i card so hopefully that's okay and not part of the problem I'm having.

Link to comment
On 4/15/2021 at 8:01 AM, JorgeB said:

Check filesystem on disk8.

xfs_repair -nv showed many errors.

 

I ran xfs_repair -v

 

I ran xfs_repair -nv again, did not appear to show any errors.

 

I then did a parity check, which showed 961 errors (not sure if this is to be expected). Diagnostics attached.

 

think the correct thing to do would be another parity check and hope that there are zero errors - does that sound right or am I completely wrong?

xeus-diagnostics-20210416-1400.zip

Link to comment

Did you run the zfs_repair from the GUI or the command line?    If the command line exactly what device name did you use?

 

Was the parity check you ran correcting or Non-correcting?    If correcting then the next one should show 0 errors.     If it was non-correcting then you need to run a correcting check to get rid of the errors - and be aware that the correcting check will then show the same number of errors as unRaid misleadingly reports each correction as if it were an error in the summary (but syslog shows them being corrected).

  • Like 1
Link to comment
Just now, itimpi said:

Did you run the zfs_repair from the GUI or the command line?    If the command line exactly what device name did you use?

 

Was the parity check you ran correcting or Non-correcting?    If correcting then the next one should show 0 errors.     If it was non-correcting then you need to run a correcting check to get rid of the errors - and be aware that the correcting check will then show the same number of errors as unRaid misleadingly reports each correction as if it were an error in the summary (but syslog shows them being corrected).

xfs_repair -nv was from the GUI

 

I then went into the terminal to run "xfs_repair -v /dev/sdl" - I wasn't sure if this was possible in the GUI.

 

The parity check was correcting.

 

Does that all sound correct?

Link to comment

Using /dev/sdi would be wrong - it should be /dev/sdi1 (partition number required for /dev/sdX type devices), and doing it that way does not update parity as corrections are made.  Not sure what the effect of leaving of the ‘1’ is - I would have thought it might fail.  Using the sdX type device would certainly mean you should expect errors when doing the parity check.  You could instead have used a device of the form ‘/dev/mdX’ where X was the disk number as that does not need the partition specified and maintains parity.

 

You can run a check/repair from the GUI as described here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI.

 

if you ran a correcting check then you should expect the next one to report 0 errors.

  • Like 1
Link to comment
3 minutes ago, itimpi said:

Using /dev/sdi would be wrong - it should be /dev/sdi1 (partition number required for /dev/sdX type devices), and doing it that way does not update parity as corrections are made.  Not sure what the effect of leaving of the ‘1’ is - I would have thought it might fail.  Using the sdX type device would certainly mean you should expect errors when doing the parity check.  You could instead have used a device of the form ‘/dev/mdX’ where X was the disk number as that does not need the partition specified and maintains parity.

Does this mean I should run it again using "xfs_repair -v /dev/sdl1"? Or not bother since it appears to have worked?

 

When I ran the first xfs_repair, my laptop went to sleep and on waking the terminal window did not refresh. Since I didn't know how long the xfs_repair would take, I left the server for a few hours before rebooting it, checking the errors with xfs_repair -nv and then starting the parity check.

 

So while I say it seems to have worked, this is based on the next xfs_repair -nv and not any message of success in the terminal after running "xfs_repair -v /dev/sdl"

Link to comment
1 hour ago, randomusername said:

Does this mean I should run it again using "xfs_repair -v /dev/sdl1"? Or not bother since it appears to have worked?

 

When I ran the first xfs_repair, my laptop went to sleep and on waking the terminal window did not refresh. Since I didn't know how long the xfs_repair would take, I left the server for a few hours before rebooting it, checking the errors with xfs_repair -nv and then starting the parity check.

 

So while I say it seems to have worked, this is based on the next xfs_repair -nv and not any message of success in the terminal after running "xfs_repair -v /dev/sdl"


I am afraid I have no idea if what you did was OK if you omitted the partition number and whether it would have damaged the file system.  Whenever I made that mistake it failed as it could not find the superblock 

 

Unless there is serious corruption a xfs_repair is very fast (seconds/minutes) so if your laptop went to sleep this suggests something else was happening.

 

@JorgeB might have a suggestion on best action to take at this point.

  • Like 1
Link to comment
24 minutes ago, itimpi said:

Unless there is serious corruption a xfs_repair is very fast (seconds/minutes) so if your laptop went to sleep this suggests something else was happening.

I set it to run and walked away from the laptop as I assumed it would take a while (like a parity check), though now you mention it I believe the first message on the screen was something about not being able to find a superblock.

Link to comment
15 minutes ago, JorgeB said:

I don't see how this could work, just run another check using the GUI to make sure all is well.

Well now I feel foolish.

 

Check completed in the GUI using parameters -nv got the following result:

 

Phase 1 - find and verify superblock...
        - block cache size set to 1473264 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 182074 tail block 182074
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 5
        - agno = 2
        - agno = 4
        - agno = 3
        - agno = 6
        - agno = 7
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Fri Apr 16 16:46:26 2021

Phase		Start		End		Duration
Phase 1:	04/16 16:46:24	04/16 16:46:24
Phase 2:	04/16 16:46:24	04/16 16:46:25	1 second
Phase 3:	04/16 16:46:25	04/16 16:46:26	1 second
Phase 4:	04/16 16:46:26	04/16 16:46:26
Phase 5:	Skipped
Phase 6:	04/16 16:46:26	04/16 16:46:26
Phase 7:	04/16 16:46:26	04/16 16:46:26

Total run time: 2 seconds

 

Link to comment
4 minutes ago, itimpi said:

I would now rerun it removing the -n (no modify) flag to see if that has helped

Running with -v gives the following output:

Phase 1 - find and verify superblock...
        - block cache size set to 1473264 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 182074 tail block 182074
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 5
        - agno = 7
        - agno = 1
        - agno = 6
        - agno = 4
        - agno = 2
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...

        XFS_REPAIR Summary    Fri Apr 16 16:57:46 2021

Phase		Start		End		Duration
Phase 1:	04/16 16:57:45	04/16 16:57:45
Phase 2:	04/16 16:57:45	04/16 16:57:45
Phase 3:	04/16 16:57:45	04/16 16:57:45
Phase 4:	04/16 16:57:45	04/16 16:57:45
Phase 5:	04/16 16:57:45	04/16 16:57:46	1 second
Phase 6:	04/16 16:57:46	04/16 16:57:46
Phase 7:	04/16 16:57:46	04/16 16:57:46

Total run time: 1 second
done

 

Link to comment

Okay I can view files on the disk through the GUI, and in Krusader the /user directory now shows the folders that used to be there, now with a new "lost+found" folder that I assume I need to go through and place its contents in their correct folders.

 

Would the correct thing now to be a parity check? And assuming no parity errors, consider this solved?

Link to comment
27 minutes ago, randomusername said:

Okay I can view files on the disk through the GUI, and in Krusader the /user directory now shows the folders that used to be there, now with a new "lost+found" folder that I assume I need to go through and place its contents in their correct folders.

Yes.  Is is likely that you would not have ended up with as much in tle lost+found folder if the correct device had been used for the first xfs_repair (not that that is much consolation at this point :( )

 

Sorting out the lost+found folder can be a lot of work unfortunately when file names are lost.  You can use the Linux 'file' command to at least get the file type of files with cryptic names.

 

34 minutes ago, randomusername said:

Would the correct thing now to be a parity check? And assuming no parity errors, consider this solved?

Yes.  I would not expect there to be any parity errors at this point.

 

  • Like 1
Link to comment
23 hours ago, itimpi said:

Yes.  Is is likely that you would not have ended up with as much in tle lost+found folder if the correct device had been used for the first xfs_repair (not that that is much consolation at this point :( )

 

Sorting out the lost+found folder can be a lot of work unfortunately when file names are lost.  You can use the Linux 'file' command to at least get the file type of files with cryptic names.

 

Yes.  I would not expect there to be any parity errors at this point.

 

Parity check shows 384 errors, diagnostics attached. Is there anything to suggest what the cause might be?

 

Thanks for the lost+found tip, I'll make sure to use the correct procedure for xfs_repair in the future.

xeus-diagnostics-20210417-1657.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.