[Unraid 6.9.2] Slow parity check and stability issues + poorly timed drive replacement

kyle_d · October 7, 2023

Hi there,

Trying to get to the bottom of this + learn a little bit more about RAID5 and Unraid...

Everything was fine with this server until we started running out of space. Unfortunately I didn't have many safeguards in place to stop every drive from being filled and it got to the point where Unraid was unstable and I had to manually delete some stuff via network connection using Windows File Explorer.

Upgrade drive finally came in the mail about a week ago. 16tb Toshiba to replace one of our 4tb WDs. I unassigned the drive, pulled it out and replace it with the 16tb drive. I booted and re-assigned the drive, then started the parity-sync.

A few weeks later, we started having some issues with our NVME cache drive, so we replaced it with a 4tb NVME. I decided to move from btrfs to XFS with our cache drives. Replacement procedure was copy data from cache into a folder I created called "cache_backup", power down, replace drive, start up, format & re-assign, transfer data from array to cache. Restarted the dockers (which rely on the cache) and everything (mostly) worked fine.

This server is stored / operating off site from me. I instructed the host of where it lives to power down the server, replace the NVME and let me know when its done. What I didn't notice was that there was a parity check ongoing and there were errors... I have no idea if 'write parity errors to disk' was selected or not.

About a week ago, we started experiencing some issues again. The main issue was that parity checks were ridiculously slow (like 400 KB/s instead of the usual 150 MB/s) and a few of our backups that we store on the server weren't loading properly. At this point, I knew about the previous parity errors from the last parity check but by now we had already cycled at least 8tb of data on/off the server. I assumed that we had a dying drive, but after investigating SMART, they all seemed fine. I finally concluded that it was a hardware issue, SATA3_2 and SATA3_5 ports on my motherboard had died. I verified this by plugging those two drives into my SAS card (lsi 9211-8i) and the problem was solved, parity check now performing at ~150 MB/s.

At least, for about the first 30 minutes.

I started to notice that there was about 30 MB/s of writes happening on both my Parity drive an Disk 2 (the one that was upgraded, I believe) and proportionally higher Reads on both disks.

If only there were some way to see what the parity check was doing...

image.png.9a3951938645bbd9a36b698e97fdb4d1.png

Hmm... maybe iotop?

TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                
24031 be/4 root       24.65 M/s   24.75 M/s  0.00 % 98.19 % shfs /mnt/user -disks 2047 -o noatime,allow_other -o remember=0

So, what is shfs doing with all of the disks? Who knows....

Here's a little timeline of what happened over the last few weeks:

tl; dr: Parity check is running, getting some writes on two drives but not getting any sync errors. Concerned about what is going on. Unraid won't explicitly tell me anything about an ongoing parity check.

serverus-diagnostics-20231006-2358.zip

itimpi · October 7, 2023

There is no way to tell what the parity check is doing in terms of any files that might be affected (parity does not understand the concept of files). It just knows how far through the check it has got and it tells you that in the GUI.

Chances are that since disk2 is a new large drive all new files are going to it thus explaining the write behaviour.

Probably the best thing to do at this point is to run a check filesystem on all array drives to check there is no corruption at that level, and assuming it all looks good run a parity check with the option to correct errors set.

kyle_d · October 8, 2023

19 hours ago, itimpi said:

There is no way to tell what the parity check is doing in terms of any files that might be affected (parity does not understand the concept of files). It just knows how far through the check it has got and it tells you that in the GUI.

Chances are that since disk2 is a new large drive all new files are going to it thus explaining the write behaviour.

Probably the best thing to do at this point is to run a check filesystem on all array drives to check there is no corruption at that level, and assuming it all looks good run a parity check with the option to correct errors set.

Is there a way to check if the current parity check task is set to 'write corrections to disk'? Also, do those corrections get written as the check moves along? Currently says 17000+ errors detected

Wondering if I should just cancel this parity check and investigate further using the check filesystem you mentioned above.

itimpi · October 8, 2023

2 hours ago, kyle_d said:

Is there a way to check if the current parity check task is set to 'write corrections to disk'? Also, do those corrections get written as the check moves along? Currently says 17000+ errors detected

Wondering if I should just cancel this parity check and investigate further using the check filesystem you mentioned above.

Corrections get written as the check proceeds and the ‘error’ count increments as it happens.

the easiest way to know what type of check is running is to install my Parity Cheek Tuning plugin. Even if you do not use its features such as running in increments then you will find that it adds the following you may find useful:

Entries in the parity history (going forward) will have the type of check as additional information
you can the use the ‘parity.check status’ command from a console session which will also show this information for a running check.

JonathanM · October 10, 2023

On 10/7/2023 at 4:26 AM, kyle_d said:

learn a little bit more about RAID5 and Unraid

Unraid doesn't use RAID5.

kyle_d · October 15, 2023

For anyone that stumbles across this in the future:

- I ran a non-correcting parity check and found 28,000+ errors.

- I then ended up checking the file system for each drive, all of them passed

- I ran a correcting parity check and it corrected all of the errors.

All data appears to be intact and everything seems to be running fine.

[Unraid 6.9.2] Slow parity check and stability issues + poorly timed drive replacement

Recommended Posts

kyle_d

Link to comment

itimpi

Link to comment

kyle_d

Link to comment

itimpi

Link to comment

JonathanM

Link to comment

kyle_d

Link to comment

Join the conversation