Excessively Slow Parity Check


ssb201

Recommended Posts

A few days ago I had what appeared to be a system lockup with the host unresponsive to ping. The system came back up and seemed to be working fine. The parity check due to unclean shutdown seemed to be proceeding apace, but was not complete when checking on it the next day. The web console has become somewhat unresponsive, the same with all docker containers and shares. 

 

The parity check started at 124MB/sec and is now running at 84KB/sec.

 

image.thumb.png.a00ed5f3577a2ab160dee35e564b48f1.png 

 

I reviewed each of the drives using smartctl but see no errors. Nothing apparent on dmesg or in the system log. SSH connection hangs if I run ps after getting through part of the process list. 

 

I kicked off diagnostics collection from the command line, but it had not completed after 5 minutes. If it eventually completes I will update this post.

 

root@Tower:/dev# /root/mdcmd status | egrep "mdResync|mdState|sbSync"
sbSynced=1513116254
sbSyncErrs=0
sbSynced2=0
sbSyncExit=0
mdState=STARTED
mdResyncAction=check P
mdResyncSize=7814026532
mdResyncCorr=1
mdResync=7814026532
mdResyncPos=7595181832
mdResyncDt=115
mdResyncDb=4608
root@Tower:/dev# diagnostics
Starting diagnostics collection...


 

tower-diagnostics-20171213-1626.zip

Edited by ssb201
Update with diagnostics
Link to comment

Don't know why this is (assuming you're running AD), but every minute on the minute, the system is changing the permissions on your downloads share.

 

It appears that you started the check @ 2:04pm on the 12th.

It got through the first 4 TB 12:39 am.  

 

At that point, CA Auto Turbo Mode kicked in and started to enable / disable turbo mode depending upon your drives spun up.

 

At 8:38 am on the 13th the parity drive decided to drop offline, but it recovered and came back no problems right way.

 

Problem being with this is that the permissions being changed every minute *may* cause a drive to spin up for unRaid to find the /mnt/user/Downloads share.  I'm surmising this because auto turbo keeps enabling / disabling turbo mode, which means that a drive is spinning up/down.

 

Its possible the spin ups that are giving you the problems, as very often with controllers all transfers to/from other drives attached will pause until the drive being spun up is up to speed.  (which takes a couple of seconds).  Turbo mode being enabled / disabled shouldn't account for any significant slowdown itself.

 

 

 

Link to comment

Also, while not directly related to your problem, you do have this:

184 End-to-End_Error        0x0032   097   097   099    Old_age   Always   FAILING_NOW 3

on ST4000VN000-1H4168_Z300T4WM-20171213-1626 - This is a failure of the electronics (cache) on the drive itself

 

If you don't already have it, enable notifications in Settings.  Drives with attributes in Failing State should never be ignored.

Link to comment
39 minutes ago, Squid said:

Don't know why this is (assuming you're running AD), but every minute on the minute, the system is changing the permissions on your downloads share.

 

It appears that you started the check @ 2:04pm on the 12th.

It got through the first 4 TB 12:39 am.  

 

At that point, CA Auto Turbo Mode kicked in and started to enable / disable turbo mode depending upon your drives spun up.

 

At 8:38 am on the 13th the parity drive decided to drop offline, but it recovered and came back no problems right way.

 

Problem being with this is that the permissions being changed every minute *may* cause a drive to spin up for unRaid to find the /mnt/user/Downloads share.  I'm surmising this because auto turbo keeps enabling / disabling turbo mode, which means that a drive is spinning up/down.

 

Its possible the spin ups that are giving you the problems, as very often with controllers all transfers to/from other drives attached will pause until the drive being spun up is up to speed.  (which takes a couple of seconds).  Turbo mode being enabled / disabled shouldn't account for any significant slowdown itself.

 

 

Yes, I am running AD. I am not sure why the permissions are changing. I have not explicitly set anything to do that.

Link to comment
21 hours ago, Squid said:

Also, while not directly related to your problem, you do have this:

 


184 End-to-End_Error        0x0032   097   097   099    Old_age   Always   FAILING_NOW 3

 

on ST4000VN000-1H4168_Z300T4WM-20171213-1626 - This is a failure of the electronics (cache) on the drive itself

 

If you don't already have it, enable notifications in Settings.  Drives with attributes in Failing State should never be ignored.

Yes, I noticed the SMART issue and will have to address it, but it does not explain the slow speed of the parity check.

 

It took me 10 minutes, but I was able to finally disable Turbo Mode. That did not seem to improve anything.

 

The system finally went from 97.2% to 97.3%! I cannot wait 30-50 more days before it finishes. I am probably going to try and reboot tonight unless someone can offer another thing to try.

 

I am scratching my head as to why it is running so slow, without any other indicator in the logs to signify a problem.

 

Link to comment

I had difficulty shutting the system down. I finally managed to do so after force killing Docker ('killall docker'). After rebooting and starting in safe mode, bringing up the array and starting a parity sync was still incredibly slow. <diagnostics attached>.

 

I then tried shutting down, removing the drive with the SMART error and starting up again (safe mode). I initiated a parity rebuild with a new 8TB drive that was previously precleared. The console is estimating 8130 days to complete. I still see no relevant hardware errors registering in dmesg or system log. 

 

A couple of things I did note:

  • hdparm -I /dev/sdg returns a Sense I/O error
  • The perms on each of the disks under /mnt are 1069023732:1069023745. The same with some, but not all, shares under /mnt/user. I never noticed this before and do not know if this is proper. 

System operations like going from one web page to another in the web portal take 5-10 minutes each. Most operations on the command line seem to run just as snappy as ever.

tower-diagnostics-20171214-1949.zip

Link to comment

Looks like I am well and truly screwed. The parity drive is now showing errors on the array page. It looks like it was the problem, despite the absence of any SMART errors.

 

Not sure how I can dig myself out of this mess now. Do I let the rebuild continue and hope that there are only a limited number of errors from a failing parity drive? It is still saying 2000 days, so that will take forever.

Link to comment
39 minutes ago, johnnie.black said:

Grab and post new diags after the errors.

 

Attached new diagnostics, though I gave up on the prior status quo..

 

The parity rebuild was going nowhere. Just tons of errors on the parity drive. I ended up rebooting and leaving the array in a downed state.

 

In the mean time, I reformatted the new drive (the one that had been precleared and added to the array to replace the SMART error drive), since the parity rebuild had barely begun the file system was entirely corrupt. 

 

I kicked off a btrfs restore from the drive I had replaced to the new drive to preserve a copy of as much of my data as possible. Since the parity drive seemed to actually be the problem, I am hoping that despite some btrfs corruption (caused by the less than graceful initial shutdown?), the vast majority of my data will be recovered.

 

Between the three good data drives and the new one I restored to, I should at least have all or almost all of my data, without any dependency on the f--g parity drive. 

tower-diagnostics-20171215-0012.zip

Link to comment

Yeah, parity drive looks bad, no pending sectors but these are bad signs:
 

  1 Raw_Read_Error_Rate     0x000b   053   053   ---    Pre-fail  Always       -       922755730
 22 Helium_Level            0x0023   076   076   ---    Pre-fail  Always       -       76

 

Raw Read Error rate should be zero on an healthy drive, and it's leaking helium, never seen a helium drive where this value wasn't 100.

Link to comment

You can also see the difference form the previous diags:

 

Quote

  1 Raw_Read_Error_Rate     0x000b   080   080   016    Pre-fail  Always       -       715 -> early signs of trouble
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100

 

Link to comment
14 hours ago, johnnie.black said:

You can also see the difference form the previous diags:

 

 

Yeah. The drive went south quickly, even without SMART notification warning. So I managed to complete the BTRFS restore. Ten or twenty files had warnings about loops and may not have restored properly, the rest hopefully are good. I will need to double check their contents once I have things up and running. Thankfully most are just Blu-Ray rips that I can redo.

 

Now that I have four drives with all my data. Any pointers on recreating the array? Currently the new drive shows up as Disk 3 (drive contents emulated). I am guessing I will need to create a brand new array and slowly copy my data over drive by drive, before importing the next drive. 

Link to comment

I just discovered New Config. It looks like I should be able to just do New Config->Retain Data and Cache slots. Then apply.

 

Will that work when the array thinks that disk 3 is emulated, even if disk 3 does have all the data it should? Do I have to remove that drive and then add it in separately? I assume once the array is good I would then be able to add a new parity drive and build parity.

Edited by ssb201
Link to comment

 

 

6 hours ago, ssb201 said:

Now that I have four drives with all my data. Any pointers on recreating the array? Currently the new drive shows up as Disk 3 (drive contents emulated)

New config will lose any emulated disk, but didn't you un-assign parity? How do you have an emulated disk? You can also post current diags so I can see.

Link to comment
12 hours ago, johnnie.black said:

 

 

New config will lose any emulated disk, but didn't you un-assign parity? How do you have an emulated disk? You can also post current diags so I can see.

Attached. (Please note I have removed the failed parity drive and the other drive that was reporting a SMART end to end error (the one I did a BTRFS restore from).

 

Steps to get here:

1) Parity check slowed to crawl at 97.3%, VM running fine, Dockers unresponsive.

2) Web UI unresponsive, from SSH attempt to reboot hangs.

3) Power cycle in safe mode. System seems fine, web UI responsive and no errors showing in log. Array down.

4) Decide to replace 4TB drive showing End to End SMART error. Pull drive, put new 8TB drive in. Start array and select rebuild from parity.

5) Rebuild slows to a crawl. read errors start to appear on parity drive. 

6) Cancel rebuild, reboot, confirm, parity drive is bad. (Freak a bit).

7) Put 4TB drive back in, unable to mount due to FS corruption. 

8 ) Clean BTRFS format of 8TB replacement disk (since parity has started on it, a superblock and some fs was there but no data).

9) BTRFS restore almost all files from 4TB to 8TB.

10) Remove 4TB drive. As stands now I have 3 original data drives and a new copy (not clone) of the 4th data drive. No parity drive, but an array configuration expecting a parity rebuild of drive 3 (hence emulation) and a missing parity drive.

 

 

tower-diagnostics-20171216-1221.zip

Link to comment

OK, I think I understand now, disk3 is invalid (yellow) not emulated (red), array is in an unusual state but as long as data on disk3 looks correct if you start the array as is, without parity, you can do a new config, retain all and start the array, you can later (or now) add a new parity drive, but keep old disk3 intact in case anything goes wrong.

Link to comment
2 minutes ago, johnnie.black said:

OK, I think I understand now, disk3 is invalid (yellow) not emulated (red), array is in an unusual state but as long as data on disk3 looks correct if you start the array as is, without parity, you can do a new config, retain all and start the array, you can later (or now) add a new parity drive, but keep old disk3 intact in case anything goes wrong.

Correct. Apologies for any confusion,  the tooltip for the yellow says emulated:

image.thumb.png.a9efd1e2f7a4881b08045192dc2104ef.png

Link to comment
1 hour ago, johnnie.black said:

You're right, I meant the disk status is invalid, not disabled, but without parity the disk can't be emulated, so kind of a bug, but like I said the array is in an unusual state.

 

The array was definitely in an unusual state.

 

New config worked well. Plex docker was missing and had to rejoin AD, but everything else looked fine. A few movies are corrupt and will need to be re-ripped, but nothing major. I have a new parity drive building now. So glad a pulled the trigger on an extra drive for Black Friday. Unfortunately, the drive the failed, while under warranty is flagged as needing to be returned to system vendor rather than Hitachi, and I have no idea who that is.

 

Thanks everyone for the assistance.

Link to comment
  • 3 years later...

I just want to add my experience as I have been struggling for a week to get my unraid system up and running. My setup is a mini tower with an 8 bay backplane, everything I purchased was new. My 16TB seagate ironwolf was acting as parity, other disks as data storage. 

 

Everything seemed to be working fine, i could not see any particular entries in the log files. However, my system was behaving erratic. As soon as I tried to build parity, the throughput decreased to 30/40 kbps thus lasting 2-3 years to complete. Other trouble also occured like drives not showing up in the bios, not being able to connect anymore, formatting issues, dockers unable to download etc. I was scratching my head and started to take out components of the system one by one. Since the disk was new and causing issues, I was focussing too much on this as well as SATA adressing topics.

 

It took me a week to identify the root cause, which also left me puzzled. After I disconnected my backplane and hooked up the disks manually with sata cables, everything seemed to be working fine. The backplane is connected with 4SATA to SFF-8087 cables, power supply comes from a SATA cable going to a MOLEX-splitter for 2 backplane power MOLEX adapters.

 

Changing the SFF-8087 cable also didn't solve the issue, so I reached for molex power supply and noticed that the two red cables from the Y-splitter had come off. After fixing this, the system went up and running as it should.

 

So the 5V power supply of the backplane was off, but since the disks spun I never expected this to be the trouble. 

 

I'm now a happy unraid user and wanted to share this issue, in case somebody else experiences issues like that. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.