[Solved] Extremely Slow Parity Checks and High CPU usage


Recommended Posts

So I have been battling with this for a week now and am just about at my wits end at this point. I have had unRaid setup on my server for a few months now and everything has been mostly fine until the most recent parity check on Aug 1st. When I awoke that morning, the parity check was only 5% complete and had ran for over 8 hours. Normally, this check is completed in about 20 to 24 hours with my 8TB drives. I have tried a lot of troubleshooting and have even gone so far as to roll everything back to a single 8TB storage drive and an 8TB parity drive, yet the issue persists. This issue was first observed while running the first parity check after adding an 8TB drive. I am not sure if I ran a parity check after upgrading to 6.5.3. Additionally, this issue also persists with VM manager and docker services stopped, so zero disk activity outside of the parity check should be occurring.

 

System specifications:

HP Proliant ML350e G8 v2

2x Xeon E5-2440 v2 @ 1.90GHz (8 physical cores ea for a total of 32 threads)

96GB ECC Memory (12 x 8gb)

HP H220 HBA card in IT mode (currently at firmware 15, but may be able to upgrade it to 20)

 

A few things I have yet to try are:

Revert back to an older version of unRaid

Restart the server in no plugins mode

 

I have uploaded a diagnostics below, though the log is a bit small as I canceled out the parity check after only a few minutes.

prefect-diagnostics-20180808-1319.zip

Edited by SMLLR
Link to comment

There have been a couple of similar issues helped by changing the tunables, to default or lower than default values, you're using very strange values:

 

Aug  8 12:46:53 Prefect kernel: mdcmd (31): set md_num_stripes 512
Aug  8 12:46:53 Prefect kernel: mdcmd (32): set md_sync_window 144
Aug  8 12:46:53 Prefect kernel: mdcmd (33): set md_sync_thresh 192

md_sync_thresh should be lower than md_sync_window, either change them to default, or try for example 100 for sync_thresh.

Edited by johnnie.black
Link to comment

Those settings were still set as the default when I first started experiencing this issue. I believe I started playing around with the value and even put a config in place to boot up with those options. I changed them in the disk settings before running the most recent parity check as shown below:

image.png.cd8f745e8dbff8966b03bfd255b81fa1.png

 

I can kill that config and reboot the server to see if that helps at all.

Link to comment

I took the opportunity to reinstall the OS after backing up the existing configuration. As of right now, my parity rebuild is running at around 120MB/s and is 25% complete (it was running upwards of 150MB/s at the beginning). This parity rebuild is being completed with all four disks in place (three previously existing and the one new one). I fully believe all hardware is working as expected right now, however I will not know if a parity sync will work as expected until the rebuild is done tomorrow. If the parity sync works as expected, it may be worth digging in the configs to compare my old config with near stock config to see what may have caused issues. I believe reinstalling the OS should return parity sync runtimes to normal as the parity sync was working without issue prior to about two weeks ago.

 

At this rate, the rebuild will be completed probably around noon EST tomorrow. I will hopefully have a positive update at that time.

Link to comment

Finished up the rebuild, which averaged at 130MB/s, however the parity check still ran at 10MB/s and CPU was pegged at ~80%. I had to reduce the tunable settings to about a quarter of the original values to get back to where I was before the most recent parity check. It just seems odd to me that the rebuild is so fast without changing any settings yet the parity check goes so slow.

Link to comment

I'm curious what hdparm -I <drive> says about all the drives.

 

Having a disk drop down to PIO mode instead of UDMA would give this silly slow speeds at extreme CPU loads, since PIO mode means there are no hardware acceleration of the data transfers. With DMA, the transfers just consumes memory bandwidth and at the end of the transfer the OS gets an interrupt informing that the transfer is done.

 

Below is partial output of hdparm -I - where the star before "udma6" shows which mode the drive is currently using.

Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 0
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns

 

Link to comment

All drives show udma6 as the mode the drive is currently being used. The only difference between them is that the older, 4TB drive does not have the "Advanced power management level" line.

Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 0
        Advanced power management level: 164
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns

 

Link to comment
  • 2 weeks later...

So I believe this may be resolved, finally. I did make a large number of changes, but I believe a BIOS update and switching server's power management to OS controlled made the most impact. I am even using the default tunable settings, which I am now going to work on tweaking in an effort to improve performance. I just find it odd that it was working perfectly fine until earlier this month...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.