File sharing service become unresponsive at end of parity check

klamath · January 28, 2018

Howdy,

I have been battling this for a while. Over the holidays I accidentally fubared my super.dat while rolling back from 6.3 to 6.2 (don't computer before coffee). I was in the process of upgrading parity and a few data disks to new HE8 drives. After getting everything resettled and working I started to notice weird link speeds with the HP SAS expander, randomly it would show link speeds of 1.5 on some drives, after replacing the SAS expander with a new Intel one all drive speeds now report correctly. This last week I added another Intel SAS expander and replaced all SAS cables and 9211-8i card and my speeds doubled. So far so good, when running parity checks I am avg 81MB/s with 28 drives, everything works however im seeing process unraidd using 100% of CPU0, seems to be all in system wait. Things all appear to be working fine, however when the parity crosses over to the 8TB drives NFS and SMB stop responding, I see errors emitted on NFS clients reporting timeouts waiting for server to respond. I have been trying to figure out why the 8TBs are causing issues.

-All interrupts for 9211 are on CPU0, I changed affinity of that IRQ to allow scheduling on all cores, doesn't seem to help according to /proc/interrupts.

-IRQ issues? It seems that from lspci -v both my USB controller/9211 and 10gb all share IRQ 10, however /proc/interrupts shows them on different interrupts.

-Same issues seen in 6.2 and 6.4 release.

-nr_requests set at default 128, however increasing to 512 seems to have a speed increase. The Max queue depth for SAS 2008 is 3200, should those match per port?

Drives all check out, no pending sectors or other indicators of an issue, nothing in syslog showing up during these slow downs.

Running out of ideas at this point as to why unraidd is taking up so much CPU during a parity check and if that has anything to do with NFS going unresponsive. Once parity check is canceled all file sharing services start responding as normal.

Any help would be appreciated,

Tim

orion-diagnostics-20180128-1331.zip

JorgeB · January 28, 2018

High CPU usage with dual parity on larger arrays is normal, much more so than with single parity, you're using an older CPU with low single thread performance, I did some tests a while a ago and for the CPU not being a bottleneck on such a large array you need a CPU with a single thread rating of around 2000 passmarks, now if that is a reason for NFS going unresponsive I'm not sure but suspect that it is.

klamath · January 28, 2018

Using a Xeon 5560, not sure passmarks score but googling shows around 5400, is there a reason why when check is running with 28 drives everything is responsive, however when checking the 8TB drives things become unresponsive?

JorgeB · January 28, 2018

2 minutes ago, klamath said:

Using a Xeon 5560, not sure passmarks score but googling shows around 5400

1357 single threaded.

2 minutes ago, klamath said:

is there a reason why when check is running with 28 drives everything is responsive, however when checking the 8TB drives things become unresponsive?

None I can think of, but with dual parity the CPU usage should remain about the same all the way through, since parity2 is still calculated as if the disks are all there, it's just zeros for those.

klamath · January 28, 2018

You think reducing the array by 4 drives might get me out of having to buy a new head?

JorgeB · January 28, 2018

It will help, don't know if by enough, ironically changing your expanders to improve your parity speed would make the problem worse, as for the same array, the faster the parity check speed the higher the CPU usage, until it's at 100%.

klamath · January 28, 2018

If i can evacuate all 4 drives at once can i use this method "The "Clear Drive Then Remove Drive" Method" or do i need to remove each drive one by one?

Tim

JorgeB · January 28, 2018

You can remove them all at once after all are cleared but clear them one by one.

klamath · January 28, 2018

So thinking out loud, if CPU was an issue with the amount of drives is the issue NFS should stop responding when checking all 28 drives, not the 5 HE8 drives. So im thinking maybe the tunable testing script might be not factoring in the entire run into the recommendation. Speeds jump to 100+MB/s once system only checks the He8 drives. Think returning values to default will help? The only system that looks like it would work is the Dell T130, fits price point well and gives good CPU numbers.

Tim

JorgeB · January 28, 2018

Just now, klamath said:

Think returning values to default will help?

It might help and it won't hurt to try.

klamath · January 30, 2018

Still no dice, server goes unresponsive once I hit the 4TB barrier, T130 with a xeon 1270 has been ordered.

klamath · February 7, 2018

Like night and day so far, under parity check the cpu is at 40% utilization @ 600GB checked, steady so far.

Drewster727 · February 9, 2018

@klamath @johnnie.black I'm also running into this issue. I've got 22 drives in my array, most are 4-6TB, 1x8TB, and dual 8TBs in parity.

Your last message is confusing, did you make a change that made a difference? My NFS files access is basically *useless* when running a parity check. I seem to be capped at 75MB/s throughout the whole check as well, so it's slowwwww. My CPU never goes above 40% either.

Also, this is on 6.4.1, so the UI is still responsive during checks, just NFS access is atrocious.

Thanks

Edited February 9, 2018 by Drewster727

klamath · February 9, 2018

I converted my Norco case into a jbod, I had a super micro motherboard with a X5570 CPU all inside the norco case to begin with. I just bought a Dell T130 with a Xeon 1270 v5 and i went from 65MB/s party check speeds to 100-150MB with default tune, no modifications to the config at all. The system did a party check, start to finish without any hiccups on the network with NFS, no client reported any timeouts at all, plex (a major subscriber to NFS) didn't register any issues at all during parity check. @Drewster727 In my case I install nmon and saw the unraidd process logged most time in System Wait during a parity check, another interesting thing between systems i noticed is my SAS card's interrupts are now spread across all cores vs old system having all interrupts assigned to core 0.

Hope this helps a little bit!

Tim

JorgeB · February 9, 2018

5 hours ago, Drewster727 said:

@klamath @johnnie.black I'm also running into this issue. I've got 22 drives in my array, most are 4-6TB, 1x8TB, and dual 8TBs in parity.

Your last message is confusing, did you make a change that made a difference? My NFS files access is basically *useless* when running a parity check. I seem to be capped at 75MB/s throughout the whole check as well, so it's slowwwww. My CPU never goes above 40% either.

Also, this is on 6.4.1, so the UI is still responsive during checks, just NFS access is atrocious.

Thanks

We'd need your diagnostics to have any opinion on the problem, but likely a not powerful enough CPU.

File sharing service become unresponsive at end of parity check

Recommended Posts

klamath

Link to comment

JorgeB

Link to comment

klamath

Link to comment

JorgeB

Link to comment

klamath

Link to comment

JorgeB

Link to comment

klamath

Link to comment

JorgeB

Link to comment

klamath

Link to comment

JorgeB

Link to comment

klamath

Link to comment

klamath

Link to comment

Drewster727

Link to comment

klamath

Link to comment

JorgeB

Link to comment

Join the conversation