Slow parity check question for the nth time [Solved: Loose HBA]


Recommended Posts

With nothing changed to my knowledege in the last 30 days, my parity check dropped down to 150 kB/sec, with a projected completion time of over a year.  I'm currently running Unraid 6.72 on a dual Xeon  Dell T610 with 32GB of ECC memory.  My array is two 10TB parity drives with two 10TB , a 5TB and a 6TB drive for data.  All devices, drives and cache SSDs show green with no SMART errors.  I tried stopping the parity check and rebooting, but the parity check settles into the 150KB/sec range.  I tried shutting down all VMs, Dockers and taking shares offline.  Htop shows a cpu load of 10, but no single task seems to be taking much resources.  CPU % is at 2% or less.  Moving a large file from one disk to another is showing a rate of about 175MB/sec.  The array is working fine, I can saturate a 10GbE link copying a file from my Proxmox server to the Unraid server.  Iperf is showing 9.7Gb/sec in both directions.

 

I'm at a loss to figure out what my next trouble shooting steps should be.  With CPU % at 2%, but CPU load at 10, are there tasks being blocked?  I not able to identify them.  Everything seems to be running, but the parity check is dog slow.  I'm not worried about my data at this time, but I may be naive.

 

Any guidance would be greatly appreciated.  Thank you.

 

Edit2:  The slow parity check appeared again.  This time I remembered my PCI HBA did not have a faceplate.  (Long story) Re-seated the HBA and the parity check sped up from 140 kb/sec to over 1 Gb/sec.

 

Edit:  Rebooting in safe mode did no solve slow issue.  Did nothing other than go view every bios, raid controller and boot setup.  No changes were made.  Did not shut down server.  When UnRaid came back up, the parity check was running at full speed.

Edited by [email protected]
Update
Link to comment

Any idea why these are hitting your syslog every 10 seconds?

Nov 26 08:02:57 t610 rpcbind[28748]: connect from 192.168.1.210 to getport/addr(mountd)
Nov 26 08:03:07 t610 rpcbind[29233]: connect from 192.168.1.210 to getport/addr(mountd)
Nov 26 08:03:17 t610 rpcbind[29629]: connect from 192.168.1.210 to getport/addr(mountd)
Nov 26 08:03:28 t610 rpcbind[29973]: connect from 192.168.1.210 to getport/addr(mountd)
...
Nov 26 18:38:37 t610 rpcbind[28345]: connect from 192.168.1.210 to getport/addr(mountd)
Nov 26 18:38:46 t610 rpcbind[28731]: connect from 192.168.1.210 to getport/addr(mountd)
Nov 26 18:38:56 t610 rpcbind[29359]: connect from 192.168.1.210 to getport/addr(mountd)

Something to do with this?

11 hours ago, [email protected] said:

I can saturate a 10GbE link copying a file from my Proxmox server to the Unraid server.

Are you trying to write a lot to your server during the parity check?

Link to comment

I don't know the reason why my Proxmox server is trying to connect so much.  When I turn the port down in the distribution switch the parity check is still slow.

 

I may have found the problem due to my misinterpretation of the SMART attribute, although the drive shows green.

Main -> Drive 2 -> Attributes show Raw read error rate = 105312344 and Seek error rate = 52672152943

But, why does it say FAILED Never?

Is this the cause of the slow parity check?

drive2.jpg

Link to comment

I rebooted in safe mode and its been running about 25 minutes.  The parity check speed ramped up to 9 MB/sec after a couple of minutes, then dropped down to 150KB/sec.  After 10 minutes it settled into 3 MB/sec.  Faster, but still much slower than expected.

CPU ave is at 0.9% but load is at 10.

 

A lot of googling told me CPU% != Load and these are the most likely culprits slowing things down.  Don't know what these processes do though (other than nsfd) and how to optimize them.

 

root@t610:/proc# top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D (I/O wait probably): "count}'
top - 13:58:57 up  1:53,  1 user,  load average: 12.99, 12.73, 13.34
Tasks: 444 total,   1 running, 443 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.7 sy,  0.0 ni, 92.1 id,  6.5 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32223.3 total,  18945.6 free,   2305.0 used,  10972.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  28300.9 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  709 root      20   0       0      0      0 D   0.0   0.0   0:00.94 kworker/11:1+xfs-cil/md2
 2078 root      20   0       0      0      0 D   0.0   0.0   0:11.48 mdrecoveryd
 2153 root      20   0       0      0      0 D   0.0   0.0   0:13.58 nfsd
 2155 root      20   0       0      0      0 D   0.0   0.0   0:16.81 nfsd
 2160 root      20   0       0      0      0 D   0.0   0.0   0:14.26 nfsd
14153 root      20   0       0      0      0 D   0.0   0.0   0:02.36 kworker/u34:2+flush-9:2
16072 root      20   0       0      0      0 D   0.0   0.0   0:00.25 kworker/11:2+xfs-sync/md2
18871 root      20   0       0      0      0 D   0.0   0.0   0:01.01 kworker/1:7+md
Total status D (I/O wait probably): 8
root@t610:/proc# 

Link to comment

Did NOTHING and it is FIXED!

I went through every bios, HBA controller setting, system setting and boot options.  Just viewed everything, as they were set correctly.

I then changed back to normal GUI mode and now the parity check is running between 50MB/sec  to 175MB/sec!  Load is at 2.5 and CPU % is at 0.5%.

 

I wish I knew what changed, would like to learn.  The 10TB parity check will now take 17 hours instead of 1.5 YEARS!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.