Jump to content

Lockups when Parity is enabled


THF13

Recommended Posts

Unraid version: 6.9.2

Hardware:

  • Older dual Xeon board with dual Intel® Xeon® CPU E5-2660 v2.  CPUs support AES-NI
  • 128GB ECC DDR3 RAM
  • Motherboard has built in LSI SAS2008.  Array/Parity drives connected to HP SAS Expander.  Cache drives connected to SATA ports on motherboard.  
  • Unencrypted Cache drive (XFS) for new files
  • 2nd Encrypted Cache drive (XFS-Encrypted) in a different pool for appdata, metadata.  
  • Mix of 8-16TB encrypted (XFS-Encrypted) array drives.  15 drives total
  • 18TB Parity drive

 

What Happens

  • Usually, when Parity is enabled, either building or valid, and data is written to the array (current setup this only happens when mover runs), many CPU threads (but not all) go to 100% and aspects of the system become unresponsive.  What aspects of the system become unresponsive seem random.  
  • Sometimes the System is able to recover and returns to normal.  Sometimes more and more pieces become unresponsive and the system is unable to restart on its own, requiring a hard power reset.  
  • If Parity is not present the system operates completely stable.  (~30 days no issue)


What is affected:

  • This part is weirdly inconsistent.  Most commonly network access to the shares from SMB.
  • Some docker containers, but not necessarily all and not always the same ones become inaccessible.  One time this happened the machine seemed totally unresponsive and I couldn't load the webUI or most of the docker containers I tried, but Emby was perfectly fine, able to browse between pages and playback media from the array without issue. 
  • Aspects of the Unraid webUI itself.  Sometimes it becomes totally inaccessible and won't load at all.  Sometimes certain pages of the WebUI (like the docker tab or the syslog) won't load but other pages will.  
  • Ability to shutdown/restart:  When the system is locking up it is unable to even shut itself down.  If I can access the terminal or webUI and trigger a shutdown it will just hand and never finish the process. 

 

What I've tried:

  • Check firmware for LSI controller for updates
  • Disable all VMs
  • Unmounted additional unassigned SSDs I had attached via a PCI-e card
  • Change any BTRFs disks to XFS
  • Lowered priority of mover process
  • Mirrored syslog to flash
  • Run drive benchmarks
  • Disabling Parity to test stability

 

Other Details:

When the issue is happening and I catch it early I can usually get the system to recover with "mover stop".  But it doesn't go back to normal for anywhere between 10-30 minutes.  It will still spike CPU.  Stopping docker altogether similarly does not immediately resolve the issue, nor doing both simultaneously.  

 

Attached are unraid diagnostics and the syslog from before the mover process started until I had to hard reboot the server a few hours later.  In this example I could access the unraid WebUI but not the docker tab.  Some docker containers were still working (emby, for one) but others were not (tdarr server node).  Also attached are the recent results of drive.  Parity was building during this lockup.  

 

I thought the lockups only happened when mover was running but in the syslog attached the mover starts running at 4:40 but nothing related to the lockup happens until 8:15, there shouldn't have been enough data on the cache to take that long to run.  

benchmark-speeds.png

unraid01-diagnostics-20211014-0933.zip syslog.txt

Link to comment
  • 2 weeks later...

I believe I have this solved and the system has been running stably for the past week.  

iowait was actually happening, just not showing in top as a big problem.  When viewing the netdata docker container however while the issue was happening it was a lot clearer.  

I followed the advice in the thread below and used the Tips and Tweaks plugin to reduce "Disk Cache 'vm.dirty_background_ratio' (%):" from 10% to 1%, and "Disk Cache 'vm.dirty_ratio' (%):" from 20% to 2%.  The effect was night and day when I tested the mover after this change.  No pegged CPU cores, system responsive while the mover was running, and no lock ups or crashes for the past week.  

 I don't know why this only happened to me with parity enabled, but glad it's fixed now.  The fix has more of an impact the more RAM you have.  My system had 128GB of ram so the effect was quite extreme, but I think experimenting with this setting change might be worth testing if you have 16GB or more and your system is running worse than expected when the mover triggers.  

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...