Lockups when Parity is enabled

THF13 · October 14, 2021

Unraid version: 6.9.2

Hardware:

Older dual Xeon board with dual Intel® Xeon® CPU E5-2660 v2. CPUs support AES-NI
128GB ECC DDR3 RAM
Motherboard has built in LSI SAS2008. Array/Parity drives connected to HP SAS Expander. Cache drives connected to SATA ports on motherboard.
Unencrypted Cache drive (XFS) for new files
2nd Encrypted Cache drive (XFS-Encrypted) in a different pool for appdata, metadata.
Mix of 8-16TB encrypted (XFS-Encrypted) array drives. 15 drives total
18TB Parity drive

What Happens:

Usually, when Parity is enabled, either building or valid, and data is written to the array (current setup this only happens when mover runs), many CPU threads (but not all) go to 100% and aspects of the system become unresponsive. What aspects of the system become unresponsive seem random.
Sometimes the System is able to recover and returns to normal. Sometimes more and more pieces become unresponsive and the system is unable to restart on its own, requiring a hard power reset.
If Parity is not present the system operates completely stable. (~30 days no issue)

What is affected:

This part is weirdly inconsistent. Most commonly network access to the shares from SMB.
Some docker containers, but not necessarily all and not always the same ones become inaccessible. One time this happened the machine seemed totally unresponsive and I couldn't load the webUI or most of the docker containers I tried, but Emby was perfectly fine, able to browse between pages and playback media from the array without issue.
Aspects of the Unraid webUI itself. Sometimes it becomes totally inaccessible and won't load at all. Sometimes certain pages of the WebUI (like the docker tab or the syslog) won't load but other pages will.
Ability to shutdown/restart: When the system is locking up it is unable to even shut itself down. If I can access the terminal or webUI and trigger a shutdown it will just hand and never finish the process.

What I've tried:

Check firmware for LSI controller for updates
Disable all VMs
Unmounted additional unassigned SSDs I had attached via a PCI-e card
Change any BTRFs disks to XFS
Lowered priority of mover process
Mirrored syslog to flash
Run drive benchmarks
Disabling Parity to test stability

Other Details:

When the issue is happening and I catch it early I can usually get the system to recover with "mover stop". But it doesn't go back to normal for anywhere between 10-30 minutes. It will still spike CPU. Stopping docker altogether similarly does not immediately resolve the issue, nor doing both simultaneously.

Attached are unraid diagnostics and the syslog from before the mover process started until I had to hard reboot the server a few hours later. In this example I could access the unraid WebUI but not the docker tab. Some docker containers were still working (emby, for one) but others were not (tdarr server node). Also attached are the recent results of drive. Parity was building during this lockup.

I thought the lockups only happened when mover was running but in the syslog attached the mover starts running at 4:40 but nothing related to the lockup happens until 8:15, there shouldn't have been enough data on the cache to take that long to run.

unraid01-diagnostics-20211014-0933.zip syslog.txt

THF13 · October 27, 2021

I believe I have this solved and the system has been running stably for the past week.

iowait was actually happening, just not showing in top as a big problem. When viewing the netdata docker container however while the issue was happening it was a lot clearer.

I followed the advice in the thread below and used the Tips and Tweaks plugin to reduce "Disk Cache 'vm.dirty_background_ratio' (%):" from 10% to 1%, and "Disk Cache 'vm.dirty_ratio' (%):" from 20% to 2%. The effect was night and day when I tested the mover after this change. No pegged CPU cores, system responsive while the mover was running, and no lock ups or crashes for the past week.

I don't know why this only happened to me with parity enabled, but glad it's fixed now. The fix has more of an impact the more RAM you have. My system had 128GB of ram so the effect was quite extreme, but I think experimenting with this setting change might be worth testing if you have 16GB or more and your system is running worse than expected when the mover triggers.

Lockups when Parity is enabled

Recommended Posts

THF13

Link to comment

THF13

Link to comment

Join the conversation