Hey all — I was just wandering around and wanted to share my experience with IOWait.
Background: I have a 220 TiB server with 128 GiB of Ram. Nothing serious running beyond what others in here have been running, on a 32 core and cpu. IoWait regularly at ~50% and critically system unresponsive every minute or so.
Root cause: Fuse getting overwhelmed, which consumed all of the vm_dirty_ratio causing the kernel to lock until everything was flushed to disk.
What helped:
- Install the Tips and Tweaks plugin to allow you to easily set these cache values.
- Reduce the time it takes to start writing to disk by setting vm.dirty_background_ratio to zero or 1-2 (% of your ram) helping reduce the lag between the start of writing to disk and your ram filling up with disk IO.
- Reduce the maximum cache size by reducing vm.dirty_ratio. I found 5 (% of ram) worked reasonably. Remember if you hit this LIMIT everything that was using your ram (vs disk) to write now has to BOTH wait for the flush, and switch to blocking IO. That said your disks are very slow compared to memory, so you need this overall cash to be small or zero to eliminate the instability at flush.
What eliminated the issue:
- Switching my main array to ZFS on unraid. I have been really happy with performance; network throughput is 3x, IOWait is gone; getting almost 10 gbit with a 7 x 20 TiB * 3 vdev configuration on unraid. (A word of warning, I almost lost all my data in a phased transfer to ZFS, so backup your data or consider a new JBOD if you are new to zfs.)
References:
- https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
- Unraid Support