Tracking down IOwait cause

Exilepc · May 3, 2020

I need some help... I have been having high IOwaits for a few months now. Its driving me up the wall...

Specs:

Unraid system: Unraid server Pro, version 6.8.3

Model: Custom

Motherboard: Supermicro - X8DT6

Processor:Intel® Xeon® CPU E5620 @ 2.40GHz

HVM: Enabled

IOMMU: Disabled

Cache:L1-Cache = 256 kB (max. capacity 256 kB)

L2-Cache = 1024 kB (max. capacity 1024 kB)

L3-Cache = 12288 kB (max. capacity 12288 kB)

Memory: 48 GB (max. installable capacity 384 GB)

Network:bond0: fault-tolerance (active-backup), mtu 1500

eth0: 1000Mb/s, full duplex, mtu 1500

Kernel:Linux 4.19.107-Unraid x86_64

OpenSSL:1.1.1d

P + Q algorithm:5892 MB/s + 8187 MB/s

Steps to fix issues:

I have been having a high amount of IOwait issues. I have tried everything that has been the issue for others:

Add a second SSD to cache
Move Deluge to its own SSD via unmounted drives
Switched from Plex to Emby
Added a nVidia 1660 Ti for Emby
Disabled and removed Dynamix Cache Directories, waited a week and re-added it
DiskSpeed docker on all drives
Changed the Parity drive to Seagate Ironwolf Pro (second one still in shipping put in a HP Enterprise as temp Parity)
Replaced all drives what errors (even ones with 1 error 3 drives)
Disabled the specter and meltdown protections
Replaced both CPU Coolers with dual noctua fans
Added Memory fans to the system, to solve memory heat issues
Change to a new case (Fractal design define 7lx) from 24 bay supermicro
- Added 8 case fans
- Using 18 - 3.5 drives and 5 - 2.5 drives
Removed all 1 TB drives
Watched NetData, htop and Glances like a hawk for any signs as to what it causing the issue

I am still getting the IOwait errors messages, and the system is very slow.

tower2-diagnostics-20200502-2038.zip

JorgeB · May 3, 2020

Docker image is on the array, there are similar reports when this is the case, try moving it to cache.

Exilepc · May 3, 2020

I will look into that now.

Edited May 3, 2020 by Exilepc

Exilepc · May 3, 2020

Moved docker vm to a cache only share still getting high IOwait....

JorgeB · May 3, 2020

And no difference accessing/using the dockers during those i/o waits?

Exilepc · May 3, 2020

Most of the dockers have not been slow, only plex, smb and webgui. Parity checks and the like have been at 90-160mb/s.

Exilepc · May 3, 2020

561952334_ScreenShot2020-05-03at4_45_48PM.png.4e8a3f5724dfd1858f90b88f0fbf9955.png

Exilepc · May 4, 2020

IKWeb · March 5, 2021

@Exilepc - Did you ever find a fix for this? I am getting the same issues, and while copying large amounts of data to the array it kills the WebUI's for the server itself, and all the docker containers.

Exilepc · November 1, 2021

Nope, The issue has not been so pronounced until tonight... So I googled it tonight and this is one of the top posts... Its kinda sad...

dclive · November 28, 2021

Having similar issues here. Sometimes my webgui just doesn't respond (meaning unraid's; Plex's and etc. are fine), and IO performance takes a negative hit. I just installed Glances (thanks!) and will monitor iowait now.

Fffrank · November 9, 2022

Has anyone ever figured this out? My system appears to be getting worse. I download and process on my cache drives (4x 240gb SSD.) Then the data is moved to my media share (which also uses the cache.)

During these operations my system comes to a standstill. Anything streaming off the array chokes, my Dockers become unresponsive, VMs time out and I even get ssh shells dropped.

It's a dual Xenon 5660 w/ 96gb of RAM.

htop doesn't show anything out of the ordinary but the unRAID Dashboard shows cpu use through the roof. I finally stumbled upon this thread and it seems I have the same iowait issues.

I've got my dockers all pinned to limit what they can use. I have my docker image on the cache as well as my appdata share.

Very frustrating!

Gruffydd · January 1, 2023

Same problem here. My system is strong enough, however I get IOWAIT up in the 50-60% ranges.

brendan399 · January 9, 2023

I have been dealing with this as well for a couple weeks. I can't seem to figure out what the exact cause is. when I check the logs is seems one or the other of my ssd cache drives resets failed scmd and I also get io errrors sector xxxx op 0x0: (read) flags 0x80700phys_seg 1 or 2 or 3 or 4 prio class 0

Andiroo2 · January 17, 2023

I'm having this issue when trying to export data from the array to another server for backup. Speeds start around 30MB/s and drop to around 5-7MB/s. IOWAIT sitting around 33.3 according to glances. No other activity on the server at the same time.

What's interesting is that the array shows ~100MB/s of reads, but there is only a trickle going over the wire.

It's like the system is spinning it's wheels trying to get the data ready to send, but can only send really slowly. For reference, I am "pulling" data from Unraid to macOS. I am running the rsync commands on MacOS, connected to Unraid via the network. I have been trying with rsync over SSH and just via SMB but no real difference.

Exilepc · January 17, 2023

Sadly I have not been able to track down the issue… I wish I had an answer

dankulo · January 28, 2023

It doesn't seem like there is a solution to the problem at all. I've been dealing with this for months and it only gets worse.

lonnie776 · January 31, 2023

I have a large unraid server with 17 array drives 2 parity and RAID1 SSD Cache Pool. On that I run 3 VM's and up to 12 dockers some of which are io intensive. I often see high iowait % however in my case I know that it's because I am simply demanding too much from my disks, which would be fine if it didn't cripple every other part of the system.

Years back I found a way around iowait consuming the whole CPU. Linux allows you to isolate CPU cores from the system so you can dedicate them to other tasks (VM/Docker). This way when the system is crippled by iowait, your VM's and Docker containers can continue to function happily on the isolated CPU cores, although IO may still suffer if accessing the array/pool causing the iowait.

As I understand it, my situation is different than yours, but hopefully this trick will still help you work around some of the headaches.

In order to isolate the cores, you have to go into your flash drive and edit /syslinux/syslinux.cfg

Here is my default boot mode which I have edited to include "append isolcpus=4-9, 14-19". This option will force the system to run on 0-3, 10-13 leaving the isolated cores idle.

label Unraid OS
  menu default
  kernel /bzimage
  append isolcpus=4-9,14-19 initrd=/bzroot

I have an old hyperthreaded 10 core Xeon so I have 20 virtual cores 0-19. I chose to keep 4 cores for my system as plugins still run on the system, and I have isolated 6 cores for VM's and Docker containers. For this to work properly you must pin each VM and Docker to the isolated cores of your choosing.

Now when you are plagued by iowait, your Dockers and VM's will still have processing power.

I hope this helps.

Edit: After looking into this a bit further, I found that this has been implemented in the GUI. Now you simply go to Settings->CPU Pinning.

Edited February 1, 2023 by lonnie776

Andiroo2 · January 31, 2023

Pinning CPUs makes sense in your case where the performance of other things isn’t acceptable when the issue occurs. My experience is different though…I get the high IOWait but the rest of the system doesn’t hang.

JPAchilles · March 2, 2023

Bumping this thread. My server's been rendered unusable even without a parity drive and with docker disabled. Tried all the steps in the OP, and they helped, but not enough.

nas-diagnostics-20230301-1643.zip

maust · July 26, 2023

More or less have followed the same thing as OP with similar "results" but no permanent fix. Issue became much more apparent after upgrading to 6.11.5 from 6.9.

Consistently sitting between 5-10% IOWAIT. Anytime qbittorrent or any other container really does any large scale file operations it shoots the IOWAIT upwards to 30-50%, sometimes sitting there for hours at a time. This causes all network traffic to grind to a halt.

Some of what I have tried:

Swapped Cache drives.
Tried adding more cache drives
Tried Splitting the workloads between cache drives.
Switched all cache drives from BTRFS to XFS (greatly improved the baseline IOWait, but issues continue to persist)
Switched Docker.img from BTRFS to XFS (again, improved IOWait issues but they continue to persist)
Rebuild Docker.img from 150GB -> 50GB after fixing naughty containers (no performance changes)
Ensured docker containers were not writing to Docker.img after build (no performance changes)
Switched Docker.img from XFS to Directory (no change)
Tried adding better, faster, pool drives (no perceived difference)
Replace both CPUs with E5-2650v2 from E5-2650

What I am working to try:

Replacing all RAM with higher capacity sticks (128GB -> 384GB)

Things that really trigger the IOWait:

Qbittorrent Cache flushing (get a better IO and system performance if all qbittorrent caching is disabled)
Mover (even with/without Nice)
Radarr/Sonarr (file analysis)
Sonarr (Every 30 seconds on Finished Download Check, typically causes 5-6% IOWait every 30 seconds for ~10 seconds)
Sabnzbd (no longer an issue once Nice was adjusted)
Unzip/unrar (any kind, have to be incredibly harsh on the nice values to get it to not choke the server)
NFSv3 (full stop, any remote NFSv3 actions cause massive IOWait, talking upwards of 40-50% IOWait on just READ ONLY)
BTRFS (literally anything BTRFS causes issue on my R720XD, I do not experience this issue on my other servers)

Specs:

R720XD
E5-2650V2
128GB DDR3-1600 MHz
Parity - 2 Drives
- 16TB WD Red
- 18TB WD Gold
Array (not including Parity) - 16 Drives - 236TB Usable, all tested with DiskSpeed, monitored with
- Seagate 16TB Exos x7
- WD x2
- 14TB x4
- 12TB x3
Cache Pools
- Team 1TB (Weekly Appdata Backups)
- P31 1TB (Appdata)
- 1TB - WD Black NVME (Blank)
- 4TB - Samsung 870 EVO (for download caching)
Dell Compellent SC200
Dell 165T0 BROADCOM 57800S QUAD PORT SFP+
Dell H200 6Gbps HBA LSI 9211

Working Hypothesis:

Monitoring with NetData. Noticing IOWait jumps typically correlate with Memory Writeback. Specifically Dirty Memory Writeback. All my research comes back to either bad/lacking ram (which I will be swapping all of them out to 384GB) or Tunables need further adjustment.

Edited August 2, 2023 by maust

DanielPT · August 2, 2023

On 7/26/2023 at 5:37 PM, maust said:

More or less have followed the same thing as OP with similar "results" but no permanent fix. Issue became much more apparent after upgrading to 6.11.5 from 6.9.

Consistently sitting between 5-10% IOWAIT. Anytime qbittorrent or any other container really does any large scale file operations it shoots the IOWAIT upwards to 30-50%, sometimes sitting there for hours at a time. This causes all network traffic to grind to a halt.

This behavior also occurs anytime Mover runs as well.

Additionally, tried swapping Cache drives, tried adding more cache drives, tried splitting the load between multiple cache drives. Switched from BTRFS to XFS (which seemed to help throughput and lowered the baseline IOWAIT but it persists).

Things that really trigger the IOWait:
Qbittorrent Cache (somehow get a better IO and system performance if all qbittorrent caching is disabled)
Mover (even with/without Nice)

Radarr (file analysis)

Sabnzbd (no longer an issue once Nice was adjusted)

Interestingly, I only seem to have this issue on my main Poweredge R720XD Unraid 6.11.5 Server, none of my other servers that are still running 6.10 are experiencing this issue.

I think im having the same issue with 6.11.5. But i dont know how to see the IOwaits. Think im gonna try Netdata.

But i also use my cache drive for Qbittorent and when that is slamming the Samsung SSD pool i get unresponsive dockers.

Mbeco · August 17, 2023

On 7/26/2023 at 5:37 PM, maust said:

...

Working Hypothesis:

Monitoring with NetData. Noticing IOWait jumps typically correlate with Memory Writeback. Specifically Dirty Memory Writeback. All my research comes back to either bad/lacking ram (which I will be swapping all of them out to 384GB) or Tunables need further adjustment.

As I am just another one in this long list:

Could you kindly guide me to how you monitored Dirty Memory Writeback? That is a bit out of my depth, but I have tried almost everything else suggested in this thread and many others that I read over the past months. So if nothing else, maybe I could at least support your working hypothesis

DanielPT · November 8, 2023

So nobody have solved this?

When Qbitorrent is doing a "little" work all my dockers get unresponsive.

I even enabled "exclusive shares" to appdata on my 2 x Samsung SSDs

JorgeB · November 8, 2023

And the torrents/downloads are also going to exclusive shares?

Tracking down IOwait cause

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation