Tracking down IOwait cause


Recommended Posts

I need some help... I have been having high IOwaits for a few months now. Its driving me up the wall...

 

Specs:

Unraid system: Unraid server Pro, version 6.8.3

Model: Custom

Motherboard: Supermicro - X8DT6

Processor:Intel® Xeon® CPU E5620 @ 2.40GHz

HVM: Enabled

IOMMU: Disabled

Cache:L1-Cache = 256 kB (max. capacity 256 kB)

L2-Cache = 1024 kB (max. capacity 1024 kB)

L3-Cache = 12288 kB (max. capacity 12288 kB)

Memory: 48 GB (max. installable capacity 384 GB)

Network:bond0: fault-tolerance (active-backup), mtu 1500

eth0: 1000Mb/s, full duplex, mtu 1500

Kernel:Linux 4.19.107-Unraid x86_64

OpenSSL:1.1.1d

P + Q algorithm:5892 MB/s + 8187 MB/s

 

Steps to fix issues:

I have been having a high amount of IOwait issues. I have tried everything that has been the issue for others:

  • Add a second SSD to cache
  • Move Deluge to its own SSD via unmounted drives
  • Switched from Plex to Emby
  • Added a nVidia 1660 Ti for Emby
  • Disabled and removed Dynamix Cache Directories, waited a week and re-added it
  • DiskSpeed docker on all drives
  • Changed the Parity drive to Seagate Ironwolf Pro (second one still in shipping put in a HP Enterprise as temp Parity)
  • Replaced all drives what errors (even ones with 1 error 3 drives)
  • Disabled the specter and meltdown protections
  • Replaced both CPU Coolers with dual noctua fans
  • Added Memory fans to the system, to solve memory heat issues
  • Change to a new case (Fractal design define 7lx) from 24 bay supermicro
    • Added 8 case fans
    • Using 18 - 3.5 drives and 5 - 2.5 drives
  • Removed all 1 TB drives
  • Watched NetData, htop and Glances like a hawk for any signs as to what it causing the issue

 

I am still getting the IOwait errors messages, and the system is very slow.

Screen Shot 2020-05-02 at 8.27.14 PM.png

tower2-diagnostics-20200502-2038.zip

Link to comment
  • 10 months later...
  • 7 months later...
  • 4 weeks later...
  • 11 months later...

Has anyone ever figured this out? My system appears to be getting worse. I download and process on my cache drives (4x 240gb SSD.) Then the data is moved to my media share (which also uses the cache.)

 

During these operations my system comes to a standstill. Anything streaming off the array chokes, my Dockers become unresponsive, VMs time out and I even get ssh shells dropped.

 

It's a dual Xenon 5660 w/ 96gb of RAM.

 

htop doesn't show anything out of the ordinary but the unRAID Dashboard shows cpu use through the roof. I finally stumbled upon this thread and it seems I have the same iowait issues.

 

I've got my dockers all pinned to limit what they can use. I have my docker image on the cache as well as my appdata share. 

 

Very frustrating!

Link to comment
  • 1 month later...
  • 2 weeks later...

I have been dealing with this as well for a couple weeks. I can't seem to figure out what the exact cause is. when I check the logs is seems one or the other of my ssd cache drives resets failed scmd and I also get io errrors sector xxxx op 0x0: (read)  flags 0x80700phys_seg 1 or 2 or 3 or 4 prio class 0

Link to comment

I'm having this issue when trying to export data from the array to another server for backup.  Speeds start around 30MB/s and drop to around 5-7MB/s.  IOWAIT sitting around 33.3 according to glances.  No other activity on the server at the same time.  

 

What's interesting is that the array shows ~100MB/s of reads, but there is only a trickle going over the wire.

 

image.thumb.png.296f4b8bfd61d7c89845a7935de0f76d.png

 

image.thumb.png.26f3970252036df8196214e3b8156d6e.png

 

It's like the system is spinning it's wheels trying to get the data ready to send, but can only send really slowly. For reference, I am "pulling" data from Unraid to macOS.  I am running the rsync commands on MacOS, connected to Unraid via the network.  I have been trying with rsync over SSH and just via SMB but no real difference.

 

image.thumb.png.4b06fe326adb280930af20b5bb3b5229.png

 

Link to comment
  • 2 weeks later...

I have a large unraid server with 17 array drives 2 parity and RAID1 SSD Cache Pool. On that I run 3 VM's and up to 12 dockers some of which are io intensive. I often see high iowait % however in my case I know that it's because I am simply demanding too much from my disks, which would be fine if it didn't cripple every other part of the system.

 

Years back I found a way around iowait consuming the whole CPU. Linux allows you to isolate CPU cores from the system so you can dedicate them to other tasks (VM/Docker). This way when the system is crippled by iowait, your VM's and Docker containers can continue to function happily on the isolated CPU cores, although IO may still suffer if accessing the array/pool causing the iowait.

 

As I understand it, my situation is different than yours, but hopefully this trick will still help you work around some of the headaches.

 

In order to isolate the cores, you have to go into your flash drive and edit /syslinux/syslinux.cfg

 

Here is my default boot mode which I have edited to include "append isolcpus=4-9, 14-19". This option will force the system to run on 0-3, 10-13 leaving the isolated cores idle.

 

label Unraid OS
  menu default
  kernel /bzimage
  append isolcpus=4-9,14-19 initrd=/bzroot

 

I have an old hyperthreaded 10 core Xeon so I have 20 virtual cores 0-19. I chose to keep 4 cores for my system as plugins still run on the system, and I have isolated 6 cores for VM's and Docker containers. For this to work properly you must pin each VM and Docker to the isolated cores of your choosing.

 

Now when you are plagued by iowait, your Dockers and VM's will still have processing power.

 

I hope this helps.

 

 

Edit: After looking into this a bit further, I found that this has been implemented in the GUI. Now you simply go to Settings->CPU Pinning.

Edited by lonnie776
Link to comment
  • 1 month later...
  • 4 months later...

More or less have followed the same thing as OP with similar "results" but no permanent fix.  Issue became much more apparent after upgrading to 6.11.5 from 6.9.

 

Consistently sitting between 5-10% IOWAIT.  Anytime qbittorrent or any other container really does any large scale file operations it shoots the IOWAIT upwards to 30-50%, sometimes sitting there for hours at a time.  This causes all network traffic to grind to a halt.

 

Some of what I have tried:

 

  • Swapped Cache drives. 
  • Tried adding more cache drives
  • Tried Splitting the workloads between cache drives.
  • Switched all cache drives from BTRFS to XFS (greatly improved the baseline IOWait, but issues continue to persist)
  • Switched Docker.img from BTRFS to XFS (again, improved IOWait issues but they continue to persist)
  • Rebuild Docker.img from 150GB -> 50GB after fixing naughty containers (no performance changes)
  • Ensured docker containers were not writing to Docker.img after build (no performance changes)
  • Switched Docker.img from XFS to Directory (no change)
  • Tried adding better, faster, pool drives (no perceived difference)
  • Replace both CPUs with E5-2650v2 from E5-2650

 

What I am working to try:

  • Replacing all RAM with higher capacity sticks (128GB -> 384GB)

 

Things that really trigger the IOWait:

  • Qbittorrent Cache flushing (get a better IO and system performance if all qbittorrent caching is disabled)
  • Mover (even with/without Nice)
  • Radarr/Sonarr (file analysis)
  • Sonarr (Every 30 seconds on Finished Download Check, typically causes 5-6% IOWait every 30 seconds for ~10 seconds)
  • Sabnzbd (no longer an issue once Nice was adjusted)
  • Unzip/unrar (any kind, have to be incredibly harsh on the nice values to get it to not choke the server)
  • NFSv3 (full stop, any remote NFSv3 actions cause massive IOWait, talking upwards of 40-50% IOWait on just READ ONLY)
  • BTRFS (literally anything BTRFS causes issue on my R720XD, I do not experience this issue on my other servers)

 

Specs:

  • R720XD
  • E5-2650V2
  • 128GB DDR3-1600 MHz
  • Parity - 2 Drives
    • 16TB WD Red
    • 18TB WD Gold
  • Array (not including Parity) - 16 Drives - 236TB Usable, all tested with DiskSpeed, monitored with  
    • Seagate 16TB Exos x7
    • WD x2
    • 14TB x4
    • 12TB x3
  • Cache Pools
    • Team 1TB (Weekly Appdata Backups)
    • P31 1TB (Appdata)
    • 1TB - WD Black NVME (Blank)
    • 4TB - Samsung 870 EVO (for download caching)
  • Dell Compellent SC200
  • Dell 165T0 BROADCOM 57800S QUAD PORT SFP+
  • Dell H200 6Gbps HBA LSI 9211

 

Working Hypothesis:

Monitoring with NetData.  Noticing IOWait jumps typically correlate with Memory Writeback. Specifically Dirty Memory Writeback. All my research comes back to either bad/lacking ram (which I will be swapping all of them out to 384GB) or Tunables need further adjustment.

Edited by maust
  • Upvote 1
Link to comment
On 7/26/2023 at 5:37 PM, maust said:

More or less have followed the same thing as OP with similar "results" but no permanent fix.  Issue became much more apparent after upgrading to 6.11.5 from 6.9.

 

Consistently sitting between 5-10% IOWAIT.  Anytime qbittorrent or any other container really does any large scale file operations it shoots the IOWAIT upwards to 30-50%, sometimes sitting there for hours at a time.  This causes all network traffic to grind to a halt.


This behavior also occurs anytime Mover runs as well.

 

Additionally, tried swapping Cache drives, tried adding more cache drives, tried splitting the load between multiple cache drives.  Switched from BTRFS to XFS (which seemed to help throughput and lowered the baseline IOWAIT but it persists).

 

Things that really trigger the IOWait:
Qbittorrent Cache (somehow get a better IO and system performance if all qbittorrent caching is disabled)
Mover (even with/without Nice)

Radarr (file analysis)

Sabnzbd (no longer an issue once Nice was adjusted)


Interestingly, I only seem to have this issue on my main Poweredge R720XD Unraid 6.11.5 Server, none of my other servers that are still running 6.10 are experiencing this issue.
 

I think im having the same issue with 6.11.5. But i dont know how to see the IOwaits. Think im gonna try Netdata.

But i also use my cache drive for Qbittorent and when that is slamming the Samsung SSD pool i get unresponsive dockers.

Link to comment
  • 3 weeks later...
On 7/26/2023 at 5:37 PM, maust said:

...

 

Working Hypothesis:

Monitoring with NetData.  Noticing IOWait jumps typically correlate with Memory Writeback. Specifically Dirty Memory Writeback. All my research comes back to either bad/lacking ram (which I will be swapping all of them out to 384GB) or Tunables need further adjustment.

As I am just another one in this long list:

Could you kindly guide me to how you monitored Dirty Memory Writeback? That is a bit out of my depth, but I have tried almost everything else suggested in this thread and many others that I read over the past months. So if nothing else, maybe I could at least support your working hypothesis

Link to comment
  • 2 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.