Jump to content

BTRFS / High IO Wait Issue


Go to solution Solved by chris smashe,

Recommended Posts

Latest information at the bottom.

 

 

 

System

Mother Board: 
ASUSTeK COMPUTER INC. B150M-C , Version Rev X.0x

CPU: Intel® Core™ i7-6700 CPU @ 3.40GHz

Memory:  32GB DDR4

GPU: NVIDIA GeForce GTX 1070

 

Hard Drives

Cache:  Crucial/Micron Client SSDs 480 GB  SATA 3.3

Parity: WD-Plus 12 tb 7200 rpm Sata 3.1

Array Disk 1: WD-Plus 8 tb 5640 rpm Sata 3.1

Array Disk 2: Seagate Barracuda  1.5 tb 7200 rpm SATA 2.6, 3.0 Gb/s (only holds music)

Array Disk 3: WD-Plus 8 tb 5640 rpm Sata 3.1

 

 

I'm having an issue with iowait locking up the system with high cpu usage. I have turned everything off one by one and figured out its deluge. Once i stopped that the io wait went to 0 and stayed there.

 

While this was going on things were downloading but nothing finished so there was no unpacking being done (that I am aware of).

 

Deluge is set up to download to the downloads share which is set to cache pool = yes.

Once the file is done, the rrr's are set to move the files to the array.  This was not being done during this test.

 

 

My questions are

How do I tell what disk is causing the io to wait?  At first, I was thinking that there is a problem with my cache drive (or deluge), but while typing this out, I also figured out that most of the files being downloaded by deluge have been doing so for long enough that least part of the download has been moved off of the cache drive to the array by the mover.  Is this part of my problem? Do I need to change the downloads cache pool to be only, so it never moves that to the array until one of the rrr's does it?

625245845_iowait3.thumb.png.b3c218e238cfe259517cf84406a1bf7a.png

 

io wait 1.png

io wait 2.png

smashenas-diagnostics-20221021-0909.zip

Edited by chris smashe
  • Thanks 1
Link to comment

Additional note 2

I made deluge only download items that are only on the cache drive (moved any items that have been partly moved to the array to the bottom) and there is no change. I still have an io wait issue.

 

I changed the number of active torrents from 8 down to 1 and added one in at a time every 2 minutes until i started to get io wait issues at 4. the first 3 had a slow download speed though where the 4th had up 10 15mib/s. Not fast enough that it should matter for the ssd that's for sure.

 

I stopped all other downloads except for that one and i still have an iowait issue.

This download is faster then the rest

has more currently downloaded then the rest (22gb total with 66.73% currently downloaded where the others are less than 15%)

 

I moved another file to the top running by itself which is 46.8gb total with 43% currently downloaded. This one is downloading slow and there is no iowait issue.

I moved another file to the top that had nothing downloaded yet and has a higher download speed and it is not having an iowait issue until it got to about 8% downloaded then i started to get them again.

 

 

 

Link to comment

Additional note 3

ok its not deluge specific. I copied a 23 GB file from my computer to a share that saves to the cach drive. It was fine for a bit but then  started to have an ongoing iowait for the last 3/4(ish) of the file. The max transfer was 113MB/s. 13% complete was where it started to have iowait issues.  It would go down to a transfer of 7mbs.  Closing the VM had no impact on this.

 

 

I ran disk speed against this and while it was scanning  it would have IO wait issues. It also had a strange drop off then rise. I ran it again and it was even stranger.

 

I ran disk speed against one of the 8tb spinning drives. max io wait of 12.3%

 

This is all of the tests that i can think of to run to give data to you.

 

Thanks for any help you can give me.

 

image.thumb.png.7e162da76240abfd85a90a479d5f8338.pngimage.thumb.png.5fce70b3e243b4a7bc66490da45ecd3a.png

 

Link to comment

I changed the share for the test of me moving the 24gb file from the cache drive to the array. It stayed up at 113mb/s longer before it dropped and when it did drop it dropped as low as 53MB/s.  Not as bad as going to the cache drive but still not great.

 

Server is not doing anything else but this.

io wait hovering around 16% in this configuration which is lower than if it was sending to the cache file but still not great.

 

Nothing should be saturating the array and definitely not the ssd. Im not 100% sure on what causes io waits other than the disk so if anyone can help me figure out why this is happening that would be greatly appreciated.

 

 

 

 

Link to comment

I changed out the sata cable. No change.
I stopped docker and ran the file transfer above and when it hung i ran pf auxf and the only thing that has a stat of D is [kworker/u16:5+btrfs-worker]

 

I dont know what this means but hopefully this helps

 

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    15:57   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [rcu_gp]
root         4  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [rcu_par_gp]
root         5  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [slub_flushwq]
root         6  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [netns]
root         8  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/0:0H-events_highpri]
root        10  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/0:1H-kblockd]
root        11  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [mm_percpu_wq]
root        12  0.0  0.0      0     0 ?        I    15:57   0:00  \_ [rcu_tasks_kthread]
root        13  0.0  0.0      0     0 ?        I    15:57   0:00  \_ [rcu_tasks_rude_kthread]
root        14  0.0  0.0      0     0 ?        I    15:57   0:00  \_ [rcu_tasks_trace_kthread]
root        15  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/0]
root        16  0.1  0.0      0     0 ?        I    15:57   0:02  \_ [rcu_preempt]
root        17  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/0]
root        18  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/0]
root        19  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/1]
root        20  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/1]
root        21  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/1]
root        23  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/1:0H-events_highpri]
root        24  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/2]
root        25  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/2]
root        26  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/2]
root        28  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/2:0H-events_highpri]
root        29  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/3]
root        30  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/3]
root        31  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/3]
root        33  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/3:0H-events_highpri]
root        34  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/4]
root        35  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/4]
root        36  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/4]
root        37  0.0  0.0      0     0 ?        I    15:57   0:00  \_ [kworker/4:0-mm_percpu_wq]
root        38  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/4:0H-events_highpri]
root        39  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/5]
root        40  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/5]
root        41  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/5]
root        43  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/5:0H-events_highpri]
root        44  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/6]
root        45  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/6]
root        46  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/6]
root        48  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/6:0H-kblockd]
root        49  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [cpuhp/7]
root        50  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [migration/7]
root        51  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [ksoftirqd/7]
root        53  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/7:0H-kblockd]
root        54  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [kdevtmpfs]
root        55  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [inet_frag_wq]
root        57  0.0  0.0      0     0 ?        I    15:57   0:00  \_ [kworker/1:1-events]
root        59  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [oom_reaper]
root        61  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [writeback]
root        62  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [kcompactd0]
root        63  0.0  0.0      0     0 ?        SN   15:57   0:00  \_ [ksmd]
root        64  0.0  0.0      0     0 ?        SN   15:57   0:00  \_ [khugepaged]
root        65  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kintegrityd]
root        66  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kblockd]
root        67  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [blkcg_punt_bio]
root        68  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [ata_sff]
root        69  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [devfreq_wq]
root        71  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/1:1H-kblockd]
root        94  0.2  0.0      0     0 ?        I    15:57   0:07  \_ [kworker/u16:4-btrfs-endio-write]
root       116  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [kswapd0]
root       120  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kthrotld]
root       125  0.0  0.0      0     0 ?        I    15:57   0:00  \_ [kworker/7:2-rcu_gp]
root       171  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [xenbus_probe]
root       241  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [target_completi]
root       242  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [target_submissi]
root       243  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [xcopy_wq]
root       250  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [vfio-irqfd-clea]
root       268  0.2  0.0      0     0 ?        I    15:57   0:05  \_ [kworker/u16:6-btrfs-endio-write]
root       274  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/2:1H-kblockd]
root       335  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kstrp]
root       380  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/4:1H-kblockd]
root       381  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/3:1H-kblockd]
root       385  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [scsi_eh_0]
root       386  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [scsi_tmf_0]
root       388  0.0  0.0      0     0 ?        S    15:57   0:00  \_ [usb-storage]
root       422  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/7:1H-events_highpri]
root       423  0.0  0.0      0     0 ?        I<   15:57   0:00  \_ [kworker/6:1H-events_highpri]
root       764  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [acpi_thermal_pm]
root       774  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [scsi_eh_1]
root       775  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [kworker/5:1H-kblockd]
root       776  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [scsi_tmf_1]
root       777  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [scsi_eh_2]
root       778  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [scsi_tmf_2]
root       779  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [scsi_eh_3]
root       780  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [scsi_tmf_3]
root       781  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [scsi_eh_4]
root       782  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [scsi_tmf_4]
root       784  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [scsi_eh_5]
root       785  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [scsi_tmf_5]
root       786  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [scsi_eh_6]
root       787  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [scsi_tmf_6]
root       794  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [cryptd]
root       829  0.0  0.0      0     0 ?        I    15:59   0:00  \_ [kworker/4:2-rcu_gp]
root       836  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [nv_queue]
root       837  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [nv_queue]
root       844  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [nvidia-modeset/kthread_q]
root       845  0.0  0.0      0     0 ?        S    15:59   0:00  \_ [nvidia-modeset/deferred_close_kthread_q]
root       987  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [mld]
root       988  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [ipv6_addrconf]
root       999  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [bond0]
root      1024  0.0  0.0      0     0 ?        I    15:59   0:00  \_ [kworker/3:3-events]
root      1136  0.0  0.0      0     0 ?        I<   15:59   0:00  \_ [wg-crypt-wg0]
root      4581  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [md]
root      4582  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [mdrecoveryd]
root      5163  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [unraidd0]
root      5164  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [unraidd1]
root      5165  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [unraidd2]
root      5166  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [unraidd3]
root      5199  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfsalloc]
root      5200  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs_mru_cache]
root      5202  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-buf/md1]
root      5203  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-conv/md1]
root      5204  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-reclaim/md1]
root      5205  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-blockgc/md1]
root      5206  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-inodegc/md1]
root      5207  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-log/md1]
root      5208  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-cil/md1]
root      5209  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [xfsaild/md1]
root      5220  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-buf/md2]
root      5221  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-conv/md2]
root      5222  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-reclaim/md2]
root      5223  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-blockgc/md2]
root      5224  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-inodegc/md2]
root      5225  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-log/md2]
root      5226  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-cil/md2]
root      5227  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [xfsaild/md2]
root      5238  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-buf/md3]
root      5239  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-conv/md3]
root      5240  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-reclaim/md3]
root      5241  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-blockgc/md3]
root      5242  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-inodegc/md3]
root      5243  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-log/md3]
root      5244  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [xfs-cil/md3]
root      5245  0.0  0.0      0     0 ?        S    16:00   0:00  \_ [xfsaild/md3]
root      5257  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-worker]
root      5259  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-worker-hi]
root      5260  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-delalloc]
root      5261  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-flush_del]
root      5262  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-cache]
root      5263  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-fixup]
root      5264  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-endio]
root      5265  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-endio-met]
root      5266  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-endio-met]
root      5267  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-endio-rai]
root      5268  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-rmw]
root      5269  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-endio-wri]
root      5270  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-freespace]
root      5271  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-delayed-m]
root      5272  0.0  0.0      0     0 ?        I<   16:00   0:00  \_ [btrfs-qgroup-re]
root      5273  0.5  0.0      0     0 ?        S    16:00   0:12  \_ [btrfs-cleaner]
root      5274  0.0  0.0      0     0 ?        S    16:00   0:01  \_ [btrfs-transaction]
root      7859  0.0  0.0      0     0 ?        S    16:01   0:00  \_ [irq/128-nvidia]
root      7860  0.0  0.0      0     0 ?        S    16:01   0:00  \_ [nvidia]
root      7861  0.0  0.0      0     0 ?        S    16:01   0:00  \_ [nv_queue]
root      8129  0.0  0.0      0     0 ?        I<   16:01   0:00  \_ [dio/sdd1]
root      9607  0.0  0.0      0     0 ?        I    16:02   0:00  \_ [kworker/0:3-mm_percpu_wq]
root      9720  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-worker]
root      9722  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-worker-hi]
root      9723  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-delalloc]
root      9724  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-flush_del]
root      9725  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-cache]
root      9726  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-fixup]
root      9727  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-endio]
root      9728  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-endio-met]
root      9729  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-endio-met]
root      9730  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-endio-rai]
root      9731  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-rmw]
root      9732  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-endio-wri]
root      9733  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-freespace]
root      9734  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-delayed-m]
root      9735  0.0  0.0      0     0 ?        I<   16:02   0:00  \_ [btrfs-qgroup-re]
root      9767  0.0  0.0      0     0 ?        S    16:02   0:00  \_ [btrfs-cleaner]
root      9768  0.0  0.0      0     0 ?        S    16:02   0:00  \_ [btrfs-transaction]
root     12543  0.0  0.0      0     0 ?        S    16:02   0:00  \_ [UVM global queue]
root     12544  0.0  0.0      0     0 ?        S    16:02   0:00  \_ [UVM deferred release queue]
root     12546  0.0  0.0      0     0 ?        S    16:02   0:00  \_ [UVM Tools Event Queue]
root     16555  0.0  0.0      0     0 ?        I    16:03   0:00  \_ [kworker/2:4-events]
root     16557  0.0  0.0      0     0 ?        I    16:03   0:00  \_ [kworker/2:6-rcu_gp]
root     27361  0.0  0.0      0     0 ?        I    16:09   0:00  \_ [kworker/7:0-events]
root     31055  0.0  0.0      0     0 ?        I    16:11   0:00  \_ [kworker/0:0-inet_frag_wq]
root     31982  0.0  0.0      0     0 ?        I    16:11   0:00  \_ [kworker/1:0-events]
root      1258  0.0  0.0      0     0 ?        I    16:12   0:00  \_ [kworker/3:1-mm_percpu_wq]
root      1949  0.0  0.0      0     0 ?        I    16:13   0:00  \_ [kworker/6:11-cgroup_destroy]
root      1974  0.0  0.0      0     0 ?        I    16:13   0:00  \_ [kworker/5:12-cgroup_destroy]
root      1975  0.0  0.0      0     0 ?        I    16:13   0:00  \_ [kworker/5:13-events]
root     29577  0.0  0.0      0     0 ?        I    16:27   0:00  \_ [kworker/6:0-mm_percpu_wq]
root     29912  0.2  0.0      0     0 ?        I    16:27   0:01  \_ [kworker/u16:3-btrfs-endio-write]
root     29958  0.2  0.0      0     0 ?        I    16:27   0:01  \_ [kworker/u16:7-btrfs-endio-write]
root     30289  0.0  0.0      0     0 ?        I<   16:28   0:00  \_ [kworker/u17:1-btrfs-worker-high]
root     30290  0.0  0.0      0     0 ?        I<   16:28   0:00  \_ [kworker/u17:2-btrfs-worker-high]
root     30293  0.0  0.0      0     0 ?        I<   16:28   0:00  \_ [kworker/u17:5-btrfs-worker-high]
root     30331  0.3  0.0      0     0 ?        I    16:28   0:02  \_ [kworker/u16:13-events_unbound]
root      6970  0.3  0.0      0     0 ?        I    16:33   0:01  \_ [kworker/u16:0-bond0]
root      6971  0.3  0.0      0     0 ?        I    16:33   0:01  \_ [kworker/u16:1-events_unbound]
root      8726  0.1  0.0      0     0 ?        I    16:34   0:00  \_ [kworker/u16:2-btrfs-worker]
root     12431  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/6:1-cgroup_pidlist_destroy]
root     12573  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/5:0-events]
root     12586  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/5:1-mm_percpu_wq]
root     12696  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/1:2-mm_percpu_wq]
root     12730  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/7:1]
root     12739  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/2:0-mm_percpu_wq]
root     13085  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/0:1-events]
root     13094  0.0  0.0      0     0 ?        I    16:36   0:00  \_ [kworker/4:1-rcu_gp]
root     13306  0.0  0.0      0     0 ?        D    16:36   0:00  \_ [kworker/u16:5+btrfs-worker]
root     13860  0.0  0.0      0     0 ?        I<   16:37   0:00  \_ [kworker/u17:0-btrfs-worker-high]
root     13861  0.0  0.0      0     0 ?        I<   16:37   0:00  \_ [kworker/u17:3]
root     13862  0.0  0.0      0     0 ?        I<   16:37   0:00  \_ [kworker/u17:4]
root         1  0.0  0.0   2592   920 ?        Ss   15:57   0:00 init
root       734  0.0  0.0  18016  3540 ?        Ss   15:59   0:00 /sbin/udevd --daemon
root       926  0.0  0.0 211388  4168 ?        Ssl  15:59   0:00 /usr/sbin/rsyslogd -i /var/run/rsyslogd.pid
dhcpcd    1034  0.0  0.0   3032  1976 ?        S    15:59   0:00 dhcpcd: br0 [ip4]
root      1035  0.0  0.0   3056  2140 ?        S    15:59   0:00  \_ dhcpcd: [privileged proxy] br0 [ip4]
dhcpcd    1047  0.0  0.0   3044   288 ?        S    15:59   0:00  |   \_ dhcpcd: [BPF ARP] br0 192.168.1.104
dhcpcd    1087  0.0  0.0   3056   292 ?        S    15:59   0:00  |   \_ dhcpcd: [network proxy] 192.168.1.104
dhcpcd    1036  0.0  0.0   3032   272 ?        S    15:59   0:00  \_ dhcpcd: [control proxy] br0 [ip4]
root      1178  0.0  0.0   3184   164 ?        Ss   15:59   0:00 /usr/sbin/mcelog --daemon
message+  1188  0.0  0.0   5208  2080 ?        Ss   15:59   0:00 /usr/bin/dbus-daemon --system
root      1201  0.0  0.0   4272  2416 ?        S    15:59   0:00 elogind-daemon
ntp       1234  0.0  0.0  74552  4512 ?        Ssl  15:59   0:00 /usr/sbin/ntpd -g -u ntp:ntp
root      1241  0.0  0.0   2600   104 ?        Ss   15:59   0:00 /usr/sbin/acpid
root      1255  0.0  0.0   2576  1788 ?        Ss   15:59   0:00 /usr/sbin/crond
daemon    1259  0.0  0.0   2656  1540 ?        Ss   15:59   0:00 /usr/sbin/atd -b 15 -l 1
daemon    2483  0.0  0.0   5148  3080 ?        S    16:00   0:00  \_ /usr/sbin/atd -b 15 -l 1
root      2486  0.0  0.0   3980  2992 ?        S    16:00   0:00      \_ sh
root      2487  0.0  0.0   4344  2756 ?        S    16:00   0:00          \_ sh
root      2488  0.0  0.0   2772   860 ?        S    16:00   0:00              \_ inotifywait -q /boot/changes.txt -e move_self,delete_self,modify
root      4537  0.0  0.0   2608   916 tty1     Ss+  16:00   0:00 /sbin/agetty --noclear 38400 tty1 linux
root      4538  0.0  0.0   2608   932 tty2     Ss+  16:00   0:00 /sbin/agetty 38400 tty2 linux
root      4539  0.0  0.0   2608   916 tty3     Ss+  16:00   0:00 /sbin/agetty 38400 tty3 linux
root      4540  0.0  0.0   2608   888 tty4     Ss+  16:00   0:00 /sbin/agetty 38400 tty4 linux
root      4541  0.0  0.0   2608   908 tty5     Ss+  16:00   0:00 /sbin/agetty 38400 tty5 linux
root      4542  0.0  0.0   2608   912 tty6     Ss+  16:00   0:00 /sbin/agetty 38400 tty6 linux
root      4574  0.0  0.0   4344   240 ?        Ss   16:00   0:00 /usr/sbin/inetd
root      4575  0.3  0.0 412916  6640 ?        Ssl  16:00   0:06 /usr/local/sbin/emhttpd
root      4603  0.0  0.0 142600  2176 ?        Ssl  16:00   0:00 /sbin/apcupsd
root      4706  0.0  0.0  88096  9984 ?        Ss   16:00   0:00 php-fpm: master process (/etc/php-fpm/php-fpm.conf)
root     10164  0.1  0.0  91736 14100 ?        S    16:35   0:00  \_ php-fpm: pool www
root     11758  0.1  0.0  91864 14608 ?        S    16:35   0:00  \_ php-fpm: pool www
root     12837  0.1  0.0  91672 13516 ?        S    16:36   0:00  \_ php-fpm: pool www
root     13632  0.1  0.0  91672 13704 ?        S    16:37   0:00  \_ php-fpm: pool www
root     14014  0.1  0.0  91672 13504 ?        S    16:37   0:00  \_ php-fpm: pool www
root     14518  0.1  0.0  91672 13516 ?        S    16:37   0:00  \_ php-fpm: pool www
root     14609  0.1  0.0  91672 13516 ?        S    16:37   0:00  \_ php-fpm: pool www
root      4919  0.0  0.0 147708  4388 ?        Ss   16:00   0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
root      4920  0.5  0.0 149504 10508 ?        S    16:00   0:12  \_ nginx: worker process
root      4951  0.9  0.4 11417656 157852 ?     SLsl 16:00   0:21 /usr/local/bin/unraid-api/unraid-api /snapshot/api/dist/unraid-api.cjs start
root      4972  0.0  0.0   5172  3632 ?        S    16:00   0:00 /bin/bash /etc/rc.d/rc.flash_backup watch
root     14638  0.0  0.0   2576   872 ?        S    16:37   0:00  \_ sleep 60
root      5371  0.0  0.0  91220 29040 ?        SL   16:00   0:00 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/notify_poller
root      5373  0.0  0.0  91220 29016 ?        SL   16:00   0:00 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/session_check
root      5375  0.0  0.0  91284 29728 ?        SL   16:00   0:00 /usr/bin/php -q /usr/local/emhttp/plugins/dynamix.system.temp/nchan/system_temp
root      5377  0.1  0.0  91456 29240 ?        SL   16:00   0:03 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/device_list
root      5379  0.0  0.0   4016  3036 ?        S    16:00   0:01 /bin/bash /usr/local/emhttp/webGui/nchan/disk_load
root     16043  0.0  0.0   2576   932 ?        S    16:38   0:00  \_ sleep 2
root      5381  0.1  0.0  91220 29188 ?        SL   16:00   0:02 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/parity_list
root      5775  0.0  0.0  91284 29540 ?        SL   16:01   0:01 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/wg_poller
root      5777  0.0  0.0  91220 29216 ?        SL   16:01   0:00 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/update_1
root      5779  0.0  0.0  91416 29616 ?        SL   16:01   0:01 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/update_2
root      5781  0.1  0.0  91220 29264 ?        SL   16:01   0:03 /usr/bin/php -q /usr/local/emhttp/webGui/nchan/update_3
root      7727  0.0  0.0 140236   304 ?        Ssl  16:01   0:00 /usr/local/sbin/shfs /mnt/user0 -disks 14 -o default_permissions,allow_other,noat
root      7739  3.1  0.0 693092 21496 ?        Ssl  16:01   1:08 /usr/local/sbin/shfs /mnt/user -disks 15 -o default_permissions,allow_other,noati
root      9785  0.0  0.0  31864  9560 ?        S    16:02   0:00 /usr/sbin/virtlockd -d -f /etc/libvirt/virtlockd.conf -p /var/run/libvirt/virtloc
root      9790  0.0  0.0  31936 10460 ?        S    16:02   0:00 /usr/sbin/virtlogd -d -f /etc/libvirt/virtlogd.conf -p /var/run/libvirt/virtlogd.
root      9809  0.0  0.0 1391124 21664 ?       Sl   16:02   0:00 /usr/sbin/libvirtd -d -l -f /etc/libvirt/libvirtd.conf -p /var/run/libvirt/libvir
nobody    9918  0.0  0.0   7484  3096 ?        S    16:02   0:00 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-r
root      9919  0.0  0.0   7352   332 ?        S    16:02   0:00  \_ /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefi
root      4564  0.0  0.0  12324  5000 ?        Sl   16:32   0:00 /usr/bin/ttyd -d0 -t disableLeaveAlert true -t theme {'background':'black'} -t fo
root      4569  0.0  0.0   7604  4308 pts/0    Ss   16:32   0:00  \_ bash --login
root     16077  0.0  0.0   5172  3004 pts/0    R+   16:38   0:00  |   \_ ps auxf
root     14384  0.0  0.0   7604  4240 pts/1    Ss   16:37   0:00  \_ bash --login
root     14460  0.1  0.0   5580  3608 pts/1    S+   16:37   0:00      \_ top
root     12001  0.0  0.0  79008 13516 ?        Ss   16:36   0:00 /usr/sbin/smbd -D
root     12006  0.0  0.0  77216  9116 ?        S    16:36   0:00  \_ /usr/sbin/smbd -D
root     12007  0.0  0.0  77208  4352 ?        S    16:36   0:00  \_ /usr/sbin/smbd -D
csmashe  14228 40.1  0.0  80940 17064 ?        S    16:37   0:30  \_ /usr/sbin/smbd -D
root     12014  0.0  0.0   2588  2004 ?        Ss   16:36   0:00 /usr/sbin/wsdd2 -d
root     12017  0.0  0.0  75032 11940 ?        Ss   16:36   0:00 /usr/sbin/winbindd -D
root     12020  0.0  0.0  75232 12792 ?        S    16:36   0:00  \_ /usr/sbin/winbindd -D
root     13492  0.0  0.0  75860 10448 ?        S    16:37   0:00  \_ /usr/sbin/winbindd -D
avahi    12049  0.0  0.0   5640  3404 ?        S    16:36   0:00 avahi-daemon: running [SmasheNas.local]
avahi    12052  0.0  0.0   5344   268 ?        S    16:36   0:00  \_ avahi-daemon: chroot helper
root     12060  0.0  0.0   2676   120 ?        S    16:36   0:00 /usr/sbin/avahi-dnsconfd -D

 

Edited by chris smashe
Link to comment
  • chris smashe changed the title to BTRFS / High IO Wait Issue

So it is definitely an issue with the cache drive or butter fs.

I ran iotop and here is a video of it running while I do a transfer and deluge is running. A good portion of the time btrfs worker is 99% io but has no read or writes being listed. Then when you get to 31 seconds it is the only thing showing ( i used the -o flag to only show processes that have io) since it looks like its blocking everything else from running and causing the wait.

I have also seen nothing listed at all for a couple seconds in my tests but that is not shown in this video.

Also attached is an iostat image
 



My next steps are to change out the drive with another drive and if that does not work, change the drive from btfs to xfs and see if that works. That will be later so if anyone has any ideas before i start that it would be helpful.

 

iostat.png

Link to comment
  • Solution

I have fixed the problem, I think. So far, so good. I wanted to try everything before I started updating the hardware, so I found and updated my bios for my motherboard to the latest version available. I still have some iowaits that go up to the 50’s in certain situations, but so far, nothing I have thrown at it has locked up the server while trying to complete an awaited task.

 

Most situations that I have tested still allowed me to transfer my test file at full speed with no degradation of speed even if there was an io wait reported. The 2 situations in which I was able to make the speed slow down slightly were if tdar was transcoding multiple files (3 at a time plus any 1 copy could be running) or if plex was detecting intros. Both of those situations made my transfer speed drop about 20% at most, but it no longer drops to 0 and locks up the server until the io catches up. I would like to have little to no iowaits but I will take having the server always respond without issue as a win for now.

 

While trying to figure this out the past 2 days, I found multiple posts about people having the same issues, but none of the posts have solutions, so I want to post what helped me here.

First, update your bios if there is an update. Something about the bios my server was using (which was old) and btfs did not work well together. This fixed my issue so save yourself 2 days and try this first.

 

If that does not fix your issue, the couple of things you can do to try and diagnose what is causing it are

 

First, install glances from apps. It’s an easy way to view the iowait and show tasks that are using your cpu and you can watch/log iowaits while you try different scenarios. You can get the same information with the top command, but glances made it easier to view, and it logs the last 9 warning/critical issues.

 

Next, run the iostat with the following command: iostat -hymx 1 4

This will give you a response like below. Look to see if any disks have 100% usage, as I had below. If you do, that is the disk that is having issues.

iostat.thumb.png.e2a8f81ab7c6fe7e42f4ce826ff3967d.png

 

Next, install disk speed from apps and run the benchmark on that drive. In my case, the drive looked fine when I ran the benchmark without other things turned on, but that helped me rule out the disk (somewhat since I still thought we could have a disk issue).

 

Lastly, install nerd tools, and then inside nerd tools, install iotop. After you install iotop run it using the command:  iotop -o

This will list out all the tasks that are using io in real-time. Keep it open and do things on your server that cause it to have a high iowait percentage to see what task is causing the problem. In my case, it was the btfs worker (shown in the youtube video linked in this thread), so if my bios update did not work, I was going to swap out the cache drive and change the drive from btfs to xfs as a last resort.

 

If iotop is not showing the percentage and shows unavailable, run the following command and then run iotop again. It might still show unavailable for a minute, but then it started showing for me.

echo 1 > /proc/sys/kernel/task_delayacct

 

 

Hopefully, this helps anyone else having this issue. If it starts happening again i will open this thread back up.

 

 

  • Like 2
Link to comment
  • 1 month later...
On 10/23/2022 at 4:57 AM, chris smashe said:

I have fixed the problem, I think. So far, so good. I wanted to try everything before I started updating the hardware, so I found and updated my bios for my motherboard to the latest version available. I still have some iowaits that go up to the 50’s in certain situations, but so far, nothing I have thrown at it has locked up the server while trying to complete an awaited task.

 

Most situations that I have tested still allowed me to transfer my test file at full speed with no degradation of speed even if there was an io wait reported. The 2 situations in which I was able to make the speed slow down slightly were if tdar was transcoding multiple files (3 at a time plus any 1 copy could be running) or if plex was detecting intros. Both of those situations made my transfer speed drop about 20% at most, but it no longer drops to 0 and locks up the server until the io catches up. I would like to have little to no iowaits but I will take having the server always respond without issue as a win for now.

 

While trying to figure this out the past 2 days, I found multiple posts about people having the same issues, but none of the posts have solutions, so I want to post what helped me here.

First, update your bios if there is an update. Something about the bios my server was using (which was old) and btfs did not work well together. This fixed my issue so save yourself 2 days and try this first.

 

If that does not fix your issue, the couple of things you can do to try and diagnose what is causing it are

 

First, install glances from apps. It’s an easy way to view the iowait and show tasks that are using your cpu and you can watch/log iowaits while you try different scenarios. You can get the same information with the top command, but glances made it easier to view, and it logs the last 9 warning/critical issues.

 

Next, run the iostat with the following command: iostat -hymx 1 4

This will give you a response like below. Look to see if any disks have 100% usage, as I had below. If you do, that is the disk that is having issues.

iostat.thumb.png.e2a8f81ab7c6fe7e42f4ce826ff3967d.png

 

Next, install disk speed from apps and run the benchmark on that drive. In my case, the drive looked fine when I ran the benchmark without other things turned on, but that helped me rule out the disk (somewhat since I still thought we could have a disk issue).

 

Lastly, install nerd tools, and then inside nerd tools, install iotop. After you install iotop run it using the command:  iotop -o

This will list out all the tasks that are using io in real-time. Keep it open and do things on your server that cause it to have a high iowait percentage to see what task is causing the problem. In my case, it was the btfs worker (shown in the youtube video linked in this thread), so if my bios update did not work, I was going to swap out the cache drive and change the drive from btfs to xfs as a last resort.

 

If iotop is not showing the percentage and shows unavailable, run the following command and then run iotop again. It might still show unavailable for a minute, but then it started showing for me.

echo 1 > /proc/sys/kernel/task_delayacct

 

 

Hopefully, this helps anyone else having this issue. If it starts happening again i will open this thread back up.

 

 

hi chris, how do you run iotop and ipstat on unraid? eventhough, i cant use yum to install these command.

thanks

Link to comment
  • 1 year later...

Thanks for all the work you put into this Chris. I recently converted my old AMD FX4300 Black based unraid server to an i7-6700K based platform with 32GB RAM. I added a Samsung SATA SSD, a Crucial NVME drive, and an LSI HBA to attach my stack of spinning rust.

 

I had been struggling with IOWAIT issues ever since first booting the new platform up and had been struggling to diagnose it. After reading your post I moved all my appdata and VMs off my NVME drive, converted it from BTRFS to XFS, then moved everything back. Now it seems my IOWAIT issues are solved. Occasionally see 5-6% IOWAIT in top, but only if I'm doing something file intensive like a Plex library scan.

 

Thanks for pointing me in the right direction.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...