Need help tracking down an I/O bottleneck/issue.


Recommended Posts

This issue has been plaguing me for a while now and I finally want to get to the bottom of it.
Sometimes when doing a file transfer, the transfer will eventually drop down to 0kbps for a while (10+ seconds) before resuming, building up speed again and then dropping back to 0.
It then bounces around like this until the copy finishes.

It doesn't seem to matter how I'm doing the transfer or what the source/target drive is. Its always the same.

I can recreate the issue 100% of the time when I'm coping a file in a VM, but it also happens when coping a file over the network to a share, or if I'm moving from an unassigned drive to a disk (not user share).
ANY sustained disk activity seems to exhibit the problem.
It seems to happen the most when copying larger files, however lots of smaller copies in a short amount of time can also cause the issue.

I have just tried booting UnRaid using a brand new trail version of 6.10.3, which means everything is default settings and 'fresh'.
There was a single SSD in the array and NO parity.

While copying the Win11 ISO over to the "ISO" the issue happened again, I didn't even need to create a VM to recreate the issue this time.
This tells me its hardware related.

After I encountered the issue, I then installed "Glances" and "DiskSpeed" dockers.

System Specs are:

Unraid: Unraid server Pro, version 6.10.3
Motherboard: ASUSTeK COMPUTER INC. RAMPAGE IV BLACK EDITION Version Rev 1.xx -
CPU: Intel® Core™ i7-4930K CPU @ 3.40GHz

HVM: Enabled

IOMMU: Enabled

Cache:

  • L1-Cache = 32 kB (max. capacity 32 kB)
  • L2-Cache = 256 kB (max. capacity 256 kB)
  • L3-Cache = 12 MB (max. capacity 12 MB)

Memory: 24 GB (max. installable capacity 96 GB)

  • ChannelA_Dimm1 = 4 GB, 1600 MT/s
  • ChannelB_Dimm1 = 4 GB, 1600 MT/s
  • ChannelB_Dimm2 = 4 GB, 1600 MT/s
  • ChannelC_Dimm1 = 4 GB, 1600 MT/s
  • ChannelD_Dimm1 = 4 GB, 1600 MT/s
  • ChannelD_Dimm2 = 4 GB, 1600 MT/s

1000W PSU

I have run a MemTest on all stick for 48hours and had no issues.
I have also removed all 6 stocks and tested them 1 at a time and the issue is present regardless of what stick is used so I don't believe its a memory problem.


The Motherboard has 10 SATA ports but no HDDs are currently connected to them (for trouble shooting reasons). I also have a "LSI 9201-16i 6Gbps 16P SAS HBA" and a "NEC LSI 9207-8i 6Gbs HBA".
The 16i was purchased for troubleshooting to remove the motherboard and its controllers from the equation but the issue remains regardless of what controller they are connected too.

All HDDs are in Icy-Docks.
The 2.5" drives are in an "ExpressCage" and the 3.5" drives are in some old "FatCages" (I think that's the model. Its the 3to4 and 3to5 ones).

I have run "DiskSpeed" on all of my drives and all of them (apart from one) is showing the expected speeds.
The 15 second controller/all disk test also shows expected speeds (although I would say 15 seconds might not be enough time for the issue to show its self)

One of my SSDs is showing really slow speeds (20Mbps) which is a new issue so I'm going to remove it for now, although its not part of the array or pool so it shouldn't really be effecting anything.

I'm going to stick with the trial version and single drive setup for now and see if I can solve the issue on here before going back to my proper USB.

I have attached the diagnostics but I'm not sure how much help they will actually be.

My next step is to remove the backplane from the equation (connect the test drive directly to the card) and see if that improves anything.

Any help figuring this out would be greatly approached as I'm on my last legs.
I have considered just getting a whole new machine and moving the array over, but I'm worried the issue will move with it.

Cheers.

tower-diagnostics-20220716-2052.zip

Link to comment
28 minutes ago, trurl said:

Do all these transfers involve the network?


No.
Like I said in the original post, its basically any sustained disk activity.
Copying from an Unassigned Drive to a User share using Krusader does it.
Its noticeable when installing an OS in a new VM (install times of 2+ hours), or even using the VM. The VM will report 100% disk activity when doing simple things like browsing the web.
Even "Mover" can cause it to happen.

Side note: I haven't noticed it happen when using the Unbalance Plugin or when doing a file cope via the terminal, although I don't do that very often so I might be just a coincidence that I haven't noticed it.

When I upgraded up cache drive I have to move all the files to the array using Unbalance because Mover was exhibiting the issue and was going to take 10+ hours. Unbalance got it done in about 40 mins I think.

Link to comment
4 hours ago, DarkMain said:

One of my SSDs is showing really slow speeds (20Mbps) which is a new issue so I'm going to remove it for now, although its not part of the array or pool so it shouldn't really be effecting anything.

I'm going to stick with the trial version and single drive setup for now and see if I can solve the issue on here before going back to my proper USB.


My next step is to remove the backplane from the equation (connect the test drive directly to the card) and see if that improves anything.


Slow SSD removed from the system and test drive directly connected to the MBoard (not in the Icy-Dock any more) and the issue is still present.
I was able to copy the ISO over this time with no problems but I'm currently installing the VM and I can see that the write speeds are all over the place (doing the expected 100MB/s+ and then dropping down to 0B/s.

I'm pretty much out of ideas now.

Its possible its a BIOS issue but I have no idea what setting would cause this + I have reset the BIOS multiple times while trying to figure this out in the past.
Maybe it just doesn't like the Motherboard? Or Firmware issue?

Link to comment

Update:

I have dusted off an old i5-2500 I had sitting around and did a test using that hardware.
The ONLY things that were the same were the test SSD and the test USB. Everything else was different hardware.

Booted into UnRaid, started the VM and did a file copy... Same issue.

Now I'm completely stumped and have NO idea where to go now with the testing.

Link to comment
  • 3 weeks later...

Its dropping to below 10MB/s at times and Glances is giving me a "CPU_IOWAIT" error.
I know that SMR drives are slower but they shouldn't be dropping that slow.
(I'm actually in the process of swapping the parity to one that's not, but got an error, hence the rebuild before swapping the drives)

It's not just the array though. Its ALL the drives that are exhibiting this behavior, even the SSDs in cache pools or unassigned devices.

I'm kind all out of ideas.
Its getting bad enough that I've considered just getting a new server and starting again, but considering the issue followed me to completely different hardware on a brand new install of UnRaid, I don't want to spend a heap of money on a new computer just to find it still has problems.

Link to comment

I personally know the SV300 and they are pretty slow, the other 120GB SSD also doesn't like like a performance model, and most 120GB are much slower than larger capacity SSDs, due to parallel writing to the large number of flash modules, not all ( read most) SSDs can sustain announced write speeds.

Link to comment

Were not talking about slow peformance though.
We are talking about a complete freeze in I/O operations.

If I try copying a single large file, it will copy, let say 2 or 3GB, and then the performance will literally drop to nothing for about 10+ seconds (and I get a whole bunch of CPU_IOWAIT errors. Makes sense as my understanding is that error is the processor waiting on the drives). Then the speed will jump up again, then back to nothing. Yo-Yoing up and down until the transfer is finished.
It doesn't matter how the drives are used.
Cache pool, unassigned drive, pass through to a VM... its always the same.

Its a bad analogy, but think of it like a CPU that's throttling because of poor cooling.
It get too hot so the CPU throttles and the performance drops... Because the performance has dropped the CPU cools down, because the CPU has cooled down the performance goes up again, but then it gets too hot and throttles...
Its kinda like that but much more aggressive.

It doesn't seem to matter how the copy is initialized. 
Network transfer, krusader in docker... even the mover script has exhibited this behavior.

The SSDs were previously used as Windows boot drives and they were fine then, so even if they aren't performance drives they should NOT be acting this way. Its not normal and I have never seen this behavior in a drive before.

Its really driving me insane and the fact that a brand new install of UnRaid exhibits the same behavior on two completely (all be it old) systems makes it even harder for me to try and narrow down what's causing the problem.

Edited by DarkMain
Link to comment
9 minutes ago, JorgeB said:

I see you have a 10TB Ironwolf unassigned, suggest assigning that disk to a pool and see if you see the same issues while transferring data to it.


I'll give that a shot and let you know how it goes. Its eventually going to be the new parity (and I have a 2nd one to replace the older drives in the array).

I'm actually dealing with another issue right now... In the last few days a lot of my drives have started giving me UDMA CRC error counts. 
Its actually since I put the new LSI 9201-16i 6Gbps 16P SAS HBA card in, so I'm going to have to remove that and put all the drives back onto the MBoard.

I had to stop the parity rebuild as drive 5 was getting new error counts every few mins. :S

Not having much luck with the system lately, I guess after, I dunno, 10+ years? (When was UnRaid 3 released?) of no issues they were bound to catch up to me.

Edited by DarkMain
Link to comment
On 8/3/2022 at 2:07 AM, JorgeB said:

I see you have a 10TB Ironwolf unassigned, suggest assigning that disk to a pool and see if you see the same issues while transferring data to it.

 

K, so I have tested it with the IronWolf and I was unable to recreate the issue with my usual go to method of making a VM and then copying a file to the running VM.

This method has been pretty much a guarantee to reproduce the problem so that's good news its not happening.
I also tested some older VMs running on the IronWolf and they ran much better than on the SSDs.

 

That got me thinking... All of the problems have been when using SSDs (I assumed even a bad SSD would be better than a mechanical HDD so have never bothered testing with them).

 

Right now, I'm making the IronWolf into the parity drive, and once that has been done (in about 18 hours) I am going to use the old 8GB Segate Barracuda (which you pointed out, is an SMR drive) and run the tests again.
If I can recreate the issue then I can chalk it up to poor performing drives, however, if the issue still isnt present with the 8GB, I can probably say its an SSD only issue...
If that is the case, my next step will be to go and buy a "high performance" SSD and test that.

Can I get a recommendation from you for what SSD SHOULD work well in UnRaid.
I keep seeing the 870 EVO and MX500 popping up as recommended drives but want to double check.

Cheers.

Link to comment

Got the VM setup on the 8GB drive and I'm not able to recreate the issue using this drive either.
This time its setup as an unassigned drive (rather than a cache pool).


Its formatted as btrfs, however I'm pretty sure some of the SSDs I tested were also BFRS, so I don't think its a file system thing (although I cant be 100% on that)

Looks like I'm off to the store tomorrow to pick up an 870 EVO and see how that goes.

Link to comment

Update:
The 870 EVO seems to be working fine as well (network speed inside the VM is lower than expected but that's another issue to figure out).

Strange thing is, I have taken one of the older SSDs that wasn't working properly in UnRaid and I have put it into a Windows machine and I can not for the life of me, recreate the issue.

So the problem has been 'solved' but I still don't quite understand why it was happening in the first place.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.