Extremely slow VM disk performance on SSD (cache)

crescentwire · July 24, 2023

Hey everyone,

The full story is over on Reddit, but I'll recap here:

I have a Dell R520 running unRAID 6.11.5 with:

4 x 4 TB array SAS disks
2 x 4 TB parity SAS disks
2 x 1 TB SSD (SATA) cache disks
2 x Xeon E5-2450 v2 CPUs (32 cores across two physical sockets)
160 GB DDR3 memory

In my mind, this setup should be plenty fast for running multiple VMs, Docker containers, and so on. This is all part of a home lab setup, so I have a Linux VM, along with a few Windows Server/Windows 10 VMs.

Recently, I upgraded my cache drives from 200 GB Intel SSDs to 1 TB Micron M600 SSDs. I was previously running several VMs on my array and, while performance wasn't great, it was moderately usable. I was eager to move all my VMs on to my cache drives, alongside my Docker containers, for increased speed and room to add even more VMs.

Since I've moved the VMs to my new cache drives, read/write speeds are unusably slow. I'm talking 2-3 MBps (bytes, not bits). On the Windows VMs, Task Manager reports 100% disk active time almost all the time, with response times often in the hundreds to thousands of milliseconds. That's really, really bad. If I run one VM (Linux) and try booting three other Windows VMs up, Linux slows to a crawl and almost completely stops responding.

After witnessing a RedHat-based VM take 9 mins 44 secs to boot from a powered off state, I opened the logs window just for the heck of it... and saw this:

I had already run a BTRFS filesystem check when the array was started in maintenance mode, but didn't see any issues. I also don't see any errors listed next to the cache pool drives:

I'm in the process of moving my domains, appdata, and system shares to the array to see if performance is any better there. If it is, then I suppose I'll be replacing these SSD cache drives.

Would you (like me) suspect a hardware issue (bad SSD) at this point?

ih-nas01-diagnostics-20230724-1048.zip

Edited July 24, 2023 by crescentwire
Added diagnostics file

crescentwire · July 25, 2023

Bump

Vr2Io · July 26, 2023

If error come from sdc and not both with sdh ( another M600 in cache pool ), it likely sdc were bad.

crescentwire · July 26, 2023

Thank you, that seems to match with what I'm seeing.

I stopped the array, unassigned sdc (the cache pool drive showing errors), but kept sdh. After starting the array, speeds are now in the hundreds of MB/s, which is exactly what I would expect to see.

So, sdc is definitely a bad drive. Thank you for the help and confirmation!

Extremely slow VM disk performance on SSD (cache)

Recommended Posts

crescentwire

Link to comment

crescentwire

Link to comment

Vr2Io

Link to comment

crescentwire

Link to comment

Join the conversation