6.11.5 - Transfers slows down to halt

LAS · November 29, 2022

Been looking to replace an old 13-year old ubuntu server running about 40 docker-containers, plex being the most resource-hungry.

unRaid seemed like the perfect choice, as I'd love to have the possibility to spin up a VM to play some games now and then.

Getting the VM working went almost painlessly, passthrough on 1TB nvme-disk, 3060Ti, USB-PCIe-card, and 4 cores (8threads) of an i5-12600K.

Set up my array with 10TB parity Ironwolf, 3x8TB Barracuda for storage, 1.2TB Intel S3710 SSD as pool "cache".

A separate 1TB nvme as pool "services" for docker and whatever needed for other VMs down the road.

Spun up cloudflared, Portainer and Tailscale, everything working perfect.

VM stopped, containers running.

Started transfering movies off my old server, mouting it as a smb-share, using rclone sync.

Started getting btrfs corruption errors, did some googling, reformatted both pool drives to xfs, and tried again.

Having transfer speeds of about 35MB/s, both computers wired, but speeds would after some 15-20minutes start to slow down drasticly. I stopped the transfers, restarted, and same thing, slowing down.

Figured I'd move the stuff off the cache, so started the mover. Same thing here!

Speeds started at 50MB/s on parity + disk1, before slowing down to sub 1MB/s speeds.

Stopped the mover manually by doing mover stop, and noticed 2-3cores having high load, usually two at 100% while a thrid bumping up and down.

Rebooted the server, everything fine and idling at about nothing.

Restarted the mover, same thing! Stopped it again, 2-3cores keep on working, cache disk, parity and disk1 having blips of sub 1MB/s transfers, even though mover is stopped and nothing else is accessing them. About 10-15minutes later, everything suddenly went back to idling.

Did an attempt by skipping the cache, moving the files directly to the array (cache:no instead of cache: yes), initial speeds where higher, at about 45MB/s, but dropped alot faster, and most of my cores are at full usage.

Seems like something is not working as intended in my config, or I'm doing something wrong, but I have no idea left on where to look.

hyper-diagnostics-20221129-1934.zip

Edited November 29, 2022 by LAS

JorgeB · November 29, 2022

14 minutes ago, LAS said:

3x8TB Barracuda for storage

These disks are SMR, so lower write performance is expected, especially with medium/small files, transfer to cache should be much faster, not familiar with rclone but it might not be the fastest option, can you do a large file transfer from a wired desktop over SMB direct to cache?

17 minutes ago, LAS said:

Started getting btrfs corruption errors

Also, getting btrfs corruption errors immediately suggests a hardware problem

LAS · November 29, 2022

45 minutes ago, JorgeB said:

These disks are SMR, so lower write performance is expected, especially with medium/small files, transfer to cache should be much faster, not familiar with rclone but it might not be the fastest option, can you do a large file transfer from a wired desktop over SMB direct to cache?

They're slow, but still about 70 times faster than the speeds I'm seeing, on the slower areas of the platter.

Rclone and rsync (if you're more familiar with) got about the same performance, I believe.

45 minutes ago, JorgeB said:

Also, getting btrfs corruption errors immediately suggests a hardware problem

The Intel SSD was fine a few days ago, it's a datacenter drive worth more than the rest of my drives combined, with 4 months of use.

Reading more from the thread I skimmed through when i swapped btrfs for xfs, I suppose I could look into my ram settings.

I'll do a test setting the nvme (services drive) as cache for the media share as well, though as I'm seeing the same behavior when writing directly to the array, I have my doubts.

LAS · November 29, 2022

Did the same rclone sync, from mounted smb-share using my nvme as cache, speeds at about 65MB/s, same high CPU usage.

Did some 40GB of data (2-5GB files), slowed down somewhat. mover worked as it should.

Powered down the server, swapped the sata cable (and SATA-port as its shared with nvme3 - even though I dont have one inserted).

Then did rsync over ssh instead, and lo and behold! Speeds stable at 105MB/s, CPU cores are mostly calm and almost idle!

Did a 120GB transfer, everything working perfect. Transfer to array getting speeds between 45-50MB/s, don't know whats expected on these drives (preclear started just above 200MB/s, ended at about 70MB/s).

LAS · November 30, 2022

Update:

Let it run overnight, transfering about 250GB to the cache. Mover was started on schedule, while still copying files from the older server.
Noticed when I woke up, all speeds dropped to almost halt. Stopped the transfer from the old server, mover still at very low speeds.

JorgeB · November 30, 2022

10 hours ago, LAS said:

Did a 120GB transfer, everything working perfect. Transfer to array getting speeds between 45-50MB/s, don't know whats expected on these drives (preclear started just above 200MB/s, ended at about 70MB/s).

50MB/s is about normal for the default write mode, unless SMR limitations quick in, I see <5MB/s every few minutes with some if my SMR drives, if you enable turbo write you'll get faster writes at the expense of all drives spinning up.

LAS · December 1, 2022

Found some recommendations of disabling the parity drive on the initial data transfer, combined with setting Direct IO to Yes, I was getting consistent 110MB/s transfer speeds.

...Untill I hit the 3.2TB mark on my 8TB BarraCuda Disk1, tranfers stalled.

Direct transfer to drive instead of share gave the same results, stalled.

Attempts to read/download from Disk1, it stalls as well.

Excluded Disk1 from the share, Disk2 spun up, back at full speed.

Rebooted the server, I'm now able to both read and write to Disk1 at good speeds. Tested writing 63GB of data to ensure it wasn't all RAM cache (32GB total RAM), 110MB/s consistent.

Would there be anything I've overlooked that could cause this behaviour, or do I simply have a drive thats starting to fail?

SMART shows following

1. Raw read error rate - 2907048 (hex 2C 5BA8)

5. 0

7. Seek error rate - 669490849 (hex 27E7 9EA1)

187. 0

188. 0 0 0

197. 0

198. 0

199. 0

From what I can gather from this thread, this equals no errors

Edit: After another 25GB, it yet again stalls. sigh

Edited December 1, 2022 by LAS

LAS · December 1, 2022

Restarting the array seems to fix the issue.

Transfering in max 80GB bulks, with small breaks to let the disk settle, I've now transferred another 1TB without any more issues.

JorgeB · December 2, 2022

That model disk, besides being SMR as mentioned, have been found to have inconsistent performance, sometimes one disk can be slower than others of the same model, so possibly just a disk issue.

LAS · December 2, 2022

8 hours ago, JorgeB said:

That model disk, besides being SMR as mentioned, have been found to have inconsistent performance, sometimes one disk can be slower than others of the same model, so possibly just a disk issue.

It would seem it is the issue, yes. Everything has been stable now, when I've not been pushing the disks too hard.

Think I'll be replacing the disks with Ironwolf, the whole Ironwolf-series seems to be CMR.

What actually happens is explained quite nicely on https://superuser.com/a/1691665

Quote

The Seagate ST4000DM004 uses SMR to write data to the disk surface. This means, that in order to write a single byte, it might have to rewrite multiple gigabytes.

In "normal usage patterns" (as designated so by HDD vendors, not by users!) this creates not much of a problem - the data is written to a CMR cache on the outer rim of the disk. Later, when disk usage goes down, the firmware will move the date to its final place in an SMR band.

When writing larger quantities of data at a time, this CMR cache is exhausted and the process of I/O to SMR bands has to take over - this is slower by orders of magnitude.

Nota bene: This is not a RAM cache - it is a small part of the disk surface, that is written in CMR (i.e., without overlapping tracks) to make the SMR horror less visible to users.

Thank you @JorgeB for being patient with a complete newbie in this field of computing.

It has been some interesting days of learning of how transfers and caching works.

Edited December 2, 2022 by LAS

6.11.5 - Transfers slows down to halt

Recommended Posts

LAS

Link to comment

JorgeB

Link to comment

LAS

Link to comment

LAS

Link to comment

LAS

Link to comment

JorgeB

Link to comment

LAS

Link to comment

LAS

Link to comment

JorgeB

Link to comment

LAS

Link to comment

Join the conversation