CowboyRedBeard Posted November 8, 2021 Author Share Posted November 8, 2021 Tried that (rebooted the server even)... doesn't seem to make a difference Quote Link to comment
JorgeB Posted November 8, 2021 Share Posted November 8, 2021 38 minutes ago, CowboyRedBeard said: doesn't seem to make a difference Shame, it was worth a try. Quote Link to comment
JorgeB Posted November 10, 2021 Share Posted November 10, 2021 FYI this is the post where a user reported this helped, and the symptoms looked similar to yours: https://forums.unraid.net/topic/114827-lockups-when-parity-is-enabled/?do=findComment&comment=1047923 Quote Link to comment
CowboyRedBeard Posted February 4, 2022 Author Share Posted February 4, 2022 So at this point, I've suffered through this by scheduling tasks in the wee hours for nearly 2 years. But I've begun using the server for more workloads than just Plex and infrastructure dockers again. (I had originally built this box to help with VMs for Crypto projects + Plex) But at this point, the high IO wait is actually bad enough to raise the fans and CPU / PCM temps for sustained periods. It's time to add more storage and change a few other things on top of it. So here's my question: I started unRAID on this box back in 2019. A few hardware changes and however many versions later I still have this problem. Would rebuilding it from scratch potentially help alleviate this problem? And if it's worth a shot, what is a good general strategy? What other things should I consider? @Squid @trurl Quote Link to comment
CowboyRedBeard Posted February 10, 2022 Author Share Posted February 10, 2022 Guys any help would be much appreciated. I'm at a loss with this. Today I did a test.... I stopped and disabled VMs / Dockers Moved everything off the cache Reformatted that drive from XFS to BTRFS Moved everything back... Whamo... When doing a big download with Sabnzbd I got decent performance. Then later today, doing the same sort of workload I'm stuck at roughly 80MiB/s on writes again... How is this? It can't be a hardware thing?? Quote Link to comment
Squid Posted February 10, 2022 Share Posted February 10, 2022 On 2/4/2022 at 10:10 AM, CowboyRedBeard said: But at this point, the high IO wait is actually bad enough to raise the fans and CPU / PCM temps for sustained periods That seems very strange because during IO wait the processor is basically at idle waiting on the IO to complete. Quote Link to comment
CowboyRedBeard Posted February 11, 2022 Author Share Posted February 11, 2022 Yeah, but if you watch the threads... those are hung at 100% so maybe that's why? And I track fans / CPUs via IPMI. You can see it here (mover kicks off at 5:00) Quote Link to comment
CowboyRedBeard Posted February 11, 2022 Author Share Posted February 11, 2022 I have a different, new SSD that I'm thinking of trying this weekend. However, it seems to me it has to be some sort of OS or OS / hw configuration thing since I was able to get that freshly formatted drive to to perform at 300MiB/s and then the next work load and every one beyond that, I can't get any higher than 80MiB/s I'm down to do tests, I just don't know what to do... Quote Link to comment
CowboyRedBeard Posted February 11, 2022 Author Share Posted February 11, 2022 Like here is me copying an 85GB file, across the LAN via SMB to a share which uses the cache drive. As you can see, the server is receiving at nearly a full 1Gbps on the network interface, and it's flushing the file to the cache drive in bursts of around 400MiB/s... until about half way through the file copy which was about 5minutes to this point. Then the system falls back to a steady stream of 80MiB/s of file write speed... ¯¯\_(ツ)_/¯¯ So it doesn't seem to me to be a hardware problem. I've also set the "dirty ratio" in Tips & Tweaks to 1% & 2% ... if it's at default this is way worse. Quote Link to comment
Squid Posted February 11, 2022 Share Posted February 11, 2022 16 hours ago, CowboyRedBeard said: those are hung at 100% so maybe that's why The dashboard takes into consideration I/O wait on its graphs. Depending upon how you look at it it's either right or wrong. It's wrong because the core isn't actually running at 100% (rather it's idle waiting for the data transfer and the processes can't continue without it) or right because the core isn't able to do anything else while it's waiting for the transfer to happen (so it's effectively at 100% from the user's point of view) Quote Link to comment
Squid Posted February 11, 2022 Share Posted February 11, 2022 16 hours ago, CowboyRedBeard said: (mover kicks off at 5:00) How long is mover taking to run. On one of your old diagnostics every day at 5:30 you had the drives doing a trim at 5:30. During that trim operation, all transfers to and from the SSDs are effectively paused due to I/O wait. Quote Link to comment
CowboyRedBeard Posted February 11, 2022 Author Share Posted February 11, 2022 I recently moved trim to 08:30 and have mover set to start at 05:05 ... mover takes different amount of time based on how much was downloaded by SAB that day. Having recently upgraded netdata and other things I don't have any recent history examples. I'll try to load up some files for mover tonight and post those results tomorrow. Quote Link to comment
CowboyRedBeard Posted February 12, 2022 Author Share Posted February 12, 2022 So this is last night, downloads start at 1:30... which you can see it goes up to 430'ish MiB/s... and then IO Wait goes up as high as 22% and write speeds level off at 80MiB/s Then this is when the mover starts: Quote Link to comment
Squid Posted February 12, 2022 Share Posted February 12, 2022 Pause the downloads while you are post processing / unpacking? Rearrange your mappings so that when the system moves it to the cache enabled share after its finished so that it can do a simple rename instead of a copy / delete operation. Quote Link to comment
CowboyRedBeard Posted February 13, 2022 Author Share Posted February 13, 2022 I have Sab using a cache enabled share, I guess pausing during post processing may help... but that's really just another way to mask the issue without fixing it. I'd love to figure out why this is happening Quote Link to comment
dlandon Posted February 17, 2022 Share Posted February 17, 2022 I'm a little late to this party and wasn't able to determine your current Unraid version. What Unraid version are you running? Quote Link to comment
CowboyRedBeard Posted February 17, 2022 Author Share Posted February 17, 2022 Hi, welcome to the club! haha I'm currently on 6.9.2 but have had this issue since 6.7 Quote Link to comment
dlandon Posted February 18, 2022 Share Posted February 18, 2022 And your network and disk controller hardware? Quote Link to comment
CowboyRedBeard Posted February 18, 2022 Author Share Posted February 18, 2022 Supermicro X9DRi-LN4+ the onboard controller. The cache drive is connected to a SATA 3 port on the motherboard. The other SSD that's not assigned to the array (VMs on this) is on a SATA 3 port also. I've done tests to/from those. And I've even tried a PCIE SATA controller that I have in the machine. The network is the onboard ethernet from the motherboard. Which doesn't seem to be a bottleneck at all, I can copy at a full 1Gbps on that Quote Link to comment
dlandon Posted February 18, 2022 Share Posted February 18, 2022 First, I apologize if I walk you back through things you've already done, but I'd like to get an assessment of where you are so we can do some troubleshooting. First look at your SSD disks and be sure they are formatted like this: For best operation, they should show 1 MiB-aligned. Give me a screen shot of the Tips & Tweaks page so I can see what you've 'Tweaked'. Quote Link to comment
CowboyRedBeard Posted February 18, 2022 Author Share Posted February 18, 2022 No apologies needed, I appreciate the help! Quote Link to comment
dlandon Posted February 18, 2022 Share Posted February 18, 2022 Ok, I like all your settings. Now review the issues with me. Lets start with non spinners and work our way up from there. As I understand you are not seeing good performance with a SSD device? Give me some details. Quote Link to comment
CowboyRedBeard Posted February 18, 2022 Author Share Posted February 18, 2022 Probably the posts on this page are a very depiction of the problem as it appears currently. But essentially, with any file write process to cache I end up with high i/o wait times. I will see the cache drive able to write at around 300MiB/s for just a minute or two and then after that it will only give around 80MiB/s after. This shows up in netdata and on the unraid dashboard as in the following posts: And in that second one you can even see the CPU temps rise, which as was mentioned here was thought to be odd since it's just "waiting" ... but I monitor CPU temp / Fan speed with IPMI and then send that data to influxDB where I can trend it (which is that graph in the second post) Happy to conduct any tests you think are meaningful and post the results here. But primarily I see this with cache drives only (spinning disks don't obtain the same sort of speeds so I guess the system can keep up with them). And, I also see this if it's Sab downloading / unpacking a file, or transferring data to or from a non-array / non-cache SSD. Also, earlier on in this thread, I had an Intel Optane NVME drive in the box on PCIE slot and was able to get crazy sustain write speeds to it without this issue occurring. I've since pulled that out, but could put it back in for testing if needed. Quote Link to comment
dlandon Posted February 18, 2022 Share Posted February 18, 2022 6 minutes ago, CowboyRedBeard said: I will see the cache drive able to write at around 300MiB/s for just a minute or two and then after that it will only give around 80MiB/s after. What you are seeing is the disk caching happening with Linix. It will initially fill the ram disk cache and then start writing to disk when the disk cache gets to a set threshold. These are the Disk Cache settings iin Tips & Tweaks. If I recall, you have 128GB of ram. @ 1% dirty background, that's 1.28GB. A file transfer will fill that ram cache before committing to disk. That's why you are seeing high speeds on a reboot, because the disk cache is empty. You're theoretical max is about 100MB/s your 1GB network. Honestly, the 80MB/s is reasonable really. I want you to do some adjustments in Tips and Tweaks to the network though and let's see if it helps: Disable NIC Flow Control. Disable NIC Offload. Set Ethernet NIC Rx Buffer to 1024. Set Ethernet NIC Tx Buffer to 1024. These settings help on Intel NICs at times. See if that improves your speed. As for the high CPU use and temps, I have no idea what's happening there. I'm not an expert on iowaits, but could they also be from NIC io? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.