January 14Jan 14 Hi guys,So recently I’ve noticed that my Plex has been buffering a lot and some of my hosted sites are running really slow, sometimes they even show as "down" even though the server is definitely up.I checked and saw that my server is hitting over 500ms ping spikes when pinging the gateway. At first I thought it was just a bad network cable, so I changed it out, but the same thing is still happening.I was looking into it and saw that shfs is using somewhere around 200% CPU. I really don't know what it is and I haven't modified any settings lately. My only recent changes were adding a parity disk and a new M.2 SSD (which is where my appdata is hosted now). Those changes were a few months ago, but this problem just started showing up at the end of 2025.I’ve been trying to fix it by myself since then with no luck and I still can't figure out what's causing it. I've attached my diagnostics and a screenshot of the spikes.Any help would be great, thanks! tower-diagnostics-20260113-2216.zip
January 14Jan 14 Community Expert 18 minutes ago, VozDeOuro said:where my appdata is hosted nowActually your appdata is on multiple pools, as is systemappdata shareUseCache="only" # Share exists on nvme-red, cache-sata-w, cache-sata, cache system shareUseCache="prefer" # Share exists on cache, cache-sataYou should do extended self-test on cache-sataJan 12 03:52:22 Tower kernel: ata2.00: configured for UDMA/133 Jan 12 03:52:22 Tower kernel: sd 1:0:0:0: [sdd] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s Jan 12 03:52:22 Tower kernel: sd 1:0:0:0: [sdd] tag#18 Sense Key : 0x3 [current] Jan 12 03:52:22 Tower kernel: sd 1:0:0:0: [sdd] tag#18 ASC=0x11 ASCQ=0x4 Jan 12 03:52:22 Tower kernel: sd 1:0:0:0: [sdd] tag#18 CDB: opcode=0x28 28 00 17 11 04 90 00 00 08 00 Jan 12 03:52:22 Tower kernel: I/O error, dev sdd, sector 386991252 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Jan 12 03:52:22 Tower kernel: ata2: EH complete Jan 12 03:52:23 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x80004 SErr 0x0 action 0x0 Jan 12 03:52:23 Tower kernel: ata2.00: irq_stat 0x40000008 Jan 12 03:52:23 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED Jan 12 03:52:23 Tower kernel: ata2.00: cmd 60/08:98:90:04:11/00:00:17:00:00/40 tag 19 ncq dma 4096 in Jan 12 03:52:23 Tower kernel: res 41/40:08:94:04:11/00:00:17:00:00/40 Emask 0x409 (media error) <F> Jan 12 03:52:23 Tower kernel: ata2.00: status: { DRDY ERR } Jan 12 03:52:23 Tower kernel: ata2.00: error: { UNC } 187 Reported_Uncorrect -O--CK 100 100 --- - 2541
January 14Jan 14 Author 23 minutes ago, trurl said:Actually your appdata is on multiple pools, as is systemit's jsut some leftovers, I will finish move everything to the new nvme-red 23 minutes ago, trurl said:You should do extended self-test on cache-sataI ran the smart test on the failing drive as welldo you think this will be related to the high ping values? tower-smart-20260114-0131.zip
January 14Jan 14 Community Expert 4 minutes ago, VozDeOuro said:I ran the smart test on the failing drive as well# 1 Extended offline Completed: read failure 00% 41902 401213152Replace ASAP
January 14Jan 14 Author 10 minutes ago, trurl said:Replace ASAPthx, did you saw anything that could be causing the ping spike ?
January 14Jan 14 Community Expert 15 minutes ago, VozDeOuro said:thx, did you saw anything that could be causing the ping spike ?He told you the reason already. Your faulty drive is causing delays because it resets all the time and all other processes are halted during the reset.
January 14Jan 14 Author 8 minutes ago, MAM59 said:He told you the reason already.Your faulty drive is causing delays because it resets all the time and all other processes are halted during the reset.11 minutes ago, MAM59 said:He told you the reason already.Your faulty drive is causing delays because it resets all the time and all other processes are halted during the reset.Ohh ok, is there a way to disable the drive for now by software? Since I don't have money to replace it ATM
January 14Jan 14 Community Expert would not help, the resets block the bus (and therefor the whole system) if addressed or not.You need to take it out physically (but first try to move away the data from it (could take a Looooooooooooooooooong time...)
January 14Jan 14 Community Expert 9 hours ago, VozDeOuro said:disable the drive for now by software9 hours ago, MAM59 said:try to move away the data from itLooks like cache-sata-w pool has plenty of space. You could just move it all there and get rid of cache-sata pool completely.
January 14Jan 14 Author 41 minutes ago, trurl said:Looks like cache-sata-w pool has plenty of space. You could just move it all there and get rid of cache-sata pool completely.The problem is the cache-SATA-W is a normal SSD, it's not a server-grade one, is that a big problem like I think it is?
January 14Jan 14 Community Expert 5 minutes ago, VozDeOuro said:The problem is the cache-SATA-W is a normal SSD, it's not a server-grade one, is that a big problem like I think it is?Why do you think it is?Looks like it is nvme, unlike your cache-sata and nvme-red which are M.2 SATA
January 14Jan 14 Community Expert 11 minutes ago, VozDeOuro said:The problem is the cache-SATA-W is a normal SSD, it's not a server-grade one, is that a big problem like I think it is?No, you dont need server grade hardware to run unraid. I run WD Blue SATA SSDs for my cache pool. However I have them running in mirrored pairs for redundancy if/when one of them fails.Ideally you want components rated for your expected workloads but the only thing that may happen is you wear the drive out sooner rather than later.Move the data ASAP or you risk losing it entirely. That drive is about to die. Your last concern should be whether or not the other disk is server grade. Edited January 14Jan 14 by MowMdown
January 15Jan 15 Author I removed the faulty drive, and the high ping issue is still happening.What else could it be ? Edited January 15Jan 15 by VozDeOuro
January 15Jan 15 Author here is the newerest diagnostics, if it helps tower-diagnostics-20260114-2133.zip
January 15Jan 15 Community Expert can be anything because you are using a wireguard tunnel (dunno why ?) this can also be a result of the tunnelserver on the other end having hickups, your isp, or failing dns queries...
January 15Jan 15 Author 1 minute ago, MAM59 said:can be anything because you are using a wireguard tunnel (dunno why ?) this can also be a result of the tunnelserver on the other end having hickups, your isp, or failing dns queries...The ping test is on the server to the local gateway, the VPN is just to me to access the LAN to access the server to test what is happening.so the high ping is local From server 192.168.0.10 to gateway 192.168.0.1
January 15Jan 15 Author 3 minutes ago, MAM59 said:sorry, cannot see any local reason in your diagnostics anymore.Do you have any idea on what type of logs or processes I can look to help find a diagnostic?
January 15Jan 15 Community Expert no clue. you are running a lot of interpreter stuff (Java, Python), they are invisible to the system. Python takes the largest amount of cpu time, but this does not automatically prove it as the culprit.It must be something that is able to block the whole system for a certain time (for instance big writes that overwhelm the internal buffers or something).But because they are only visible for the split of a second, they logs do not catch such events.Sticking to networking I would suspect missing/wrong Flow Control, but you are running at 1G speed only, Flow Control is not necessary there.To be safe I would disable the WiFi chip in the BIOS so UNRAID does not see him and maybe tries to initialize it (which also costs time and hickups). But this is a very far away guess and try. I dont think this could be the reason. (but then, just try and see. does not harm anything)
January 15Jan 15 Author 1 minute ago, MAM59 said:no clue. you are running a lot of interpreter stuff (Java, Python), they are invisible to the system. Python takes the largest amount of cpu time, but this does not automatically prove it as the culprit.It must be something that is able to block the whole system for a certain time (for instance big writes that overwhelm the internal buffers or something).But because they are only visible for the split of a second, they logs do not catch such events.Sticking to networking I would suspect missing/wrong Flow Control, but you are running at 1G speed only, Flow Control is not necessary there.To be safe I would disable the WiFi chip in the BIOS so UNRAID does not see him and maybe tries to initialize it (which also costs time and hickups). But this is a very far away guess and try. I dont think this could be the reason. (but then, just try and see. does not harm anything)I can try running a diagnostic when the pings are high too, but it's so random. Something happens a lot in a few minutes, and then nothing for an hour.I've always run those Java and Python apps; they never caused me any problems. Over the lifespan of 6 years, this issue that is happening is pretty recent, so I don't believe it is those.I will try to collect more relevant evidence.
January 15Jan 15 Community Expert 1 minute ago, VozDeOuro said:I can try running a diagnostic when the pings are high too, but it's so random. Something happens a lot in a few minutes, and then nothing for an hour.Yeah I was afraid that it happens that way. I did not say Java or Python are bad, it is just that these languages allow constructs that may cause a really heavy demand for memory which results into a "system shock" to free up enough memory to be able to satisfy the demand. And a second later, the memory is released again. Very simple statements like "a = b" may trigger this because b can be a gigantic array with subarrays and so on. This is programmer stuff, but often even programmers have no clue what they are doing :-)Anyway, this is nothing you can do anything against.You could only turn off these apps/dockers one by one and watch out if the pings still are bad or if they are stable then.Once you have found the bad one, there maybe a chance to fix it. Dunno.Anyway, could be a loooong and depressing search, sorry.
January 15Jan 15 Author 2 minutes ago, MAM59 said:Yeah I was afraid that it happens that way.I did not say Java or Python are bad, it is just that these languages allow constructs that may cause a really heavy demand for memory which results into a "system shock" to free up enough memory to be able to satisfy the demand. And a second later, the memory is released again. Very simple statements like "a = b" may trigger this because b can be a gigantic array with subarrays and so on. This is programmer stuff, but often even programmers have no clue what they are doing :-)Anyway, this is nothing you can do anything against.You could only turn off these apps/dockers one by one and watch out if the pings still are bad or if they are stable then.Once you have found the bad one, there maybe a chance to fix it. Dunno.Anyway, could be a loooong and depressing search, sorry.I was able to get two diagnostics just now lol. I just had two massive spikes, so I started the diagnostic as soon as the ping started spiking.As you can see the ping goes very high making a lot of stuff thinking its down tower-diagnostics-20260115-0141.zip ping.txt tower-diagnostics-20260115-0140.zip
January 15Jan 15 Community Expert yeah, I am aware that it happens, but I cannot see anything in UNRAID that may cause it.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.