relink Posted April 28, 2021 Posted April 28, 2021 (edited) I have been having this particular issues for a couple days now. My server will become mostly unresponsive. Sometimes it recovers, and sometimes a hard reboot is the only option. I'll do my best to lay out everything I know, but I'm beginning to feel overwhelmed. Now what I mean by mostly unresponsive is that some of my docker containers will still work just slowly (such as NGINX, Wallabag, PiHole, etc.) while others become unresponsive, most notably Plex. Sometimes the unraid UI becomes inaccessible. sometimes it can be accessed but I cant get diagnostics, I cant start or stop any containers or the array, I cant reboot or shutdown. Its like the UI is visible, but I cant actually DO anything. When this happens I can usually still SSH in, but I cant control anything. I tried to reboot over SSH by running "shutdown -r now" and it came back with the usual "system is going down" message, but it never actually did anything. I do have a separate Syslog server so I have been monitoring the errors, and the only thing that catches my attention are Kernel errors pertaining to "aacraid" which is my Adaptec 71605 HBA. I did find a page discussing solutions for my card but its using Ubuntu, so I don't know how or if I even should apply that to unRAID. The Page is here. But im not even sure if this is the issue, or just a coincidence. So far what I have done: -Ran Memtest (passed) -Checked SATA cables -ran chkdsk on flash (Passed) -updated mobo to latest bios -updated RAID card to latest bios -Tried RAID card in other PCIE slot. -unraid is fully up to date -all plugins are fully up to date -disabled write cacheing on HBA disks (this was mostly to protect me from all the hard shutdowns im doing) System Specs: Ryzen 5 2600 ROG Strix B450-F Gaming 32GB Corsair Dominator Adaptec 71605 RAID Controller I attached: -Full Syslog -Syslog showing only Kernel errors -Diagnostics, although Im unsure if it will really show much. EDIT: Attached a copy of the logs as .txt files in a .zip archive. All_2021-4-28-10_7_11.csv Kernel-Errors-Only_All_2021-4-28-10_8_4.csv serverus-diagnostics-20210428-1052.zip Logs.zip Edited April 28, 2021 by relink Quote
trurl Posted April 28, 2021 Posted April 28, 2021 Instead of .csv, please attach syslogs as plain text files or even better, zipped plain text files. Quote
relink Posted April 28, 2021 Author Posted April 28, 2021 10 minutes ago, trurl said: Instead of .csv, please attach syslogs as plain text files or even better, zipped plain text files. Sorry about that, just went with what my Syslog server gives me. New copy is attached. I'll also add it to the original post too. Logs.zip Quote
trurl Posted April 28, 2021 Posted April 28, 2021 Those are maybe even more difficult to make sense of. They are in descending timestamp order just like the .csv, but can't be easily sorted since they are text files. Maybe the .csv would be better but I don't have a licensed copy of Excel on this computer. Sorry, I am just more familiar with looking at syslogs in the usual way they appear in Diagnostics. Maybe someone else will take a look. Or, you could try to get Diagnostics from the command line when the problem occurs. Quote
relink Posted April 28, 2021 Author Posted April 28, 2021 31 minutes ago, trurl said: Those are maybe even more difficult to make sense of. Is the attached file any better? 31 minutes ago, trurl said: Or, you could try to get Diagnostics from the command line when the problem occurs. Unfortunately I cant, it just freezes when I try to get diagnostics. All_2021-4-28-12_39_59.html Quote
rodan5150 Posted April 28, 2021 Posted April 28, 2021 (edited) I looked over it, I'm no expert, but you do have quite a few errors and warnings. Not sure what is critical or would cause hangups/crashes. Anyway, I took your CSV file, sorted it in descending orde r by dat e and time stamp, then exported as tab delimited txt file. Maybe this will help others to interpret it better. All_2021-4-28-10_7_11_tab delimited.zip Edited April 28, 2021 by rodan5150 Quote
relink Posted April 28, 2021 Author Posted April 28, 2021 41 minutes ago, rodan5150 said: I took your CSV file, sorted it in descending orde r by dat e and time stamp, then exported as tab delimited txt file. Thanks, I was trying to figure out a quick way to do that but couldn't think of anything. Quote
Vr2Io Posted April 28, 2021 Posted April 28, 2021 (edited) Pls run UEFI memtest86 and setting test all CPU in parallel. Edited April 28, 2021 by Vr2Io Quote
relink Posted April 28, 2021 Author Posted April 28, 2021 37 minutes ago, Vr2Io said: Pls run UEFI memtest86 and setting test all CPU in parallel. wow, I just read that entire thread, I really hope it's not a RAM or CPU issue, Neither are exactly affordable right now. But I'll take your advice and run the test as soon as I get home from work in about an hour. Quote
Vr2Io Posted April 28, 2021 Posted April 28, 2021 (edited) 11 minutes ago, relink said: wow, I just read that entire thread, I really hope it's not a RAM or CPU issue, Neither are exactly affordable right now. But I'll take your advice and run the test as soon as I get home from work in about an hour. Great you have read entire thread, personally confirm RAM issue quite common and easy for fix, CPU really seldom will be the cause at least I never face that. Just say I won't test by legacy memtest86. Edited April 28, 2021 by Vr2Io Quote
relink Posted April 28, 2021 Author Posted April 28, 2021 So far so good. I’m going let it run overnight and I’ll check it again in the morning. Quote
trurl Posted April 29, 2021 Posted April 29, 2021 10 hours ago, relink said: Ryzen 5 2600 Don't know if there is anything here for you or not: Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 12 hours ago, trurl said: Don't know if there is anything here for you or not I read over that thread numerous times. It really only seems to pertain to 1st gen Ryzen, but I tried it anyway, and its made no difference. Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 @Vr2Io Alright the memtest ran over night and passed. MemTest86-Report-20210428-211336.html Quote
Vr2Io Posted April 29, 2021 Posted April 29, 2021 (edited) 6 minutes ago, relink said: @Vr2Io Alright the memtest ran over night and passed. Then main hardware should be health, next I would shoot on HBA, Does Container / Docker image store on array or separate SSD ? Edited April 29, 2021 by Vr2Io Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 2 minutes ago, Vr2Io said: Then main hardware should be health, next I would shoot on HBA, Does Container / Docker image store on array or separate SSD ? Appdata, System, and Domains are all on a NVME cache pool. Also I have the Maxview utility installed as a container. I don't know if you're familiar but it's a tool to monitor and manage Adaptec cards. The screenshot below is my HBAs status, which all looks good to me. Quote
Vr2Io Posted April 29, 2021 Posted April 29, 2021 I haven't ASR71605. Next I would suggest unplug HBA and whole array disks but keep disks in power on. Then new config and add a dummy disk which connect to onboard SATA, so you can start array, container and docker to check does system crash again. Pls backup whole boot flash and record down all disk assignment. Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 56 minutes ago, Vr2Io said: I haven't ASR71605. Next I would suggest unplug HBA and whole array disks but keep disks in power on. Then new config and add a dummy disk which connect to onboard SATA, so you can start array, container and docker to check does system crash again. Pls backup whole boot flash and record down all disk assignment. What would I need to do to ensure that I will be able to restore my current config when I'm done? Also I'm not sure anything would happen, the troubleshooting I have done so far seems to point to issues when there are a high number of writes going onto the array. I kept my server online all day yesterday until I ran the memtest, and it ran fine. What I did different was I disabled things like Radarr and Sabnzbd, anything that would do heavy writes. In fact Im about to run a test and see if I can make it crash by doing a large transfer from my desktop to unraid. Quote
Vr2Io Posted April 29, 2021 Posted April 29, 2021 (edited) 9 minutes ago, relink said: What I did different was I disabled things like Radarr and Sabnzbd, anything that would do heavy writes. I always heavy read / write in array, but mainly sequential I/O, I haven't run Radarr and Sabnzbd. 9 minutes ago, relink said: I need to do to ensure that I will be able to restore my current config You can use Unraid build-in backup flash feature. Does same system running stable before ? With which ver. Unraid or since any change then have problem. Pls found similar case as below, pls hold on for remove HBA and array disk and keep track does no heavy random I/O then no crash. Edited April 29, 2021 by Vr2Io Quote
Vr2Io Posted April 29, 2021 Posted April 29, 2021 Will parity check ( no correction ) also crash ? Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 Just now, Vr2Io said: Will parity check ( no correction ) also crash ? No parity checks run just fine, over 100MB/s the entire time. Im also currently running a simple test, Im transferring several 100GB of data from my one of my computers to a no cache share on my unraid server...and so far im transferring at about 36MB/s and not a single error in my log and everything is still working...I dont get it... Quote
Vr2Io Posted April 29, 2021 Posted April 29, 2021 Then I assume not hardware related issue. As I almost no random I/O happen in Unraid, so I really no much suggestion/idea in this case. Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 just to spice things up since just writing data doesn't seem to be enough, I decided to start a non-correcting parity check while it's still transferring. I'll probably start to do a large read transfer now too. Quote
relink Posted April 29, 2021 Author Posted April 29, 2021 Ok, so I have several 100GB going into a no cache share, I have several 100GB being read from the array, and I'm running a parity check. The writes have slowed to about 1MB/s or less, the reads are between 34-40MB/s, and the parity check is running at about 45MB/s. Also CPU useage is around 50% average, and RAM is around 50%. Ok things changed before I even posted this. The Parity check is now running around 80MB/s and the writes have practically stalled. Quote
Vr2Io Posted April 29, 2021 Posted April 29, 2021 (edited) Slow expected, but crash abnormal, just make more test to narrow down the symptom . Edited April 29, 2021 by Vr2Io Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.