Please help. My unRAID is constantly locking up, and has become unusable.


32 posts in this topic Last Reply

Recommended Posts

I have been having this particular issues for a couple days now. My server will become mostly unresponsive. Sometimes it recovers, and sometimes a hard reboot is the only option. I'll do my best to lay out everything I know, but I'm beginning to feel overwhelmed. 

 

Now what I mean by mostly unresponsive is that some of my docker containers will still work just slowly (such as NGINX, Wallabag, PiHole, etc.) while others become unresponsive, most notably Plex. Sometimes the unraid UI becomes inaccessible. sometimes it can be accessed but I cant get diagnostics, I cant start or stop any containers or the array, I cant reboot or shutdown. Its like the UI is visible, but I cant actually DO anything. When this happens I can usually still SSH in, but I cant control anything. I tried to reboot over SSH by running "shutdown -r now" and it came back with the usual "system is going down" message, but it never actually did anything.

 

I do have a separate Syslog server so I have been monitoring the errors, and the only thing that catches my attention are Kernel errors pertaining to "aacraid" which is my Adaptec 71605 HBA. I did find a page discussing solutions for my card but its using Ubuntu, so I don't know how or if I even should apply that to unRAID. The Page is here. But im not even sure if this is the issue, or just a coincidence. 

 

So far what I have done:

-Ran Memtest (passed)

-Checked SATA cables

-ran chkdsk on flash (Passed)

-updated mobo to latest bios

-updated RAID card to latest bios

-Tried RAID card in other PCIE slot.

-unraid is fully up to date

-all plugins are fully up to date

-disabled write cacheing on HBA disks (this was mostly to protect me from all the hard shutdowns im doing) 

 

System Specs:

Ryzen 5 2600

ROG Strix B450-F Gaming

32GB Corsair Dominator 

Adaptec 71605 RAID Controller

 

I attached:

-Full Syslog

-Syslog showing only Kernel errors 

-Diagnostics, although Im unsure if it will really show much. 

 

EDIT: Attached a copy of the logs as .txt files in a .zip archive.

 

All_2021-4-28-10_7_11.csv Kernel-Errors-Only_All_2021-4-28-10_8_4.csv serverus-diagnostics-20210428-1052.zip

Logs.zip

Edited by relink
Link to post
10 minutes ago, trurl said:

Instead of .csv, please attach syslogs as plain text files or even better, zipped plain text files.

Sorry about that, just went with what my Syslog server gives me.

 

New copy is attached. I'll also add it to the original post too. 

Logs.zip

Link to post

Those are maybe even more difficult to make sense of. They are in descending timestamp order just like the .csv, but can't be easily sorted since they are text files. Maybe the .csv would be better but I don't have a licensed copy of Excel on this computer.

 

Sorry, I am just more familiar with looking at syslogs in the usual way they appear in Diagnostics. Maybe someone else will take a look.

 

Or, you could try to get Diagnostics from the command line when the problem occurs.

Link to post
31 minutes ago, trurl said:

Those are maybe even more difficult to make sense of.

Is the attached file any better? 

 

31 minutes ago, trurl said:

Or, you could try to get Diagnostics from the command line when the problem occurs.

Unfortunately I cant, it just freezes when I try to get diagnostics. 

All_2021-4-28-12_39_59.html

Link to post

I looked over it, I'm no expert, but you do have quite a few errors and warnings. Not sure what is critical or would cause hangups/crashes.

 

Anyway, I took your CSV file, sorted it in descending orde r by dat e and time stamp, then exported as tab delimited txt file. Maybe this will help others to interpret it better.

All_2021-4-28-10_7_11_tab delimited.zip

Edited by rodan5150
Link to post
41 minutes ago, rodan5150 said:

I took your CSV file, sorted it in descending orde r by dat e and time stamp, then exported as tab delimited txt file.

Thanks, I was trying to figure out a quick way to do that but couldn't think of anything. 

Link to post
37 minutes ago, Vr2Io said:

Pls run UEFI memtest86 and setting test all CPU in parallel.

wow, I just read that entire thread, I really hope it's not a RAM or CPU issue, Neither are exactly affordable right now. But I'll take your advice and run the test as soon as I get home from work in about an hour. 

Link to post
11 minutes ago, relink said:

wow, I just read that entire thread, I really hope it's not a RAM or CPU issue, Neither are exactly affordable right now. But I'll take your advice and run the test as soon as I get home from work in about an hour. 

Great you have read entire thread, personally confirm RAM issue quite common and easy for fix, CPU really seldom will be the cause at least I never face that.

Just say I won't test by legacy memtest86.

Edited by Vr2Io
Link to post
12 hours ago, trurl said:

Don't know if there is anything here for you or not

I read over that thread numerous times. It really only seems to pertain to 1st gen Ryzen, but I tried it anyway, and its made no difference. 

Link to post
6 minutes ago, relink said:

@Vr2Io Alright the memtest ran over night and passed. 

 

Then main hardware should be health, next I would shoot on HBA, Does Container / Docker image store on array or separate SSD ? 

Edited by Vr2Io
Link to post
2 minutes ago, Vr2Io said:

 

Then main hardware should be health, next I would shoot on HBA, Does Container / Docker image store on array or separate SSD ? 

Appdata, System, and Domains are all on a NVME cache pool. 

 

Also I have the Maxview utility installed as a container. I don't know if you're familiar but it's a tool to monitor and manage Adaptec cards. The screenshot below is my HBAs status, which all looks good to me.

Screen Shot 2021-04-29 at 10.06.44 AM.png

Link to post

I haven't ASR71605. Next I would suggest unplug HBA and whole array disks but keep disks in power on.

Then new config and add a dummy disk which connect to onboard SATA, so you can start array, container and docker to check does system crash again.

Pls backup whole boot flash and record down all disk assignment.

Link to post
56 minutes ago, Vr2Io said:

I haven't ASR71605. Next I would suggest unplug HBA and whole array disks but keep disks in power on.

Then new config and add a dummy disk which connect to onboard SATA, so you can start array, container and docker to check does system crash again.

Pls backup whole boot flash and record down all disk assignment.

What would I need to do to ensure that I will be able to restore my current config when I'm done?

 

Also I'm not sure anything would happen, the troubleshooting I have done so far seems to point to issues when there are a high number of writes going onto the array. I kept my server online all day yesterday until I ran the memtest, and it ran fine. What I did different was I disabled things like Radarr and Sabnzbd, anything that would do heavy writes. 

 

In fact Im about to run a test and see if I can make it crash by doing a large transfer from my desktop to unraid. 

Link to post
9 minutes ago, relink said:

What I did different was I disabled things like Radarr and Sabnzbd, anything that would do heavy writes. 

I always heavy read / write in array, but mainly sequential I/O, I haven't run Radarr and Sabnzbd.

 

9 minutes ago, relink said:

I need to do to ensure that I will be able to restore my current config

You can use Unraid build-in backup flash feature.

 

Does same system running stable before ? With which ver. Unraid or since any change then have problem.

 

Pls found similar case as below, pls hold on for remove HBA and array disk and keep track does no heavy random I/O then no crash.

 

Edited by Vr2Io
Link to post
Just now, Vr2Io said:

Will parity check ( no correction ) also crash ?

No parity checks run just fine, over 100MB/s the entire time. 

 

Im also currently running a simple test, Im transferring several 100GB of data from my one of my computers to a no cache share on my unraid server...and so far im transferring at about 36MB/s and not a single error in my log and everything is still working...I dont get it...

Link to post

just to spice things up since just writing data doesn't seem to be enough, I decided to start a non-correcting parity check while it's still transferring. I'll probably start to do a large read transfer now too. 

Link to post

Ok, so I have several 100GB going into a no cache share, I have several 100GB being read from the array, and I'm running a parity check. 

The writes have slowed to about 1MB/s or less, the reads are between 34-40MB/s, and the parity check is running at about 45MB/s. Also CPU useage is around 50% average, and RAM is around 50%. 

 

Ok things changed before I even posted this. The Parity check is now running around 80MB/s and the writes have practically stalled. 

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.