UnRAID Hard Lock-up


Recommended Posts

Running the latest version of UnRAID 6.1.9 with a new server with all new hardware built less than six months ago. I ran hardware tests on all the gear while I was waiting for drives to arrive and everything passed with flying colors so I don't think (and also hope) that the issue is necessarily hardware related. It is a Core i5-4460 with 16GB of DDR3 RAM.

 

Tonight I got home and noticed that my parity check status for the array had been stuck on 88.2% for a while now so I tried to refresh the WebUI and it hung on "waiting for NODE". I ran a ping and the server responded so I thought ok that is a good sign so I tried to SSH into the server. No dice it refused to connect, the session just hung there for hours and never timed out or anything. As a final resort before driving down the street to the office I tried to telnet into the machine, again no response.

 

I drove over to the office and plugged in a monitor and keyboard and this was displayed on the screen:

2016_07_05_23_18_18.jpg

 

They keyboard was unresponsive and would not light up at all when plugged in so I couldn't even perform a safe power down in order to capture the system log so this picture is all I have to go on.

I googled the last line as it seemed the most likely to produce results and it turned up this, specifically:

"If the relevant grace-period kthread has been unable to run prior to

the stall warning, the following additional line is printed:

 

rcu_preempt kthread starved for 2023 jiffies!"

So it sounds like my CPU stalled while processing a job. Does anyone have any suggestions for looking into this further?

 

You will find my diagnostics dump (for what it is worth without relevant syslog) attached to this post. I have turned on logging in putty and started tail -f in the hopes of catching something more useful if it happens again.

 

I noticed this in the system log after turning the server back on, doubt it is related but curious as to it's meaning. Does it just mean my system hasn't been synchronized with an NTP server lately?

Jul 5 23:25:02 Node ntpd[1529]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

node-diagnostics-20160705-2332.zip

Link to comment

There's not enough in the pic to help, need what was just above.  If it occurs again, try using the Shift-PgUp key combination to scroll back, usually works.

 

The TIME error is common, I have it often and my system appears to sync fine, appears to be a harmless message.

Link to comment

So I wanted to setup a script to copy the syslog to my cache drive so if this does happen again I can hopefully get useful logs from it. The script is working and runs fine on its own but I am having trouble getting the cron entry to be read and added by unraid. I have followed the advice here: https://lime-technology.com/forum/index.php?topic=44172.0 and I know the methods in that thread work because I have used them on my other unraid machine.

 

My syslog.cron file is stored in /boot/config/plugins/mycron/syslog.cron and contains:

# runs syslogCopy every 30 minutes
*/30 * * * * /boot/config/plugins/mycron/syslogCopy.sh 2> /mnt/cache/syslog_errors.txt

 

I have run update_cron, rebooted the server, and even tried placing the cron file under the dynamix plugins folder with other .cron files and still no luck. My file permissions are:

-rwxrwxrwx  1 root root   124 Jul  9 08:05 syslog.cron
-rwxrwxrwx  1 root root   132 Jul  7 14:25 syslogCopy.sh

 

All of the above match what I have on my other unraid machine and it reads the cron file and does everything as expected. It is probably something obvious but I can't figure out what I am doing wrong.

Link to comment

So I wanted to setup a script to copy the syslog to my cache drive so if this does happen again I can hopefully get useful logs from it. The script is working and runs fine on its own but I am having trouble getting the cron entry to be read and added by unraid. I have followed the advice here: https://lime-technology.com/forum/index.php?topic=44172.0 and I know the methods in that thread work because I have used them on my other unraid machine.

 

My syslog.cron file is stored in /boot/config/plugins/mycron/syslog.cron and contains:

# runs syslogCopy every 30 minutes
*/30 * * * * /boot/config/plugins/mycron/syslogCopy.sh 2> /mnt/cache/syslog_errors.txt

 

I have run update_cron, rebooted the server, and even tried placing the cron file under the dynamix plugins folder with other .cron files and still no luck. My file permissions are:

-rwxrwxrwx  1 root root   124 Jul  9 08:05 syslog.cron
-rwxrwxrwx  1 root root   132 Jul  7 14:25 syslogCopy.sh

 

All of the above match what I have on my other unraid machine and it reads the cron file and does everything as expected. It is probably something obvious but I can't figure out what I am doing wrong.

Far easier to use the user scripts plugin (although then the lowest you can go is hourly).

 

But the best way is actually far simpler.  At the local keyboard / monitor log in and type this

 

tail -f /var/log/syslog > /boot/syslog.txt

 

That way the syslog copied to the flash will be current up to the point of the crash no matter what

Link to comment

So I wanted to setup a script to copy the syslog to my cache drive so if this does happen again I can hopefully get useful logs from it. The script is working and runs fine on its own but I am having trouble getting the cron entry to be read and added by unraid. I have followed the advice here: https://lime-technology.com/forum/index.php?topic=44172.0 and I know the methods in that thread work because I have used them on my other unraid machine.

 

My syslog.cron file is stored in /boot/config/plugins/mycron/syslog.cron and contains:

# runs syslogCopy every 30 minutes
*/30 * * * * /boot/config/plugins/mycron/syslogCopy.sh 2> /mnt/cache/syslog_errors.txt

 

I have run update_cron, rebooted the server, and even tried placing the cron file under the dynamix plugins folder with other .cron files and still no luck. My file permissions are:

-rwxrwxrwx  1 root root   124 Jul  9 08:05 syslog.cron
-rwxrwxrwx  1 root root   132 Jul  7 14:25 syslogCopy.sh

 

All of the above match what I have on my other unraid machine and it reads the cron file and does everything as expected. It is probably something obvious but I can't figure out what I am doing wrong.

Far easier to use the user scripts plugin (although then the lowest you can go is hourly).

 

But the best way is actually far simpler.  At the local keyboard / monitor log in and type this

 

tail -f /var/log/syslog > /boot/syslog.txt

 

That way the syslog copied to the flash will be current up to the point of the crash no matter what

Well when this crash happened the keyboard was non-functional so I couldn't have copied the syslog anyways hence the reason I was wanting to setup a cronjob in the hopes of catching the cause of the issue. I will check out the user scripts plugin.

Link to comment

You type the command before the crash.  It'll continually copy the syslog as it changes to the flash drive.  Better than a cron job because the entries which you might need to look at may happen between executions of the cron

Oh, I get what you are suggesting! Good idea, I will have to set that up on Monday when I go back in to work.

Link to comment

You type the command before the crash.  It'll continually copy the syslog as it changes to the flash drive.  Better than a cron job because the entries which you might need to look at may happen between executions of the cron

Oh, I get what you are suggesting! Good idea, I will have to set that up on Monday when I go back in to work.

Of course its Monday so I am slammed and haven't had a chance to make it back to the server room yet and I noticed the thing is hung and completely unresponsive again. I tried RobJ's suggestion of using PageUp to view more of the error but the server is locked up hard and completely unresponsive to any input besides a hard power off. I didn't have any disk activity so I think my parity is fine but I will let the check run anyways.

 

Squid I have taken your advice and started a tail -f of the syslog to the flash drive so next time this happens I should have some useful log information to go off of.

Link to comment
I tried RobJ's suggestion of using PageUp to view more of the error but the server is locked up hard and completely unresponsive to any input besides a hard power off.
Typically the only reason for a machine to lock up that hard is a hardware fault.

 

Has this machine passed 24+ hours solid of memtest with no errors?

Link to comment

I tried RobJ's suggestion of using PageUp to view more of the error but the server is locked up hard and completely unresponsive to any input besides a hard power off.
Typically the only reason for a machine to lock up that hard is a hardware fault.

 

Has this machine passed 24+ hours solid of memtest with no errors?

Yes I mentioned in the OP I ran extended hardware tests on all of the components including a memtest for over 24 hours. The server ran for months sitting at my house while I got all my drives in and prepared to phase out my old server with this new one and I never once had an issue like this until I brought it into the office and started running a bunch of docker containers and putting load on the machine. I suppose if hardware was going to fail on me it would be in the first year so I can run another battery of tests on it and see what comes back once the parity check is finished.

 

EDIT: I do remember having a similar issue to this with my old server where the kernel would panic and completely lock up due to insufficient RAM (I was running 4GB for the longest time but v6 needed at least 8GB). I don't think that is the issue here, even with UnRAID, Deluge, CP, Sonarr, Headphones, NZBGet, & Plex running 16GB of RAM should be enough. I watch those dockers like a hawk and I have never seen them use over 8GB combined which in theory should leave plenty for UnRAID.

Link to comment

Yes I mentioned in the OP I ran extended hardware tests on all of the components including a memtest for over 24 hours.

That was before the issue started, correct? Have you run a memtest after the lockups started?

No, I had a bunch of work to do off the machine and it ran fine all last week after the first crash while I finished it. I I plan to after the parity check finishes.

Link to comment

I tried RobJ's suggestion of using PageUp to view more of the error but the server is locked up hard and completely unresponsive to any input besides a hard power off.

 

Just a clarification, it's Linux, with its own way of doing things - PgUp by itself won't work, you have to use the Shift key with it, Shift-PgUp.  Not that it would have made any difference in your situation!

Link to comment

I tried RobJ's suggestion of using PageUp to view more of the error but the server is locked up hard and completely unresponsive to any input besides a hard power off.

 

Just a clarification, it's Linux, with its own way of doing things - PgUp by itself won't work, you have to use the Shift key with it, Shift-PgUp.  Not that it would have made any difference in your situation!

Oh shoot, I missed the shift part when I read your post! Thanks for clarifying, I will keep it in mind for the future.
Link to comment

Update, I have run memtest for over 24 hours and have found no issues with the memory. I ran a hardware test disk we use at work to verify that the rest of the hardware (motherboard, cpu, etc) are testing out fine as well.

Just for funsies I bought another 16GB of RAM and ran memtest on them for 24 hours with no errors. I installed them in the server for a total of 32GB of RAM.

I have a tail -f running at the terminal so if it crashes again I will have a system log to post this time around.

 

Edit: It has been almost a week and so far no more hard lock-ups. Hopefully adding more RAM was the solution to the issue.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.