First unraid server crash. What do I do?

glenner · August 9, 2017

Hey! So I recently setup my new unraid server a month or so ago... It's been running well and server uptime was at 20 days.... ...until today I had my first crash. It looked like this:

around 2pm this afternoon, I noticed my Logitech Media Center docker seemed to be down, as my Squeezebox displays around my house were offline (requires unraid to be online).
unraid dashboard offline.
could ping unraid server ip. but ssh/telnet to same ip did not work.
my unraid server is in the basement and headless. I connect remotely via ssh. I did have an ssh window open overnight, and noticed these messages in the console shown below. I had dozens of these messages in the console starting around 4:37AM.
since I could not access the server via the UI or ssh, I did a hard reset on the box. Not sure what other options I have in this case. It has since powered up and I've started the array, and it forced me to start a parity check.
at the time, I have only 7 dockers installed and all were running (delugevpn, crashplan, logictechmediaserver, plex, noip, sickrage, sagetv) and 1 windows 10 VM (which doesn't do much... I set it up last week to try).
on reboot, I got a warning that my cache is 72% used (likely not an issue)? My mover is set to run every 8 hours.

But so any ideas on how I can debug this or what I should look at? What should I do next time this happens? I'm really hoping this is a one off deal as I hate to take the array go down hard like that, and force the parity check... Is there a better way than the hard power cycle I did?

Thanks!

Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea000e823b80 count:0 mapcount:0 mapping:          (null) index:                                                 0x1
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:flags: 0x200000000000014(referenced|dirty)
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea0015220000 count:0 mapcount:0 mapping:          (null) index:                                                 0x1
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea0015220040 count:0 mapcount:0 mapping:          (null) index:                                                 0x1
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:flags: 0x200000000000014(referenced|dirty)
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea0015220080 count:0 mapcount:0 mapping:          (null) index:                                                 0x1
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:flags: 0x200000000000014(referenced|dirty)
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea00152200c0 count:0 mapcount:0 mapping:          (null) index:                                                 0x1
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:flags: 0x200000000000014(referenced|dirty)
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea0015220100 count:0 mapcount:0 mapping:          (null) index:                                                 0x1
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:flags: 0x200000000000014(referenced|dirty)
Message from syslogd@unraid at Aug  9 00:04:37 ...
 kernel:page:ffffea0015220140 count:0 mapcount:0 mapping:          (null) index:                                                 0x1

Frank1940 · August 9, 2017

You can try a quick push (~ 1 sec) of the power button to initiate a proper shutdown. It may not work, but it should be your first thing to try when the GUI, ssh and console are unresponsive. As you probably know a long push (> 5 sec) will force a hard shutdown.

Do you have any entries in that ssh session that preceded the start of these repeating entries? (Often times, it what was happening just prior to the repeaters that gives a clue as to the cause.)

You can also install the 'Fix Common Problems" plugin and turn on the "Troubleshooting mode". That will continually write the latest log entries to a file in the logs folder on your flash drive.

You seem to have some tuners installed in your server. Make sure that you have the latest drivers for those tuners.

glenner · August 10, 2017

On 8/9/2017 at 3:45 PM, Frank1940 said:

You can try a quick push (~ 1 sec) of the power button to initiate a proper shutdown. It may not work, but it should be your first thing to try when the GUI, ssh and console are unresponsive. As you probably know a long push (> 5 sec) will force a hard shutdown.

Do you have any entries in that ssh session that preceded the start of these repeating entries? (Often times, it what was happening just prior to the repeaters that gives a clue as to the cause.)

You can also install the 'Fix Common Problems" plugin and turn on the "Troubleshooting mode". That will continually write the latest log entries to a file in the logs folder on your flash drive.

You seem to have some tuners installed in your server. Make sure that you have the latest drivers for those tuners.

Thanks for the info Frank. I will try the quick power reset next time and see if I can get the system to shutdown gracefully. That would be really helpful.

There were no other entries in the ssh window. I pasted the first messages at 4:37am above. So nothing else useful in the ssh window....

I had previously installed the FCP plugin, but had not enabled the troubleshooting mode. I'll try that next time if the problem persists. That said, I am surprised the system logs are not persisted across a reboot? There are no historical logs kept? I'm not clear why that would be... since it's specifically when the system crashes hard that it would be really nice to see some logs.

I have 2 HDHomerun OTA/ATSC tuners on my network. These are external devices connected to my LAN, and are not directly attached to the unraid server. The tuners work with the SageTV docker to record OTA TV shows. Firmware on these tuners is up to date... Generally SageTV + HDHR tuners are pretty stable...

But so actually it does look like my SageTV docker was recording a 1 hour show, and did not stop recording. It recorded for 14 hours and generated a 114GB file on the cache drive. I've never seen that happen before... and I've opened another thread on the SageTV forum to see if anyone has seen anything like it as it's highly abnormal.

My SageTV thread is here: https://forums.sagetv.com/forums/showthread.php?p=609274

Other than the fact that clearly SageTV should not have allowed a 114GB mpg file to be written to the cache... My 250GB cache was only 75% full at the time of the crash... Does this sound like something that could make the whole unraid server unresponsive and effectively crash the server?

Frank1940 · August 11, 2017

Was the mover trying to run at the time when this file was being saved to the cache drive? (I am asking this question in the hope that someone who is better acquainted with mover to know if this could be an issue...)

glenner · August 11, 2017

5 minutes ago, Frank1940 said:

Was the mover trying to run at the time when this file was being saved to the cache drive? (I am asking this question in the hope that someone who is better acquainted with mover to know if this could be an issue...)

That's a good question. It could have been... My mover was set to run every 8 hours, though I'm not sure what the start time would be. As the syslog is cleared after the reboot, I'm not clear how I can tell on a historical basis when the mover was running. Can I? The only log I've seen is here: http://unraid/log/syslog, and that restarts on each boot. Is there another log?

Again, I'm not clear why historical logs are not persisted. I'll need to poke around some more, and confirm that's for real. Isn't that kind like saying the flight data recorder is only persisted if the plane lands ok... If the plane crashes and burns, the recorder is lost? :-)

glenner · August 12, 2017

Actually I did find Squid's solution to archive the logs... Looks like that issue has been solved here using a short script and the user scripts plugin:

First unraid server crash. What do I do?

Recommended Posts

glenner

Link to comment

Frank1940

Link to comment

glenner

Link to comment

Frank1940

Link to comment

glenner

Link to comment

glenner

Link to comment

Archived