Jump to content

[SOLVED] Unraid 6.8.2 Server Keeps Crashing


Recommended Posts

Hi Everyone, 

 

Unraid newb here, trying to figure out why my server keeps crashing. When I first setup the box it ran for over a week with no problems, but now it is crashing all the time. Usually it stays up long enough to finish the parity check (~20hrs) but not always. I enabled the syslog mirroring to the flash drive, and this is the file that I got. I have no idea what to look for. When the server crashes I have no access through the webGUI, nor through SSH, and I can't even ping the server.

 

Hardware:

i7 4790k

16GB HyperX DDR3-1866 RAM

Z97 MSI Gaming 5 Motherboard

 

What is going on and how do I start to track down the issue?


19:04:42 is when I rebooted the server after yet another unclean shutdown.

 

Thanks!

 

edited: added hardware, off to try a memtest now.

 

syslog.zip

Edited by Greeno237
Link to comment

Ok, was getting an error at first trying to launch memtest86, had to make sure I was booting a non-UEFI instance of the USB drive, now memtest is running. Just start it up and let it run for 24 hours/until I get an error?

Edited by Greeno237
clarification
Link to comment
17 minutes ago, Greeno237 said:

Just start it up and let it run for 24 hours/until I get an error?

yes

Quote

Mar  8 19:04:42 Server237 kernel: xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 4
Mar  8 19:04:42 Server237 kernel: xhci_hcd 0000:00:14.0: Host supports USB 3.0 SuperSpeed


Mar  8 19:04:42 Server237 kernel: usb 3-14: new full-speed USB device number 3 using xhci_hcd
Mar  8 19:04:42 Server237 kernel: scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0 PMAP PQ: 0 ANSI: 6

Honestly I can't tell for certain but it looks like your flash may be using the xhci_hcd driver. Once memtest is complete, make sure your USB is not in the 3.0 port, use of USB 2 is preferred. Someone better at reading the log output above may be able to confirm or deny....

 

18 hours ago, Greeno237 said:

I rebooted the server after yet another unclean shutdown

If this is happening all of the time you should familiarize yourself with the unclean shutdown thread as well as the FAQ, specifically the maintenance and troubleshooting as you may well end up with corruption in the file system after repeated unclean shutdowns. If you should get a disabled / emulated disk then make sure you come here and get help before rebooting....

 

Of course your first post should always include the diagnostics zip file by going to the Tools tab, clicking on the Diagnostics icon, and then clicking on the Download button and uploading the zip here. If you can't access the GUI then see here for instructions on using the command line.

  • Thanks 1
Link to comment
27 minutes ago, Dissones4U said:

 

Honestly I can't tell for certain but it looks like your flash may be using the xhci_hcd driver. Once memtest is complete, make sure your USB is not in the 3.0 port, use of USB 2 is preferred. Someone better at reading the log output above may be able to confirm or deny....

 

I was definitely using it in a 3.0 slot. Thanks for letting me know that 2.0 is better, I obviously missed that. I've swapped it over now and restarted the memtest.

 

Is this diagnostics file the one that is better to upload? Not the syslog? Or both? Here's the 'diagnostics' one, I felt it was less relevant since all the events are from a few weeks ago. This file and the one from my original post are the two zip files which I found on the flash drive after enabling mirror of the syslog to the flash drive to try to see what is going on.  When this box locks up I have no access through GUI or SSH, and I can't even ping it.

server237-diagnostics-20200223-2209.zip

Link to comment
1 hour ago, Greeno237 said:

Is it possible that using the boot drive in the USB 3.0 slot was the source of my crashes?

definitely

1 hour ago, Greeno237 said:

Is this diagnostics file the one that is better to upload?

yes, it has way more info such as SMART data etc (this also contains syslog)

1 hour ago, Greeno237 said:

When this box locks up I have no access through GUI or SSH, and I can't even ping it.

I'm no expert but the post Iinked to above suggests that even when GUI freezes direct connection with monitor and keyboard (mine boots non GUI mode) may provide a response that is unattainable otherwise?

  • Thanks 1
Link to comment

Thanks for the answers, very helpful! In previous server crashes I was also unable to do anything from a direct connection with monitor and a USB keyboard. 

 

I ran memtest for about 6 hours, got through 4 complete cycles (pass) with 0 errors, so I started up the server again yesterday. It ran for about 23 hours with no problems and then just rebooted all by itself while I was sitting here. This is not the same behaviour as the previous crashes, as before it would hard lock and I wouldn't be able to do anything. This time it rebooted itself and started unraid at the command line. (Eventually I would like to run headless, but since I've been troubleshooting, I have a monitor/keyboard plugged in. I'm using a wireless USB keyboard mouse combo.)

 

At this time, I've restarted the memtest, choosing the multi-core option this time instead of the default. Should I be using multi-core? Seems like that is closer to a real world test? I'm going to let this one run for at least 24 hours, but the 6 hrs of clean memtest had me leaning towards the boot drive being plugged into USB3 being the source of the problem. Maybe it is/was? Maybe I have found a new problem? Tough to tell with a slightly different symptom.

 

Uploading the current syslog from the flash drive, here it is: 


Greeno237 syslog.zip
 

This unplanned restart happened at Mar 10 18:48:xx in the log.

 

Just for clarity, the boot USB and the keyboard/mouse dongle are plugged into USB 2.0 slots now. No other USB devices are plugged in. If there is anything else I should be looking for beyond errors in the memtest, please let me know.

Link to comment
On 3/9/2020 at 4:43 PM, Greeno237 said:

Is this diagnostics file the one that is better to upload? Not the syslog? Or both?

Diagnostics zip file contains syslog as noted, but it only includes syslog since the last reboot. Since your problems are causing or requiring a reboot, those problems are not in the syslog included in diagnostics.

On 3/8/2020 at 9:33 PM, Greeno237 said:

I enabled the syslog mirroring to the flash drive

So, the syslog you get like that from the flash drive can also be useful in this situation.

 

You say it is rebooting itself now. Are you sure you don't have a power or cooling problem?

  • Thanks 2
Link to comment
11 minutes ago, trurl said:

Diagnostics zip file contains syslog as noted, but it only includes syslog since the last reboot. Since your problems are causing or requiring a reboot, those problems are not in the syslog included in diagnostics.

Ok, I think I get the structure of those two files, as I can see that the syslog is updated constantly while the diagnostics hasn't changed since Feb 23. I will continue to upload the syslog to pass along the most recent info as we progress with troubleshooting.

 

13 minutes ago, trurl said:

You say it is rebooting itself now. Are you sure you don't have a power or cooling problem?

I'm fairly sure that cooling and power are reliable. This hardware used to be my main rig, I was running the same hardware with different SSD/HDDs plus a GTX1080. I was slightly overclocking the CPU to 4.7GHz from 4.4, on air using a BeQuiet Dark Rock 3 which has a 250W TDP rating. Under stress test I was hitting high 80s or maybe 90 degrees but it was stable when running Win 10.

 

The PSU is a CM600X Corsair from 2014. It is 80+ Bronze certified, though it certainly isn't top of the line, and it is 5-6 years old now. It performed well while overclocking the old rig, and I'm not asking nearly as much from it in the current setup. I have 3 fans moving air in the current Full ATX case, and it was recently dusted. CPU temp in memtest is currently bouncing between 55-60, I have no frame of reference, but that seems normal for a light load. I think it was low 50s when running the single core test. I don't remember noticing high temps when stressing the server (running a W10 VM and 1 Plex stream is probably the hardest I pushed it since setup), but I can't say for certain.

 

That's all the relevant info I can think of re: power and cooling, but I'll definitely keep a closer eye on temperature data moving forward.

Link to comment
9 hours ago, Greeno237 said:

Ok, I think I get the structure of those two files, as I can see that the syslog is updated constantly while the diagnostics hasn't changed since Feb 23.

If you get a new diagnostic, it should include the latest syslog information (assuming you haven't filled your log space). What it won't include is any syslog from before the latest reboot.

 

Possibly you are referring to an old diagnostics that you got off the flash drive. You can get a new diagnostic

On 3/9/2020 at 4:11 PM, Dissones4U said:

by going to the Tools tab, clicking on the Diagnostics icon, and then clicking on the Download button

 

For many situations, we don't want or need the separate syslog. Only when the needed syslog information was lost due to reboot do we ask for it, and Syslog Server is the way to get it saved.

  • Thanks 1
Link to comment
13 hours ago, trurl said:

Diagnostics zip file contains syslog as noted, but it only includes syslog since the last reboot

I guess I should have realized that, but since I live in the land of hopes and dreams I assumed that the diagnostics pulled the log from the syslog server folder once it was setup. Good to know....

Link to comment
1 minute ago, Dissones4U said:

I guess I should have realized that, but since I live in the land of hopes and dreams I assumed that the diagnostics pulled the log from the syslog server folder once it was setup. Good to know....

Probably less trouble free in general to not include the syslog from Syslog Server, since there is no telling how large someone might let it become, and it isn't needed nearly as often as the diagnostics are.

  • Like 1
Link to comment
6 hours ago, trurl said:

Possibly you are referring to an old diagnostics that you got off the flash drive. You can get a new diagnostic

Oh I must be dense. Sorry, of course I can make a new one, I was just grabbing the one off the flash drive, as you said.

 

server237-diagnostics-20200311-1335.zip

 

I realized power issues could include loose cables, so I double checked the power cords and all of them seem snug. I could try replacing the cord that plugs the PSU into the surge protector, I believe I have some spares around. 

 

10 Full cycles of memtest over 18.5 hours and no errors.

Link to comment

@Greeno237 At first blush I've compared your first diagnostic SMART with the second and there is something going on with parity Serial #ZCT1Y56S, the following attributes have gone up significantly in only a few weeks (346 hours). I'm not sure if this could cause your reboots or not. I'll keep looking...

  • 195 Hardware_ECC_Recovered went from 0 ==> 69,487,100
  • 7 Seek_Error_Rate went from 6,068,679 ==> 40,377,770

 

I just looked at your latest diagnostic, specifically the lsscsi.text, maybe someone can correct me but it looks like you're still on USB3, hopefully I'm misunderstanding that though (maybe this just means the drive is USB 3 capable?)

 

Quote

[0:0:0:0]    disk    Kingston DataTraveler 3.0 PMAP  /dev/sda   /dev/sg0 
  state=running queue_depth=1 scsi_level=7 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:14.0/usb3/3-8/3-8:1.0/host0/target0:0:0/0:0:0:0]

 

I'm not seeing anything obvious, maybe someone with more experience will find something. Generally when I troubleshoot, the first thing I'd do is return the server to the most basic state, no VMs, No dockers etc. especially if you've made recent changes right before the problems started. Then after running for some time if the trouble is gone, re-enable one docker or VM at a time.

 

You can also see jonp's post here and try tailing the log as described, in short he says that this method may be useful because

Quote

The syslog to flash method is not valid for capturing log events related to major crashes like you're experiencing.  The problem is that the hang can occur before the write to the flash can occur.

 

Edited by Dissones4U
Added Info
  • Like 1
  • Thanks 1
Link to comment
  • 3 weeks later...

Sorry for the radio silence, it's been a busy few weeks.

 

Took me some digging but I think I found the problem. My socket had bent pins!

I've just replaced the motherboard today, got the server up and running, I will report back on the stability of the system moving forward. I'm not sure what was causing the errors that Dissones found. I'll take a look at the numbers now.

Link to comment
On 3/12/2020 at 9:50 AM, Dissones4U said:

@Greeno237 At first blush I've compared your first diagnostic SMART with the second and there is something going on with parity Serial #ZCT1Y56S, the following attributes have gone up significantly in only a few weeks (346 hours). I'm not sure if this could cause your reboots or not. I'll keep looking...

  • 195 Hardware_ECC_Recovered went from 0 ==> 69,487,100
  • 7 Seek_Error_Rate went from 6,068,679 ==> 40,377,770

Ok, so I did some digging on this, and apparently the large numbers are due to the way Seagate has the data encoded, large numbers in the raw values are normal. Source: http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
 

On 3/12/2020 at 9:50 AM, Dissones4U said:

I just looked at your latest diagnostic, specifically the lsscsi.text, maybe someone can correct me but it looks like you're still on USB3, hopefully I'm misunderstanding that though (maybe this just means the drive is USB 3 capable?)

[0:0:0:0]    disk    Kingston DataTraveler 3.0 PMAP  /dev/sda   /dev/sg0 
  state=running queue_depth=1 scsi_level=7 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:14.0/usb3/3-8/3-8:1.0/host0/target0:0:0/0:0:0:0]

The ".../usb3/3-8/..." part there seems to be some type of naming scheme for the motherboard. If I switch the boot device back to the 3.0 slot, it actually calls it "usb4/4-x" and if I switch it to the other 2.0 port on the front I/O Panel it changes the second digit of 3-8.

I'm hoping at this point that all my troubles were from the damaged socket. Like I said previously, I will keep this updated and note it as solved if everything runs smoothly now.

  • Like 1
Link to comment
  • 5 weeks later...
On 4/1/2020 at 7:28 PM, Greeno237 said:

I'm hoping at this point that all my troubles were from the damaged socket. Like I said previously, I will keep this updated and note it as solved if everything runs smoothly now.

30 days stable. I think it's safe to say my problems were a result of my damaged hardware.

Thank you so very much to my troubleshooters trurl and Dissones! I never would've figured this out on my own, I had no idea where to start. I appreciate you taking the time to help.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...