Unraid crashes every 24-48 hours.


Recommended Posts

Hi guys,

 

Having some issues for quite a while now with the server staying up. Been happening for the last few months. Server stays on, all fans running, shares not accessible, SSH dosn't connect and IPMI KVM dosn't output. A reboot brings it all back up swiftly. 

 

My build is a nocro-esque 24 bay server running the following components:

 

Supermicro X9SCM-F (on most recent BIOS)

Xeon E3-1240v2

EVGA 1300G2 PSU (1300w)

32 GiB DDR3 Single-bit ECC

 

3 x Dell H200 (flashed to LSI 9211i FWVersion(20.00.07.00) 

HBA PCIe Benchmarks:

HBA1 = 5/GT width x8 (4GB/s Max)

HBA2 = 5/GT width x4 (4GB/s Max) <- Bottlenecked

HBA3 = 5/GT width x2 (4GB/s Max) <- Bottlenecked

 

1 x Intel X540-AT2 10GB NIC

NIC PCIe Benchmark 5/GT 32GB/s

 

22 x 2TB Disks

1 x Samsung 1TB SSD

1 x Intel 2TB SSD

1 x Micron 500GB

 

Things i've checked.

1. RAM is new from Kingston

2. PSU swapped out from corsair 750w to EVGA 1300w

3. All docker appdata references moved to cache from user

4. VM manager disabled

5. All temps are fine for everything.

6. No SMART disk failures on any of the array disks

7. Diagnostics don't seem to turn up anything.

8. Syslog server mirrored to flash and to array.

9. Checked container sizes - All looks fine, but i have Increased docker.img it was getting a little full.

10. CA fix commong problems finds no errors and just a few warnings.= about updates for containers that i'm keeping on a certain label.

 

Things i believe it could be:

1. Complications of running 3 x LBA's on this board. As you can see in the attached pic, 2 controllers are bottlenecked but im more concerned by the different bandwidth between identical controllers being an issue.

2. Hardware failure - Although every component is working and responds so there's nothing to find.

3. I have SMART failures on both the cache drive and the backup drive, but they look like early sign wear related issues, undecided whether there causing issues.

4. Unraid nuance that i havn't yet found. Syslog turns up nothing that i can see.

5. sleep/power state related????

 

 

I've included my diagnostics below and a screen shot of my HBA's benchmarked through the handy speeddisk container. If anyone can sanity check the logs i'd be greatly appreciative.

 

Does anyone have a solid working installation on the X9SCM board right now that they'd be happy to share the BIOS options for? I understand detaiing it all out may be painful, But if there's any chance of saving your bios config to file so i can give your settings a go that would be awesome. It just helps me outrule BIOS options for the X9SCM...

 

If the syslog dosn't contain anything too confidential and isn't included in the diagnostics.zip i'm happy to upload it seperately?

 

 

 

 

lsi_benchmarks.png

mediaserver-diagnostics-20220225-1800.zip

Edited by thestraycat
adding
Link to comment
53 minutes ago, JorgeB said:

Should be fine, if you have mover logging enable you can remove those, or it will show the file names.

@JorgeB - Cheers. I'm currently waiting for a new crash so here's the older one from a few days back... There should be at least 2-3 crashes in there. The server is left on 24/7 so if there's any large jumps (2hrs+)  between the date/time stamps.. It'll be because it had become unresponsive and was rebooted.

 

 

syslog-192.168.1.125.log

Link to comment

Unfortunately there's nothing relevant logged, this usually suggests a hardware issue, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

@JorgeB - I'm leaning to the same things myself. I have actually been running it with VM Manager off for the last few days and had a few crashes... Yesterday i turned off docker overnight to see if it stayed up.. .It was off in the morning.

 

I decided to run a memtest just for the hell of it which ran for 8hrs and passed every test. 

 

One thing i did note from the log though was that is seems quite often to crash at:

 

 MediaServer emhttpd: spinning down /dev/sdf

 

sdf is my parity disk 2 -  

 

What do you think? Coincidence? I was thinking of doing the following:

 

1. Replace disk 'sdf' letting the array rebuild and then seeing how it runs... (i have a spare)

2. If that dosn't work then move disk 'sdf' onto a different drive bay controlled by a different HBA... That disk has reported 4 errors on 'UDMA CRC Error count' Might be the connectivity? What do you think?

3. If that dosn't work then replace the HBA sdf was originally connected too (i have a spare)

4. Last resort. Replacing the motherboard.

 

What do you think?

Link to comment

@JorgeBI agree it looks unlikely. However if you look at where the  crashes happen in the log. They all happen at the same point in time.

Therefore something happens after communicating with an HBA or an attached disk to that particular HBA is the last tracked log entry.

 

Or do you think that's just co-incidence? 

Edited by thestraycat
Link to comment

As JorgeB said, start with the basics. Leave it running spun down. If it doesn't crash, spin the array up, but keep Docker disabled. If it still doesn't crash, enable Docker. If it crashes then, start with eliminating some of the containers.

 

You've got a lot of different things connected and from my own experience, in such a situation there can always be one piece of hardware that doesn't properly display temperatures etc. I had one device on my own machine that reported no temp in the UI but when I inspected it myself physically I burned the s**** out of my finger touching it for even a second. After removing it I didn't have any issues.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.