Unraid crashes every 24-48 hours.

thestraycat · February 25, 2022

Hi guys,

Having some issues for quite a while now with the server staying up. Been happening for the last few months. Server stays on, all fans running, shares not accessible, SSH dosn't connect and IPMI KVM dosn't output. A reboot brings it all back up swiftly.

My build is a nocro-esque 24 bay server running the following components:

Supermicro X9SCM-F (on most recent BIOS)

Xeon E3-1240v2

EVGA 1300G2 PSU (1300w)

32 GiB DDR3 Single-bit ECC

3 x Dell H200 (flashed to LSI 9211i FWVersion(20.00.07.00)

HBA PCIe Benchmarks:

HBA1 = 5/GT width x8 (4GB/s Max)

HBA2 = 5/GT width x4 (4GB/s Max) <- Bottlenecked

HBA3 = 5/GT width x2 (4GB/s Max) <- Bottlenecked

1 x Intel X540-AT2 10GB NIC

NIC PCIe Benchmark 5/GT 32GB/s

22 x 2TB Disks

1 x Samsung 1TB SSD

1 x Intel 2TB SSD

1 x Micron 500GB

Things i've checked.

1. RAM is new from Kingston

2. PSU swapped out from corsair 750w to EVGA 1300w

3. All docker appdata references moved to cache from user

4. VM manager disabled

5. All temps are fine for everything.

6. No SMART disk failures on any of the array disks

7. Diagnostics don't seem to turn up anything.

8. Syslog server mirrored to flash and to array.

9. Checked container sizes - All looks fine, but i have Increased docker.img it was getting a little full.

10. CA fix commong problems finds no errors and just a few warnings.= about updates for containers that i'm keeping on a certain label.

Things i believe it could be:

1. Complications of running 3 x LBA's on this board. As you can see in the attached pic, 2 controllers are bottlenecked but im more concerned by the different bandwidth between identical controllers being an issue.

2. Hardware failure - Although every component is working and responds so there's nothing to find.

3. I have SMART failures on both the cache drive and the backup drive, but they look like early sign wear related issues, undecided whether there causing issues.

4. Unraid nuance that i havn't yet found. Syslog turns up nothing that i can see.

5. sleep/power state related????

I've included my diagnostics below and a screen shot of my HBA's benchmarked through the handy speeddisk container. If anyone can sanity check the logs i'd be greatly appreciative.

Does anyone have a solid working installation on the X9SCM board right now that they'd be happy to share the BIOS options for? I understand detaiing it all out may be painful, But if there's any chance of saving your bios config to file so i can give your settings a go that would be awesome. It just helps me outrule BIOS options for the X9SCM...

If the syslog dosn't contain anything too confidential and isn't included in the diagnostics.zip i'm happy to upload it seperately?

mediaserver-diagnostics-20220225-1800.zip

Edited February 25, 2022 by thestraycat
adding

JorgeB · February 25, 2022

Enable the syslog server and post that after a crash, it might catch something.

thestraycat · February 25, 2022

@JorgeB - Yeah i have. Some older ones from a few days ago and captured the before and after state for 2 or 3 crashes. Is there anything that i need to anonymize from the syslog output?

JorgeB · February 25, 2022

Should be fine, if you have mover logging enable you can remove those, or it will show the file names.

thestraycat · February 25, 2022

53 minutes ago, JorgeB said:

Should be fine, if you have mover logging enable you can remove those, or it will show the file names.

@JorgeB - Cheers. I'm currently waiting for a new crash so here's the older one from a few days back... There should be at least 2-3 crashes in there. The server is left on 24/7 so if there's any large jumps (2hrs+) between the date/time stamps.. It'll be because it had become unresponsive and was rebooted.

syslog-192.168.1.125.log

thestraycat · February 26, 2022

@JorgeB - Another crash. Happened between:

Feb 25 23:21:51 MediaServer autofan: Highest disk temp is 21C, adjusting fan speed from: 84 (32% @ 1062rpm) to: 54 (21% @ 704rpm)

and

Feb 26 01:01:23 MediaServer root: Delaying execution of fix common problems scan for 10 minutes

syslog attached.

syslog-192.168.1.125.log

JorgeB · February 26, 2022

Unfortunately there's nothing relevant logged, this usually suggests a hardware issue, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

thestraycat · February 26, 2022

@JorgeB - I'm leaning to the same things myself. I have actually been running it with VM Manager off for the last few days and had a few crashes... Yesterday i turned off docker overnight to see if it stayed up.. .It was off in the morning.

I decided to run a memtest just for the hell of it which ran for 8hrs and passed every test.

One thing i did note from the log though was that is seems quite often to crash at:

MediaServer emhttpd: spinning down /dev/sdf

sdf is my parity disk 2 -

What do you think? Coincidence? I was thinking of doing the following:

1. Replace disk 'sdf' letting the array rebuild and then seeing how it runs... (i have a spare)

2. If that dosn't work then move disk 'sdf' onto a different drive bay controlled by a different HBA... That disk has reported 4 errors on 'UDMA CRC Error count' Might be the connectivity? What do you think?

3. If that dosn't work then replace the HBA sdf was originally connected too (i have a spare)

4. Last resort. Replacing the motherboard.

What do you think?

JorgeB · February 27, 2022

IMHO very unlikely that crashing is related to spinning down a disk.

thestraycat · February 27, 2022

@JorgeBI agree it looks unlikely. However if you look at where the crashes happen in the log. They all happen at the same point in time.

Therefore something happens after communicating with an HBA or an attached disk to that particular HBA is the last tracked log entry.

Or do you think that's just co-incidence?

Edited February 27, 2022 by thestraycat

JorgeB · February 28, 2022

It's easy to verify, just disable spin down and see if the server still crashes.

plantsandbinary · February 28, 2022

As JorgeB said, start with the basics. Leave it running spun down. If it doesn't crash, spin the array up, but keep Docker disabled. If it still doesn't crash, enable Docker. If it crashes then, start with eliminating some of the containers.

You've got a lot of different things connected and from my own experience, in such a situation there can always be one piece of hardware that doesn't properly display temperatures etc. I had one device on my own machine that reported no temp in the UI but when I inspected it myself physically I burned the s**** out of my finger touching it for even a second. After removing it I didn't have any issues.

Unraid crashes every 24-48 hours.

Recommended Posts

thestraycat

Link to comment

JorgeB

Link to comment

thestraycat

Link to comment

JorgeB

Link to comment

thestraycat

Link to comment

thestraycat

Link to comment

JorgeB

Link to comment

thestraycat

Link to comment

JorgeB

Link to comment

thestraycat

Link to comment

JorgeB

Link to comment

plantsandbinary

Link to comment

Join the conversation