thestraycat Posted February 25, 2022 Share Posted February 25, 2022 (edited) Hi guys, Having some issues for quite a while now with the server staying up. Been happening for the last few months. Server stays on, all fans running, shares not accessible, SSH dosn't connect and IPMI KVM dosn't output. A reboot brings it all back up swiftly. My build is a nocro-esque 24 bay server running the following components: Supermicro X9SCM-F (on most recent BIOS) Xeon E3-1240v2 EVGA 1300G2 PSU (1300w) 32 GiB DDR3 Single-bit ECC 3 x Dell H200 (flashed to LSI 9211i FWVersion(20.00.07.00) HBA PCIe Benchmarks: HBA1 = 5/GT width x8 (4GB/s Max) HBA2 = 5/GT width x4 (4GB/s Max) <- Bottlenecked HBA3 = 5/GT width x2 (4GB/s Max) <- Bottlenecked 1 x Intel X540-AT2 10GB NIC NIC PCIe Benchmark 5/GT 32GB/s 22 x 2TB Disks 1 x Samsung 1TB SSD 1 x Intel 2TB SSD 1 x Micron 500GB Things i've checked. 1. RAM is new from Kingston 2. PSU swapped out from corsair 750w to EVGA 1300w 3. All docker appdata references moved to cache from user 4. VM manager disabled 5. All temps are fine for everything. 6. No SMART disk failures on any of the array disks 7. Diagnostics don't seem to turn up anything. 8. Syslog server mirrored to flash and to array. 9. Checked container sizes - All looks fine, but i have Increased docker.img it was getting a little full. 10. CA fix commong problems finds no errors and just a few warnings.= about updates for containers that i'm keeping on a certain label. Things i believe it could be: 1. Complications of running 3 x LBA's on this board. As you can see in the attached pic, 2 controllers are bottlenecked but im more concerned by the different bandwidth between identical controllers being an issue. 2. Hardware failure - Although every component is working and responds so there's nothing to find. 3. I have SMART failures on both the cache drive and the backup drive, but they look like early sign wear related issues, undecided whether there causing issues. 4. Unraid nuance that i havn't yet found. Syslog turns up nothing that i can see. 5. sleep/power state related???? I've included my diagnostics below and a screen shot of my HBA's benchmarked through the handy speeddisk container. If anyone can sanity check the logs i'd be greatly appreciative. Does anyone have a solid working installation on the X9SCM board right now that they'd be happy to share the BIOS options for? I understand detaiing it all out may be painful, But if there's any chance of saving your bios config to file so i can give your settings a go that would be awesome. It just helps me outrule BIOS options for the X9SCM... If the syslog dosn't contain anything too confidential and isn't included in the diagnostics.zip i'm happy to upload it seperately? mediaserver-diagnostics-20220225-1800.zip Edited February 25, 2022 by thestraycat adding Quote Link to comment
JorgeB Posted February 25, 2022 Share Posted February 25, 2022 Enable the syslog server and post that after a crash, it might catch something. Quote Link to comment
thestraycat Posted February 25, 2022 Author Share Posted February 25, 2022 @JorgeB - Yeah i have. Some older ones from a few days ago and captured the before and after state for 2 or 3 crashes. Is there anything that i need to anonymize from the syslog output? Quote Link to comment
JorgeB Posted February 25, 2022 Share Posted February 25, 2022 Should be fine, if you have mover logging enable you can remove those, or it will show the file names. Quote Link to comment
thestraycat Posted February 25, 2022 Author Share Posted February 25, 2022 53 minutes ago, JorgeB said: Should be fine, if you have mover logging enable you can remove those, or it will show the file names. @JorgeB - Cheers. I'm currently waiting for a new crash so here's the older one from a few days back... There should be at least 2-3 crashes in there. The server is left on 24/7 so if there's any large jumps (2hrs+) between the date/time stamps.. It'll be because it had become unresponsive and was rebooted. syslog-192.168.1.125.log Quote Link to comment
thestraycat Posted February 26, 2022 Author Share Posted February 26, 2022 @JorgeB - Another crash. Happened between: Feb 25 23:21:51 MediaServer autofan: Highest disk temp is 21C, adjusting fan speed from: 84 (32% @ 1062rpm) to: 54 (21% @ 704rpm) and Feb 26 01:01:23 MediaServer root: Delaying execution of fix common problems scan for 10 minutes syslog attached. syslog-192.168.1.125.log Quote Link to comment
JorgeB Posted February 26, 2022 Share Posted February 26, 2022 Unfortunately there's nothing relevant logged, this usually suggests a hardware issue, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
thestraycat Posted February 26, 2022 Author Share Posted February 26, 2022 @JorgeB - I'm leaning to the same things myself. I have actually been running it with VM Manager off for the last few days and had a few crashes... Yesterday i turned off docker overnight to see if it stayed up.. .It was off in the morning. I decided to run a memtest just for the hell of it which ran for 8hrs and passed every test. One thing i did note from the log though was that is seems quite often to crash at: MediaServer emhttpd: spinning down /dev/sdf sdf is my parity disk 2 - What do you think? Coincidence? I was thinking of doing the following: 1. Replace disk 'sdf' letting the array rebuild and then seeing how it runs... (i have a spare) 2. If that dosn't work then move disk 'sdf' onto a different drive bay controlled by a different HBA... That disk has reported 4 errors on 'UDMA CRC Error count' Might be the connectivity? What do you think? 3. If that dosn't work then replace the HBA sdf was originally connected too (i have a spare) 4. Last resort. Replacing the motherboard. What do you think? Quote Link to comment
JorgeB Posted February 27, 2022 Share Posted February 27, 2022 IMHO very unlikely that crashing is related to spinning down a disk. Quote Link to comment
thestraycat Posted February 27, 2022 Author Share Posted February 27, 2022 (edited) @JorgeBI agree it looks unlikely. However if you look at where the crashes happen in the log. They all happen at the same point in time. Therefore something happens after communicating with an HBA or an attached disk to that particular HBA is the last tracked log entry. Or do you think that's just co-incidence? Edited February 27, 2022 by thestraycat Quote Link to comment
JorgeB Posted February 28, 2022 Share Posted February 28, 2022 It's easy to verify, just disable spin down and see if the server still crashes. Quote Link to comment
plantsandbinary Posted February 28, 2022 Share Posted February 28, 2022 As JorgeB said, start with the basics. Leave it running spun down. If it doesn't crash, spin the array up, but keep Docker disabled. If it still doesn't crash, enable Docker. If it crashes then, start with eliminating some of the containers. You've got a lot of different things connected and from my own experience, in such a situation there can always be one piece of hardware that doesn't properly display temperatures etc. I had one device on my own machine that reported no temp in the UI but when I inspected it myself physically I burned the s**** out of my finger touching it for even a second. After removing it I didn't have any issues. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.