ADDAndy Posted June 3, 2018 Share Posted June 3, 2018 Hi Everyone, Im trying to configure an unraid server on a new NAS build that i purchased a few weeks ago and im running into a pretty bad lockup problem: Problem: Approximately once per week, the unraid box becomes completely unresponsive, (webui goes down, ssh doesn't work, ping doesn't work, monitor + keyboard unresponsive), requiring a hard reset to fix. Steps to reproduce: run Unraid for ~1 week, wake up to find the server is crashed Server config: 2x E5 2660 v2 32GB ECC ram Super micro dual LGA2011 Mobo 4x HGST HDD (3 drive, 1 parity) 2x 500 GB SSD (1x XFS cache drive, 1x UD XFS stratch drive) Things i've alreayd tired: Memtest: ran for 24hrs, no failures changed from BRTFS cache pool (2 drives) to 1x XFS drive for cache pool (and 1 drive for 'other') I've attached the most recent diagnostic and tail from FCP Any advice would be appreciated odin-diagnostics-20180602-0837.zip FCPsyslog_tail.txt Quote Link to comment
trurl Posted June 3, 2018 Share Posted June 3, 2018 Are you sure it has power when it is unresponsive? Quote Link to comment
ADDAndy Posted June 3, 2018 Author Share Posted June 3, 2018 2 hours ago, trurl said: Are you sure it has power when it is unresponsive? Yes, The peripherals still had power (monitor/keyboard) and the fans were still spinning. Quote Link to comment
PeteB Posted June 4, 2018 Share Posted June 4, 2018 I'm having similar problems but infrequently. Unraid locks up. Pings to server fail and console is non responsive. If it helps I have posted diags and syslog before (I have troubleshooting mode enabled). You should be able to find them by looking for my posts but can post again if needed. Still no closer to any resolution. Sent from my SM-N920I using Tapatalk Quote Link to comment
ADDAndy Posted June 6, 2018 Author Share Posted June 6, 2018 This happened again overnight. Anyone have any ideas? i've been trying to debug this for almost my whole trial period. Quote Link to comment
ADDAndy Posted June 7, 2018 Author Share Posted June 7, 2018 Update: i pulled some event logs from my impi, the CLPD CATERR seems troublesom https://docs.google.com/spreadsheets/d/14wbsCsWK8PPBrdXaqzzT5MTLUPMFX03-oO-IvRcUGeQ/edit?usp=sharing Quote Link to comment
PeteB Posted June 7, 2018 Share Posted June 7, 2018 Hi. I've got similar symptoms. Latest lockup left an MCE message on the screen. It might be worthwhile installing mcelog in case you are getting hardware errors. If you want to do it, you first install the nerd tools plugin and then open it and select mcelog. I haven't had a re-occurance yet, but hoping that I might capture something with it. Quote Link to comment
ADDAndy Posted June 7, 2018 Author Share Posted June 7, 2018 Ok, i've installed mcelog. does it automatically add to my syslog? or do i need to run another command? Quote Link to comment
Squid Posted June 7, 2018 Share Posted June 7, 2018 If you've got Fix Common Problems installed, then the next time a scan runs (usually daily), if a Machine Check Error is present, it will automatically log the output of mcelog Other than that, to do it manually mcelog Quote Link to comment
ADDAndy Posted June 7, 2018 Author Share Posted June 7, 2018 @Squid i tried running mcelog on my machine, but i keep getting an error message that /proc/mcelog doesn't exist. Ideas? Quote Link to comment
trurl Posted June 7, 2018 Share Posted June 7, 2018 4 hours ago, ADDAndy said: @Squid i tried running mcelog on my machine, but i keep getting an error message that /proc/mcelog doesn't exist. Ideas? Did you do this? 8 hours ago, PeteB said: It might be worthwhile installing mcelog in case you are getting hardware errors. If you want to do it, you first install the nerd tools plugin and then open it and select mcelog. Quote Link to comment
PeteB Posted June 8, 2018 Share Posted June 8, 2018 Sorry. Don't get that message at all. *maybe* I'm not getting it yet as I haven't had an MCE to capture? It might be worthwhile persisting with getting mcelog working correctly in case it's trying to capture a problem. Hopefully someone else here can help with this error. Quote Link to comment
taxydrivar Posted June 8, 2018 Share Posted June 8, 2018 Does look the same as this before hand? Because this is what mine is doing.... Doesn't matter what tab you click on, nothing loads. I can access terminal though. Quote Link to comment
trurl Posted June 8, 2018 Share Posted June 8, 2018 5 hours ago, taxydrivar said: Does look the same as this before hand? Because this is what mine is doing.... Doesn't matter what tab you click on, nothing loads. I can access terminal though. If you have an ad blocker whitelist your server. Quote Link to comment
ADDAndy Posted June 8, 2018 Author Share Posted June 8, 2018 Update: I wanted to make sure i was on the newest firmware and bios, and it looks like i am. so no possible solution there I did have another crash last night: It occurred around 1 am local, and based on the syslog_tail and diagnotiscs info (attached) This corresponds to a CATERR event in my BIOS: (my bios clock is wrong, it's 8:30 am, and my bios clock is listing 14:30) Quote 8 2018/06/08 07:04:16 OEM CPLD CATERR - Asserted so im now working under the assumption that this CATERR is the root cause of the instability What could cause a CATERR? whats the debugpath? odin-diagnostics-20180608-0055.zip FCPsyslog_tail.txt Quote Link to comment
raidserver Posted August 5, 2018 Share Posted August 5, 2018 (edited) I have had this issue since owning Supermicro X11SAE-M It will only happen when starting/shutdown of a LibreELEC VM. The motherboard CATERR_LED will glow orange and the machine needs to be reset, triggering a parity check once rebooted. I have never been able to capture any logs as its an extreme system halt. FWIW the system has been up for 1 hr before error, i never leave the server on for more than a day. Edited August 5, 2018 by raidserver Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.