Random Halt - How to diagnose?


Recommended Posts

I've run into an issue lately where I'm having somewhat frequent random halts of my server. It's headless so I can't really see the screen output, but I'd imagine there must be some sort of a dump or something available? It's still on and running, but the system stops responding to network communication or anything. I'm attaching the latest diagnostics after restarting the system in hopes there might be something. Any information on how I could best diagnose this? 

atlantis-diagnostics-20200222-1250.zip

Link to comment

Ok, so I finally got a change to dig through the logs after it happened again recently while I was away. I believe I've got a pinpoint of the time though, and I grabbed the relevant logs prior to the manual restart by me at the IPMI interface. Attaching the log here in hopes someone can tell me if there is something known issue or that looks to be a cause? I don't see anything that jumped out other than what looks like some error messages that I don't fully understand - never been great on the hardware side.

unraid-failure.log

Link to comment

This is a screenshot as well that I grabbed from the KVM interface. Unfortunately I'm running issues trying to run the damn IPMI software so I can actually see the screen, but the screenshot at least shows some of the details. 

Screen Shot 2020-02-26 at 5.24.10 PM.png

 

I was actually just on the system now via ssh and it happened again. Was just sitting there not even doing anything in particular, then it kicked out the kernel panic message. 

user@atlantis:/mnt/user/downloads$ 
Message from syslogd@atlantis at Mar  1 15:34:54 ...
 kernel:Kernel panic - not syncing: Fatal exception in interrupt
packet_write_wait: Connection to UNKNOWN port 65535: Broken pipe

Also including the latest diagnostics output from right now after it came back up.

 atlantis-diagnostics-20200301-1556.zip

Edited by 1activegeek
Link to comment

So I had posted this issue a few weeks ago: 

 - and now, it seems it's happened again. Was there something in those logs or the new logs attached here that would lead you in a specific hardware issue direction? This is now the second time this seems to have happened on me. Though it does appear this time that it says the file system is unmountable, not mounted RO. I'm thinking I've got something hardware related that is broken or somehow not working as intended, but I'm not the best at being able to diagnose hardware in this sense. 

atlantis-diagnostics-20200306-1852.zip

 

Link to comment

Ok, thanks for the indication. In the meantime, I might actually swap out the NVMe for the 2x 500GB SSDs I was using previously. If these prove to work fine, then I'll assume the NVMe is the problem. There have just been too many random halts and 2 corruptions of the file system lately for me to ignore this as just pure coincidence.

 

Appreciate the guidance - crossing my fingers it's just a bum NVMe drive and I can get it swapped by Intel under warranty. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.