Mysterious intermittent crashing


Ometoch

Recommended Posts

A somewhat non-descriptive post title for a problem that's been sort of baffling me. For the past several months, I've been having an issue where my Unraid server will become unresponsive and when I go in to check on it there seems to be the output from a kernel panic or something on the monitor. The server will run just fine for several weeks (I've seen it happen after two weeks of uptime or after over a month of uptime as well) and then this happens with no obvious trigger beforehand.

 

I've seen this happen on all versions of Unraid between 6.4.1 (if I recall correctly) and 6.6.0 (what I was running when it happened again about half an hour ago). I ran a memory check for a solid day or so with no problems. I installed an HP H220 HBA (flashed with the LSI firmware) recently, but it crashed both using that and using the combination of onboard SATA and some cheap SYBA SATA card I was using before.

 

I'm attaching a photo of the monitor, just in case someone can make any sense of the junk onscreen (although I doubt it's useful), as well as the server diagnostics bundle. Hopefully someone will be able to find something I'm not seeing.

IMG_1254.JPG

imogen-diagnostics-20181012-0344.zip

Link to comment

Not trying to hijack this thread, but I too have been having this problem and maybe it's related. Mine started after I updated to 6.6.x. The system runs fine for several days and then is unresponsive. After power cycling the system I have to endure a 2 day parity check. Over the last 2 months I'd guess this has happend 4 times. The console shows a kernel panic (sorry for the poor image).

 

553066408_ScreenShot2018-10-12at6_46_04AM.thumb.png.dd82ef269ca93e3a84f57a4bcd291ac3.png

asok-diagnostics-20181012-0708.zip

Link to comment
42 minutes ago, subagon said:

Not trying to hijack this thread, but I too have been having this problem and maybe it's related. Mine started after I updated to 6.6.x. The system runs fine for several days and then is unresponsive. After power cycling the system I have to endure a 2 day parity check. Over the last 2 months I'd guess this has happend 4 times. The console shows a kernel panic (sorry for the poor image).

 

553066408_ScreenShot2018-10-12at6_46_04AM.thumb.png.dd82ef269ca93e3a84f57a4bcd291ac3.png

asok-diagnostics-20181012-0708.zip

Open a new thread.  A kernel panic is a very non-specific thing.

Link to comment
5 hours ago, CHBMB said:

Can't look at your diagnostics as on mobile, but my gut feeling is something hardware. Memory and PSU would be the obvious culprits.

I'd wait for someone to review your diagnostics though, you could consider running a memtest.

Sent from my Mi A1 using Tapatalk
 

 

I ran memtest from the boot menu for close to 24 hours a couple months ago and it passed. Is there a more thorough memory test I can do, or is that enough to probably rule it out?

Link to comment

I went in the room the server is in and saw something I didn't expect to see. A bunch of "BUG: Bad page state in process php" errors. I've only ever seen any of the previous error screens (like what I attached in the first post) after the system had already crashed and I had gone to check on it, but the server is still running here, with the web UI still working and SMB shares still accessible; nothing obviously wrong right now except what's on this screen.

 

The server is currently running a parity check after the crash that prompted the first post in this thread, if that's relevant to this.

 

I'm attaching a photo of this new screen plus a fresh copy of the diagnostics. What I posted in my first post is the screen after the system had become unresponsive and a diagnostics file from immediately after a reboot. What is attached to this post is after the system has been running for about 12 hours or so and remains responsive.

IMG_1264.JPG

imogen-diagnostics-20181012-1544.zip

Edited by Ometoch
Link to comment

Reboot, then try a hail mary and see if by chance this makes a difference

virsh node-memory-tune --shm-merge-across-nodes 0

To be honest, beyond me what this actually does, but googling one weird entry in your syslog

Oct 12 08:27:58 imogen kernel: swap_info_get: Bad swap file entry 3ffffff7fffff

when you don't have a swapfile defined led to a google search that suggested that as a possibility.  (That, and wouldn't hurt to upgrade to 6.6.1 if you're not using NFS)

Edited by Squid
Link to comment

I found that as well, and I also found an account of someone reseating the memory in their computer to fix errors about tainted kernels from bad page state. I figured I would try that before I run that command (since neither of us really know what it does; as near as I can tell it's related to NUMA which should be irrelevant to my machine because it's just a four-core i5 and not a multi-CPU or Threadripper/Epyc system).

 

I've been watching the syslog like a hawk lately and so far it hasn't thrown new errors, but it hasn't been running long enough for me to feel confident that that was the fix (not entirely sure how long will be long enough since the crashing was so sporadic). If it does crash again I'm going to try that node-memory-tune thing and see what happens, though.

 

As for 6.6.1, I do use NFS to access a share from one of my VMs. I looked through the 6.6.1 release thread and saw people having trouble with NFS but I don't know quite how dangerous it would be to update in my circumstance. I'll probably hold off until 6.6.2 though. My problem has been happening since before the 6.6.x series anyway.

Link to comment
  • 4 weeks later...

Crashed again about thirty minutes ago (after running quietly since, looking back at my last post, October 13, which goes to show what a pain this is to troubleshoot). Guess reseating the memory didn't help after all. I've now updated Unraid to 6.6.4 and tried running that virsh command, and absent other ideas I'll hope that does something and I don't have to look at replacing hardware (probably the memory?).

 

I was running the Fix Common Problems troubleshooting mode, which writes the diagnostics bundle to the flash drive every half an hour, and also taking a look through the syslog every day or so between my last post and now, and I don't see any errors between then and the crash happening. I'll attach the last diagnostics bundle that was saved before the crash anyway, in case I'm missing something. But as far as I can tell the main thing I have to go on is again some cryptic error screen.

IMG_1272.JPG

imogen-diagnostics-20181108-0437.zip

Link to comment

Forgot about the FCPsyslog_tail.txt that Fix Common Problems tells you to upload to the forums along with the diagnostics... but when I look at it I notice that it stopped being updated on the flash drive on November 2, for some reason, despite the system continuing to run and FCP's troubleshooting mode continuing to be active. I have no idea what that means.

FCPsyslog_tail.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.