[6.9.2] Server crash (unresponsive, fans spin)


Recommended Posts

Hello Everyone!

I put this system together not so long ago, and most of the stuff is running and I was able to troubleshoot. But sometimes it crashes, and can't figure out why. If anyone has an idea or can point me in some direction, I would be happy. 🙂

Symptoms: webgui doesn't load, shares are unaccessable, fans are running.
Since I have Pihole running on it, and my desktop and phone points to it as DNS server, the connection to the internet is not working on the said machines. If I leave it as it is, doesn't recover, after cold reboot everything start up fine (array, parity check, docker containers). Timing looks random. It happened usually the early morning hours, but happened also at 11:40.

Uptime is also random, sometimes a week, sometimes it doesn't even reach 24h.

 

System:
Acer Q67H2-AM mobo with i3-2120

6GB RAM: 2GB Kingmax + 4GB Crucial
PSU: Cooler Master Elite Power 500W

HDDs: 4x Samsung HD204UI 2TB

 

What I ruled out:
- no cache - > no mover
- no Vm
- RAM: memtest86 found no errors
 

Syslog: doesn't show anything. (Crash happened between on 28th, 8:03 and 12:32.)

Tried to capture with local syslog server and mirroring to the flash, but there is nothing suspicious.

 

Once the stats tab was running in my browser on my desktop PC, when the crash happened. I saw very small activities on everything, so CPU, network and disk usage were barely visible on the graphs. The RAM showed the usual values, around 3 gigs used, around 2-300 MB free, the rest is cached, which should be normal in Linux based system afaik.

 

From the Docker logfiles: I couldn’t find all the logs but I can post them, as I copied my whole appdata folder to my desktop after this crash.
I checked the following dockers: Jellyfin, Jackett, Lidarr, NginxProxyManager, Ombi, Qbittorrent, Radarr, Readarr, Sonarr.
I found, that the crash was between 8:03 and 12:30. The Ombi logfile had the last timestamp. Nothing suspicious was found, only normal sheduled tasks running successfully.


I suspect a PSU fault, but I would like to rule out the SW side reasons. Currently I don't have a spare PSU to swap and test it.

Any ideas are welcome. 🙂

thebrain-diagnostics-20210828-1246.zip syslog-127.0.0.1.log

Link to comment

I've ruled out one more thing since my last post. 

I had 2 containers on br0 custom network (Pihole and Unbound). Yesterday I stopped them, today my server crashed again. 

I will stop some more, after the next restart to narrow down if they are the problem, but I doubt Ican find the rootcause this way. 

 

If anyone has the slightest idea, don't hesitate to share. :)

Link to comment
On 9/3/2021 at 4:51 PM, Gibbo592 said:

 Seems to be similar I will try going back to 6.8.3 and see what happens

 

Keep me updated. Im not home until tomorrow night to try it. Are you using the myservers plugin? Im curious if it started around that time it released. They did have issues with the api at first.

Link to comment

Still no luck I have tried all the available versions as a trial, from 6.8.3 to 6.10. all randomly hang curser keeps flashing but won’t respond only reboot works no errors in Syslog.

 

installed several different os windows 10, Ubuntu server, truenas and currently Slackware 14.2 all run fine so I don’t believe a hardware problem it all points to unraid itself I’m trying to find an upto date guide on custom kernel and remove all the amd stuff and keep it generic Intel and Nvidia.

Link to comment
On 9/5/2021 at 4:59 PM, Gibbo592 said:
Still no luck I have tried all the available versions as a trial, from 6.8.3 to 6.10. all randomly hang curser keeps flashing but won’t respond only reboot works no errors in Syslog.
 
installed several different os windows 10, Ubuntu server, truenas and currently Slackware 14.2 all run fine so I don’t believe a hardware problem it all points to unraid itself I’m trying to find an upto date guide on custom kernel and remove all the amd stuff and keep it generic Intel and Nvidia.

Have you ran a memtest for a number of hours yet? That's one thing I still need to try. Im also running out of options.

 

Edit: Sorry just realized that you did do a memtest.
 

Edited by mkono87
Link to comment
On 9/8/2021 at 4:42 PM, Gibbo592 said:

@mkono87 I added BOOT_IMAGE=/bzimage initrd=/bzroot acpi=off it has been running for 10 hours so far lots of time errors still but at least it is alive 

 

I realized that my bios was way out of date. I updated an its been running for 3 days so far. Im not out of the clear yet, but its a good sign. My app data drive appears to have a xfs corruption error so I need to repair that at some point too.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.