Jump to content

UnRAID Crashing


Go to solution Solved by Marbles_00,

Recommended Posts

Hey all,

 

Need some help.  Normally I can troubleshoot this stuff on my own, but for this, I really need some advice.  I've been running UnRaid on some older hardware, 24/7, for the better part of 10+ years.  The system up until the last few months has been rock solid (other than times when we have lost actual power).  The hardware is the following:

Motherboard: Asus P5B-VM DO

CPU: Intel E4700 Core2Duo (the most recent upgrade)

Memory: 4 sticks, 6 Gigs total, 2x 2GB Muskin Silverline, 2x 1GB Muskin Silverline

Power Supply: Corsair CX600

11 Hard drives, most WD20EFRX/EFZX with a couple of Seagate Barracuda 2Tb drives, and 2x Hitachi 160Gig cache drives.

LAN Card: TPLink Gig adapter (the motherboard LAN died)

Promise SATA TX4

 

As mentioned, this machine was very reliable, up to until I upgraded to UnRAID 6.10.  Then the system would just stop responding, and I had to do hard reboots to get it up again.  It was rather random in occurrence, and unfortunately on every reboot, the syslog would clear, so I couldn't capture the event happening.  I dropped back down to 6.9 thinking that was an issue, and though the system didn't crash so much, it still did periodically.

 

I've blown the dust out (thinking overheating) several times, even though the temps in UnRAiD indicated they were all good.  I replaced a fan that was faulting...and cycling the power supply at one point (I thought that might have been the culprit).  Things seemed to run good after...and the system stayed up with out a crash for several weeks.  So I updated to UnRaid 6.11.  System ran for over 19 days straight, but then real crashes occurred over the last day, where the system just stops responding as soon as UnRAID boots up.

 

I've connected a monitor now, and I've noticed this hardware error several times on boot:

image.thumb.jpeg.83cc949035f4589ab68e167a8bfd97b0.jpeg

 

The most recent screen has been this after a successful reboot, and left at the login screen:

image.thumb.jpeg.362b24cadf54629773ff1ff7b8712d9f.jpeg

 

Now I've re-seated the RAM, and the Promise card seemed to have popped up over time, so that was corrected.  I've run through several MEMTests and all have passed, so I don't feel that the RAM is an issue.  I'm thinking it could be one of 3 things, and this is where someone who has seen this issue may be able to help.  I think it is either:

a) the power supply is faulting (yet BIOS voltages are all good)

b) the CPU

c) the motherboard

d) a combination of a), b), c)

I am in an area where we get some pretty severe weather, so I wouldn't doubt at some point a brown out or power outage may have taken its toll on something, but I just can't correlate when the crashing issue occurred in relation to a severe power outage.

One other thing I've noticed in the syslog, and captured the following picture of a network communication error.  But I figured that happens during bootup prior to the LAN adapter drivers being installed:

image.thumb.jpeg.22ab0bf182f06f286740d4017af4e27e.jpeg

Once the system is booted up (and the system wasn't crashing), I've never noticed not having a network connection.

 

Apologies for the long post.  Hope someone can shed some light on what may be going on.

 

Cheers,

 

Edited by Marbles_00
Link to comment

I did have a remote syslog server setup to another system (OMV server called Cronus), but it never reported anything hardware issue wise...just reported that it lost communication to my UnRAID server starting by loosing NUT communication to the UPS...here is an example (1992.168.0.14 is my UnRAID server):

 

Jan 12 00:00:04 Cronus rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="497" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jan 12 00:00:04 Cronus rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="497" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jan 12 00:00:04 Cronus systemd[1]: logrotate.service: Succeeded.
Jan 12 00:00:04 Cronus systemd[1]: Started Rotate log files.
Jan 12 00:02:49 Cronus nmbd[581]: [2023/01/12 00:02:49.178410,  0] ../source3/nmbd/nmbd_namequery.c:109(query_name_response)
Jan 12 00:02:49 Cronus nmbd[581]:   query_name_response: Multiple (2) responses received for a query on subnet 192.168.0.15 for name WORKGROUP<1d>.
Jan 12 00:02:49 Cronus nmbd[581]:   This response was from IP 192.168.0.14, reporting an IP address of 192.168.0.14.
Jan 12 00:07:48 Cronus nmbd[581]: [2023/01/12 00:07:48.942806,  0] ../source3/nmbd/nmbd_namequery.c:109(query_name_response)
Jan 12 00:07:48 Cronus nmbd[581]:   query_name_response: Multiple (2) responses received for a query on subnet 192.168.0.15 for name WORKGROUP<1d>.
Jan 12 00:07:48 Cronus nmbd[581]:   This response was from IP 192.168.0.14, reporting an IP address of 192.168.0.14.
Jan 12 00:09:01 Cronus CRON[11294]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jan 12 00:09:03 Cronus systemd[1]: Starting Clean php session files...
Jan 12 00:09:04 Cronus systemd[1]: phpsessionclean.service: Succeeded.
Jan 12 00:09:04 Cronus systemd[1]: Started Clean php session files.
Jan 12 00:11:18 Cronus upsmon[597]: Poll UPS [[email protected]] failed - Server disconnected
Jan 12 00:11:18 Cronus upsmon[597]: Communications with UPS [email protected] lost
Jan 12 00:11:18 Cronus upssched[11641]: Executing command: notify
Jan 12 00:11:20 Cronus collectd[745]: nut plugin: nut_read: upscli_list_start (0002) failed: Server disconnected
Jan 12 00:11:20 Cronus collectd[745]: read-function of plugin `nut/[email protected]' failed. Will suspend it for 20.000 seconds.
Jan 12 00:11:38 Cronus collectd[745]: nut plugin: nut_connect: upscli_connect (192.168.0.14, 3493) failed: Connection failure: No route to host
Jan 12 00:11:38 Cronus collectd[745]: read-function of plugin `nut/[email protected]' failed. Will suspend it for 40.000 seconds.
Jan 12 00:11:41 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:11:47 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:11:53 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:11:59 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:12:05 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:12:11 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:12:18 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host
Jan 12 00:12:18 Cronus collectd[745]: nut plugin: nut_connect: upscli_connect (192.168.0.14, 3493) failed: Connection failure: No route to host
Jan 12 00:12:18 Cronus collectd[745]: read-function of plugin `nut/[email protected]' failed. Will suspend it for 80.000 seconds.
Jan 12 00:12:24 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host

 

Maybe I didn't have it setup correctly.

syslog (1)

Edited by Marbles_00
Link to comment

Just pulled this diagnostic from the server.zeus-diagnostics-20230113-1545.zip

 

And no sooner I did that (and logged in via command prompt) this came up on the screen:

image.thumb.jpeg.49deda80eb09157db0a786cae47d18f9.jpeg

 

I have a gut feeling that the motherboard is on its way out.  The combination of the issues, starting with the onboard LAN pooping a while back is just pointing at that.  I do have another LGA775 board on hand that I could try out the CPU, a couple sticks of memory and the PSU, just didn't want to use it for replacing the current board as it lacks SATA ports for all my drives.  I could just run a trial UnRAID on it just to see if the other hardware is still good.

Edited by Marbles_00
Link to comment

To give an update.  Swapped the CPU and memory over to another motherboard, and have been running a trial version of UnRAID on it.  Where my server has been crashing just after boot the last few days, this temporary system has been humming along.  At this point it doesn't appear that the CPU or memory are the issue.

 

I'm going to gerry-rig the PSU over onto the temp system, and load it as much as possible to rule that out.

Link to comment

PSU is now running the temp setup.  CPU, PSU, Memory all looking good right now.  Booted and running without issues.  Give it a couple of days, but early conclusions look like the P5B-VM DO has finally bit the biscuit.  So, looking at a Supermicro X10SL7-F, Xeon E3-1285L, with 16Gb of RAM combo.  A little dated, but so was the previous system when I actually put it all together (and it lasted 10+ years), but regardless, way more powerful than what I was running with...and what I was running with worked exceptionally well for my requirements.

Edited by Marbles_00
Link to comment
  • 2 weeks later...
  • Solution

This is just a follow up.  I have now replaced the motherboard with a SuperMicro X10SL7-F (along with the CPU, and memory).  System is back up and running fine.  The old CPU and partial memory from the previous unRaid setup is now being used in another media machine and it too is running all fine.

 

Solution: bad motherboard - note to others that may be having similar crashing issues.  I was fortunate enough to have enough spare parts gathered over the years to be able to troubleshoot hardware in other setups to determine if any of them were the culprit.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...