Marbles_00 Posted January 13, 2023 Share Posted January 13, 2023 (edited) Hey all, Need some help. Normally I can troubleshoot this stuff on my own, but for this, I really need some advice. I've been running UnRaid on some older hardware, 24/7, for the better part of 10+ years. The system up until the last few months has been rock solid (other than times when we have lost actual power). The hardware is the following: Motherboard: Asus P5B-VM DO CPU: Intel E4700 Core2Duo (the most recent upgrade) Memory: 4 sticks, 6 Gigs total, 2x 2GB Muskin Silverline, 2x 1GB Muskin Silverline Power Supply: Corsair CX600 11 Hard drives, most WD20EFRX/EFZX with a couple of Seagate Barracuda 2Tb drives, and 2x Hitachi 160Gig cache drives. LAN Card: TPLink Gig adapter (the motherboard LAN died) Promise SATA TX4 As mentioned, this machine was very reliable, up to until I upgraded to UnRAID 6.10. Then the system would just stop responding, and I had to do hard reboots to get it up again. It was rather random in occurrence, and unfortunately on every reboot, the syslog would clear, so I couldn't capture the event happening. I dropped back down to 6.9 thinking that was an issue, and though the system didn't crash so much, it still did periodically. I've blown the dust out (thinking overheating) several times, even though the temps in UnRAiD indicated they were all good. I replaced a fan that was faulting...and cycling the power supply at one point (I thought that might have been the culprit). Things seemed to run good after...and the system stayed up with out a crash for several weeks. So I updated to UnRaid 6.11. System ran for over 19 days straight, but then real crashes occurred over the last day, where the system just stops responding as soon as UnRAID boots up. I've connected a monitor now, and I've noticed this hardware error several times on boot: The most recent screen has been this after a successful reboot, and left at the login screen: Now I've re-seated the RAM, and the Promise card seemed to have popped up over time, so that was corrected. I've run through several MEMTests and all have passed, so I don't feel that the RAM is an issue. I'm thinking it could be one of 3 things, and this is where someone who has seen this issue may be able to help. I think it is either: a) the power supply is faulting (yet BIOS voltages are all good) b) the CPU c) the motherboard d) a combination of a), b), c) I am in an area where we get some pretty severe weather, so I wouldn't doubt at some point a brown out or power outage may have taken its toll on something, but I just can't correlate when the crashing issue occurred in relation to a severe power outage. One other thing I've noticed in the syslog, and captured the following picture of a network communication error. But I figured that happens during bootup prior to the LAN adapter drivers being installed: Once the system is booted up (and the system wasn't crashing), I've never noticed not having a network connection. Apologies for the long post. Hope someone can shed some light on what may be going on. Cheers, Edited January 13, 2023 by Marbles_00 Quote Link to comment
trurl Posted January 13, 2023 Share Posted January 13, 2023 attach diagnostics to your NEXT post in this thread 32 minutes ago, Marbles_00 said: unfortunately on every reboot, the syslog would clear, so I couldn't capture the event happening setup syslog server Quote Link to comment
Marbles_00 Posted January 13, 2023 Author Share Posted January 13, 2023 (edited) I did have a remote syslog server setup to another system (OMV server called Cronus), but it never reported anything hardware issue wise...just reported that it lost communication to my UnRAID server starting by loosing NUT communication to the UPS...here is an example (1992.168.0.14 is my UnRAID server): Jan 12 00:00:04 Cronus rsyslogd: [origin software="rsyslogd" swVersion="8.1901.0" x-pid="497" x-info="https://www.rsyslog.com"] rsyslogd was HUPed Jan 12 00:00:04 Cronus rsyslogd: [origin software="rsyslogd" swVersion="8.1901.0" x-pid="497" x-info="https://www.rsyslog.com"] rsyslogd was HUPed Jan 12 00:00:04 Cronus systemd[1]: logrotate.service: Succeeded. Jan 12 00:00:04 Cronus systemd[1]: Started Rotate log files. Jan 12 00:02:49 Cronus nmbd[581]: [2023/01/12 00:02:49.178410, 0] ../source3/nmbd/nmbd_namequery.c:109(query_name_response) Jan 12 00:02:49 Cronus nmbd[581]: query_name_response: Multiple (2) responses received for a query on subnet 192.168.0.15 for name WORKGROUP<1d>. Jan 12 00:02:49 Cronus nmbd[581]: This response was from IP 192.168.0.14, reporting an IP address of 192.168.0.14. Jan 12 00:07:48 Cronus nmbd[581]: [2023/01/12 00:07:48.942806, 0] ../source3/nmbd/nmbd_namequery.c:109(query_name_response) Jan 12 00:07:48 Cronus nmbd[581]: query_name_response: Multiple (2) responses received for a query on subnet 192.168.0.15 for name WORKGROUP<1d>. Jan 12 00:07:48 Cronus nmbd[581]: This response was from IP 192.168.0.14, reporting an IP address of 192.168.0.14. Jan 12 00:09:01 Cronus CRON[11294]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi) Jan 12 00:09:03 Cronus systemd[1]: Starting Clean php session files... Jan 12 00:09:04 Cronus systemd[1]: phpsessionclean.service: Succeeded. Jan 12 00:09:04 Cronus systemd[1]: Started Clean php session files. Jan 12 00:11:18 Cronus upsmon[597]: Poll UPS [[email protected]] failed - Server disconnected Jan 12 00:11:18 Cronus upsmon[597]: Communications with UPS [email protected] lost Jan 12 00:11:18 Cronus upssched[11641]: Executing command: notify Jan 12 00:11:20 Cronus collectd[745]: nut plugin: nut_read: upscli_list_start (0002) failed: Server disconnected Jan 12 00:11:20 Cronus collectd[745]: read-function of plugin `nut/[email protected]' failed. Will suspend it for 20.000 seconds. Jan 12 00:11:38 Cronus collectd[745]: nut plugin: nut_connect: upscli_connect (192.168.0.14, 3493) failed: Connection failure: No route to host Jan 12 00:11:38 Cronus collectd[745]: read-function of plugin `nut/[email protected]' failed. Will suspend it for 40.000 seconds. Jan 12 00:11:41 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:11:47 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:11:53 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:11:59 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:12:05 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:12:11 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:12:18 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Jan 12 00:12:18 Cronus collectd[745]: nut plugin: nut_connect: upscli_connect (192.168.0.14, 3493) failed: Connection failure: No route to host Jan 12 00:12:18 Cronus collectd[745]: read-function of plugin `nut/[email protected]' failed. Will suspend it for 80.000 seconds. Jan 12 00:12:24 Cronus upsmon[597]: UPS [[email protected]]: connect failed: Connection failure: No route to host Maybe I didn't have it setup correctly. syslog (1) Edited January 13, 2023 by Marbles_00 Quote Link to comment
Marbles_00 Posted January 13, 2023 Author Share Posted January 13, 2023 (edited) I do have a diagnostics from when I started noticing the crashing (attached). I will try and do a new one, if I can get the system up long enough to do a diagnostic test. zeus-diagnostics-20221130-1002.zip Edited January 13, 2023 by Marbles_00 Quote Link to comment
Marbles_00 Posted January 13, 2023 Author Share Posted January 13, 2023 (edited) Just pulled this diagnostic from the server.zeus-diagnostics-20230113-1545.zip And no sooner I did that (and logged in via command prompt) this came up on the screen: I have a gut feeling that the motherboard is on its way out. The combination of the issues, starting with the onboard LAN pooping a while back is just pointing at that. I do have another LGA775 board on hand that I could try out the CPU, a couple sticks of memory and the PSU, just didn't want to use it for replacing the current board as it lacks SATA ports for all my drives. I could just run a trial UnRAID on it just to see if the other hardware is still good. Edited January 13, 2023 by Marbles_00 Quote Link to comment
Marbles_00 Posted January 14, 2023 Author Share Posted January 14, 2023 To give an update. Swapped the CPU and memory over to another motherboard, and have been running a trial version of UnRAID on it. Where my server has been crashing just after boot the last few days, this temporary system has been humming along. At this point it doesn't appear that the CPU or memory are the issue. I'm going to gerry-rig the PSU over onto the temp system, and load it as much as possible to rule that out. Quote Link to comment
Marbles_00 Posted January 14, 2023 Author Share Posted January 14, 2023 (edited) PSU is now running the temp setup. CPU, PSU, Memory all looking good right now. Booted and running without issues. Give it a couple of days, but early conclusions look like the P5B-VM DO has finally bit the biscuit. So, looking at a Supermicro X10SL7-F, Xeon E3-1285L, with 16Gb of RAM combo. A little dated, but so was the previous system when I actually put it all together (and it lasted 10+ years), but regardless, way more powerful than what I was running with...and what I was running with worked exceptionally well for my requirements. Edited January 15, 2023 by Marbles_00 Quote Link to comment
Solution Marbles_00 Posted January 27, 2023 Author Solution Share Posted January 27, 2023 This is just a follow up. I have now replaced the motherboard with a SuperMicro X10SL7-F (along with the CPU, and memory). System is back up and running fine. The old CPU and partial memory from the previous unRaid setup is now being used in another media machine and it too is running all fine. Solution: bad motherboard - note to others that may be having similar crashing issues. I was fortunate enough to have enough spare parts gathered over the years to be able to troubleshoot hardware in other setups to determine if any of them were the culprit. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.