May 23, 20242 yr Hey all, complete unix newbie here. I have what I think is a basic unraid setup with no VMs or dockers. I do have notifications setup via email. I get two per day when things are good. The current situation though is the server will go offline and be unreachable. I do not get a notification that something is wrong when this happens. I've run memcheck for hours and all seems fine with that. I have two parity drives 8TB and three 4TB drives in the array, all connected to the mobo sata ports. I believe this started when I added two new drives for parity. They are running hotter than the other drives and I would occasionally get a warning when the temp hit 46 c. Last night I spread the drives out in the case to give them more room; the server is currently located in a cool bedroom for testing, and it went offline again last night. I had to hard power off the system (though I did not try the short touch power switch, I will do that next time to see if it gracefully shuts down); it has been running parity check this morning for 2.5 hours and just hit 45 c on the two new 8TB drives. I have included the system diagnostics. If there are known troubleshooting steps to try, I'm all ears. <edit> I also included syslog from just before midnight, showing it checks UPS every 10 minutes; sent the all good email at 12:20 am; UPS checks every 10 minutes, and it just stops at 6am. Then nothing til I hard booted at 7am. server-diagnostics-20240523-0732.zip syslog-192.168.1.21.log Edited May 23, 20242 yr by dmcknight123 Included syslog
May 23, 20242 yr Community Expert 54 minutes ago, dmcknight123 said: hey are running hotter than the other drives and I would occasionally get a warning when the temp hit 46 c. That should not make the server crash, unfortunately there's nothing relevant logged, this can be some hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.
May 23, 20242 yr Author 2 hours ago, JorgeB said: That should not make the server crash, unfortunately there's nothing relevant logged, this can be some hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. I just did that, and after 1.5 hours the server was unresponsive again. I've attached the syslog and diags. Could the flash drive be an issue? I'll start by removing one of the 8TB parity drives since that was the last hardware change to the system. I'll also look into any motherboard updates that might be available. server-diagnostics-20240523-1208.zip syslog-192.168.1.21 (3).log
May 23, 20242 yr Community Expert 6 minutes ago, dmcknight123 said: Could the flash drive be an issue? It's not a common symptom of flash drive issues, but it's a possibility.
May 24, 20242 yr Author I updated the ASUS motherboard bios...interestingly, the version it shipped with is not listed on their website as a stable downloadable version. It was a couple of years and 3 versions behind so should be a good thing. I also removed one of the new 8TB parity drives and will see where this leads. There is another firmware on the ASUS site for Intel Management Engine but I gather that is only for Windows, is that correct? Edited May 24, 20242 yr by dmcknight123 clarity
May 24, 20242 yr Author 1 hour ago, JorgeB said: That should not affect server stability. You are correct, because it just happened again. When I booted into UnRaid last time I had a monitor and kb connected and left it at the login screen. I connected remotely and monitored it for 30 minutes, then started a parity check. Looked in on it every few minutes while I'm WFH. I just checked and it's unresponsive again. The cursor at the login screen isn't even blinking. No kb or mouse activity. Before I hard shutdown yet again (and it tells me on bootup that an unclean shutdown was detected) is there anything I can try? What else should I look at hardware-wise? I will swap the other 8TB drive around for parity to see if that makes a difference. This hardware was purchased new about a year ago, maybe a little longer.
May 24, 20242 yr Community Expert If you have multiple RAM sticks try with just one, if the same try with a different one, that will basically rule out bad RAM.
May 24, 20242 yr Author Just now, JorgeB said: If you have multiple RAM sticks try with just one, if the same try with a different one, that will basically rule out bad RAM. I've run memtest for hours and it always passes. But at this point I'll try most anything. Will let you know. Any suggestions for RAM? I'm currently using one 8GB stick plugged in to the correct slot.
May 24, 20242 yr Community Expert Memtest is only definitive if it finds errors, but if you only have one, no easy way to rule that out.
June 9, 20242 yr Author Earlier I marked the “replace ram” post as the solution because I did replace the 8gb with 32gb but unfortunately it is not the solution. Here’s a current timeline: Replace ram, runs perfect for about a week June 7, reboot. Received status emails at noon and midnight Last email received June 8 at noon. I’ll be home later today (June 9) and confirm but I think it went offline sometime after noon on June 8. This is very frustrating in that I don’t know what else to check other than to replace the asus motherboard. Edited June 9, 20242 yr by dmcknight123
June 9, 20242 yr Community Expert Solution If the server is rebooting on its own, and it's not the RAM, PSU, board or CPU, would be the next suspects.
June 9, 20242 yr Author 2 hours ago, JorgeB said: If the server is rebooting on its own, and it's not the RAM, PSU, board or CPU, would be the next suspects. I apologize the reboot was intentional. We were gone for a weekend and I wanted a fresh restart of the system. Prior to that it ran perfectly for about a week.
June 10, 20242 yr Author 3 hours ago, JorgeB said: Post the persistent syslog in case there's something there. Attached. Nothing that I can tell. It's plugged into my test area so there's no UPS. June 7 8:35, I rebooted June 7 12:20 status email sent June 8 00:20 status email sent June 8 20:49 last UPS communication check logged (normally every 10 minutes) June 9 19:40 hard power cycle BTW as a side note, with this most recent hard power cycle the parity check has found 172 errors. First time that's ever happened. Grrrrrrrrrrrr. syslog-192.168.1.21.log Edited June 10, 20242 yr by dmcknight123 additional info
October 3, 20241 yr Author I've marked the solution above, thanks so much to @JorgeB . Though I was skeptical, I replaced the motherboard and CPU (already had new ram) and haven't had a freeze up since. I also took the opportunity to put it all in a new Fractal case and they are amazing. Thank you again. Edited October 3, 20241 yr by dmcknight123 clarity
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.