ssh Posted June 8, 2021 Share Posted June 8, 2021 (edited) Hi everyone, Today I noticed that my server had become unresponsive. I was able to login and see that the disk activity was 0, 1% CPU usage and 11% ram usage - temps were all in check (disks under 40, CPU under 45). The syslog (which I could still open) showed a lot of red error logs. But I was not able to download the diagnostics anymore - the system had become too unresponsive. I did managed to connect via SSH. The top command showed nothing to be concerned about, but the top command crashed after 5 seconds with a "Segmentation fault" error. (see screenshot attached) That's when I decided to reboot the server entirely with the Unraid "powerdown -r" command, which again resulted in a Segmentation fault error. Trying again did show the "going down" message, but after waiting another 10 minutes it still didn't power down. Even the terminal when connecting to the server directly with keyboard and display, was unresponsive. I could still type but the commands didn't actually do anything. I eventually restarted the server by holding down the power button and then starting the server again. The boot up process proceeded as it normally would, except that it started a parity check immediately, but I think that is expected when the server experienced an unclean shutdown. == Does anyone have any suggestions as to what could have caused this? Or any recommendations in terms of next steps? I was thinking of maybe doing a memory test - but this memory kit has been running fine 24/7 since I bought it (4 months ago) and is running stock (non-xmp). syslog-manual.rtf Edited June 10, 2021 by ssh Spelling, grammar and wording Quote Link to comment
ssh Posted June 10, 2021 Author Share Posted June 10, 2021 (edited) Parity check finished with 0 errors found/corrected. No weird behaviour since the reboot. Still no clue what happened though. Edited June 10, 2021 by ssh Quote Link to comment
ssh Posted June 10, 2021 Author Share Posted June 10, 2021 (edited) Today I had this problem again. Attached the syslog again via manual copy, as it was not possible to download diagnostics when the server was in this state. I have recorded a video of what I saw on screen, not sure if its helpful: syslog.txt Edited June 10, 2021 by ssh Quote Link to comment
JorgeB Posted June 10, 2021 Share Posted June 10, 2021 Can't see the reason based on the syslog, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
ssh Posted June 10, 2021 Author Share Posted June 10, 2021 (edited) Not sure if this error could be related to RAM, but I decided to do a RAM test anyways. At around 36% of the first pass, errors started appearing. So I am starting to suspect my memory modules / motherboard / cpu now (or can it only be the memory itself with these kind of errors?). I am using 2 sticks of 16 GB DDR4-2400 memory (CMK32GX4M2A2400C16) in dual-channel mode with XMP disabled, using the "Auto" frequency setting in the BIOS. So what I am doing now is testing each stick individually to see if the errors remain (will post results here). Edited June 11, 2021 by ssh Quote Link to comment
ssh Posted June 12, 2021 Author Share Posted June 12, 2021 (edited) Turns out one of my modules is bad. Tested them separately, module A seems to be fine: no errors after 5 hours (3 passes), module B started spitting out errors during the 2nd pass (or after 1 hour). This was repeatable on another system with another motherboard and cpu, so it's definitely the module itself. I've replaced both sticks with 1x 8GB stick that I had laying around and completed 8 passes on that without any errors. I've started up the server again and did another parity check (luckily still 0 sync errors, so it seems that the memory was faulty, but not faulty enough to cause any data loss on the array yet). Hopefully replacing the bad RAM fixed the issue I've requested an RMA on the faulty memory kit. Edited June 12, 2021 by ssh 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.