June 8, 20215 yr Hi everyone, Today I noticed that my server had become unresponsive. I was able to login and see that the disk activity was 0, 1% CPU usage and 11% ram usage - temps were all in check (disks under 40, CPU under 45). The syslog (which I could still open) showed a lot of red error logs. But I was not able to download the diagnostics anymore - the system had become too unresponsive. I did managed to connect via SSH. The top command showed nothing to be concerned about, but the top command crashed after 5 seconds with a "Segmentation fault" error. (see screenshot attached) That's when I decided to reboot the server entirely with the Unraid "powerdown -r" command, which again resulted in a Segmentation fault error. Trying again did show the "going down" message, but after waiting another 10 minutes it still didn't power down. Even the terminal when connecting to the server directly with keyboard and display, was unresponsive. I could still type but the commands didn't actually do anything. I eventually restarted the server by holding down the power button and then starting the server again. The boot up process proceeded as it normally would, except that it started a parity check immediately, but I think that is expected when the server experienced an unclean shutdown. == Does anyone have any suggestions as to what could have caused this? Or any recommendations in terms of next steps? I was thinking of maybe doing a memory test - but this memory kit has been running fine 24/7 since I bought it (4 months ago) and is running stock (non-xmp). syslog-manual.rtf Edited June 10, 20215 yr by ssh Spelling, grammar and wording
June 10, 20215 yr Author Parity check finished with 0 errors found/corrected. No weird behaviour since the reboot. Still no clue what happened though. Edited June 10, 20215 yr by ssh
June 10, 20215 yr Author Today I had this problem again. Attached the syslog again via manual copy, as it was not possible to download diagnostics when the server was in this state. I have recorded a video of what I saw on screen, not sure if its helpful: syslog.txt Edited June 10, 20215 yr by ssh
June 10, 20215 yr Can't see the reason based on the syslog, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.
June 10, 20215 yr Author Not sure if this error could be related to RAM, but I decided to do a RAM test anyways. At around 36% of the first pass, errors started appearing. So I am starting to suspect my memory modules / motherboard / cpu now (or can it only be the memory itself with these kind of errors?). I am using 2 sticks of 16 GB DDR4-2400 memory (CMK32GX4M2A2400C16) in dual-channel mode with XMP disabled, using the "Auto" frequency setting in the BIOS. So what I am doing now is testing each stick individually to see if the errors remain (will post results here). Edited June 11, 20215 yr by ssh
June 12, 20215 yr Author Turns out one of my modules is bad. Tested them separately, module A seems to be fine: no errors after 5 hours (3 passes), module B started spitting out errors during the 2nd pass (or after 1 hour). This was repeatable on another system with another motherboard and cpu, so it's definitely the module itself. I've replaced both sticks with 1x 8GB stick that I had laying around and completed 8 passes on that without any errors. I've started up the server again and did another parity check (luckily still 0 sync errors, so it seems that the memory was faulty, but not faulty enough to cause any data loss on the array yet). Hopefully replacing the bad RAM fixed the issue I've requested an RMA on the faulty memory kit. Edited June 12, 20215 yr by ssh
Archived
This topic is now archived and is closed to further replies.