Server keeps crashing, cannot access GUI or log into SSH, requires hard reset.

Outer_Zevin · January 1, 2022

Hey,

First time posting here, been having a problem with my server recently. Last night it crashed and I could not access the server from the GUI or SSH. I eventually gave up and hard reset the server (don't like to do this), and about 13 hours later, it crashed again presenting the same problems.

I’ve got the syslog server set up to back the log contents up to a cache enabled share so if it happens again, I’ll have the logs. Beyond that, I also removed the only docker container I’ve added to my system since this started happening (I doubt it was the problem and I enjoyed it, but it's hardly a necessity, if it wasn't causing problems I'd happily reinstall it).

In the mean time, while I wait for the logs to show something, I was wondering if people more experienced can offer me any advice on trouble shooting. Any common problems the should be checked for are appreciated. What tools are available to test for specific hardware, etc.

I'll keep doing my own digging on the forums and documentation, but any help is appreciated. I'd like to diagnose the problem as soon as possible so the family won't have to ask me why Plex isn't working.

EDIT: Attached Diagnostics

mother-ship-diagnostics-20220101-1459.zip

Edited January 1, 2022 by Outer_Zevin

Hoopster · January 1, 2022

33 minutes ago, Outer_Zevin said:

Last night it crashed and I could not access the server from the GUI or SSH

It's good that you have a syslog server setup. In the meantime, you should post your diagnostics in a new post. (Tools-->Diagnostics in the GUI or diagnostics from the command line).

Outer_Zevin · January 1, 2022

24 minutes ago, Hoopster said:

It's good that you have a syslog server setup. In the meantime, you should post your diagnostics in a new post. (Tools-->Diagnostics in the GUI or diagnostics from the command line).

Here they are.

mother-ship-diagnostics-20220101-1459.zip

Edited January 1, 2022 by Outer_Zevin

Outer_Zevin · January 2, 2022

Okay, I had another problem this morning, the server didn't hang up, I was able to log into the gui and managed to shut down the server from there. All Docker containers showed as stopped with generic icons and no container names. I attached the logs from today below, it seems there was an error with my cache drive or possibly my memory? I'm no expert at reading these things yet, but I did find this thread while googling.

If anyone has any input, it is greatly appreciated. If I have to reformat my NVME drive, I'm fine, the only things on there I'd be broken up about losing is backed up on the array and elsewhere.

syslogs01022022.txt

Edited January 2, 2022 by Outer_Zevin

JorgeB · January 3, 2022

There's filesystem corruption in the NVMe device, which I assume is your cache, best bet now it backup and reformat it.

Outer_Zevin · January 3, 2022

12 hours ago, JorgeB said:

There's filesystem corruption in the NVMe device, which I assume is your cache, best bet now it backup and reformat it.

Yeah, after reading around on the forums I determined that was the best bet. I regularly back up the stuff on the cache so I didn't lose anything important. I reformatted yesterday and have been up for 34 hours with no problems so far.

Outer_Zevin · January 4, 2022

Server hanged again, required a hard reset. I attached the logs for today, if anyone sees anything in them that sticks out, let me know. I've done some more reading around the forums for people facing similar issues and currently have the server running memtest86 to see if RAM shows any errors.

syslog-10.10.69.6.txt

Edited January 4, 2022 by Outer_Zevin

JorgeB · January 4, 2022

5 hours ago, Outer_Zevin said:

have the server running memtest86 to see if RAM shows any errors.

That's a good idea, looks more like a hardware issue.

Outer_Zevin · January 4, 2022

6 hours ago, JorgeB said:

That's a good idea, looks more like a hardware issue.

This is my first time using memtest, I just need to fire it up and let it run right? No need to configure anything?

JorgeB · January 4, 2022

Yep, just boot needs to be in CSM/legacy mode, memtest that comes with Unraid won't work with UEFI boot.

itimpi · January 4, 2022

1 hour ago, JorgeB said:

Yep, just boot needs to be in CSM/legacy mode, memtest that comes with Unraid won't work with UEFI boot.

Probably worth pointing out that a version that DOES work with UEFI boot can be downloaded from memtest86.com

Outer_Zevin · January 4, 2022

So while I'm waiting for the results I'm wondering something: I was running into problems every 12-36 hours, so would it be safe to assume Memtest should run for atleast that long before it may encounter errors? I know that most people in other threads here say it run for 24 to 48 hours, so I assume it making it through a few passes without problem isn't always an indicator that there aren't any problems.

JorgeB · January 4, 2022

48 minutes ago, Outer_Zevin said:

so I assume it making it through a few passes without problem isn't always an indicator that there aren't any problems.

Correct, if the problem is serious enough it's easily caught, but even 48H without errors is no guarantee, still as close as you can get to rule that out.

cyberspectre · January 5, 2022

On 1/2/2022 at 6:38 AM, Outer_Zevin said:

Okay, I had another problem this morning, the server didn't hang up, I was able to log into the gui and managed to shut down the server from there. All Docker containers showed as stopped with generic icons and no container names. I attached the logs from today below, it seems there was an error with my cache drive or possibly my memory? I'm no expert at reading these things yet, but I did find this thread while googling.

If anyone has any input, it is greatly appreciated. If I have to reformat my NVME drive, I'm fine, the only things on there I'd be broken up about losing is backed up on the array and elsewhere.

syslogs01022022.txt 56.53 kB · 3 downloads

I'm the OP in that thread. There was another thread after that, in which I determined the lockups were occurring when the temperature of the NVME SSD's onboard controller exceeded about 60 degrees celcius. The SSD was a Crucial P1. I threw it in the garbage and replaced it with a Samsung 970 Evo, haven't had a single problem since. I'm also using XFS on the cache drive now, not BTRFS.

Outer_Zevin · January 5, 2022

2 hours ago, cyberspectre said:

I'm the OP in that thread. There was another thread after that, in which I determined the lockups were occurring when the temperature of the NVME SSD's onboard controller exceeded about 60 degrees celcius. The SSD was a Crucial P1. I threw it in the garbage and replaced it with a Samsung 970 Evo, haven't had a single problem since. I'm also using XFS on the cache drive now, not BTRFS.

Thank you for reaching out! My cache drive is a Samsung 970 EVO and when I reformatted the drive earlier in the thread I kept it as XFS just to see what how things went. It definitely lasted longer between problems. How did you go about diagnosing that problem? If memtest gets me no where I will definitely be interested in following up with that.

My current RAM is 2x sticks of 16gb Crucial DDR4 unbuffered ECC RAM though. Same brand as your junked SSD. No idea if that means anything.

cyberspectre · January 5, 2022

1 hour ago, Outer_Zevin said:

Thank you for reaching out! My cache drive is a Samsung 970 EVO and when I reformatted the drive earlier in the thread I kept it as XFS just to see what how things went. It definitely lasted longer between problems. How did you go about diagnosing that problem? If memtest gets me no where I will definitely be interested in following up with that.

My current RAM is 2x sticks of 16gb Crucial DDR4 unbuffered ECC RAM though. Same brand as your junked SSD. No idea if that means anything.

You've got a 970 Evo formatted XFS and you're still getting that issue? I don't have much to add, then. Controller thermals could still be part of the issue, possibly. You can keep an eye on it in your terminal with:

watch -n 0.1 nvme smart-log /dev/nvme0

The reading for the NAND itself is different from the controller's, which is where the drive gets hottest. Look at temperature sensor 2. Mine goes to 65-70 C pretty regularly, but still doesn't cause problems. It usually doesn't even throttle.

With regard to your memory, I doubt the brand makes a difference. But if memtest picks up a fault, start pulling one DIMM at a time and re-running memtest until no faults appear.

Edited January 5, 2022 by cyberspectre

Outer_Zevin · January 6, 2022

Okay, I've been running memtest for over 48 hours now and it's found no errors. (A friend swears that reseating the RAM before running the test may have solved the problems I was having, I have no idea.)

Anyone have advice for further troubleshooting from here if the problem persists?

Edited January 6, 2022 by Outer_Zevin

Server keeps crashing, cannot access GUI or log into SSH, requires hard reset.

Recommended Posts

Outer_Zevin

Link to comment

Hoopster

Link to comment

Outer_Zevin

Link to comment

Outer_Zevin

Link to comment

JorgeB

Link to comment

Outer_Zevin

Link to comment

Outer_Zevin

Link to comment

JorgeB

Link to comment

Outer_Zevin

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

Outer_Zevin

Link to comment

JorgeB

Link to comment

cyberspectre

Link to comment

Outer_Zevin

Link to comment

cyberspectre

Link to comment

Outer_Zevin

Link to comment

Join the conversation