Jump to content

Server keeps crashing, cannot access GUI or log into SSH, requires hard reset.


Recommended Posts

Hey,

 

First time posting here, been having a problem with my server recently. Last night it crashed and I could not access the server from the GUI or SSH. I eventually gave up and hard reset the server (don't like to do this), and about 13 hours later, it crashed again presenting the same problems.
 

I’ve got the syslog server set up to back the log contents up to a cache enabled share so if it happens again, I’ll have the logs. Beyond that, I also removed the only docker container I’ve added to my system since this started happening (I doubt it was the problem and I enjoyed it, but it's hardly a necessity, if it wasn't causing problems I'd happily reinstall it).


In the mean time, while I wait for the logs to show something, I was wondering if people more experienced can offer me any advice on trouble shooting. Any common problems the should be checked for are appreciated. What tools are available to test for specific hardware, etc.

I'll keep doing my own digging on the forums and documentation, but any help is appreciated. I'd like to diagnose the problem as soon as possible so the family won't have to ask me why Plex isn't working.

 

EDIT: Attached Diagnostics

 

mother-ship-diagnostics-20220101-1459.zip

Edited by Outer_Zevin
Link to comment
  • Outer_Zevin changed the title to Server keeps crashing, cannot access GUI or log into SSH, requires hard reset.

Okay, I had another problem this morning, the server didn't hang up, I was able to log into the gui and managed to shut down the server from there. All Docker containers showed as stopped with generic icons and no container names. I attached the logs from today below, it seems there was an error with my cache drive or possibly my memory? I'm no expert at reading these things yet, but I did find this thread while googling.

 


If anyone has any input, it is greatly appreciated. If I have to reformat my NVME drive, I'm fine, the only things on there I'd be broken up about losing is backed up on the array and elsewhere.

 

 

syslogs01022022.txt

Edited by Outer_Zevin
Link to comment
12 hours ago, JorgeB said:

There's filesystem corruption in the NVMe device, which I assume is your cache, best bet now it backup and reformat it.

 

Yeah, after reading around on the forums I determined that was the best bet. I regularly back up the stuff on the cache so I didn't lose anything important. I reformatted yesterday and have been up for 34 hours with no problems so far.

Link to comment

So while I'm waiting for the results I'm wondering something: I was running into problems every 12-36 hours, so would it be safe to assume Memtest should run for atleast that long before it may encounter errors? I know that most people in other threads here say it run for 24 to 48 hours, so I assume it making it through a few passes without problem isn't always an indicator that there aren't any problems.

Link to comment
48 minutes ago, Outer_Zevin said:

so I assume it making it through a few passes without problem isn't always an indicator that there aren't any problems.

Correct, if the problem is serious enough it's easily caught, but even 48H without errors is no guarantee, still as close as you can get to rule that out.

Link to comment
On 1/2/2022 at 6:38 AM, Outer_Zevin said:

Okay, I had another problem this morning, the server didn't hang up, I was able to log into the gui and managed to shut down the server from there. All Docker containers showed as stopped with generic icons and no container names. I attached the logs from today below, it seems there was an error with my cache drive or possibly my memory? I'm no expert at reading these things yet, but I did find this thread while googling.

 


If anyone has any input, it is greatly appreciated. If I have to reformat my NVME drive, I'm fine, the only things on there I'd be broken up about losing is backed up on the array and elsewhere.

 

 

syslogs01022022.txt 56.53 kB · 3 downloads

 

I'm the OP in that thread. There was another thread after that, in which I determined the lockups were occurring when the temperature of the NVME SSD's onboard controller exceeded about 60 degrees celcius. The SSD was a Crucial P1. I threw it in the garbage and replaced it with a Samsung 970 Evo, haven't had a single problem since. I'm also using XFS on the cache drive now, not BTRFS.

Link to comment
2 hours ago, cyberspectre said:

 

I'm the OP in that thread. There was another thread after that, in which I determined the lockups were occurring when the temperature of the NVME SSD's onboard controller exceeded about 60 degrees celcius. The SSD was a Crucial P1. I threw it in the garbage and replaced it with a Samsung 970 Evo, haven't had a single problem since. I'm also using XFS on the cache drive now, not BTRFS.

 

Thank you for reaching out! My cache drive is a Samsung 970 EVO and when I reformatted the drive earlier in the thread I kept it as XFS just to see what how things went. It definitely lasted longer between problems. How did you go about diagnosing that problem? If memtest gets me no where I will definitely be interested in following up with that.

 

My current RAM is 2x sticks of 16gb Crucial DDR4 unbuffered ECC RAM though. Same brand as your junked SSD. No idea if that means anything.

Link to comment
1 hour ago, Outer_Zevin said:

 

Thank you for reaching out! My cache drive is a Samsung 970 EVO and when I reformatted the drive earlier in the thread I kept it as XFS just to see what how things went. It definitely lasted longer between problems. How did you go about diagnosing that problem? If memtest gets me no where I will definitely be interested in following up with that.

 

My current RAM is 2x sticks of 16gb Crucial DDR4 unbuffered ECC RAM though. Same brand as your junked SSD. No idea if that means anything.

 

You've got a 970 Evo formatted XFS and you're still getting that issue? I don't have much to add, then. Controller thermals could still be part of the issue, possibly. You can keep an eye on it in your terminal with:

 

watch -n 0.1 nvme smart-log /dev/nvme0

 

The reading for the NAND itself is different from the controller's, which is where the drive gets hottest. Look at temperature sensor 2. Mine goes to 65-70 C pretty regularly, but still doesn't cause problems. It usually doesn't even throttle.

 

With regard to your memory, I doubt the brand makes a difference. But if memtest picks up a fault, start pulling one DIMM at a time and re-running memtest until no faults appear.

Edited by cyberspectre
Link to comment

Okay, I've been running memtest for over 48 hours now and it's found no errors. (A friend swears that reseating the RAM before running the test may have solved the problems I was having, I have no idea.)

Anyone have advice for further troubleshooting from here if the problem persists? 

 

 

Edited by Outer_Zevin
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...