Outer_Zevin Posted January 1, 2022 Share Posted January 1, 2022 (edited) Hey, First time posting here, been having a problem with my server recently. Last night it crashed and I could not access the server from the GUI or SSH. I eventually gave up and hard reset the server (don't like to do this), and about 13 hours later, it crashed again presenting the same problems. I’ve got the syslog server set up to back the log contents up to a cache enabled share so if it happens again, I’ll have the logs. Beyond that, I also removed the only docker container I’ve added to my system since this started happening (I doubt it was the problem and I enjoyed it, but it's hardly a necessity, if it wasn't causing problems I'd happily reinstall it). In the mean time, while I wait for the logs to show something, I was wondering if people more experienced can offer me any advice on trouble shooting. Any common problems the should be checked for are appreciated. What tools are available to test for specific hardware, etc. I'll keep doing my own digging on the forums and documentation, but any help is appreciated. I'd like to diagnose the problem as soon as possible so the family won't have to ask me why Plex isn't working. EDIT: Attached Diagnostics mother-ship-diagnostics-20220101-1459.zip Edited January 1, 2022 by Outer_Zevin Quote Link to comment
Hoopster Posted January 1, 2022 Share Posted January 1, 2022 33 minutes ago, Outer_Zevin said: Last night it crashed and I could not access the server from the GUI or SSH It's good that you have a syslog server setup. In the meantime, you should post your diagnostics in a new post. (Tools-->Diagnostics in the GUI or diagnostics from the command line). Quote Link to comment
Outer_Zevin Posted January 1, 2022 Author Share Posted January 1, 2022 (edited) 24 minutes ago, Hoopster said: It's good that you have a syslog server setup. In the meantime, you should post your diagnostics in a new post. (Tools-->Diagnostics in the GUI or diagnostics from the command line). Here they are. mother-ship-diagnostics-20220101-1459.zip Edited January 1, 2022 by Outer_Zevin Quote Link to comment
Outer_Zevin Posted January 2, 2022 Author Share Posted January 2, 2022 (edited) Okay, I had another problem this morning, the server didn't hang up, I was able to log into the gui and managed to shut down the server from there. All Docker containers showed as stopped with generic icons and no container names. I attached the logs from today below, it seems there was an error with my cache drive or possibly my memory? I'm no expert at reading these things yet, but I did find this thread while googling. If anyone has any input, it is greatly appreciated. If I have to reformat my NVME drive, I'm fine, the only things on there I'd be broken up about losing is backed up on the array and elsewhere. syslogs01022022.txt Edited January 2, 2022 by Outer_Zevin Quote Link to comment
JorgeB Posted January 3, 2022 Share Posted January 3, 2022 There's filesystem corruption in the NVMe device, which I assume is your cache, best bet now it backup and reformat it. Quote Link to comment
Outer_Zevin Posted January 3, 2022 Author Share Posted January 3, 2022 12 hours ago, JorgeB said: There's filesystem corruption in the NVMe device, which I assume is your cache, best bet now it backup and reformat it. Yeah, after reading around on the forums I determined that was the best bet. I regularly back up the stuff on the cache so I didn't lose anything important. I reformatted yesterday and have been up for 34 hours with no problems so far. Quote Link to comment
Outer_Zevin Posted January 4, 2022 Author Share Posted January 4, 2022 (edited) Server hanged again, required a hard reset. I attached the logs for today, if anyone sees anything in them that sticks out, let me know. I've done some more reading around the forums for people facing similar issues and currently have the server running memtest86 to see if RAM shows any errors. syslog-10.10.69.6.txt Edited January 4, 2022 by Outer_Zevin Quote Link to comment
JorgeB Posted January 4, 2022 Share Posted January 4, 2022 5 hours ago, Outer_Zevin said: have the server running memtest86 to see if RAM shows any errors. That's a good idea, looks more like a hardware issue. Quote Link to comment
Outer_Zevin Posted January 4, 2022 Author Share Posted January 4, 2022 6 hours ago, JorgeB said: That's a good idea, looks more like a hardware issue. This is my first time using memtest, I just need to fire it up and let it run right? No need to configure anything? Quote Link to comment
JorgeB Posted January 4, 2022 Share Posted January 4, 2022 Yep, just boot needs to be in CSM/legacy mode, memtest that comes with Unraid won't work with UEFI boot. Quote Link to comment
itimpi Posted January 4, 2022 Share Posted January 4, 2022 1 hour ago, JorgeB said: Yep, just boot needs to be in CSM/legacy mode, memtest that comes with Unraid won't work with UEFI boot. Probably worth pointing out that a version that DOES work with UEFI boot can be downloaded from memtest86.com Quote Link to comment
Outer_Zevin Posted January 4, 2022 Author Share Posted January 4, 2022 So while I'm waiting for the results I'm wondering something: I was running into problems every 12-36 hours, so would it be safe to assume Memtest should run for atleast that long before it may encounter errors? I know that most people in other threads here say it run for 24 to 48 hours, so I assume it making it through a few passes without problem isn't always an indicator that there aren't any problems. Quote Link to comment
JorgeB Posted January 4, 2022 Share Posted January 4, 2022 48 minutes ago, Outer_Zevin said: so I assume it making it through a few passes without problem isn't always an indicator that there aren't any problems. Correct, if the problem is serious enough it's easily caught, but even 48H without errors is no guarantee, still as close as you can get to rule that out. Quote Link to comment
cyberspectre Posted January 5, 2022 Share Posted January 5, 2022 On 1/2/2022 at 6:38 AM, Outer_Zevin said: Okay, I had another problem this morning, the server didn't hang up, I was able to log into the gui and managed to shut down the server from there. All Docker containers showed as stopped with generic icons and no container names. I attached the logs from today below, it seems there was an error with my cache drive or possibly my memory? I'm no expert at reading these things yet, but I did find this thread while googling. If anyone has any input, it is greatly appreciated. If I have to reformat my NVME drive, I'm fine, the only things on there I'd be broken up about losing is backed up on the array and elsewhere. syslogs01022022.txt 56.53 kB · 3 downloads I'm the OP in that thread. There was another thread after that, in which I determined the lockups were occurring when the temperature of the NVME SSD's onboard controller exceeded about 60 degrees celcius. The SSD was a Crucial P1. I threw it in the garbage and replaced it with a Samsung 970 Evo, haven't had a single problem since. I'm also using XFS on the cache drive now, not BTRFS. Quote Link to comment
Outer_Zevin Posted January 5, 2022 Author Share Posted January 5, 2022 2 hours ago, cyberspectre said: I'm the OP in that thread. There was another thread after that, in which I determined the lockups were occurring when the temperature of the NVME SSD's onboard controller exceeded about 60 degrees celcius. The SSD was a Crucial P1. I threw it in the garbage and replaced it with a Samsung 970 Evo, haven't had a single problem since. I'm also using XFS on the cache drive now, not BTRFS. Thank you for reaching out! My cache drive is a Samsung 970 EVO and when I reformatted the drive earlier in the thread I kept it as XFS just to see what how things went. It definitely lasted longer between problems. How did you go about diagnosing that problem? If memtest gets me no where I will definitely be interested in following up with that. My current RAM is 2x sticks of 16gb Crucial DDR4 unbuffered ECC RAM though. Same brand as your junked SSD. No idea if that means anything. Quote Link to comment
cyberspectre Posted January 5, 2022 Share Posted January 5, 2022 (edited) 1 hour ago, Outer_Zevin said: Thank you for reaching out! My cache drive is a Samsung 970 EVO and when I reformatted the drive earlier in the thread I kept it as XFS just to see what how things went. It definitely lasted longer between problems. How did you go about diagnosing that problem? If memtest gets me no where I will definitely be interested in following up with that. My current RAM is 2x sticks of 16gb Crucial DDR4 unbuffered ECC RAM though. Same brand as your junked SSD. No idea if that means anything. You've got a 970 Evo formatted XFS and you're still getting that issue? I don't have much to add, then. Controller thermals could still be part of the issue, possibly. You can keep an eye on it in your terminal with: watch -n 0.1 nvme smart-log /dev/nvme0 The reading for the NAND itself is different from the controller's, which is where the drive gets hottest. Look at temperature sensor 2. Mine goes to 65-70 C pretty regularly, but still doesn't cause problems. It usually doesn't even throttle. With regard to your memory, I doubt the brand makes a difference. But if memtest picks up a fault, start pulling one DIMM at a time and re-running memtest until no faults appear. Edited January 5, 2022 by cyberspectre Quote Link to comment
Outer_Zevin Posted January 6, 2022 Author Share Posted January 6, 2022 (edited) Okay, I've been running memtest for over 48 hours now and it's found no errors. (A friend swears that reseating the RAM before running the test may have solved the problems I was having, I have no idea.) Anyone have advice for further troubleshooting from here if the problem persists? Edited January 6, 2022 by Outer_Zevin Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.