rocky_mtn Posted December 31, 2023 Share Posted December 31, 2023 (edited) This all started back at the end of November, so about a month ago when I went from crucial SSD's for my cache drives (kept getting 197 errors on them) to Samsung 980 nvme drives. I installed 2 pci adapters to then install the 2, 2TB nvme's. I've been dealing with the server crashing anywhere from 2-4 days apart, consistently, and I can't find anything in the logs referencing what the issue is at the time it occurred. This morning 12/31/23 it crashed at roughly 10:11 am and the only thing I saw in the syslog was well before that at 9:02 am. I'm sure with everything that I've messed with over the past month, that there's things misconfigured as I've got more than one thing wrong with the server/disks at the moment. When it crashed at 10:11 am this morning, upon rebooting it, the docker containers are ALL gone, but the appdata and folders remain on the cache, and I do have them backed up, but it wouldn't let me restore. Something seems like it's corrupted now with the docker image but it won't let me access or delete it, on the server itself in the docker settings, or in the directory itself. I can navigate to /mnt/usr/system/docker/ but then it won't let me access the folder with the docker image in it because of file permissions it says. Also, one of my 12TB parity drives started giving me errors about a week ago so I replaced and upgraded those to 14TB drives. I still have the 2 crucial drives in the system, but they aren't mounted. Oh, also, I changed the cache file system to zfs from btrfs when I upgraded to the nvme's. Would someone mind poking around my diagnostics and see if you guys can point me in the right direction here? I'd surely appreciate it! Oh, I also am currently running memtest but it's 256GB so those results might be tomorrow before I see them. trailheadmedia-diagnostics-20231231-1212.zip Edited December 31, 2023 by rocky_mtn Quote Link to comment
rocky_mtn Posted January 1 Author Share Posted January 1 (edited) The results of the first pass of memtest after about 6 hours, I’m gonna let it run about 24 hours total. Edited January 1 by rocky_mtn Quote Link to comment
JorgeB Posted January 1 Share Posted January 1 Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.