Heffe Posted November 27, 2023 Share Posted November 27, 2023 Hey all, I recently moved my server into a Dell Percision 5820. The Unraid version is 6.11.5. The server works very well except that it fails after about 24h of being on. Sometimes this looks like an unresponsive system, other times I can access the GUI but there are no shares listed. If I remove the USB and put it in a new port, the server boots fine and then fails about 24h later. If I simply try a reboot, the system will hang and not boot. I've gone through ther BIOS and disabled all C-states but I have no idea what is causing this issue or where to start. Any help would be appriciated. Thanks! jarvis-diagnostics-20231126-2140.zip Quote Link to comment
JorgeB Posted November 27, 2023 Share Posted November 27, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
Heffe Posted November 27, 2023 Author Share Posted November 27, 2023 31 minutes ago, JorgeB said: Enable the syslog server and post that after a crash. Will do. Thank you. If there a preferred location for the logs? Is saving them to the array the best option? Quote Link to comment
JorgeB Posted November 27, 2023 Share Posted November 27, 2023 Since the flash drive could be a problem save it to an array share. Quote Link to comment
Heffe Posted November 29, 2023 Author Share Posted November 29, 2023 Well, it failed agaiun but I misconfigured my syslog server so I'll have to wait another 24 to 48h until it crashes again. Quote Link to comment
Heffe Posted November 30, 2023 Author Share Posted November 30, 2023 On 11/27/2023 at 7:30 AM, JorgeB said: Enable the syslog server and post that after a crash. Hey @JorgeB, here is the syslog. The server stopped working on Nov 30th @ 03:00 after completing an appdata backup opperation. From reading the log, maybe a bad memory stick? Thanks! syslog-192.168.0.10.log Quote Link to comment
JorgeB Posted November 30, 2023 Share Posted November 30, 2023 Could be, but I would that PC to use ECC RAM, still there could be some related issue. Quote Link to comment
Heffe Posted November 30, 2023 Author Share Posted November 30, 2023 14 minutes ago, JorgeB said: Could be, but I would that PC to use ECC RAM, still there could be some related issue. Yes, the memory is ECC memory. I just ran the MEMTEST86 utility with no errors. Do you have any other suggestions? Quote Link to comment
JorgeB Posted November 30, 2023 Share Posted November 30, 2023 There various crashes but not real clear to me what's causing them, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
Heffe Posted November 30, 2023 Author Share Posted November 30, 2023 Good idea. I'll give that a try. Thank you for your suggestions. Quote Link to comment
Heffe Posted December 3, 2023 Author Share Posted December 3, 2023 So this weekend I updated to tge latest unRAID version, updated all plugins and stopped all services on the server. I then let the server run until faults started appearing. Here's the log file. It shows that CPU 0 and CPU 4 are causing some errors? Does that mean I have a bad CPU? Can it be fixed? jarvis-syslog-20231203-1208.zip Quote Link to comment
Heffe Posted December 3, 2023 Author Share Posted December 3, 2023 Here's the non-zipped log file. syslog-192.168.0.10 (2).log Quote Link to comment
JorgeB Posted December 3, 2023 Share Posted December 3, 2023 Don't know if the CPU is the problem but it does show multiple apps segfaulting, so most likely some hardware issue, I would try with just a stick or RAM, if the same try a different one, that will basically rule out RAM issues. Quote Link to comment
Heffe Posted December 3, 2023 Author Share Posted December 3, 2023 Ill give that a try. Thank you. Quote Link to comment
Heffe Posted December 4, 2023 Author Share Posted December 4, 2023 I took one RAM stick out and ran the server all night. It still errored. This morning I swapped the RAM stick with another and it errored again. I suppose it could be the RAM but I doubt both sticks would be bad. I do have a NVME PCIe card that my cache drive is on. I'm going to try and remove that and see if it makes a difference. I've attatched the latest log. syslog-192.168.0.10.log Quote Link to comment
JorgeB Posted December 4, 2023 Share Posted December 4, 2023 Unlikely the RAM is the problem, next suspects for me would be the board or CPU. Quote Link to comment
Heffe Posted December 6, 2023 Author Share Posted December 6, 2023 The errors increase in frequency if I have the docker containers running. I have no way of troubleshooting if the CPU is the issue or the motherboard. The server was a cold spare from a working environment so I'm doubtful that its a hardware issue. I'll continue to update the software and hope that the updates fixe it. Thanks for your help @JorgeB. 1 Quote Link to comment
Heffe Posted December 7, 2023 Author Share Posted December 7, 2023 (edited) Just as an update, I'll keep this thread going in case someone else has similar issues. In order to troubleshoot the mobo and/CPU issues, I ran the Dell diagnostic tool built into the BIOS on the motherboard. The tool ran all night (~12 hours) and it passed all tests (including RAM tests). After the tests were done, I booted the server in safe made (no GUI) and let it run. Its been on for about 5 hours and it has one "segfault". Here's the error: Dec 7 09:01:45 Jarvis kernel: smartctl_type[17085]: segfault at 0 ip 0000000000000000 sp 00007ffcc6c52d68 error 14 in php[400000+3b000] likely on CPU 5 I'm really frustrated at this issue and I'm lost as to a solution. I'm 99% sure there's no hardware issue but yet here we are. jarvis-syslog-20231207-1446.zip Edited December 7, 2023 by Heffe Formatting error. Quote Link to comment
JorgeB Posted December 7, 2023 Share Posted December 7, 2023 If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test. Quote Link to comment
Heffe Posted December 7, 2023 Author Share Posted December 7, 2023 2 minutes ago, JorgeB said: If it's just smarctl segfaulting it could be one of the devices that is making it crash, but you'd need to basically remove one of them at at time to test. Would running a SMART disk scan provide any useful info? Quote Link to comment
Heffe Posted December 10, 2023 Author Share Posted December 10, 2023 Would it be possible that this issue is being caused by a BIOS update? I've looked across the forum and it seems others are using the same server but on another BIOS version. Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 I would guess unlikely, but not impossible. Quote Link to comment
Heffe Posted December 11, 2023 Author Share Posted December 11, 2023 Well, here are the cahnges that I've tried since, 1. I changed the boot mode to UEFI instead of Legacy. 2. I downgraded the BIOS to v2.0.2 since there was a post on here suggesting that this verison was working for them. 3. I found some corrupted docker files and repaired them. The server ran for about 10 hours and then crashed again. I'm going to run Memtest86 for an extended period of time and see if that finds anything. Lol, I'm so frustrated. syslog-192.168.0.10_111223.log Quote Link to comment
Heffe Posted December 18, 2023 Author Share Posted December 18, 2023 (edited) So I gave it a rest and returned to the problem after a few days. I had couple of "loop2" BTRFS errors that, from what I've read on the forum, are related to a corrupt docker image. I've since deleted my docker image and recreated it. Secondly, after recreating my docker image, I pinned each docker to a specific CPU so I could find if there was an offending docker container. I think I found one. If I restart my nextcloud container while my mariadb container is running, I get: kernel: php[8525]: segfault at ffffffffffffffff ip 000055871823b0b2 sp 00007ffd6737dd40 error 7 in php82[558718200000+2b4000] likely on CPU 5 (core 1, socket 0) I get this error reliably when starting my nextcloud container. Is there anything I can do to fix this? Is my nextcloud appdata corrupt? Thanks! syslog-192.168.0.10_171223.log Edited December 18, 2023 by Heffe Uploaded log file. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.