bonzi Posted October 18, 2023 Share Posted October 18, 2023 Hi, I am experiencing these every couple of days. Previously the system was rock solid. Nothing stands out in the diagnostics to me (attached), problem is that I need to capture this as it happens. Anyone have any ideas? I will do a memory test. That seems like the most likely culprit and report back. tower-diagnostics-20231018-1840.zip Quote Link to comment
JorgeB Posted October 19, 2023 Share Posted October 19, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
bonzi Posted October 19, 2023 Author Share Posted October 19, 2023 Done, saw you recommend that in another thread. Good idea, that should get us more information. It passed an overnight run of memtest so all good there. Thanks! Quote Link to comment
only-university6482 Posted October 19, 2023 Share Posted October 19, 2023 8 hours ago, bonzi said: Done, saw you recommend that in another thread. Good idea, that should get us more information. It passed an overnight run of memtest so all good there. Thanks! Did you get to the bottom of the kernel panic error? Having the same on mine so turned on syslog yesterday. I see it passed the memtest so an issue somewhere else? Thanks Quote Link to comment
bonzi Posted October 19, 2023 Author Share Posted October 19, 2023 3 hours ago, only-university6482 said: Did you get to the bottom of the kernel panic error? Having the same on mine so turned on syslog yesterday. I see it passed the memtest so an issue somewhere else? Thanks I am waiting for it to crash again, I think I will need that data to understand why this is happening. The memtest was fine, ran o/n no errors. Quote Link to comment
only-university6482 Posted October 20, 2023 Share Posted October 20, 2023 15 hours ago, bonzi said: I am waiting for it to crash again, I think I will need that data to understand why this is happening. The memtest was fine, ran o/n no errors. Mine just crashed 10 mins ago again! I'm at a loss, will do a memory test I had syslog server enable and just looked at the file - nothing in there. Crashed just before the 11:05 timestamp Oct 20 10:28:56 tower500 proxmox-backup-proxy[57]: starting rrd data sync Oct 20 10:28:57 tower500 proxmox-backup-proxy[57]: rrd journal successfully committed (18 files in 0.050 seconds) Oct 20 11:05:31 tower500 proxmox-backup-api[9]: service is ready Oct 20 11:05:31 tower500 proxmox-backup-proxy[10]: service is ready Quote Link to comment
bonzi Posted October 26, 2023 Author Share Posted October 26, 2023 @JorgeB My server crashed again today. Managed to capture the syslog. Not sure what it all means. I am attaching a snip of it. I have done some googling not sure exactly what to make of it. Let me know what you think. Thanks! syslog-crash.rtf 1 Quote Link to comment
JorgeB Posted October 26, 2023 Share Posted October 26, 2023 Those look more hardware related to me, make sure this has been taken care of, also run memtest. Quote Link to comment
bonzi Posted October 26, 2023 Author Share Posted October 26, 2023 14 hours ago, JorgeB said: Those look more hardware related to me, make sure this has been taken care of, also run memtest. Ok, I disabled c-states in the bios. I have done a overnight memtest, that passed without any issues at all. Let's see if the c-states fixes the problem. Quote Link to comment
bonzi Posted October 29, 2023 Author Share Posted October 29, 2023 (edited) On 10/26/2023 at 6:03 PM, bonzi said: Ok, I disabled c-states in the bios. I have done a overnight memtest, that passed without any issues at all. Let's see if the c-states fixes the problem. I had another crash last night, c-states are disabled. Frustratingly there seems to be little of interest in the syslog. The only activity I see is this which I believe is my Coral TPU: Oct 28 23:11:22 Tower kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd Oct 28 23:11:22 Tower kernel: usb 2-2: LPM exit latency is zeroed, disabling LPM. At this point I really do not know what to do, except hope to get more information before the crash. EDIT: I have also posted a screenshot of what is printed on to the screen at the time of the crash. Edited October 29, 2023 by bonzi Quote Link to comment
bonzi Posted October 29, 2023 Author Share Posted October 29, 2023 It crashed again, this time I think we might have something more useful. Could it be that the USB ports are causing this and are failing or not drawing power properly for devices that need a lot of power? Especially I am thinking the USB port that the Coral TPU is plugged into? Thoughts? That is what it looks like to me from the log. I attached all of the log from the crash and reboot. @JorgeB if you could have a look at the first few lines and let me know if you think this is what is going on. Thanks! syslog-crash-2.rtf Quote Link to comment
Solution JorgeB Posted October 29, 2023 Solution Share Posted October 29, 2023 A lot of call traces, they look more hardware related, try with just RAM stick, if the same try a different one, that will basically rule out RAM. Quote Link to comment
bonzi Posted October 30, 2023 Author Share Posted October 30, 2023 On 10/29/2023 at 3:32 PM, JorgeB said: A lot of call traces, they look more hardware related, try with just RAM stick, if the same try a different one, that will basically rule out RAM. Ok I will give that a try. Crashes are becoming more frequent now, often after a few hours and less than one day. Memtest does pass but it does seem like a hardware issue. Quote Link to comment
bonzi Posted November 4, 2023 Author Share Posted November 4, 2023 I have been running for 4 days without a crash with just one single stick of memory. I am going to let it go for at least a few more days then I will try putting in the other sticks and seeing which one is giving problems. Its strange that memtest did not pick this up at all but I will be able to figure this out now I think. Quote Link to comment
itimpi Posted November 4, 2023 Share Posted November 4, 2023 You can get cases where all sticks work fine individually but when you have them all plugged in simultaneously you start getting errors. I assume this is related to the load on the memory controller on the motherboard. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.