TechTitus Posted August 12, 2021 Share Posted August 12, 2021 Hello All, It seems I can't get away from having issues. Recently my server has started to randomly lockup. I can't unmount the drives, can't load the docker page and I can't exports my diagnostics when it happens. It gets to my Cache drive then just sits there and never moves past it. I'm running 6.9.2 and I've attached my diag from my last reboot. Let me know if there's something else that's needed. tower-diagnostics-20210812-0026.zip Quote Link to comment
JorgeB Posted August 12, 2021 Share Posted August 12, 2021 Enable syslog mirror to flash then post that after a lockup. Quote Link to comment
TechTitus Posted August 13, 2021 Author Share Posted August 13, 2021 On 8/12/2021 at 2:05 AM, JorgeB said: Enable syslog mirror to flash then post that after a lockup. Here's the syslog file as requested. syslog Quote Link to comment
JorgeB Posted August 13, 2021 Share Posted August 13, 2021 There were issues with the NVMe device just before the crash, though unclear if the crash was related: Aug 13 02:26:53 Tower kernel: nvme nvme0: frozen state error detected, reset controller Aug 13 02:26:53 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 1595697920 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0 Aug 13 02:26:54 Tower kernel: pcieport 0000:00:01.0: AER: Root Port link has been reset Aug 13 02:26:54 Tower kernel: pcieport 0000:00:01.0: AER: device recovery successful Let the syslog server enable and see if the same happens again before next crash. Quote Link to comment
TechTitus Posted August 13, 2021 Author Share Posted August 13, 2021 14 hours ago, JorgeB said: There were issues with the NVMe device just before the crash, though unclear if the crash was related: Aug 13 02:26:53 Tower kernel: nvme nvme0: frozen state error detected, reset controller Aug 13 02:26:53 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 1595697920 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0 Aug 13 02:26:54 Tower kernel: pcieport 0000:00:01.0: AER: Root Port link has been reset Aug 13 02:26:54 Tower kernel: pcieport 0000:00:01.0: AER: device recovery successful Let the syslog server enable and see if the same happens again before next crash. Here it is again. syslog Quote Link to comment
JorgeB Posted August 14, 2021 Share Posted August 14, 2021 Same thing, though this time the crash to a little longer after the errors: Aug 13 17:54:39 Tower kernel: nvme nvme0: frozen state error detected, reset controller Aug 13 17:54:39 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 1528760504 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 0 Aug 13 17:54:39 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 1528760760 op 0x0:(READ) flags 0x80700 phys_seg 18 prio class 0 Aug 13 17:54:40 Tower kernel: pcieport 0000:00:01.0: AER: Root Port link has been reset Aug 13 17:54:40 Tower kernel: pcieport 0000:00:01.0: AER: device recovery successful Aug 13 18:23:29 Tower kernel: microcode: microcode updated early to revision 0x71a, date = 2020-03-24 Still, this suggest it could be related, try removing the NVMe device or using a a different one if available (or it in a different slot). Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.