TechTitus Posted October 6, 2021 Share Posted October 6, 2021 (edited) Hello all, I previously had an issue where UnRaid would randomly lockup but I was able to solve that. Now I'm having an issue where UnRaid is locking up from Dockers. I've narrowed it down to SABnzbd and Binhex-Delugevpn. I'm thinking it's SABnzbd as whenever I'm downloading files for awhile, it just seems to lock the entire system up. I've placed my diagnostics to the thread. tower-diagnostics-20211005-2344.zip Edited October 22, 2021 by TechTitus Quote Link to comment
ChatNoir Posted October 6, 2021 Share Posted October 6, 2021 Your diagnostics are lacking the syslog. Can you try to create it again ? Quote Link to comment
TechTitus Posted October 6, 2021 Author Share Posted October 6, 2021 27 minutes ago, ChatNoir said: Your diagnostics are lacking the syslog. Can you try to create it again ? Here you go.tower-diagnostics-20211006-1252.zip tower-diagnostics-20211006-1252.zip Quote Link to comment
trurl Posted October 7, 2021 Share Posted October 7, 2021 Are you running syslog server? Quote Link to comment
TechTitus Posted October 8, 2021 Author Share Posted October 8, 2021 15 hours ago, trurl said: Are you running syslog server? How can I find out? Quote Link to comment
tjb_altf4 Posted October 8, 2021 Share Posted October 8, 2021 (edited) 33 minutes ago, TechTitus said: How can I find out? Settings > Syslog Server You should know if you are running syslog server, but sounds like you may have accidentally enabled syslog as @trurl was leading I can see in your logs that rsyslog is starting Oct 6 02:22:43 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="1899" x-info="https://www.rsyslog.com"] start Edited October 8, 2021 by tjb_altf4 Quote Link to comment
TechTitus Posted October 8, 2021 Author Share Posted October 8, 2021 9 hours ago, tjb_altf4 said: Settings > Syslog Server You should know if you are running syslog server, but sounds like you may have accidentally enabled syslog as @trurl was leading I can see in your logs that rsyslog is starting Oct 6 02:22:43 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="1899" x-info="https://www.rsyslog.com"] start tower-syslog-20211008-1531.zip Quote Link to comment
TechTitus Posted October 8, 2021 Author Share Posted October 8, 2021 I've uploaded the syslog. Quote Link to comment
TechTitus Posted October 8, 2021 Author Share Posted October 8, 2021 Also seeing this. Oct 8 14:25:00 Tower rsyslogd: file '/boot/logs/syslog'[6] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: File too large [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: file '/boot/logs/syslog'[6] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: File too large [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: file '/boot/logs/syslog'[6] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: File too large [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Quote Link to comment
tjb_altf4 Posted October 9, 2021 Share Posted October 9, 2021 You should know if you are running syslog server, as you didn't know, it's probably been turned on accidentally, so you should disabled it in settings. If you did mean to have syslog running, then enable log rotation as the log file has reached the maximum supported size. Quote Link to comment
TechTitus Posted October 19, 2021 Author Share Posted October 19, 2021 I've posted updated diagnostics and syslog. tower-diagnostics-20211019-1804.zip tower-syslog-20211019-2303.zip Quote Link to comment
TechTitus Posted October 21, 2021 Author Share Posted October 21, 2021 (edited) Update to the issue. It's not just the dockers causing the problem. Whenever I transfer a file to the cache ie, Remux file copy to Cache, Transfer large files from desktop to cache, etc. The server locks up. I'm not sure if you'll be able to see anything in the files provided. Edited October 21, 2021 by TechTitus Quote Link to comment
TechTitus Posted October 22, 2021 Author Share Posted October 22, 2021 Bump. Can someone please help me out. I have to reboot the server every 10 minutes at this point. Quote Link to comment
tjb_altf4 Posted October 23, 2021 Share Posted October 23, 2021 Your local syslog server is still enabled On 10/9/2021 at 12:11 PM, tjb_altf4 said: You should know if you are running syslog server, as you didn't know, it's probably been turned on accidentally, so you should disabled it in settings. If you did mean to have syslog running, then enable log rotation as the log file has reached the maximum supported size. Quote Link to comment
RGauld Posted October 23, 2021 Share Posted October 23, 2021 Also getting random server lockups... I have no idea as to why, as this has only just recently started happening... Including a diagnostics report.... rg-server-diagnostics-20211023-0833.zip Quote Link to comment
JorgeB Posted October 24, 2021 Share Posted October 24, 2021 18 hours ago, RGauld said: Also getting random server lockups... Please don't post in multiple threads about the same thing, do what I asked in the other one. Quote Link to comment
TechTitus Posted October 26, 2021 Author Share Posted October 26, 2021 (edited) I was able to set the syslog to mirror to USB and get a copy of the file. When the server locks up, I can't export the diagnostics so I'm not sure what to do there. @JorgeB @tjb_altf4 @trurl syslog Edited October 26, 2021 by TechTitus Quote Link to comment
JorgeB Posted October 26, 2021 Share Posted October 26, 2021 Diags you already posted, in this case it's mostly to see the hardware used. Just before the crash there are some issues with the NVMe device, though not sure these caused it: Oct 19 13:50:55 Tower kernel: nvme nvme0: frozen state error detected, reset controller Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: Root Port link has been reset Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful Looks like the board doesn't have an M.2 slot, so try changing the NVMe adapter to a different PCIe slot, to see if it doesn't generate errors like these (and the above): Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID) Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: device [8086:3c08] error status/mask=00003101/00002000 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [ 0] RxErr Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [ 8] Rollover Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [12] Timeout Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0 1 Quote Link to comment
TechTitus Posted October 26, 2021 Author Share Posted October 26, 2021 (edited) 4 hours ago, JorgeB said: Diags you already posted, in this case it's mostly to see the hardware used. Just before the crash there are some issues with the NVMe device, though not sure these caused it: Oct 19 13:50:55 Tower kernel: nvme nvme0: frozen state error detected, reset controller Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: Root Port link has been reset Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful Looks like the board doesn't have an M.2 slot, so try changing the NVMe adapter to a different PCIe slot, to see if it doesn't generate errors like these (and the above): Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID) Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: device [8086:3c08] error status/mask=00003101/00002000 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [ 0] RxErr Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [ 8] Rollover Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [12] Timeout Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0 Correct, I have an M.2 drive connected through an adapter to a pci-e slot. I'll try swapping slots and if that doesn't work, I'll try try a new adapter...and if that doesn't work, I'll burn it to the ground (buy a new SSD). Wait, if I'm seeing UDMA errors and they're connected through HBA cards, could these all be related? If so, what could be the cause? Edited October 26, 2021 by TechTitus Added possible related information. Quote Link to comment
JorgeB Posted October 26, 2021 Share Posted October 26, 2021 1 hour ago, TechTitus said: if I'm seeing UDMA errors and they're connected through HBA cards, could these all be related? Unrelated, but it's a known issue with LSI HBAs and the firmware you're using: FWVersion(20.00.00.00) Update to latest (20.00.07.00) 1 Quote Link to comment
TechTitus Posted October 28, 2021 Author Share Posted October 28, 2021 On 10/26/2021 at 4:44 AM, JorgeB said: Diags you already posted, in this case it's mostly to see the hardware used. Just before the crash there are some issues with the NVMe device, though not sure these caused it: Oct 19 13:50:55 Tower kernel: nvme nvme0: frozen state error detected, reset controller Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: Root Port link has been reset Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful Looks like the board doesn't have an M.2 slot, so try changing the NVMe adapter to a different PCIe slot, to see if it doesn't generate errors like these (and the above): Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID) Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: device [8086:3c08] error status/mask=00003101/00002000 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [ 0] RxErr Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [ 8] Rollover Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: [12] Timeout Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018 Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0 I moved the drive to another PCI-E slot and it appears to have fixed the issue. I also moved my 10Gb network card to another slot and it got fried some I'm not sure what's going on with the slots. Thanks for your help! 1 Quote Link to comment
TechTitus Posted October 28, 2021 Author Share Posted October 28, 2021 On 10/26/2021 at 11:04 AM, JorgeB said: Unrelated, but it's a known issue with LSI HBAs and the firmware you're using: FWVersion(20.00.00.00) Update to latest (20.00.07.00) Great! Thank you! Quote Link to comment
TechTitus Posted April 24, 2022 Author Share Posted April 24, 2022 On 10/26/2021 at 11:04 AM, JorgeB said: Unrelated, but it's a known issue with LSI HBAs and the firmware you're using: FWVersion(20.00.00.00) Update to latest (20.00.07.00) This fixed all of my disk issues. CRC UDMA errors, disk becoming unmountable, etc. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.