Server Lockup


Recommended Posts

Hello all,

 

I previously had an issue where UnRaid would randomly lockup but I was able to solve that. Now I'm having an issue where UnRaid is locking up from Dockers. I've narrowed it down to SABnzbd and Binhex-Delugevpn. I'm thinking it's SABnzbd as whenever I'm downloading files for awhile, it just seems to lock the entire system up. 

 

I've placed my diagnostics to the thread. 

tower-diagnostics-20211005-2344.zip

Edited by TechTitus
Link to comment
33 minutes ago, TechTitus said:

How can I find out?

Settings > Syslog Server

You should know if you are running syslog server, but sounds like you may have accidentally enabled syslog as @trurl was leading 

 

I can see in your logs that rsyslog is starting

Oct  6 02:22:43 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="1899" x-info="https://www.rsyslog.com"] start

 

Edited by tjb_altf4
Link to comment
9 hours ago, tjb_altf4 said:

Settings > Syslog Server

You should know if you are running syslog server, but sounds like you may have accidentally enabled syslog as @trurl was leading 

 

I can see in your logs that rsyslog is starting

Oct  6 02:22:43 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2002.0" x-pid="1899" x-info="https://www.rsyslog.com"] start

 

 

tower-syslog-20211008-1531.zip

Link to comment

Also seeing this.

 

Oct 8 14:25:00 Tower rsyslogd: file '/boot/logs/syslog'[6] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: File too large [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: file '/boot/logs/syslog'[6] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: File too large [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: file '/boot/logs/syslog'[6] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: File too large [v8.2002.0 try https://www.rsyslog.com/e/2027 ] Oct 8 14:25:00 Tower rsyslogd: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ]

Link to comment
  • 2 weeks later...

Update to the issue. It's not just the dockers causing the problem.

 

Whenever I transfer a file to the cache ie, Remux file copy to Cache, Transfer large files from desktop to cache, etc. The server locks up. I'm not sure if you'll be able to see anything in the files provided.

Edited by TechTitus
Link to comment
  • TechTitus changed the title to Server Lockup

Diags you already posted, in this case it's mostly to see the hardware used.

 

Just before the crash there are some issues with the NVMe device, though not sure these caused it:

 

Oct 19 13:50:55 Tower kernel: nvme nvme0: frozen state error detected, reset controller
Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: Root Port link has been reset
Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful

 

Looks like the board doesn't have an M.2 slot, so try changing the NVMe adapter to a different PCIe slot, to see if it doesn't generate errors like these (and the above):

 

Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID)
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:   device [8086:3c08] error status/mask=00003101/00002000
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [ 0] RxErr                 
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [ 8] Rollover              
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [12] Timeout               
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0

 

  • Thanks 1
Link to comment
4 hours ago, JorgeB said:

Diags you already posted, in this case it's mostly to see the hardware used.

 

Just before the crash there are some issues with the NVMe device, though not sure these caused it:

 

Oct 19 13:50:55 Tower kernel: nvme nvme0: frozen state error detected, reset controller
Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: Root Port link has been reset
Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful

 

Looks like the board doesn't have an M.2 slot, so try changing the NVMe adapter to a different PCIe slot, to see if it doesn't generate errors like these (and the above):

 

Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID)
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:   device [8086:3c08] error status/mask=00003101/00002000
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [ 0] RxErr                 
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [ 8] Rollover              
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [12] Timeout               
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0

 

Correct, I have an M.2 drive connected through an adapter to a pci-e slot. I'll try swapping slots and if that doesn't work, I'll try try a new adapter...and if that doesn't work, I'll burn it to the ground (buy a new SSD). 

 

Wait, if I'm seeing UDMA errors and they're connected through HBA cards, could these all be related? If so, what could be the cause?

Edited by TechTitus
Added possible related information.
Link to comment
On 10/26/2021 at 4:44 AM, JorgeB said:

Diags you already posted, in this case it's mostly to see the hardware used.

 

Just before the crash there are some issues with the NVMe device, though not sure these caused it:

 

Oct 19 13:50:55 Tower kernel: nvme nvme0: frozen state error detected, reset controller
Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: Root Port link has been reset
Oct 19 13:50:56 Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful

 

Looks like the board doesn't have an M.2 slot, so try changing the NVMe adapter to a different PCIe slot, to see if it doesn't generate errors like these (and the above):

 

Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID)
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:   device [8086:3c08] error status/mask=00003101/00002000
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [ 0] RxErr                 
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [ 8] Rollover              
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0:    [12] Timeout               
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: can't find device of ID0018
Oct 19 13:50:55 Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0

 

 

I moved the drive to another PCI-E slot and it appears to have fixed the issue.

 

I also moved my 10Gb network card to another slot and it got fried some I'm not sure what's going on with the slots.

 

Thanks for your help!

  • Like 1
Link to comment
  • 5 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.