Jump to content

Server Crashes Frequently - Kernel Panic - Persistent across HW migration (Ryzen > Intel)


Recommended Posts

Thank you very much for taking the time to look into it.

Would you mind elaborating on what re-test would mean?

 

Since it's often 2-4 weeks between crashes (before) and 3-4 happening today, I wonder how would I go about testing it.
I would have to disable my NVR, 3D printer monitoring and many automations for weeks/months to confirm anything... hmm what to do....

 

In the meantime, I've found the following thread, which seems very close to what is going on for me.

Would you agree?

Link to comment
  • 1 month later...
Posted (edited)

I have this same exact problem, and am also running frigate. Now that you mention it, it could have started when I moved my frigate instance to this machine (from another unraid server which did not show the same crashing to my memory). @secretstorage any updates from your tests?

 

For reference, my setup is a 5800x CPU with 3080 GPU used in VM passthrough. I have 32GB of ram which I pass 16GB to VMs. I also think this is likely a ram issue. I previously had issues with frigate gobbling up ram on my other system with 64GB of ram, but no crashes. I fixed this with a ram limiter, but just noticed the ram limiter didn't make the transfer between my machines. I just added the 5GB limit to frigate ("--memory=5G" in extra parameters for those following along at home). I'll report back with my results. If you don't hear back, assume this fixed my problem.

 

Cheers!

 

 

Edit: Here are the last lines from my syslog. Make anything of them? I'm in the process of researching them myself.

 

Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Successful completion on short TX
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: Looking for event-dma 00000001042888d0 trb-start 00000001042888e0 trb-end 00000001042888e0 seg-start 0000000104288000 seg-end 0000000104288ff0
Edited by huquad
Link to comment
Posted (edited)
On 6/2/2024 at 2:13 AM, huquad said:

Edit: Here are the last lines from my syslog. Make anything of them? I'm in the process of researching them myself.

 

Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Successful completion on short TX
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: Looking for event-dma 00000001042888d0 trb-start 00000001042888e0 trb-end 00000001042888e0 seg-start 0000000104288000 seg-end 0000000104288ff0

Almost identical to my logs from a crash minutes ago - a pattern?? 
```
Jun  6 18:17:10 Server dhcpcd-run-hooks[29674]: br0: Invalid domain name: .local
Jun  6 18:47:10 Server dhcpcd-run-hooks[31956]: br0: Invalid domain name: .local
Jun  6 19:17:10 Server dhcpcd-run-hooks[2210]: br0: Invalid domain name: .local
Jun  6 19:47:10 Server dhcpcd-run-hooks[4298]: br0: Invalid domain name: .local
Jun  6 20:17:10 Server dhcpcd-run-hooks[5781]: br0: Invalid domain name: .local
Jun  6 20:43:36 Server kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
Jun  6 20:43:36 Server kernel: usb 2-2: LPM exit latency is zeroed, disabling LPM.
Jun  6 20:47:10 Server dhcpcd-run-hooks[7601]: br0: Invalid domain name: .local
Jun  6 21:17:10 Server dhcpcd-run-hooks[10055]: br0: Invalid domain name: .local
Jun  6 21:47:10 Server dhcpcd-run-hooks[12527]: br0: Invalid domain name: .local
Jun  6 22:17:10 Server dhcpcd-run-hooks[16036]: br0: Invalid domain name: .local
Jun  6 22:47:10 Server dhcpcd-run-hooks[19282]: br0: Invalid domain name: .local
Jun  6 23:17:10 Server dhcpcd-run-hooks[21925]: br0: Invalid domain name: .local
Jun  6 23:47:10 Server dhcpcd-run-hooks[24961]: br0: Invalid domain name: .local
Jun  7 00:17:10 Server dhcpcd-run-hooks[27826]: br0: Invalid domain name: .local
Jun  7 00:26:23 Server kernel: xhci_hcd 0000:03:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun  7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: WARN Successful completion on short TX
Jun  7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Jun  7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: Looking for event-dma 0000000103b91a90 trb-start 0000000103b91aa0 trb-end 0000000103b91aa0 seg-start 0000000103b91000 seg-end 0000000103b91ff0

Jun  7 02:13:45 Server unassigned.devices: Mounting 'Auto Mount' Devices...
Jun  7 02:13:45 Server unassigned.devices: Error: Device '/dev/sdb1' mount point 'spare' - name is reserved, used in the array or a pool, or by an unassigned device.
Jun  7 02:13:45 Server unassigned.devices: Disk with serial 'ST4000DX001-1CE168_Z30195RK', mountpoint 'spare' cannot be mounted.
Jun  7 02:13:46 Server emhttpd: Starting services...
```
@JorgeB I've currently switched off in config all the cameras that previously were rebooting and causing errors in Frigate, so they are no longer present in logs.

This may thus be strictly related to USB resetting?

https://github.com/google-coral/edgetpu/issues/166

 

 

Edited by secretstorage
Link to comment

Just to clarify, "different USB controller" would be a physical USB hub?

 

I am using 5 different ports on my B450 Tomahawk board, with all but one being allocated with the Unraid USB, 3D printer, SDR-adapter, Webcam and UPS.

Should I buy a hub and get it transferred to it?

 

Link to comment
Posted (edited)
3 hours ago, JorgeB said:

The board may have more than one USB controller, using a HUB with just one port may help, you may need to try more than one, but it could be worth trying.

How would you read this:
image.thumb.png.4c30efe85e9d9e3262ee38fb4f03bcb2.png
Looks like there are two and I am currently using both.
I couldn't use only the back sockets because devices would fight for allocations and oscillate back and forth ON/OFF.

image.thumb.png.bb2d14918bc6afce6b66812cc20f6514.png

Edited by secretstorage
Link to comment

There appear to be two:
 

03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 xHCI Compliant Host Controller [1022:43d5] (rev 01)
    Subsystem: ASMedia Technology Inc. 400 Series Chipset USB 3.1 XHCI Controller [1b21:1142]
    Kernel driver in use: xhci_hcd

28:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
    Subsystem: Micro-Star International Co., Ltd. [MSI] Matisse USB 3.0 Host Controller [1462:7c02]
    Kernel driver in use: xhci_hcd

 

So you could try using only one or the other.

Link to comment

Since I re-implemented my memory limiter in frigate, I haven't seen any more crashes (fingers crossed). It's been two weeks which is much better than I was seeing before. @secretstorage did you ever implement this limiter in your docker container? Or are you still chasing the USB angle?

Link to comment

Bad news bears. I just had my first crash since June 1st. I will say my trick seems to have delayed the crash. I'm going to stop passing through one of my USB controllers from my MOBO to my VMs and see where that gets me.

Link to comment

Stopped passing through the USB controller and instead passed through specific devices. Now my host crashed on me while actively using it (playing a game on VM). I noticed my CPU fan doing some weird stuff according to Netdata. Its possible I'm experiencing a thermal crash. I'm going to reinstall my CPU cooler with fresh paste and see where that gets me. I'm also going to make sure I don't have any wires/junk getting caught in the fans that could be stopping them. @secretstorage did you ever find a solution?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...