Server Crashes Frequently - Kernel Panic - Persistent across HW migration (Ryzen > Intel)

secretstorage · April 7

And again died this morning, just 30-45 min ago.

From what I can see there are a lot of GPU errors as well, which starts to make sense in the way the GPU was behaving, but it would this cause a kernel panic issue?

syslog-10.10.10.99.log

secretstorage · April 11

@JorgeB would you mind taking a look?
It failed 3 times just today and while there are still USB errors attributable to my UPS which for some reason seems to disconnect all the time, there are other docker related errors (blocked mode).

Not sure where to look for next steps....

syslog-10.10.10.99.log

JorgeB · April 11

There are constant crashes that look to me, are being caused by Frigate/ffmpeg, try leaving that container disabled and re-test.

secretstorage · April 11

Thank you very much for taking the time to look into it.

Would you mind elaborating on what re-test would mean?

Since it's often 2-4 weeks between crashes (before) and 3-4 happening today, I wonder how would I go about testing it.
I would have to disable my NVR, 3D printer monitoring and many automations for weeks/months to confirm anything... hmm what to do....

In the meantime, I've found the following thread, which seems very close to what is going on for me.

Would you agree?

JorgeB · April 11

19 minutes ago, secretstorage said:

Would you mind elaborating on what re-test would mean?

Disable Frigate and run the server until it crashes or you are satisfied it's no longer crashing.

huquad · June 2

I have this same exact problem, and am also running frigate. Now that you mention it, it could have started when I moved my frigate instance to this machine (from another unraid server which did not show the same crashing to my memory). @secretstorage any updates from your tests?

For reference, my setup is a 5800x CPU with 3080 GPU used in VM passthrough. I have 32GB of ram which I pass 16GB to VMs. I also think this is likely a ram issue. I previously had issues with frigate gobbling up ram on my other system with 64GB of ram, but no crashes. I fixed this with a ram limiter, but just noticed the ram limiter didn't make the transfer between my machines. I just added the 5GB limit to frigate ("--memory=5G" in extra parameters for those following along at home). I'll report back with my results. If you don't hear back, assume this fixed my problem.

Cheers!

Edit: Here are the last lines from my syslog. Make anything of them? I'm in the process of researching them myself.

Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Successful completion on short TX
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: Looking for event-dma 00000001042888d0 trb-start 00000001042888e0 trb-end 00000001042888e0 seg-start 0000000104288000 seg-end 0000000104288ff0

Edited June 2 by huquad

secretstorage · June 7

On 6/2/2024 at 2:13 AM, huquad said:

Edit: Here are the last lines from my syslog. Make anything of them? I'm in the process of researching them myself.

Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Successful completion on short TX
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Jun  1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: Looking for event-dma 00000001042888d0 trb-start 00000001042888e0 trb-end 00000001042888e0 seg-start 0000000104288000 seg-end 0000000104288ff0

Almost identical to my logs from a crash minutes ago - a pattern??
```
Jun 6 18:17:10 Server dhcpcd-run-hooks[29674]: br0: Invalid domain name: .local
Jun 6 18:47:10 Server dhcpcd-run-hooks[31956]: br0: Invalid domain name: .local
Jun 6 19:17:10 Server dhcpcd-run-hooks[2210]: br0: Invalid domain name: .local
Jun 6 19:47:10 Server dhcpcd-run-hooks[4298]: br0: Invalid domain name: .local
Jun 6 20:17:10 Server dhcpcd-run-hooks[5781]: br0: Invalid domain name: .local
Jun 6 20:43:36 Server kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
Jun 6 20:43:36 Server kernel: usb 2-2: LPM exit latency is zeroed, disabling LPM.
Jun 6 20:47:10 Server dhcpcd-run-hooks[7601]: br0: Invalid domain name: .local
Jun 6 21:17:10 Server dhcpcd-run-hooks[10055]: br0: Invalid domain name: .local
Jun 6 21:47:10 Server dhcpcd-run-hooks[12527]: br0: Invalid domain name: .local
Jun 6 22:17:10 Server dhcpcd-run-hooks[16036]: br0: Invalid domain name: .local
Jun 6 22:47:10 Server dhcpcd-run-hooks[19282]: br0: Invalid domain name: .local
Jun 6 23:17:10 Server dhcpcd-run-hooks[21925]: br0: Invalid domain name: .local
Jun 6 23:47:10 Server dhcpcd-run-hooks[24961]: br0: Invalid domain name: .local
Jun 7 00:17:10 Server dhcpcd-run-hooks[27826]: br0: Invalid domain name: .local
Jun 7 00:26:23 Server kernel: xhci_hcd 0000:03:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun 7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: WARN Successful completion on short TX
Jun 7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Jun 7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: Looking for event-dma 0000000103b91a90 trb-start 0000000103b91aa0 trb-end 0000000103b91aa0 seg-start 0000000103b91000 seg-end 0000000103b91ff0
Jun 7 02:13:45 Server unassigned.devices: Mounting 'Auto Mount' Devices...
Jun 7 02:13:45 Server unassigned.devices: Error: Device '/dev/sdb1' mount point 'spare' - name is reserved, used in the array or a pool, or by an unassigned device.
Jun 7 02:13:45 Server unassigned.devices: Disk with serial 'ST4000DX001-1CE168_Z30195RK', mountpoint 'spare' cannot be mounted.
Jun 7 02:13:46 Server emhttpd: Starting services...
```
@JorgeB I've currently switched off in config all the cameras that previously were rebooting and causing errors in Frigate, so they are no longer present in logs.

This may thus be strictly related to USB resetting?

https://github.com/google-coral/edgetpu/issues/166

Edited June 7 by secretstorage

JorgeB · June 7

7 hours ago, secretstorage said:

This may thus be strictly related to USB resetting?

It's possible.

secretstorage · June 7

Looks to be mostly exactly the same errors for two users with completely different vintage and type of equipments.

How should we proceed with seeking the Unraid's team help on this?

Thank you!

JorgeB · June 7

Bases on the link posted, it appears to be a kernel issue, so LT likely won't be able to do much, my suggestion would be to try with a different USB controller, or report it to the Linux maintainers as a kernel bug.

secretstorage · June 7

Just to clarify, "different USB controller" would be a physical USB hub?

I am using 5 different ports on my B450 Tomahawk board, with all but one being allocated with the Unraid USB, 3D printer, SDR-adapter, Webcam and UPS.

Should I buy a hub and get it transferred to it?

JorgeB · June 7

The board may have more than one USB controller, using a HUB with just one port may help, you may need to try more than one, but it could be worth trying.

secretstorage · June 7

3 hours ago, JorgeB said:

The board may have more than one USB controller, using a HUB with just one port may help, you may need to try more than one, but it could be worth trying.

How would you read this:

Looks like there are two and I am currently using both.
I couldn't use only the back sockets because devices would fight for allocations and oscillate back and forth ON/OFF.

Edited June 7 by secretstorage

JorgeB · June 8

There appear to be two:

03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 xHCI Compliant Host Controller [1022:43d5] (rev 01)
    Subsystem: ASMedia Technology Inc. 400 Series Chipset USB 3.1 XHCI Controller [1b21:1142]
    Kernel driver in use: xhci_hcd

28:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
    Subsystem: Micro-Star International Co., Ltd. [MSI] Matisse USB 3.0 Host Controller [1462:7c02]
    Kernel driver in use: xhci_hcd

So you could try using only one or the other.

huquad · June 14

Since I re-implemented my memory limiter in frigate, I haven't seen any more crashes (fingers crossed). It's been two weeks which is much better than I was seeing before. @secretstorage did you ever implement this limiter in your docker container? Or are you still chasing the USB angle?

huquad · June 22

Bad news bears. I just had my first crash since June 1st. I will say my trick seems to have delayed the crash. I'm going to stop passing through one of my USB controllers from my MOBO to my VMs and see where that gets me.

huquad · June 23

Stopped passing through the USB controller and instead passed through specific devices. Now my host crashed on me while actively using it (playing a game on VM). I noticed my CPU fan doing some weird stuff according to Netdata. Its possible I'm experiencing a thermal crash. I'm going to reinstall my CPU cooler with fresh paste and see where that gets me. I'm also going to make sure I don't have any wires/junk getting caught in the fans that could be stopping them. @secretstorage did you ever find a solution?

bigbangus · June 27

Yeah I don't know what the issue is, I just know that if I move my google coral that frigate uses to my host mobo USB controller it crashes. It got me twice years apart lol. I just run it on a separate PCI-e USB card and it's been very stable. Mysteries of the universe.

huquad · June 28

@bigbangus that's very interesting. I'm desperate so I'll try anything! haha

Adding that to the list. I'm gonna try a few things and report back if I have success with any of them.

huquad · August 4

I'm still experiencing crashes. Did anyone ever find a solution to this? In the meantime, I'm moving my frigate instance to another unraid machine on different hardware.

huquad · August 10

The saga continues. My other unraid machine just crashed as a result of frigate/coral usb. I tried moving frigate to a home assistant VM, but wasn't able to get the network drive permissions figured out. Moving it back to my original machine as its less mission critical. @secretstorage have you had any luck?

secretstorage · September 14

On 8/10/2024 at 2:22 PM, huquad said:

The saga continues. My other unraid machine just crashed as a result of frigate/coral usb. I tried moving frigate to a home assistant VM, but wasn't able to get the network drive permissions figured out. Moving it back to my original machine as its less mission critical. @secretstorage have you had any luck?

Nope, it's just hit and miss and crashes from time to time when USB devices fall out.

For the moment mine has been stable for about a months, as I've not used the 3D printer much, nor had much to do with Frigate, but I am sure that a day will come.

At the moment I am looking at upgrading my rig with an X870E motherboard, so will have a lot more USBs under a single controller. Hopefully should be able to minimize the probability of these crashes, but not entirely sure if it will help.

huquad · September 18

Damn. Oh well. Mines been more stable recently too, but that doesn't really mean anything considering it's done this before. For now, I have a way to hard reboot the system externally (unifi PDU). Not the most eloquent, but it works. Long term I think I'll switch to an intel system to get some quick sync benefits. I hear that's a solid alternative to the coral.

Server Crashes Frequently - Kernel Panic - Persistent across HW migration (Ryzen > Intel)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation