secretstorage Posted April 7 Share Posted April 7 And again died this morning, just 30-45 min ago. From what I can see there are a lot of GPU errors as well, which starts to make sense in the way the GPU was behaving, but it would this cause a kernel panic issue? syslog-10.10.10.99.log Quote Link to comment
secretstorage Posted April 11 Share Posted April 11 @JorgeB would you mind taking a look? It failed 3 times just today and while there are still USB errors attributable to my UPS which for some reason seems to disconnect all the time, there are other docker related errors (blocked mode). Not sure where to look for next steps.... syslog-10.10.10.99.log Quote Link to comment
JorgeB Posted April 11 Share Posted April 11 There are constant crashes that look to me, are being caused by Frigate/ffmpeg, try leaving that container disabled and re-test. Quote Link to comment
secretstorage Posted April 11 Share Posted April 11 Thank you very much for taking the time to look into it. Would you mind elaborating on what re-test would mean? Since it's often 2-4 weeks between crashes (before) and 3-4 happening today, I wonder how would I go about testing it. I would have to disable my NVR, 3D printer monitoring and many automations for weeks/months to confirm anything... hmm what to do.... In the meantime, I've found the following thread, which seems very close to what is going on for me. Would you agree? Quote Link to comment
JorgeB Posted April 11 Share Posted April 11 19 minutes ago, secretstorage said: Would you mind elaborating on what re-test would mean? Disable Frigate and run the server until it crashes or you are satisfied it's no longer crashing. Quote Link to comment
huquad Posted June 2 Share Posted June 2 (edited) I have this same exact problem, and am also running frigate. Now that you mention it, it could have started when I moved my frigate instance to this machine (from another unraid server which did not show the same crashing to my memory). @secretstorage any updates from your tests? For reference, my setup is a 5800x CPU with 3080 GPU used in VM passthrough. I have 32GB of ram which I pass 16GB to VMs. I also think this is likely a ram issue. I previously had issues with frigate gobbling up ram on my other system with 64GB of ram, but no crashes. I fixed this with a ram limiter, but just noticed the ram limiter didn't make the transfer between my machines. I just added the 5GB limit to frigate ("--memory=5G" in extra parameters for those following along at home). I'll report back with my results. If you don't hear back, assume this fixed my problem. Cheers! Edit: Here are the last lines from my syslog. Make anything of them? I'm in the process of researching them myself. Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state. Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Successful completion on short TX Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1 Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: Looking for event-dma 00000001042888d0 trb-start 00000001042888e0 trb-end 00000001042888e0 seg-start 0000000104288000 seg-end 0000000104288ff0 Edited June 2 by huquad Quote Link to comment
secretstorage Posted June 7 Share Posted June 7 (edited) On 6/2/2024 at 2:13 AM, huquad said: Edit: Here are the last lines from my syslog. Make anything of them? I'm in the process of researching them myself. Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state. Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: WARN Successful completion on short TX Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1 Jun 1 18:58:13 Falcon kernel: xhci_hcd 0000:02:00.0: Looking for event-dma 00000001042888d0 trb-start 00000001042888e0 trb-end 00000001042888e0 seg-start 0000000104288000 seg-end 0000000104288ff0 Almost identical to my logs from a crash minutes ago - a pattern?? ``` Jun 6 18:17:10 Server dhcpcd-run-hooks[29674]: br0: Invalid domain name: .local Jun 6 18:47:10 Server dhcpcd-run-hooks[31956]: br0: Invalid domain name: .local Jun 6 19:17:10 Server dhcpcd-run-hooks[2210]: br0: Invalid domain name: .local Jun 6 19:47:10 Server dhcpcd-run-hooks[4298]: br0: Invalid domain name: .local Jun 6 20:17:10 Server dhcpcd-run-hooks[5781]: br0: Invalid domain name: .local Jun 6 20:43:36 Server kernel: usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd Jun 6 20:43:36 Server kernel: usb 2-2: LPM exit latency is zeroed, disabling LPM. Jun 6 20:47:10 Server dhcpcd-run-hooks[7601]: br0: Invalid domain name: .local Jun 6 21:17:10 Server dhcpcd-run-hooks[10055]: br0: Invalid domain name: .local Jun 6 21:47:10 Server dhcpcd-run-hooks[12527]: br0: Invalid domain name: .local Jun 6 22:17:10 Server dhcpcd-run-hooks[16036]: br0: Invalid domain name: .local Jun 6 22:47:10 Server dhcpcd-run-hooks[19282]: br0: Invalid domain name: .local Jun 6 23:17:10 Server dhcpcd-run-hooks[21925]: br0: Invalid domain name: .local Jun 6 23:47:10 Server dhcpcd-run-hooks[24961]: br0: Invalid domain name: .local Jun 7 00:17:10 Server dhcpcd-run-hooks[27826]: br0: Invalid domain name: .local Jun 7 00:26:23 Server kernel: xhci_hcd 0000:03:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state. Jun 7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: WARN Successful completion on short TX Jun 7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1 Jun 7 00:26:24 Server kernel: xhci_hcd 0000:03:00.0: Looking for event-dma 0000000103b91a90 trb-start 0000000103b91aa0 trb-end 0000000103b91aa0 seg-start 0000000103b91000 seg-end 0000000103b91ff0 Jun 7 02:13:45 Server unassigned.devices: Mounting 'Auto Mount' Devices... Jun 7 02:13:45 Server unassigned.devices: Error: Device '/dev/sdb1' mount point 'spare' - name is reserved, used in the array or a pool, or by an unassigned device. Jun 7 02:13:45 Server unassigned.devices: Disk with serial 'ST4000DX001-1CE168_Z30195RK', mountpoint 'spare' cannot be mounted. Jun 7 02:13:46 Server emhttpd: Starting services... ``` @JorgeB I've currently switched off in config all the cameras that previously were rebooting and causing errors in Frigate, so they are no longer present in logs. This may thus be strictly related to USB resetting? https://github.com/google-coral/edgetpu/issues/166 Edited June 7 by secretstorage Quote Link to comment
JorgeB Posted June 7 Share Posted June 7 7 hours ago, secretstorage said: This may thus be strictly related to USB resetting? It's possible. 1 Quote Link to comment
secretstorage Posted June 7 Share Posted June 7 Looks to be mostly exactly the same errors for two users with completely different vintage and type of equipments. How should we proceed with seeking the Unraid's team help on this? Thank you! Quote Link to comment
JorgeB Posted June 7 Share Posted June 7 Bases on the link posted, it appears to be a kernel issue, so LT likely won't be able to do much, my suggestion would be to try with a different USB controller, or report it to the Linux maintainers as a kernel bug. Quote Link to comment
secretstorage Posted June 7 Share Posted June 7 Just to clarify, "different USB controller" would be a physical USB hub? I am using 5 different ports on my B450 Tomahawk board, with all but one being allocated with the Unraid USB, 3D printer, SDR-adapter, Webcam and UPS. Should I buy a hub and get it transferred to it? Quote Link to comment
JorgeB Posted June 7 Share Posted June 7 The board may have more than one USB controller, using a HUB with just one port may help, you may need to try more than one, but it could be worth trying. Quote Link to comment
secretstorage Posted June 7 Share Posted June 7 (edited) 3 hours ago, JorgeB said: The board may have more than one USB controller, using a HUB with just one port may help, you may need to try more than one, but it could be worth trying. How would you read this: Looks like there are two and I am currently using both. I couldn't use only the back sockets because devices would fight for allocations and oscillate back and forth ON/OFF. Edited June 7 by secretstorage Quote Link to comment
JorgeB Posted June 8 Share Posted June 8 There appear to be two: 03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 xHCI Compliant Host Controller [1022:43d5] (rev 01) Subsystem: ASMedia Technology Inc. 400 Series Chipset USB 3.1 XHCI Controller [1b21:1142] Kernel driver in use: xhci_hcd 28:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c] Subsystem: Micro-Star International Co., Ltd. [MSI] Matisse USB 3.0 Host Controller [1462:7c02] Kernel driver in use: xhci_hcd So you could try using only one or the other. Quote Link to comment
huquad Posted June 14 Share Posted June 14 Since I re-implemented my memory limiter in frigate, I haven't seen any more crashes (fingers crossed). It's been two weeks which is much better than I was seeing before. @secretstorage did you ever implement this limiter in your docker container? Or are you still chasing the USB angle? Quote Link to comment
huquad Posted June 22 Share Posted June 22 Bad news bears. I just had my first crash since June 1st. I will say my trick seems to have delayed the crash. I'm going to stop passing through one of my USB controllers from my MOBO to my VMs and see where that gets me. Quote Link to comment
huquad Posted June 23 Share Posted June 23 Stopped passing through the USB controller and instead passed through specific devices. Now my host crashed on me while actively using it (playing a game on VM). I noticed my CPU fan doing some weird stuff according to Netdata. Its possible I'm experiencing a thermal crash. I'm going to reinstall my CPU cooler with fresh paste and see where that gets me. I'm also going to make sure I don't have any wires/junk getting caught in the fans that could be stopping them. @secretstorage did you ever find a solution? Quote Link to comment
bigbangus Posted June 27 Share Posted June 27 Yeah I don't know what the issue is, I just know that if I move my google coral that frigate uses to my host mobo USB controller it crashes. It got me twice years apart lol. I just run it on a separate PCI-e USB card and it's been very stable. Mysteries of the universe. Quote Link to comment
huquad Posted June 28 Share Posted June 28 @bigbangus that's very interesting. I'm desperate so I'll try anything! haha Adding that to the list. I'm gonna try a few things and report back if I have success with any of them. Quote Link to comment
huquad Posted August 4 Share Posted August 4 I'm still experiencing crashes. Did anyone ever find a solution to this? In the meantime, I'm moving my frigate instance to another unraid machine on different hardware. Quote Link to comment
huquad Posted August 10 Share Posted August 10 The saga continues. My other unraid machine just crashed as a result of frigate/coral usb. I tried moving frigate to a home assistant VM, but wasn't able to get the network drive permissions figured out. Moving it back to my original machine as its less mission critical. @secretstorage have you had any luck? Quote Link to comment
secretstorage Posted September 14 Share Posted September 14 On 8/10/2024 at 2:22 PM, huquad said: The saga continues. My other unraid machine just crashed as a result of frigate/coral usb. I tried moving frigate to a home assistant VM, but wasn't able to get the network drive permissions figured out. Moving it back to my original machine as its less mission critical. @secretstorage have you had any luck? Nope, it's just hit and miss and crashes from time to time when USB devices fall out. For the moment mine has been stable for about a months, as I've not used the 3D printer much, nor had much to do with Frigate, but I am sure that a day will come. At the moment I am looking at upgrading my rig with an X870E motherboard, so will have a lot more USBs under a single controller. Hopefully should be able to minimize the probability of these crashes, but not entirely sure if it will help. Quote Link to comment
huquad Posted September 18 Share Posted September 18 Damn. Oh well. Mines been more stable recently too, but that doesn't really mean anything considering it's done this before. For now, I have a way to hard reboot the system externally (unifi PDU). Not the most eloquent, but it works. Long term I think I'll switch to an intel system to get some quick sync benefits. I hear that's a solid alternative to the coral. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.