Jump to content

mwasserman

Members
  • Posts

    11
  • Joined

  • Last visited

Posts posted by mwasserman

  1. I've tried a few different changes, so far still getting random crashes every 3-6 days.

     

    Here is what I have done and some new information. Can anyone help me make sense of the errors I as able to see on the monitor

    1. Attached monitor and keyboard so I can see the terminal after crash
    2. Ran Memtest86+ v6.20. Passed 1 round
      1. MemTest.thumb.jpg.94c9b899c890a28efb281a9d3ffba411.jpg
    3. Changed out power supply
    4. Upgraded to 6.12.3
    5. Server ran for 6 days before complete dead lock. Nothing on monitor or keyboard, numlock didn't even work
    6. Read on other posts, this can be caused by duplicati docker. Shut down duplicati docker
    7. Crashed after 2 days but this time the terminal still worked. Screenshot of errors
      1. 588196923_Erroraftercrash2023-09-04.thumb.jpg.7a5061b043c312eafb7eaab098f4d74f.jpg
      2. OCR of errors to make this searchable
        1. Tower login: crond [1420]: exit status 126 from user root /usr/bin/run-parts /etc/cron.hourly 1> /dev/null 
          crond [11850]: unable to exec /usr/sbin/sendmail: cron output for user root /usr/bin/run-parts /etc/cron.hourly 1> /dev/null to /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [14201: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
          Hint: Num Lock on
          Tower login: crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null 
          crond [1420]: exit status 135 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
    8. I tried to call "diagnostics" from the command line to do diagnostics collection but received "command not found
    9. I just upgrade to 6.12.4, lets see if that makes any difference.

    Any other suggestions for things to try? My next step may be to roll back to a pre 6.12 version as everything seems to have gone down hill as of 6.12.X 

     

  2. HI everyone,

    I've been running Unraid on this Lenovo ThinkServer TS140 for about 6 years without a single issue. As of about 2-3 months ago I've been getting random lockups roughly every 2-3 weeks.

     

    • Unraid 6.12.2
    • Process: Intel Xeon E3-1246 v3
    • Memory: 32GB ECC
    • Running many dockers and VMs, nothing new between stable and random crashes. 

     

    syslog to usb stick was enabled during the last crash. diagnostics dump and syslog attached.

     

    The last crash occurred sometime between these 2 lines. 
    Aug  3 02:00:38 Tower root: /mnt/cache: 188.6 GiB (202545577984 bytes) trimmed on /dev/sdg1
    Aug  3 18:19:16 Tower kernel: microcode: microcode updated early to revision 0x28, date = 2019-11-12

     

    I'm in the process of running Memtest86+ v6.20 now to see if anything comes up.

     

    Any help to figure out what is going on here is much approached. 

    tower-diagnostics-20230803-1828.zip syslog

  3. 15 hours ago, digiblur said:

    You don't need to pass in the device to the container.  It really is as simple as loading the plugin to get the drivers going and then telling Frigate to use it.  

     

    detectors:
      coral_pci:
        type: edgetpu
        device: pci

    After reading this I really had high hopes this was everything I was doing wrong... No luck 😞

     

    Still getting this error in the Unraid System Log

    Mar 28 14:37:37 Tower kernel: eth0: renamed from veth8c00d88
    Mar 28 14:37:51 Tower kernel: apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
    Mar 28 14:37:51 Tower kernel: apex 0000:03:00.0: Error in device open cb: -110
    Mar 28 14:38:59 Tower kernel: veth8c00d88: renamed from eth0

    The eth0: renamed error is really strange and new. Not sure if this is related to trying to use the PCIe Coral at all. Not giving up yet but putting in my order for a USB Coral (they look to be backordered for 2+ months). 

     

    EDIT:

    Adding insult to injury, I pulled the PCIe to m-PCIe adapter and Coral card out of my Unraid box and put it into my Windows box. Tried the example on https://coral.ai/docs/m2/get-started/#4-run-a-model-on-the-edge-tpu and it worked perfectly. Now at least I know the Coral m-PCIe card and adapter are working correctly. It's just a matter of figuring out why Unraid won't handle the card correctly. 

     

    I tried to pass-though the card to a VM on my Unraid box and Unraid refuses to list the card as able to pass-though in a VM. I bound the IOMMU group (just the 1 Coral card) to VFIO but the card will not list under "Other PCI Devices" 

  4. 13 hours ago, ich777 said:

    I think from what I see something is preventing your Unraid box on boot from connecting to the internet itself, have you set a custom DNS or something like that in Unraid itself?

    This got me into an interesting debug path. I had Unraid setup on a bonded network (802.3ad) and it appears that this network style comes up after trying to load the plugins or ends up in a race condition where they both need to happen at the same time and the network doesn't come up in time. I've removed the bonded network and now the plugin loads after every reboot. Thank you for pointing me in the correct direction.

     

    I now get a new error that looks to be purely a Google Coral Issue

    Quote

    Mar 24 12:52:10 Tower kernel: x86/PAT: frigate.detecto:29004 map pfn RAM range req uncached-minus for [mem 0x6f1c4c000-0x6f1c4ffff], got write-back
    Mar 24 12:53:13 Tower kernel: apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
    Mar 24 12:53:25 Tower kernel: apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
    Mar 24 12:53:25 Tower kernel: apex 0000:06:00.0: Error in device open cb: -110

    I'm going to do some searching and ask in the Google Coral forums to see if this is a known problem.

    • Like 1
  5. 7 hours ago, ich777 said:

    I already built it for Unraid 6.9.1 Kernel v5.10.21 otherwise the the Plugin won't work on this Unraid version.

    Thank you for correcting me. There goes that idea of why it isn't working

     

    7 hours ago, ich777 said:

    EDIT: I think I got what is wrong here, you passed over a path to the container but it's a device:

    I double checked my mapping and it is a device. Removed and recreated it just to be 100% sure. Same results as before, "No EdgeTPU detected". I was really hoping I had missed this and you were right. The search goes on. 650146000_CoralAccelerationModule.jpg.c8877c3616455a01e229c755ae5640a2.jpg

     

    7 hours ago, ich777 said:

    This shouldn't happen, have you a active internet connection on boot or better speaking have you anything like PiHole or a VM that is your Firewall on your Unraid box?

    I do run pfBlockerNG on pfsense (not in a VM). Checked the logs on pfBlocker and didn't see it blocking anything from my Unraid box. 

     

    7 hours ago, ich777 said:

    The Diagnostics (Tools -> Diagnostics -> Download -> drop the downloaded file here in the textbox) from a reboot with previously installed Coral Accelerator Module Drivers would be very helpful to troubleshoot why it disappear.

    Have you looked into the Plugins tab in Unraid if there is a Plugin in Error state and removed that in the first place and installed it afterwards?

    No Plugin in Error State, this is my plugin page just before a reboot, I had just installed the coral module driver.

    503089511_Pluginsbeforereboot.thumb.jpg.987f7bc9601aa28651d76901fca77655.jpg

     

    Now after a reboot (I collected the Diagnostics Logs at this point)

    435186043_Pluginsafterreboot.thumb.jpg.dd254d005a0bb4fad6cce972e03a065e.jpg

    tower-diagnostics-20210324-0917.zip

  6. After seeing the great progress that was being made to get the Mini PCIe Coral working I bought one with an adapter to try my luck. It's not going as smoothly as I had hoped. Maybe someone here can point me in the correct direction or next steps to help debug

    • Unraid 6.9.1
    • Adapter I am using: Ableconn PEX-MP117 Mini PCI-E to PCI-E Adapter Card
    • Card correctly shows up in Unraid
      • 2036078921_IOMMUgroup.jpg.1c2de417055a338304d4ce226ba6d918.jpg
    • I have installed "Coral Accelerator Module Drivers"
    • From terminal, if I run the below command I get a return suggesting the card is correctly installed
    root@Tower:~# ls /dev/apex_0
    /dev/apex_0
    • I have also checked lsmod and can see apex and gasket loaded
    root@Tower:~/apex/packages# lsmod
    Module                  Size  Used by
    apex                   16384  0
    gasket                 90112  1 apex

     

    • I have the card passed though to frigate container
      • tpu_mapping.jpg.583621319abb37f9f07ecffa32226817.jpg
    • My frigate container works great with CPU processing so I believe my configuration is good but when I switch to 
    detectors:
      coral:
        type: edgetpu
        device: pci
    • After an Unraid system restart I get 1 start of Frigate where it says it finds the EdgeTPU but soon crashes. After that every time I start the container I get the following errors
    * Starting nginx nginx
    ...done.
    Starting migrations
    peewee_migrate INFO : Starting migrations
    There is nothing to migrate
    peewee_migrate INFO : There is nothing to migrate
    detector.coral INFO : Starting detection process: 41
    frigate.app INFO : Camera processor started for living_room: 44
    frigate.edgetpu INFO : Attempting to load TPU as pci
    frigate.app INFO : Camera processor started for kitchen: 46
    frigate.edgetpu INFO : No EdgeTPU detected.
    Process detector:coral:
    frigate.app INFO : Camera processor started for garage: 47
    frigate.app INFO : Camera processor started for backyard: 49
    frigate.app INFO : Capture process started for living_room: 50
    frigate.app INFO : Capture process started for kitchen: 52
    frigate.app INFO : Capture process started for garage: 57
    frigate.app INFO : Capture process started for backyard: 59
    frigate.mqtt INFO : MQTT connected
    Traceback (most recent call last):
    File "/usr/local/lib/python3.8/dist-packages/tflite_runtime/interpreter.py", line 152, in load_delegate
    delegate = Delegate(library, options)
    File "/usr/local/lib/python3.8/dist-packages/tflite_runtime/interpreter.py", line 111, in __init__
    raise ValueError(capture.message)
    ValueError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
    File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "/opt/frigate/frigate/edgetpu.py", line 124, in run_detector
    object_detector = LocalObjectDetector(tf_device=tf_device, num_threads=num_threads)
    File "/opt/frigate/frigate/edgetpu.py", line 63, in __init__
    edge_tpu_delegate = load_delegate('libedgetpu.so.1.0', device_config)
    File "/usr/local/lib/python3.8/dist-packages/tflite_runtime/interpreter.py", line 154, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
    
    ValueError: Failed to load delegate from libedgetpu.so.1.0
    
    
    frigate.watchdog INFO : Detection appears to have stopped. Exiting frigate...
    frigate.app INFO : Stopping...
    frigate.record INFO : Exiting recording maintenance...
    frigate.object_processing INFO : Exiting object processor...
    frigate.events INFO : Exiting event processor...
    frigate.events INFO : Exiting event cleanup...
    frigate.watchdog INFO : Exiting watchdog...
    frigate.stats INFO : Exiting watchdog...
    peewee.sqliteq INFO : writer received shutdown request, exiting.
    root INFO : Waiting for detection process to exit gracefully...
    watchdog.backyard INFO : Terminating the existing ffmpeg process...

     

    Final questions

    • Why does Unraid look to be seeing the EdgeTPU but the container can't talk to it?
    • Is there a way to keep the "Coral Accelerator Module Drivers" between reboots? It looks to go away after every Unraid reboot.

     

    After many more hours of this I think it just comes down to the driver being for the wrong kernel.

     

    @ich777 Any chance you can build the Coral PCI driver for Unraid 6.9.1 (kernel 5.10.21)? Thank you in advance!

  7. Just want to point out 2 issues I ran into and how I solved them after updating to 6.9.1

     

    My br0 network is a 802.3ad bonded pair with bridging enabled. After the first reboot any docker container that was using br0 stopped working. To solve this I ran the following 2 lines from the terminal console

    rm /var/lib/docker/network/files/local-kv.db
    /etc/rc.d/rc.docker restart

     

    Virtual Machine "VNC Remote" from within the web browser stopped working with a "SyntaxError: The requested module '../core/util/browser.js" error

     

    Clearing Chrome "Cached images and files" fixed this

    • Like 1
  8. Hi all, I've been using Unraid for about a year now without any major issues. I looked into my log the other day and started to notice many the follow warning.

    Mar 31 23:52:09 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:09 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:09 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:09 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:10 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:10 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:10 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503
    Mar 31 23:52:10 Tower kernel: BTRFS error (device sdg1): parent transid verify failed on 620778586112 wanted 16531233 found 15373503

    In my case sdg is an SSD Cache drive in an array of 2 Cache drives. I'm assumming (sdg1) and sdg are referring to the same drive? Is this correct?

    I've been googling around and found many many pointing to bad SATA cables suggested to run scrub. I've run the scrub operation and it is reporting "no errors found". What are my next steps in trying to fix this warning?

     

×
×
  • Create New...