Jump to content

samsausages

Members
  • Posts

    115
  • Joined

  • Last visited

Report Comments posted by samsausages

  1. On 7/1/2023 at 8:33 AM, DrSpaldo said:

    I was hoping @samsausagesmay have had some luck tracking down the issue? 


    I have not found the root cause, I do know it's docker, but haven't spent the time tracking it down, because I just used a temporary fix and was hoping it gets worked out.  Figured it's related to the docker changes in unraid and will be patched eventually.

    I found these temporary fixes in another thread:

     

    Solution 1

    If you have the scripts plugin installed. You can use this command adapted for log files. 

     

    find /run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/ -maxdepth 999999 -noleaf -type f -name "log.json" -exec rm -v "{}" \;

     

    This runs on my server every 24 hours and has yet to have the issue since. Docker Daemon will recreate the log to make sure that it is logging the health status of your application - so any docker application that has a health status showing. 

     

    Solution 2

    The other option is to remove the health check of the docker image that is running by using these parameters in

     

    Extra Parameters --no-healthcheck

     

    Solution 3

    The other option is to increase the size of your tmpfs /run folder with the command below but at some point that will fill up. This command will set it to 85MB from default 32MB

     

    mount -t tmpfs tmpfs /run -o remount,size=85M

  2. On 7/5/2023 at 12:03 PM, dopeytree said:

    Did we figure out the cause?


    I stopped using ZFS Master, as I rarely use it.  Developer seems to think it's related to ZFS Master Plugin Refreshing info when you go to the tower/main page.
    Have not been able to figure out if it's exclusive to the 9300 chipset, but do find a lot of people that are using that and reporting this issue with ZFS Master.

    • Like 1
  3. 36 minutes ago, csimmons said:

    Unwriteable array


    Check to see if your tempfs is filling up.  Use df -h and look for tempfs.  Keep track of the available space over a few hours.  If yes, then something is filling that up, most likely docker.  I have another post in this thread about it.

  4. On 5/22/2023 at 2:20 AM, casperse said:

    No But I do have a VM with Chrome having open tabs to Unraid?
    I ended up having to cut the power (Not even SSH terminal could get access) so I rolled back to the stable version.
    Everything works fine now? (Same open window on my VM.)

     

    I had this issue, something was filling up my tmpfs folder (Only exists in memory and is limited to something like 20MB, once it fills up I got the error and crashed)

    Most likely there is a docker, VM or other service filling your logs and causing crashing once the log fills up.  Could also be a health check filling the log or some other type of temp file being saved by a docker/vm.

    You can do a  df -h and looking for tempfs, you'll probably find that it goes to 0 when you start having the issues.

    You can run this command to manually clean the logs:

    find /run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/ -maxdepth 999999 -noleaf -type f -name "log.json" -exec rm -v "{}" \;
    


    Or you can increase the tempfs size:
     

    mount -t tmpfs tmpfs /run -o remount,size=85M



    But this doesn't fix the problem, it's a Band-Aid.  real solution is finding what is filling them up and fixing that.

    I made a thread over here and started figuring out what it was towards the end of the thread:

     

  5. So I have a workaround for this issue, at least for public browsing.  I still can't access password protected shares.

     

    1. Open Windows Credential Manager & Create User "root" for your server and put the path, i.e. //tower
    2. Try to browse //tower and you'll be told "Access Denied" and asked to put in password

    3. Choose to log in with a different user

    4. use username "Anonymous" and put no password.

     

    Can probably also create the credential for "anonymous", but that's how I managed to get the login dialog to pop up, so that's the path I took.

  6. 18 minutes ago, bonienl said:

    If you get a login prompt it means the destination is reachable but credentials are not accepted.

    Try clearing your windows credentials.

     

    Yeah I already tried that.  I also added credentials to see if that would work. 
    I don't get the login prompt, in windows I get an Unspecified error prompt, and sometimes I get an "this user can't sign in because this account is currently disabled" prompt.
    image.png.6f3a6d422f83dd39971dce5bab6e352a.png



    Ubuntu I get the login prompt.

    I also tried enabling guest login on windows

    Tried from laptop and workstation.  With IP address and with hostname.

  7. So does the crash happen because docker/vm/system are generating log entries that eventually fill up the "tempfs" directory mounted to memory?

    That's one of the instances I have encountered where the system works fine for hours/days, but then crashes when tempfs gets filled up.

    FYI for people that are trying to troubleshoot similar crashes, keep an eye on "tempfs" storage space by doing a df -h and looking for tempfs.  If it fills up on you, then you know something is throwing a lot of errors and you need to find out what it is.
    Most common culprits are System Logs, Docker Containers, Web Server Data and KVM temporary files.

  8. I think I have traced this to something filling up tmpfs mounted on /run
    I haven't figured out what is filling it yet, but looks like the issues begin when tmpfs fills up.

     

    Going to try and trace down the exact cause for it filling this weekend.

  9. Kernel commands are typically controlled by the bootloader, so you owuld have to add it syslinux.cfg on the USB.
    I don't usually have to mess with that and would consider it unusual.  But the steps you would take are:

     

    Backup your USB drive!!

    Edit syslinux.cfg, located at the root of the USB drive that Unraid boots from. The file is called syslinux.cfg, or syslinux.cfg-

    Locate the append line that typically starts with append initrd=/bzroot. This is where you would add the extra kernel parameter, after the existing parameters on the append line add a space and then systemd.unified_cgroup_hierarchy=0

    It would then look like this:

    label Unraid OS
      menu default
      kernel /bzimage
      append initrd=/bzroot systemd.unified_cgroup_hierarchy=0



    This is coloring outside of the lines for me and probably not best practice, so I can't tell you what would actually happen, or if it is supported by Limetech, as changing cgroup hierarchy can affect all containers running on the system.

  10. It is happening again right now, so I can provide more info & logs.  Boy, I hope it helps narrow it down as it's requiring a reboot every few days.
    Check point #7, when I stop the array.
    If I had to guess, it's related to the nchan webgui device list process, looks like there are a lot of them and they keep building.

    I went several days and it was working fine.  I ran extensive tests on my individual devices & the pool, in my workstation and in the server.  Writing varying types of data 10x to each individual drive & then another 10x when configured as a ZFS Pool.  Each time covering the full size of the disks & verifying cheksums.  I feel pretty good about my hardware being solid.  I suspect the SMART warning was logged because Unraid became unresponsive and the drive ran into what it saw as a driver error.


    While I haven't been able to recreate the issue on demand, I can expand on the symptoms and provide logs while it is happening:
    1. CPU usage ever increasing, until it hits 100% and freezes the system.  Seems like more threads are being utilized as time goes on, but not sure what triggers it.  Can sit at 40% use for some time, until it suddenly goes to 100%.
    2. "Main" page causes 100% Browser CPU usage on the client.  The MB/s Speed info on the page updates several times per second.
    3. "Main" page you cannot click on any of the disks, browser links etc to bring up Attributes or SMART info. (other pages are fine, can still "stop" array from Main page.  Putting the URL directly in to the browser, for the disk attributes, will work, it's just the links on the Main page is broken)
    4. No errors in the log specific enough for me to know what's happening

    5. Usually no one process that account for the 40-100% CPU usage, but frequently I do see 'lsof' make the list, sometimes it will show up more than once.  It's reflected in the htop process list below.  But lots of nchan device list processes.
    6. Stopping Docker has no effect.
    7. Stopping the array does!  CPU usage goes down, but still high at about 10%.  Upon stopping the array, the log is flooded with the error below:

    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [crit] 253614#253614: ngx_slab_alloc() failed: no memory
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: shpool alloc failed
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: nchan: Out of shared memory while allocating message of size 26778. Increase nchan_max_reserved_memory.
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: *10550928 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: MEMSTORE:00: can't create shared message for channel /devices


    8. Re-starting the array and the issue returns immediately, requiring a reboot.  I guess I could try restarting specific services like nginx, but I haven't tried that.

    9. No new SMART warning logged, I believe this is because I rebooted before the CPU went to 100% and the system became totally unresponsive. 
    10. I'm uploading 2 sets of logs.  #1 is from the server running and high CPU usage, but no errors flooded in the log.  #2 is with the array stopped and the log got flooded.

    Example of top 10 processes with top CPU usage:
    image.thumb.png.43df9568539d0febbfbf4df5d7c370c8.png


    Example of htop:
    image.thumb.png.fdea89504c2b040928424d5172e79edd.png
     

    babel-diagnostics-20230506-1546.zip bertha-diagnostics-20230513-1621.zip

  11. I have seen it fail to format when an existing partition exists on the disk.
    Remove the disks from the pool.
    Use the "unassigned devices" plug-in to delete the partition on the drive

    re-add the disks to the pool and format.

  12. 1 hour ago, JorgeB said:

    All filesystems mounted by Unraid, including zfs, are mounted with the noatime option, so it should always be disabled:

     

    May 11 09:27:10 Tower7 emhttpd: shcmd (77): /usr/sbin/zfs mount -o noatime cache

     


    I just created a test share (dataset) on a pool using the Unraid "shares" GUI, set to the zfs pool exclusively.  This is the result for "zfs get atime" for the newly created dataset.

    image.png.0e4eb248ff4c55e08256e1c2c2372897.png

    When I check the log to see how the pool is mounted at array start, I can confirm that the pool is mounting with 'noatime', as follows:
     

    Quote

    May 11 06:23:08 bertha emhttpd: shcmd (7997749): /usr/sbin/zfs mount -o noatime workpool


    But the datasets I create using Unraid do show atime=on for the dataset itself.

  13. I would say 71C is a bit on the high side, but looks like you have a 13900k, so that isn't bad for that chip and well within spec.
    Good sign that it didn't crash, but probably doesn't make your search for answers much easier.  My gut is still memory related, be it memory itself, motherboard or the memory controller on the cpu.
    This may be a long shot as well, but I had an older x99 system that didn't like my memory settings on Auto, I had to go in and set the timings and voltage manually in the BIOS for it to work properly.  And it's tough to troubleshoot because errors were rare and intermittent.
     

  14. How are your CPU temperatures looking while testing?  Can also install the corefreq plugin, run corefreq-cli and go to tools>atomic burn and do a cpu burn test, similar to Prime95

    You got a tough one there to track down, but good advice so far as to what to look at first.

  15. Are you using a dockerfile to build your image and docker-compose.yml to get the container up?
    Or you have an image and are using the Unraid GUI?

    Verify that your containers's entry point and/or command is still correctly configured. 


    is there a startup script? (such as an init script or a script specified in the CMD or ENTRYPOINT).  If yes, verify that script is in the same location and that it is executable and correctly starts.  It could also be defined in the dockerfile build.

    You mentioned that there is nothing in the Docker log for this container. However, it's still a good idea to double-check by running the command docker logs <container_name_or_id> to see if there are any error messages or useful information.

  16. On 5/6/2023 at 5:43 AM, husqnz said:

    Will do. Just waiting to see if it recurs after I did a bit of a clean-up of unused plugins. If it does, I'll try safe mode and see if I can get it to recur.

    Are you still having this problem?  I also got this error, reported at the link below.  But with me I also had 100% CPU usage and a SMART error on an NVMe drive.

    I don't know if the two are related or if it's a coincidence.
    I'm not getting the error right now. 
     

     

  17. 1 hour ago, bbrodka said:

    Thanks for posting solution, I too had the same trouble and uninstalled “ZFS Master" Plugin and It  fixed spindowns too.

    Glad it helped you find what plugin is causing it!  
    I have been trying to find out what is actually causing this, because it is unexpected behavior.  I'm wondering if it's hardware or a bios setting.
    Are you per chance using an LSI HBA Card?  Mind sharing some of your hardware specs?  I'm trying to find common denominators.

    Thanks!

  18. @JorgeBSo I have been doing lots of testing... I had 2 SMART errors pop up, 1 each time when this happened & Uraid went to 100% CPU usage. (Requiring a reboot)

    SMART error is as follows:
     

    Quote

    Entry 0:

    ErrCount (Error Count): 2

    SQId (Submission Queue Identifier): 4

    CmdId (Command Identifier): not available

    Status: 0xc00c

    PELoc (Parameter Error Location): not available

    LBA (Logical Block Address): 0

    NSID (Namespace Identifier): not available

    VS (Vendor Specific): not available
     

    Entry 1:

    ErrCount (Error Count): 1

    SQId (Submission Queue Identifier): 8

    CmdId (Command Identifier): not available

    Status: 0xc00c

    PELoc (Parameter Error Location): not available

    LBA (Logical Block Address): 0

    NSID (Namespace Identifier): not available

    VS (Vendor Specific): not available

     

    Deep diving into the error status code 0xc00c indicates a "Completion Queue Invalid" error. Reading up it sounds this error typically occurs when there's an issue with the completion queue head pointer or doorbell. It might be a firmware or driver issue, or even a hardware problem with the NVMe drive.

    I removed the drive from the system and ran several full SMART checks on a windows system, with no issues. (2x)
    I then used the Intel utility to perform several full diagnostic read/write tests on the drive. (3x)

    I then re-installed the drive in my Unraid server.  I wrote a script that copies files to the drive and does a battery of read/write tests.
    So far I have not been able to recreate the error/issue and I have completed 4 full drive writes & checksum tests on the written data.  Have not seen the error pop back up.

    So I'm still a bit unsure what caused it, I'm going to tweak my script a bit to throw more variety at the drive.  The one thing I'm doing differently than before is I had the "docker" folder on this drive and I set docker to use the ZFS file system (Not Image).  I haven't tested if the ZFS Docker File System integration has anything to do with it.

    But I wanted to give an update and see if you had anything to add to it.

    • Like 1
  19. I just had one of my cache pools disconnect, as a failed drive, but reboot brought it right back.  So I'm going to test my drives thoroughly, as a few of them are new and I have only had them in there for a few days, with little utilization.  I'm going to take them out and run some tests on my desktop, to eliminate them as a variable.

×
×
  • Create New...