Nuuki

Members
  • Posts

    40
  • Joined

  • Last visited

Everything posted by Nuuki

  1. In the end I switched to a Docker image - its clearly not necessarily anything to do with the root cause, but its resolved things for me for now.
  2. I noticed that a couple of containers were Stopped. When I try to start then I get the following: docker: Error response from daemon: Conflict. The container name "/Rallly" is already in use by container "fc8e9affcd6dc1be9853b7aec51893d1af5a5fd8f6dcdd8fd4f557d71c5e2144". You have to remove (or rename) that container to be able to reuse that name. Tried deleting and recreating the image but no luck. A quick search on them forums indicated a possible corrupted image. However in my case I'm using a docker directory (zfs). Is there a way I should be removing just this image? server-diagnostics-20231031-1400.zip
  3. Ah right - I switched to a (btrfs) folder about a year ago so as not to worry about sizing, but if performance takes a hit its not a major issue to switch back to an image and just increase the size if needed.
  4. Understood. I'm using a ZFS pool. and for Docker I'm using a folder with ZFS. Its running OK now - the real test will be to re-enable auto-start and bounce docker. I may try that later but I'll see if it runs stably for a day or so before I try that. Thanks for your help.
  5. Well, I've been able to bring all the containers online. However, whenever I started one the CPU would spike much more than I would expect - over 60%, for around 30 seconds, before dropping back down. If I started 2 or 3 at once it would go much higher. Maybe it always does that, but given that there'd usually be 50 or 60 starting in quick succession, it stood out a bit. I also noted that the ZFS resource meter was usually at 85%+ I'd note that I did migrate my cache pool from btrfs to zfs at the weekend, though it went find and not had any issues until today - thought it was worth nothing anyway. So right now its all back up and none of the containers seemed individually problematic, but if I turned container auto-start back on and bounced docker, I suspect I might well see the same symptoms again.
  6. OK I've done that and the CPU is looking normal. Shall I enable containers one by one and see if/when it goes crazy? I did change some container volume mounts earlier, but all seemed fine at the time. It seems feasible that that's caused a problem somewhere I guess.
  7. @JorgeB I rebooted in the end as I couldn't stop the array. After it came back up I was able to stop all running containers, but dockerd is still pinned at 100% CPU.
  8. I previously tried stopping the array (which is a USB drive) and though CPU dropped, it looks like its been stuck trying to actuallyt stop the array for some time. When I check the logs I see a bunch of these log blocks: Oct 18 17:46:19 Server emhttpd: Unmounting disks... Oct 18 17:46:19 Server emhttpd: shcmd (2345): /usr/sbin/zpool export cache Oct 18 17:46:19 Server root: cannot unmount '/var/lib/docker/zfs/graph/f9d3a0840bae10b5cb2ff414bab9a4d047e4acd9101f69ba1566b21406728928-init': unmount failed Oct 18 17:46:19 Server emhttpd: shcmd (2345): exit status: 1 Oct 18 17:46:19 Server emhttpd: Retry unmounting disk share(s)...
  9. Good question, I could shut each container down in turn I guess. Let me try to do that and see what it does.
  10. I was doing some server admin earlier when my CPU spiked to 100% and sat there. It looks like dockerd is the cause, but when I run docker stats nothing stands out - I did kill any containers using any level of substantial CPU, but no change. I've rebooted, but as soon as docker fires up it again nails at 100%. It has dropped a few times, but the GUI isn't responsive enough for me to shut docker down, get to "Fix Common Problems", logs etc. I was able to generate a diag file though, eventually. server-diagnostics-20231018-1654.zip
  11. For sure. I realise I'm pushing things a bit, but there's no data on the drive itself, and as I understand it its the only viable way to ru Unraid as a container / VM server. Maybe another hypervisor solution would be better, but I really like Unraid. So if it gives me the occasional headache its not a huge issue, so long as the core is stable and reliable (which it is).
  12. "Interestingly" I was unable to Stop the array. Even after I've removed the array USB drive, its still showing as Online in the GUI, Docker is still running etc. I'm slightly nervous about rebooting, but I'm guessing whatever state its gotten into is the cause of the error. I'll probably leave it as is for now, and once I have a new USB 2.0 drive I'll reboot and see what happens.
  13. I'm using a Dell Optiplex 7060 Micro, which sadly doesn't seem to have any 2.0 USB ports. I can try a different flash drive though - should I be using a USB 2 one?
  14. I've started getting the error "Unable to write to disk1" with the suggested fix "Drive mounted read-only or completely full." Disk 1 is a USB drive that I only have connected in order to run an array - I don't store anything on it, but its needed to run the docker and VM engines, which is what I use the server for. Anyway its only about 5% full, so hard to see why that would be the issue. I saw some comments in the linked FAQ that its often linked to docker. In my case I'm using a Docker directory rather than an image, and its on the cache pool (which again has plenty of space). So I'm slightly unsure what to check for next. I've attached diags - any help much appreciated. server-diagnostics-20230911-1451.zip
  15. I've swapped it out so I'll see if that makes any difference. As you say even if its not the root cause its clearly a problem.
  16. Ah right - so should I just pickup a new drive? Easy enough fix if that's the issue, and clearly want to address it before it gets any worse.
  17. @JorgeB No significant progress on this issue - I did cut back on plug-ins, but today I've had back to back system instability, even after a reboot. Checking the logs at the time these errors stood out as potentially problematic, and not something I'd seen before (hence the post): May 15 01:02:44 Server kernel: traps: lsof[6947] general protection fault ip:1486821704ee sp:a468894e97de6910 error:0 in libc-2.36.so[148682158000+16b000] May 15 11:39:30 Server kernel: nvme0: Admin Cmd(0x2), I/O Error (sct 0x0 / sc 0x13) May 15 11:39:30 Server kernel: nvme0: Admin Cmd(0x2), I/O Error (sct 0x0 / sc 0x13) After the reboot I see these errors: May 15 12:12:17 Server kernel: serial 0000:00:16.3: Couldn't register serial port 30a0, irq 17, type 0, error -28 May 15 12:13:20 Server kernel: I/O error, dev sda, sector 4251598 op 0x0:(READ) flags 0x84700 phys_seg 2 prio class 0 May 15 12:15:32 Server root: error: /webGui/include/InitCharts.php: wrong csrf_token I've since enabled a remote syslog server which will hopefully help. Both cache drives are new, but is it possible they're causing the problem? server-diagnostics-20230515-1249.zip
  18. Thanks. I run nginx proxy manager as a docker container - do the errors relate to that, or something else running natively on the host? I have a bunch of plug-ins running though nothing that isn't pretty popular. What are the implications of running in Safe Mode? Happy to give it a go, though as its quite an intermittent problem I'd likely need to leave it for a week or so to have a sense for whether it helped. So whether that's viable mainly depends on what's disabled in Safe Mode.
  19. I'm continuing to get this issue intermittently. Yesterday I rebooted due to the server going into this partial failure state, and this morning I'm getting the same issue. Although the server is stable other than this, I'm clearly anxious that there's an underlying fault which will only get worse. Given that I was having similar issues last year, my suspicion is that maybe the USB drive running as the array is failing - the two cache drives have both been replaced recently along with the server, so the USB drives used to boot Unraid and ru the array are the only elements that remain. Any ideas on what I can try? server-diagnostics-20230502-1621.zip
  20. I've been having some stability issues with my Unraid setup lately. I actually upgraded the server recently which I assumed would probably resolve it, but its not, so I'm keen to get to the bottom of it, as I replaced pretty much everything in the process. I use the server purely to run containers - all my storage is on an external Synology, accessed via NFS shared using Unassigned Devices. I run a pair of SSDs (replaced during the migration) in a RAID1 cache pool using UD, and I have a USB drive running an array, purely to allow things like Docker to actually run. The obvious symptoms I get are: The Dashboard stops show certain data - notably Processor and Memory info is blank: The Main screen is blank: I'm unable to launch the command line terminal - when I press the button a window appears but remains blank. Other things accessed via the UI don't necesararily work. For instance whilst I can generate a Diagnostic file, it won't download within the UI. Despite this, containers continue to run and I'm able to SSH in. Generally I reboot it via SSH and all is good for a while. I'm getting this intermittently - maybe once a week? So tricky to pin down. I'm wondering if this could relate to the USB drive running the array, as that and the boot USB are the only items that carried over from the previous server. I've attahced diags so hoping that may provide some clues. I won't reboot the server in case there's anything I can try, as once I reboot it'll probably be fine for a while. server-diagnostics-20230224-1626.zip
  21. I'm also seeing N/A running the latest version. Running intel_gpu_top from the command line I am seeing expected results. gpustat.txt intel_gpu_top.txt gpustat.cfg
  22. OK - the new drive showed up at my door at the perfect moment, and I was able to swap it into the pool without issues. Its just finished balancing so I'm now back in business with two brand new drives in the pool. I had some errors related to the new pool having a different name, but easy enough to delete the original pool and rename the new one to match, which seems to have resolved that. So I'm now done (properly this time). That you so much for handholding me through this - I had all the important data backed up, but starting from a blank cache and copying it seemed like it was likely to throw up some unexpected issues, so let me know where I can go to send you something to say thanks.
  23. OK I'm running the scrub now. So I can add the Lexar to the pool - do I need to do anything to enable redundancy on the pool, as that's what I'm after rather than pooled capacity. Also a replacement SATA drive should be with me in the next couple of hours. I assume that once the Lexar is in the pool and mirrored, that I can then simply remove the 850 from the pool, swap the drives and add the new drive? EDIT: OK I've added the drive, and though it initially showed a 1Tb pool that has now dropped to 500Gb, so I assume it is indeed RAID1. The balance is running, so I was planning to let that finish, and then remove the 850 from the pool, freeing up space for the new SATA. If that's not correct let me know, but feels like I'm almost there...
  24. Done. I created a new pool with the 850, so the old pool is still listed (though empty). I did get a SMART warning when I added the 850, so fingers crossed its useable enough to get he data onto the new Lexar. EDIT: My Docker containers have all fired up, so it seems to have read the pool data without issue... server-diagnostics-20230203-0924.zip
  25. warning, device 1 is missing using SB copy 1, bytenr 67108864 Label: none uuid: 6ce62ad6-66d5-4b0e-b758-9dcf26483095 Total devices 1 FS bytes used 1.29GiB devid 1 size 28.64GiB used 2.52GiB path /dev/md1 warning, device 1 is missing Label: none uuid: b5041f3c-3118-401e-8673-8b11b6547171 Total devices 2 FS bytes used 179.86GiB devid 2 size 465.76GiB used 238.03GiB path /dev/sdc1 *** Some devices missing To confirm, the 850 and (new) Lexar are both installed, but neither are added to the pool - both show up under Unnassigned devices.