dapiedude Posted September 2, 2020 Share Posted September 2, 2020 (edited) Hi all, I've been running UnRAID for about a year now. My current uptime is 25 days. Last night around 10pm MT (90 minutes after my parity check started) all of dockers suddenly stopped working. Here are all of the funky things going on in the server, which I am guessing is due to the Cache drive not being accessible (or something similar wrong with the Cache drive). -- I checked out Fix Common Problems and I have these two issues: Error Found: Unable to write to cache. Suggested Fix: Drive mounted read-only or completely full Error Found: Unable to write to Docker Image Suggested Fix: Docker Image either full or corrupted. -- On my dashboard, I see my Cache as available, SMART shows the healthy thumbs up, and the utilization is 44% [453GB / 1TB] (which is normal as I have Plex writing thumbnails to the Cache). So nothing wrong here. -- When clicking on my Cache drive and scrolling to the Self-Test section, I can download the SMART report (attached) but the SMART self-test history, error logs, short self-test, and extended self-test buttons are grayed out. The "Last SMART test result" shows "SPIN UP" with a note reading "Unavailable - disk must be spun up." After clicking it, it takes me back to the top of the page and nothing further seems to happen as the SPIN UP is still there. -- When I go into the web terminal and type ls /mnt I see /bin/ls: cannot access '/mnt/user': Input/output error /bin/ls: cannot access '/mnt/cache': Input/output error and my other disks are available like normal. -- In my settings --> Docker, I have just disabled the docker so the status is stopped. I see the following: Docker vDisk location: /mnt/user/system/docker/docker.img (orange triangle with a white exclamation mark) Path does not exist Default appdata storage location: /mnt/user/appdata/ (orange triangle with a white exclamation mark) Path does not exist When attempting to re-enable the docker, the docker-tab in the top bar pops back up but the message shows "Docker Service failed to start." -- When I click on the Shares tab at the top, I see the following: There are no exportable user shares There are no exportable disk shares -- When I click Main at the top and then view each of the individual array devices' contents, I see that all of my array disks still have their folders and data. However, the Cache disk shows 0 objects: 0 directories, 0 files -- The parity check is still running, it is about 92% done. It has currently found 0 errors. -- Other information The cache drive is a 1 TB M.2 NVME The file system type for the Cache drive is xfs -- I believe that that's just about everything I've looked at. I haven't done a server reboot since it is going through the parity check right now. I haven't deleted / recreated my docker image since the problem seems to be more related to the Cache drive (even though the docker problems are what tipped me off to it). I've attached the diagnostics as well as the SMART report for the cache drive. Thanks in advance for the help. I'm not really sure where to start with correcting this problem and any help / advice / pointers are appreciated! tower-diagnostics-20200902-1017.zip tower-smart-20200902-1033.zip Edited September 2, 2020 by dapiedude Quote Link to comment
trurl Posted September 2, 2020 Share Posted September 2, 2020 Cache was not giving SMART in those diagnostics which suggests it was disconnected. You do have some things about your docker configuration that probably should be reconsidered, but I can't know the full extent of that since user shares are broken without cache. Since parity check is nearly complete let it finish, then shutdown, reconnect cache, reboot, and post new diagnostics. Quote Link to comment
JorgeB Posted September 2, 2020 Share Posted September 2, 2020 NVMe dropped offline: Sep 1 22:15:36 Tower kernel: nvme nvme0: I/O 92 QID 1 timeout, aborting Sep 1 22:15:36 Tower kernel: nvme nvme0: I/O 224 QID 2 timeout, aborting Sep 1 22:15:36 Tower kernel: nvme nvme0: I/O 116 QID 3 timeout, aborting Sep 1 22:15:36 Tower kernel: nvme nvme0: I/O 90 QID 6 timeout, aborting Sep 1 22:16:06 Tower kernel: nvme nvme0: I/O 92 QID 1 timeout, reset controller Sep 1 22:16:37 Tower kernel: nvme nvme0: I/O 1 QID 0 timeout, reset controller Sep 1 22:17:39 Tower kernel: nvme nvme0: Device not ready; aborting reset Some reports that the newer kernel on the latest Unraid beta helps with this on Ryzen boards. Quote Link to comment
dapiedude Posted September 2, 2020 Author Share Posted September 2, 2020 (edited) @trurl - I'll post new diagnostics when it's finished, about an hour or so left. Thanks!! @JorgeB - does this suggest that it dropped offline due to a hardware disconnect or is this a software drop? Also, would I be able to reconnect the dropped NVME from the Tower GUI, or is it something a reboot should fix, or simply a full shut-down then disconnect + reconnect of the NVME? Thanks a lot Edited September 2, 2020 by dapiedude accidentally pressed publish Quote Link to comment
dapiedude Posted September 2, 2020 Author Share Posted September 2, 2020 @trurl - I just reconnected the cache to the same M.2 port and now have my new diagnostics (attached). It is showing up as an unassigned drive right now. Is it possible to have it be the Cache drive again without formatting / wiping the contents? tower-diagnostics-20200902-1455.zip Quote Link to comment
dapiedude Posted September 2, 2020 Author Share Posted September 2, 2020 Answered my own question! I remounted the NVMe drive, stopped the array, assigned the Cache drive to be the NVMe drive and now we're good to go! Thank you very much @trurl and @JorgeB for your help Quote Link to comment
trurl Posted September 3, 2020 Share Posted September 3, 2020 Did your dockers/VMs come back? In those diagnostics you posted I noticed your domains and system shares were on disk4 where I assume they were recreated when you didn't have cache. You may have some additional things to clean up. Quote Link to comment
dapiedude Posted September 3, 2020 Author Share Posted September 3, 2020 Everything came back without issue. I checked all of my dockets and VMs and everything is running fine. I removed any of the Docker stuff that I could find that wasn't in my cash! Quote Link to comment
trurl Posted September 3, 2020 Share Posted September 3, 2020 I noticed in earlier diagnostics you had allocated 80G for docker.img Have you had problems filling it? 20G should be more than enough and making it larger won't fix anything, it will just make it take longer to fill. I am running 17 dockers and they are using less than half of 20G. The usual reason for filling docker.img is an application writing to a path that doesn't correspond to a container path in the mappings. Common mistakes are not using the same upper/lower case as in the mappings, or using a relative path. Quote Link to comment
dapiedude Posted September 3, 2020 Author Share Posted September 3, 2020 I hadn't filled up the 20G but I was at around 90% of that 20G and kept getting notifications that the docker.img was close to full. Honestly, it was more out of laziness in finding that notification and turning it off so I just upped it to 80G. Currently I'm using 23.4G with 30 dockers. I just checked the mappings for each of them and they all seemed correct. I've turned down the allocation to 30G though, that way I can watch for bloat! Is there a good way to see how much space each of my dockers are using within the docker.img? I have cadvisor but that isn't helpful when looking at the general docker allocations. Thanks for helping me with all of this and being so forward thinking too! Quote Link to comment
trurl Posted September 4, 2020 Share Posted September 4, 2020 On 9/3/2020 at 9:31 AM, dapiedude said: I just checked the mappings for each of them It isn't so much about the mappings. It is about whether the application is using the container paths in the mappings or instead using some other paths. For example, suppose you have a mapping with a container path of "/download", but the application is using the path "/Download". Since linux is case-sensitive, these are different paths. Or the application might be using "download" instead of "/download". Since "download" is a relative path, it is going to be relative to something inside docker.img. On 9/3/2020 at 9:31 AM, dapiedude said: see how much space each of my dockers are using Container Size button at the bottom of the Docker page. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.