Dockers Stopped. Unable to write to Docker Image Error - Diagnostics Attached

Canes · May 19, 2022

Hi.

So this is the second time this has come up for me with a docker image error - first time was a few months back I believe and UNRAID was unresponsive for a few days and then magically came back and gave up figuring out why it happened since it was working fine again.

Attached is my diagnostic file.

Fix Common Problems is throwing up 2 errors -

Unable to write to cache Drive mounted read-only or completely full.

Unable to write to Docker Image Docker Image either full or corrupted.

Screenshot below showing error when trying to start a docker and available space on the drives. Wanted to reach out as I am not an UNRAID expert and didn't want to just delete docker image turning dockers on and off unless that was the best first step. Didn't seem to work last time for me when this issue cropped up few months ago and I tried doing that after seeing it in video.

Any help is appreciated.

jeffunraid-diagnostics-20220519-1821.zip

Squid · May 19, 2022

This is the root of all your problems:

Apr  8 18:23:04 JEFFUNRAID emhttpd: shcmd (187): mkdir -p /mnt/cache
Apr  8 18:23:05 JEFFUNRAID emhttpd: shcmd (188): mount -t btrfs -o noatime,space_cache=v2 /dev/nvme1n1p1 /mnt/cache
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme1n1p1): using free space tree
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme1n1p1): has skinny extents
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 3760, gen 0
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme1n1p1): enabling ssd optimizations
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme1n1p1): start tree-log replay
Apr  8 18:23:05 JEFFUNRAID emhttpd: shcmd (189): mkdir -p /mnt/plexvms
Apr  8 18:23:05 JEFFUNRAID emhttpd: shcmd (190): mount -t btrfs -o noatime,space_cache=v2 /dev/nvme0n1p1 /mnt/plexvms
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme0n1p1): using free space tree
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme0n1p1): has skinny extents
Apr  8 18:23:05 JEFFUNRAID kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 195, gen 0
Apr  8 18:23:06 JEFFUNRAID kernel: BTRFS info (device nvme0n1p1): enabling ssd optimizations

Each of the cache pools has corruption.

In the case of BTRFS, this can usually be traced back to memory issues. In your case (this is not necessarily your problem though), you are running an significant overclock on your memory by running it at XMP speeds. G-Skill (along with most manufacturers) advertises and sells their memory kits running at XMP speeds (3600), but they are actually selling you 2133 memory. Disable XMP profile in the BIOS and instead run it at SPD speeds. All overclocks introduce instability.

Completely up to you, but I'm also not a fan of using BTRFS for cache pools if you have no intension of making those multi-device pools (you have 2 pools with a single device only). XFS is far more forgiving of a filesystem and causes less problems if the system is less than 100%. This requires you however to reform that pools

@JorgeB will be able to help you to recover the information on the pools properly

Canes · May 20, 2022

Thanks @Squid for the prompt reply here. I didn't even realize it was "overclocking" the memory here. Thought I was running just standard speeds since I didn't adjust anything in BIOS for memory speeds when I first put this together - goes to show what I thought I knew so appreciate the education on that point. Doing a reboot and adjusting those settings for XMP should allow me to get it back up enough to try and copy data off the cache and plexvms drives I have?

I also didn't even catch the differences in formats on the drives with btrfs - if I wanted to reformat those two drives to xfs like the data drives how complicated of a process is that and what are chances I can't recover data or lose data? Can I plug in an external USB hard drive and back up files through a Krusader docker assuming I can get dockers back up. The cache and plexvms drives are two separate devices btw - unless I'm misunderstanding your comment about 2 pools with a single device only. Each is a 2TB nvme drive (one a 970 EVO Plus and the other a 980 Pro).

Are there any good guides/youtube videos I could watch to help walk me through the process?

EDIT: Is it weird that I did a shut down command from the main webUI, it reports that the system is powered down but the server never actually powered off like it usually would. Do I need to just wait or do I need to do a hard shutdown of holding power button? I did the power down command right after posting my original reply (7 or so minutes ago). Seems like it's thinking too... Normal CPU idle temps are <40-42* C and it's hovering between 51-52* C now per the display on the mobo.

Edited May 20, 2022 by Canes
update on situation

Canes · May 20, 2022

I waited about two hours - never shut down on its own. Had to down a hard shutdown which I didn't want to do.

Booted back up and dockers are running again (plex playing movies) after disabling XMP in BIOS. Parity check is running now.

62 Errors came up with one of the disks in my array now within 5 minutes of starting array.

From Fix Common Problems:

disk5 (ST12000NE0008-2PK103_ZS805EQV) has read errors If the disk has not been disabled, then Unraid has successfully rewritten the contents of the offending sectors back to the hard drive. It would be a good idea to look at the S.M.A.R.T. Attributes for the drive in question

Will need to see what happens or what it finds (if more) once parity check completes. Thankfully I have the two parity drives and warranty with data recovery with Ironwolfs if I did need to return it for RMA.

Attached latest Diagnostic file if additional help.

jeffunraid-diagnostics-20220520-0004.zip

JorgeB · May 20, 2022

Disk5 read errors are logged as a disk problem, you should run an extended SMART test.

Also goo idea to run memtest on the RAM.

Canes · May 20, 2022

Thanks @JorgeB

Do I need to wait for parity check to finish to run SMART test?

If so, I will run extended SMART test on Disk 5 soon as parity finishes and then when that completes I will do memtest on RAM.

I will record results and post here. Parity check has 26 hours left it says (28.0% done now).

Edited May 20, 2022 by Canes

JorgeB · May 20, 2022

26 minutes ago, Canes said:

Do I need to wait for parity check to finish to run SMART test?

Yes, also disable spin down for that disk before doing it.

Canes · May 21, 2022

Parity check finished and found 0 errors it said. Ran extended SMART report and it reported I think the report says 1 error. Attached is latest Disagnostic file and SMART report.

Will run memtest next but have to run out for a while so may not be uploaded until later tonight.

jeffunraid-diagnostics-20220521-1441.zip jeffunraid-smart-20220521-1438.zip

trurl · May 21, 2022

# 1  Extended offline    Completed: read failure       90%      4888         3228944

replace

itimpi · May 21, 2022

Also

  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    3712
197 Current_Pending_Sector  -O--C-   100   100   000    -    8
198 Offline_Uncorrectable   ----C-   100   100   000    -    8

are not good signs as ideally these should all be 0.

As @trurl said replace

Canes · May 21, 2022

I ordered a new replacement drive on Amazon. Will be here tomorrow while I then deal with RMA process.

Found this guide for replacing the data drive: https://wiki.unraid.net/Replacing_a_Data_Drive Good one to follow?

Is disk5 failing why I'm having issues with dockers stopping (they are stopped again now) and replacing disk5 should solve that?

trurl · May 21, 2022

Unrelated, but your appdata share has files on both pools. Not sure if that is what you intend or not. Mover ignores any pool not designated for the share, and nothing can move open files.

trurl · May 21, 2022

You have found a very old wiki.

https://wiki.unraid.net/Manual/Storage_Management#Normal_replacement

trurl · May 21, 2022

4 minutes ago, Canes said:

Is disk5 failing why I'm having issues with dockers stopping (they are stopped again now) and replacing disk5 should solve that?

Looks like corrupt docker.img, nothing to do with disk5.

https://wiki.unraid.net/Manual/Docker_Management#Re-Create_the_Docker_image_file

https://wiki.unraid.net/Manual/Docker_Management#Re-Installing_Docker_Applications

Canes · May 21, 2022

I don't think I did intend for any appdata to be on the data drives. Isn't appdata supposed to live on cache drive (2TB 970)? I only wanted Plex data files to be on the 2nd nvme (2TB 980 named plexvms) since I wasn't sure initially how big it would get I added more data drives/content and I don't currently have any VMs installed.

trurl · May 21, 2022

Your appdata is configured to prefer plexvms. It has files on both plexvms and cache.

Canes · May 23, 2022

Hi - sorry for delayed reply. Birthday weekend so busy with festivities to remember I'm getting older.

Memtest is running now - already throwing a bunch of errors. Screenshot attached. When this happens that means you need new RAM? Any recommendations to go along with my board and CPU combo? I can also try to RMA this G. Skill stuff if that is better vs buying new stuff but don't know if 1 or all sticks are bad to RMA? Hate to be down because of RAM of all things lol.

Have the replacement 12tb ironwolf pro for disk5 ready to go in (delivered yesterday) - reading the link you sent me I don't need to have both drives connected for the swap? Just remove failing disk5 and put new one in it's place and unraid will rebuild it once it starts? All my data and parity drives are 12TB Ironwolf Pros so this should be normal replacement?

Quote

This is a normal case of replacing a failed drive where the replacement drive is not larger than your current parity drive(s).

It is worth emphasizing that Unraid must be able to reliably read every bit of parity PLUS every bit of ALL other disks in order to reliably rebuild a missing or disabled disk. This is one reason why you want to fix any disk-related issues with your Unraid server as soon as possible.

To replace a failed disk or disks:

Stop the array.

Power down the unit.

Replace the failed disk(s) with a new one(s).

Power up the unit.

Assign the replacement disk(s) using the Unraid webGui.

Click the checkbox that says Yes I want to do this and then click Start.

When you start the array in normal mode after replacing a failed disk or disks, the system will reconstruct the contents onto the new disk(s) and, if the new disk(s) is/are bigger, expand the file system. If you start the array in Maintenance mode you will need to press the Sync button to trigger the rebuild.

Edited May 23, 2022 by Canes

JonathanM · May 24, 2022

5 minutes ago, Canes said:

Memtest is running now - already throwing a bunch of errors. Screenshot attached.

Have the replacement 12tb ironwolf pro for disk5 ready to go in

Do not allow writes to any drives until your memory test is totally clean. Bad RAM will cause data corruption.

Canes · May 24, 2022

7 minutes ago, JonathanM said:

Do not allow writes to any drives until your memory test is totally clean. Bad RAM will cause data corruption.

thanks for quick reply. So don’t try to install replacement drive until I replace RAM? Buy new RAM being quickest option since sounds like I need to RMA all 4 sticks I have in there now?

JonathanM · May 24, 2022

1 minute ago, Canes said:

don’t try to install replacement drive until I replace RAM

This.

Everything a computer does uses RAM. Bad RAM means unpredictable results, with only a small chance of things working correctly.

3 minutes ago, Canes said:

Buy new RAM being quickest option since sounds like I need to RMA all 4 sticks I have in there now?

Since you have 64GB, I'd remove 32 and run memtest again, it's possible the full load of 4 sticks is causing issues. Typically if you have a bad stick of RAM, it's just one stick that needs to be replaced. Run memtest with pairs until you narrow down what's happening with each stick. It's also possible none of the sticks are bad, but your motherboard needs tweaks to run stable with all 4 populated.

Regardless of how you do it, don't continue using the system until you get an extended run of memtest with 0 errors. Preferably 12+ hours.

If your time is more valuable than the computer parts, start firing the parts cannon and get stuff on order, starting with RAM and possibly more. I deal with clients on both ends of the spectrum in my business, some want to spend as little money as possible and are OK with extended downtime, others just want the issue fixed now, even if it means replacing everything in the box.

Canes · May 24, 2022

I think I found my bad stick. Got lucky it was narrowed down quickly.

Removed DIMMs 2 & 4 leaving in DIMMs 1 & 3. Ran memtest again and errors within 5 mins. Powered down, replaced DIMM 1 with DIMM 2 and errors within 5 mins. Powered down, swapped out DIMM 3 for DIMM 1 and running DIMM 1 & 2 now and no errors so far after 10 minutes. Hopefully that easy but will let it run 12+ hours as suggested to be safe before I try and swap out disk5.

Can also now say it is possible to remove a memory stick from DIMM 1 slot around a Noctua NH-D15S without removing it from the CPU. Barely enough room! Luckily didn't mess up any of the aluminum fins either. Felt like a game of operation when I was a kid but instead of a buzzer sound it's a cut knuckle and dollar signs.

I think I will take this downtime after memtest completes to install the custom wiring I ordered from cablemod a while back as a present to myself. May be a server but doesn't mean it can't look good at the same time...

UPDATE: 30 minutes in and no errors so far with combination of DIMMS 1 & 2. DIMM 3 seems to be the culprit.

Edited May 24, 2022 by Canes
update on situation

JonathanM · May 24, 2022

21 minutes ago, Canes said:

I think I found my bad stick. Got lucky it was narrowed down quickly.

Great!

21 minutes ago, Canes said:

May be a server but doesn't mean it can't look good at the same time...

Just be careful not to stress the SATA connections. What I mean by this, is that if you magically removed the drive, the cable should naturally stay in the same spot it was when it was attached to the drive. Any force in any direction on the SATA connector is likely to cause issues eventually as vibration and other stress causes the connector to move around. If you unplug the SATA connector it should try to plug itself back in when you remove your hand.

Canes · May 24, 2022

Crap... just looked at the screen before going to bed and now DIMMs 1 & 2 have an error after 3.5 hours - too bad since hoped it was just the one stick but now appears to be two.

Testing now with DIMMs 2 & 4. DIMMs 1 & 3 both reported errors in combination with another stick (DIMM 2) and were from the same package. DIMMs 2 & 4 were from the second set of 2x16GB sticks I purchased and so far 7 minutes in no errors. Will let this continue to run over night and update folks tomorrow after I'm home from work.

Hoped it would've been that easy lol.

JonathanM · May 24, 2022

The motherboard could also be a suspect, or memory tuning specs like timing and voltage.

The only for sure thing is you have to get this resolved before moving on.

Canes · May 24, 2022

Memtest had another error on screen when I woke up. So combination of DIMMs 2 & 4 also resulted in an error (just one but error nonetheless). I didn't have time to swap in a different combo of sticks or try sticks in slots 2/4 vs slots 1/3 but my gut is telling me motherboard based on others bringing it up and fact no combo of RAM tried so far has given me a clean pass.

Ideas? Order new RAM to test before trying to RMA motherboard or something? I don't have any spare kits I could swap in for testing if RAM is culprit and don't have spare motherboard either so both would mean ordering or waiting for RMA process to complete before it's back up.

Dockers Stopped. Unable to write to Docker Image Error - Diagnostics Attached

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation