Jump to content

Docker Service failed to start, Unable to write to nvme_cache


Recommended Posts

Hello!

 

I could really use some help. I recently discovered my docker containers were offline and went to look why. When pulling up the Docker tab on Unraid I get the error "Docker Service failed to start.". Digging further I have also found the error "Unable to write to nvme_cache" from fix common problems. I have tried to fix this the ways I know but am not sure how to proceed without potential causing more harm than good. 

 

Things I have tried so far:

  • Deleting the Docker vDisk file (did not work)
  • Running a BTRFS Scrub on the nvme_cache (gets aborted immediately) 

 

 

Notable recent occurrences:

  • This occurred days before I had to move houses. So I had to shut down the server and come back to it.

 

I have attached the diagnostics file from my server (pulled just now) to hopefully provide better details than I can.

 

Any ideas on how to fix this and get my docker services running properly again?

 

Thanks!

anton-diagnostics-20220916-2213.zip

Link to comment
Sep 16 21:39:46 Anton kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 80548, rd 0, flush 79403, corrupt 25, gen 0

 

This shows that nvme01n1 device dropped offline in the past, start with a scrub and post the output, also run a scrub on the other pool since there's corruption found and see here for better pool monitoring,

 

 

Link to comment

Thank you for ideas!

 

I have two cache pools, one is 4x1TB SATA SSDs and the other is 2x2TB NVME SSDs.

 

For the SATA pool (named cache) the output of the scrub is:

UUID:             0ad59d90-fcd8-4af3-a622-ade321c10ea0
Scrub started:    Sat Sep 17 10:07:52 2022
Status:           finished
Duration:         0:07:15
Total to scrub:   416.91GiB
Rate:             981.42MiB/s
Error summary:    no errors found

 

Running the command from your post here is the output:

 

root@Anton:~# btrfs dev stats /mnt/cache
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  317
[/dev/sdc1].generation_errs  0
[/dev/sdb1].write_io_errs    0
[/dev/sdb1].read_io_errs     0
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  901
[/dev/sdb1].generation_errs  0
[/dev/sde1].write_io_errs    0
[/dev/sde1].read_io_errs     0
[/dev/sde1].flush_io_errs    0
[/dev/sde1].corruption_errs  992
[/dev/sde1].generation_errs  0
[/dev/sdaf1].write_io_errs    0
[/dev/sdaf1].read_io_errs     0
[/dev/sdaf1].flush_io_errs    0
[/dev/sdaf1].corruption_errs  886
[/dev/sdaf1].generation_errs  0

 

 

 

 

 

 

 

For the NVME pool (named Nvme_cache) the output of the scrub is:

UUID:             94c08dd5-7765-4b75-8d62-7c23c4b37b3f
Scrub started:    Sat Sep 17 10:08:20 2022
Status:           aborted
Duration:         0:00:00
Total to scrub:   2.44TiB
Rate:             0.00B/s
Error summary:    no errors found

 

Running the command from your post here is the output:

 

root@Anton:~# btrfs dev stats /mnt/nvme_cache
[/dev/nvme0n1p1].write_io_errs    80548
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    79438
[/dev/nvme0n1p1].corruption_errs  276
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0

 

 

It appears I cannot run the scrub on the Nvme_cache. It immediately reports a status of "aborted". Any ideas on how to correct this? Is this the point of hardware failure and replacement?

 

Thanks!

 

Link to comment

@JorgeB I was afraid that was going to be the answer but thank you for confirming.

 

As a prep for re-formatting I am trying to move the data I can off of the NVMe pool. After reading other threads it seemed the best method to do this was to set the shares on the NVMe pool from "Prefer" to "Yes" for the pool. Following this I ran the mover which took a full day to run. However, the pool is still showing the same used storage quantity (1.34TB). If I'm understanding this correctly, this is a result of the NVMe pool being read only. So how can I verify if the shares were written to the array? Should I be using another method of backing up?

 

Thanks!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...