Docker Service failed to start, Unable to write to nvme_cache

buccadebeppo · September 17, 2022

Hello!

I could really use some help. I recently discovered my docker containers were offline and went to look why. When pulling up the Docker tab on Unraid I get the error "Docker Service failed to start.". Digging further I have also found the error "Unable to write to nvme_cache" from fix common problems. I have tried to fix this the ways I know but am not sure how to proceed without potential causing more harm than good.

Things I have tried so far:

Deleting the Docker vDisk file (did not work)
Running a BTRFS Scrub on the nvme_cache (gets aborted immediately)

Notable recent occurrences:

This occurred days before I had to move houses. So I had to shut down the server and come back to it.

I have attached the diagnostics file from my server (pulled just now) to hopefully provide better details than I can.

Any ideas on how to fix this and get my docker services running properly again?

Thanks!

anton-diagnostics-20220916-2213.zip

JorgeB · September 17, 2022

Sep 16 21:39:46 Anton kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 80548, rd 0, flush 79403, corrupt 25, gen 0

This shows that nvme01n1 device dropped offline in the past, start with a scrub and post the output, also run a scrub on the other pool since there's corruption found and see here for better pool monitoring,

buccadebeppo · September 17, 2022

Thank you for ideas!

I have two cache pools, one is 4x1TB SATA SSDs and the other is 2x2TB NVME SSDs.

For the SATA pool (named cache) the output of the scrub is:

UUID:             0ad59d90-fcd8-4af3-a622-ade321c10ea0
Scrub started:    Sat Sep 17 10:07:52 2022
Status:           finished
Duration:         0:07:15
Total to scrub:   416.91GiB
Rate:             981.42MiB/s
Error summary:    no errors found

Running the command from your post here is the output:

root@Anton:~# btrfs dev stats /mnt/cache
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  317
[/dev/sdc1].generation_errs  0
[/dev/sdb1].write_io_errs    0
[/dev/sdb1].read_io_errs     0
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  901
[/dev/sdb1].generation_errs  0
[/dev/sde1].write_io_errs    0
[/dev/sde1].read_io_errs     0
[/dev/sde1].flush_io_errs    0
[/dev/sde1].corruption_errs  992
[/dev/sde1].generation_errs  0
[/dev/sdaf1].write_io_errs    0
[/dev/sdaf1].read_io_errs     0
[/dev/sdaf1].flush_io_errs    0
[/dev/sdaf1].corruption_errs  886
[/dev/sdaf1].generation_errs  0

For the NVME pool (named Nvme_cache) the output of the scrub is:

UUID:             94c08dd5-7765-4b75-8d62-7c23c4b37b3f
Scrub started:    Sat Sep 17 10:08:20 2022
Status:           aborted
Duration:         0:00:00
Total to scrub:   2.44TiB
Rate:             0.00B/s
Error summary:    no errors found

Running the command from your post here is the output:

root@Anton:~# btrfs dev stats /mnt/nvme_cache
[/dev/nvme0n1p1].write_io_errs    80548
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    79438
[/dev/nvme0n1p1].corruption_errs  276
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0

It appears I cannot run the scrub on the Nvme_cache. It immediately reports a status of "aborted". Any ideas on how to correct this? Is this the point of hardware failure and replacement?

Thanks!

JorgeB · September 18, 2022

SATA pool is OK, corruption errors are old, you should clear the stats, for the NVMe pool if the scrub fails best bet is to backup and re-format.

buccadebeppo · September 19, 2022

@JorgeB I was afraid that was going to be the answer but thank you for confirming.

As a prep for re-formatting I am trying to move the data I can off of the NVMe pool. After reading other threads it seemed the best method to do this was to set the shares on the NVMe pool from "Prefer" to "Yes" for the pool. Following this I ran the mover which took a full day to run. However, the pool is still showing the same used storage quantity (1.34TB). If I'm understanding this correctly, this is a result of the NVMe pool being read only. So how can I verify if the shares were written to the array? Should I be using another method of backing up?

Thanks!

JorgeB · September 19, 2022

2 minutes ago, buccadebeppo said:

If I'm understanding this correctly, this is a result of the NVMe pool being read only.

Most likely.

3 minutes ago, buccadebeppo said:

So how can I verify if the shares were written to the array?

Run

rsync -av /path/to/source/ /path/to/dest/

it will only copy any missing data.

Docker Service failed to start, Unable to write to nvme_cache

Recommended Posts

buccadebeppo

Link to comment

JorgeB

Link to comment

buccadebeppo

Link to comment

JorgeB

Link to comment

buccadebeppo

Link to comment

JorgeB

Link to comment

Join the conversation