Unraid 6.9.1 - Recently upgraded, Docker Service keeps getting corrupted

MikaelTarquin · April 16, 2021

Last night all my dockers stopped, and the Docker tab read that the Docker Service had failed to start, along with some warnings. Today, those same warnings happened again:

Warning: stream_socket_client(): unable to connect to unix:///var/run/docker.sock (Connection refused) in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 682
Couldn't create socket: [111] Connection refused
Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 866

Warning: stream_socket_client(): unable to connect to unix:///var/run/docker.sock (Connection refused) in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 682
Couldn't create socket: [111] Connection refused
Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 932

After failing to stop docker service, I had to send a 'powerdown' command. After a restart, I was able to Disable Docker, delete the docker image, re-enable Docker, re-install my containers from the Previous Apps menu, and thought I was good to go. This morning, I found it had corrupted again, with the same message as before.

Attached are my diagnostics from this morning when I found it this way. Any help would be greatly appreciated. Thank you!

nnc-diagnostics-20210416-1126.zip

Squid · April 16, 2021

The cause is corruption on the cache drive. Wait for the BTRFS god ( @JorgeB) to pipe in.

MikaelTarquin · April 16, 2021

Thank you! In case it is helpful, here is my SMART report of the cache drive.

nnc-smart-20210416-1149.zip

rav007 · April 16, 2021

I chased my tail around for a while with btrfs cache corruption, a very long story short it was faulty memory. Personally any and all problems I've had with unraid have been memory related. I'd start by ruling that out with 24 hours of a mem test

MikaelTarquin · April 16, 2021

1 minute ago, rav007 said:

I chased my tail around for a while with btrfs cache corruption, a very long story short it was faulty memory. Personally any and all problems I've had with unraid have been memory related. I'd start by ruling that out with 24 hours of a mem test

Did something change with your memory? These are the same sticks of ECC RAM that came with the server (a Dell T630) and have been fine for the last year+ I've owned it.

JonathanM · April 16, 2021

2 hours ago, MikaelTarquin said:

have been fine for the last year+

Things can fail at any time.

MikaelTarquin · April 16, 2021

5 minutes ago, jonathanm said:

Things can fail at any time.

True! It just seems exceptionally unlikely when this all started a few short days after upgrading to 6.9.1 from an older version that I was on for many months. I had to do a lot in that upgrade, since I was using Nvidia Unraid, and had to redo how that was setup. It would be extremely coincidental for my physical ram to begin failing at the same time I do the first OS update in a long while.

I started MemTest a bit over an hour ago, but then I read elsewhere that MemTest wouldn't show any errors with ECC ram. Is that true? If so, I'm unsure how to proceed.

In the meantime, I'm following tips I found here and here by @JorgeB to reformat the cache drive, in case that solves the corruption problem.

Edited April 16, 2021 by MikaelTarquin

ChatNoir · April 17, 2021

A newer version could also detect errors not previously detected.

MikaelTarquin · April 17, 2021

18 minutes ago, ChatNoir said:

A newer version could also detect errors not previously detected.

It's not just that it's detecting an error though, right? I would expect then a warning or an error notification. Instead it's that everything is working fine and then hours later I surprise find out nothing is working and only when I click the Docker tab do I find the entire Docker Service has crashed. This didn't happen before my upgrade to 6.9.1 this week, and I tend to not believe in coincidences with computer stuff. Seems like really strange timing unless something went awry during said update.

I'm currently waiting on mover to finish so I can reformat the cache drive, but wow that is slow. It's taking hours just to get through a few gigabytes.

JorgeB · April 17, 2021

This was a hardware problem with the cache device, it dropped offline, so SMART report is empty, before that it was showing some data corruption, that could indicate a device problem, post a new SMART report and run a scrub.

MikaelTarquin · April 17, 2021

4 minutes ago, JorgeB said:

This was a hardware problem with the cache device, it dropped offline, so SMART report is empty, before that it was showing some data corruption, that could indicate a device problem, post a new SMART report and run a scrub.

Here you go! I am seeing a "reallocated sector ct is 1" and "reported uncorrect is 7" when I remount the drive.

The results of the scrub are:

UUID:             f46ce41b-b890-45d5-b887-08e28182975b
Scrub started:    Sat Apr 17 00:23:54 2021
Status:           finished
Duration:         0:00:01
Total to scrub:   263.50MiB
Rate:             263.50MiB/s
Error summary:    no errors found

nnc-smart-20210417-0021.zip

Edited April 17, 2021 by MikaelTarquin

SpaceInvaderOne · April 17, 2021

I dont think that with an ssd however a reallocated sector is as bad as on a mechanical drive, I beleive it is a block that has failed to be erased and then replaced from one from the reserve of which there are many ( but @JorgeB will know better than me) Even so probably if it were me i would replace the cache drive because of this reallocated sector and the fact it is quite old anyway. Power on hours are 27627 so about 3 years old and has written alot of data 383850 gigs or 363 TB

MikaelTarquin · April 17, 2021

8 minutes ago, SpaceInvaderOne said:

I dont think that with an ssd however a reallocated sector is as bad as on a mechanical drive, I beleive it is a block that has failed to be erased and then replaced from one from the reserve of which there are many ( but @JorgeB will know better than me) Even so probably if it were me i would replace the cache drive because of this reallocated sector and the fact it is quite old anyway. Power on hours are 27627 so about 3 years old and has written alot of data 383850 gigs or 363 TB

Thanks @SpaceInvaderOne, I will do that as I didn't realize it was already that old. And thanks for your videos! You saved my butt a number of times the last couple years!

JorgeB · April 17, 2021

46 minutes ago, MikaelTarquin said:

"reported uncorrect is 7"

I only had one device corrupting data, and it was a Sandisk SSD after an error like that, if the scrub didn't find any errors it means the corrupt files were since deleted or replaced, but since you're using ECC RAM I suspect the SSD was the reason for the data corruption, so I would recommend replacing it.

MikaelTarquin · April 17, 2021

9 minutes ago, JorgeB said:

I only had one device corrupting data, and it was a Sandisk SSD after an error like that, if the scrub didn't find any errors it means the corrupt files were since deleted or replaced, but since you're using ECC RAM I suspect the SSD was the reason for the data corruption, so I would recommend replacing it.

That's 2 votes to replace! I should get the new ssd in a couple days, I'll report back when I have it running to confirm if that solves the issue. Thanks everyone for your help with this!

MikaelTarquin · April 23, 2021

So an update on this. During the weekend I reformatted the cache drive (as xfs this time) and it seemed to fix the problem. Once I got everything back the way it was, things seemed to be running pretty smoothly! I haven't replaced the cache drive yet, but I did today encounter a strange issue and want to know if it's related. My dashboard and dockers are unresponsive, but I was able to access the System Log. Every few minutes it's throwing a Call Trace error. I've attached the system logs, but was unable to download the diagnostics.

Do we think this is due to the cache drive?

nnc-syslog-20210423-0711.zip

JorgeB · April 23, 2021

Macvlan call traces are usually the result of having dockers with a custom IP address, more info below.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

MikaelTarquin · April 23, 2021

Thanks, I'll take a look! Though, all of my dockers are either host, bridge, or Custom:proxynet, and I don't have anything with fixed IP addresses. Manually restarted just now and things seem fine (my swag problem even went away and Ombi works again). Guess I'll just be on the lookout for this problem, too.

MikaelTarquin · May 5, 2021

Well after about a week and a half post-reformat, the cache drive died again. I've replaced it with a new SSD and so far things seem to be resolved. I'm getting a "database disk image is malformed" error on my radarr dockers, but everything else seems fine.

MikaelTarquin · May 23, 2021

Well the problems just keep going. I have a brand new Samsung 870 EVO 2TB SSD for Cache, and after 2 weeks of things running smoothly, woke up today to half my dockers stopped and a ton of kernel device loop2 errors in my syslog. Attempting to stop docker service and delete the docker image now (again), but this is really getting ridiculous. I must be missing something, but I have no idea what.

May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 917 start 45056
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 917 start 45056
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 277 start 0
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 308 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 289 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 289 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 289 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 277 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 917 start 45056
May 23 08:59:51 NNC kernel: btrfs_print_data_csum_error: 10612 callbacks suppressed
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 277 off 0 csum 0x95712ae9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: btrfs_dev_stat_print_on_error: 10612 callbacks suppressed
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419302, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 937 off 0 csum 0xfaf4642c expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419303, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 917 off 45056 csum 0x0f4eaf29 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419304, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 917 off 45056 csum 0x0f4eaf29 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419305, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 277 off 0 csum 0x95712ae9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419306, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 308 off 0 csum 0x06b46f82 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419307, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 289 off 0 csum 0x7e1c42e9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419308, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 289 off 0 csum 0x7e1c42e9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419309, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 289 off 0 csum 0x7e1c42e9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419310, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 277 off 0 csum 0x95712ae9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419311, gen 0
May 23 08:59:56 NNC kernel: verify_parent_transid: 21336 callbacks suppressed
May 23 08:59:56 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546

Edit: I've updated Unraid to 6.9.2, removed my old docker image, and changed to Docker Directory instead of an image to see if that helps.

nnc-diagnostics-20210523-0958.zip

Edited May 23, 2021 by MikaelTarquin

JorgeB · May 24, 2021

May 23 04:03:31 NNC kernel: BTRFS error (device sde1): block=112534290432 write time tree block corruption detected
May 23 04:03:31 NNC kernel: BTRFS: error (device sde1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
May 23 04:03:31 NNC kernel: BTRFS info (device sde1): forced readonly

Cache filesystem is corrupt, best bet is to backup and re-format cache.

MikaelTarquin · May 24, 2021

That would be the 3rd or 4th time I've reformatted cache, including the format for the new replacement drive. There's got to be a root cause to this corruption, I can't just keep reformatting every couple weeks. I just don't know how to find the cause.

JorgeB · May 24, 2021

7 minutes ago, MikaelTarquin said:

There's got to be a root cause to this corruption,

Yes, most likely a hardware issue, though might not be easy to identify the actual culprit, I assume you're using ECC RAM so next suspects would be a board/CPU or controller issue.

Unraid 6.9.1 - Recently upgraded, Docker Service keeps getting corrupted

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation