Unraid 6.9.1 - Recently upgraded, Docker Service keeps getting corrupted


Recommended Posts

Last night all my dockers stopped, and the Docker tab read that the Docker Service had failed to start, along with some warnings. Today, those same warnings happened again:

 

Warning: stream_socket_client(): unable to connect to unix:///var/run/docker.sock (Connection refused) in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 682
Couldn't create socket: [111] Connection refused
Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 866

Warning: stream_socket_client(): unable to connect to unix:///var/run/docker.sock (Connection refused) in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 682
Couldn't create socket: [111] Connection refused
Warning: Invalid argument supplied for foreach() in /usr/local/emhttp/plugins/dynamix.docker.manager/include/DockerClient.php on line 932

 

After failing to stop docker service, I had to send a 'powerdown' command. After a restart, I was able to Disable Docker, delete the docker image, re-enable Docker, re-install my containers from the Previous Apps menu, and thought I was good to go. This morning, I found it had corrupted again, with the same message as before.

 

Attached are my diagnostics from this morning when I found it this way. Any help would be greatly appreciated. Thank you!

nnc-diagnostics-20210416-1126.zip

Link to comment
1 minute ago, rav007 said:

I chased my tail around for a while with btrfs cache corruption, a very long story short it was faulty memory. Personally any and all problems I've had with unraid have been memory related. I'd start by ruling that out with 24 hours of a mem test

 

Did something change with your memory? These are the same sticks of ECC RAM that came with the server (a Dell T630) and have been fine for the last year+ I've owned it.

Link to comment
5 minutes ago, jonathanm said:

Things can fail at any time.

True! It just seems exceptionally unlikely when this all started a few short days after upgrading to 6.9.1 from an older version that I was on for many months. I had to do a lot in that upgrade, since I was using Nvidia Unraid, and had to redo how that was setup. It would be extremely coincidental for my physical ram to begin failing at the same time I do the first OS update in a long while.

 

I started MemTest a bit over an hour ago, but then I read elsewhere that MemTest wouldn't show any errors with ECC ram. Is that true? If so, I'm unsure how to proceed.

 

In the meantime, I'm following tips I found here and here by @JorgeB to reformat the cache drive, in case that solves the corruption problem.

Edited by MikaelTarquin
Link to comment
18 minutes ago, ChatNoir said:

A newer version could also detect errors not previously detected.

It's not just that it's detecting an error though, right? I would expect then a warning or an error notification. Instead it's that everything is working fine and then hours later I surprise find out nothing is working and only when I click the Docker tab do I find the entire Docker Service has crashed. This didn't happen before my upgrade to 6.9.1 this week, and I tend to not believe in coincidences with computer stuff. Seems like really strange timing unless something went awry during said update.

 

I'm currently waiting on mover to finish so I can reformat the cache drive, but wow that is slow. It's taking hours just to get through a few gigabytes.

Link to comment
4 minutes ago, JorgeB said:

This was a hardware problem with the cache device, it dropped offline, so SMART report is empty, before that it was showing some data corruption, that could indicate a device problem, post a new SMART report and run a scrub.

Here you go! I am seeing a "reallocated sector ct is 1" and "reported uncorrect is 7" when I remount the drive.

 

The results of the scrub are:

UUID:             f46ce41b-b890-45d5-b887-08e28182975b
Scrub started:    Sat Apr 17 00:23:54 2021
Status:           finished
Duration:         0:00:01
Total to scrub:   263.50MiB
Rate:             263.50MiB/s
Error summary:    no errors found

nnc-smart-20210417-0021.zip

Edited by MikaelTarquin
Link to comment

 

I dont think that with an ssd however a reallocated sector is as bad as on a mechanical drive, I beleive it is a block that has failed to be erased and then replaced from one from the reserve of which there are many ( but @JorgeB will know better than me)  Even so probably if it were me i would replace the cache drive because of this reallocated sector and the fact it is quite old anyway. Power on hours are 27627 so about 3 years old and has written alot of data 383850 gigs or 363 TB

Link to comment
8 minutes ago, SpaceInvaderOne said:

 

I dont think that with an ssd however a reallocated sector is as bad as on a mechanical drive, I beleive it is a block that has failed to be erased and then replaced from one from the reserve of which there are many ( but @JorgeB will know better than me)  Even so probably if it were me i would replace the cache drive because of this reallocated sector and the fact it is quite old anyway. Power on hours are 27627 so about 3 years old and has written alot of data 383850 gigs or 363 TB

Thanks @SpaceInvaderOne, I will do that as I didn't realize it was already that old. And thanks for your videos! You saved my butt a number of times the last couple years!

  • Like 1
Link to comment
46 minutes ago, MikaelTarquin said:

"reported uncorrect is 7"

I only had one device corrupting data, and it was a Sandisk SSD after an error like that, if the scrub didn't find any errors it means the corrupt files were since deleted or replaced, but since you're using ECC RAM I suspect the SSD was the reason for the data corruption, so I would recommend replacing it.

Link to comment
9 minutes ago, JorgeB said:

I only had one device corrupting data, and it was a Sandisk SSD after an error like that, if the scrub didn't find any errors it means the corrupt files were since deleted or replaced, but since you're using ECC RAM I suspect the SSD was the reason for the data corruption, so I would recommend replacing it.

That's 2 votes to replace! I should get the new ssd in a couple days, I'll report back when I have it running to confirm if that solves the issue. Thanks everyone for your help with this!

Link to comment

So an update on this. During the weekend I reformatted the cache drive (as xfs this time) and it seemed to fix the problem. Once I got everything back the way it was, things seemed to be running pretty smoothly! I haven't replaced the cache drive yet, but I did today encounter a strange issue and want to know if it's related. My dashboard and dockers are unresponsive, but I was able to access the System Log. Every few minutes it's throwing a Call Trace error. I've attached the system logs, but was unable to download the diagnostics.

 

Do we think this is due to the cache drive?

nnc-syslog-20210423-0711.zip

Link to comment
  • 2 weeks later...
  • 3 weeks later...

Well the problems just keep going. I have a brand new Samsung 870 EVO 2TB SSD for Cache, and after 2 weeks of things running smoothly, woke up today to half my dockers stopped and a ton of kernel device loop2 errors in my syslog. Attempting to stop docker service and delete the docker image now (again), but this is really getting ridiculous. I must be missing something, but I have no idea what.

 

May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 917 start 45056
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 917 start 45056
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 277 start 0
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 308 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 289 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 289 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 289 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 277 start 0
May 23 08:59:51 NNC kernel: BTRFS info (device loop2): no csum found for inode 917 start 45056
May 23 08:59:51 NNC kernel: btrfs_print_data_csum_error: 10612 callbacks suppressed
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 277 off 0 csum 0x95712ae9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: btrfs_dev_stat_print_on_error: 10612 callbacks suppressed
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419302, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 937 off 0 csum 0xfaf4642c expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419303, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 917 off 45056 csum 0x0f4eaf29 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419304, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 917 off 45056 csum 0x0f4eaf29 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419305, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 277 off 0 csum 0x95712ae9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419306, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 308 off 0 csum 0x06b46f82 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419307, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 289 off 0 csum 0x7e1c42e9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419308, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 289 off 0 csum 0x7e1c42e9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419309, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 289 off 0 csum 0x7e1c42e9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419310, gen 0
May 23 08:59:51 NNC kernel: BTRFS warning (device loop2): csum failed root 362 ino 277 off 0 csum 0x95712ae9 expected csum 0x00000000 mirror 1
May 23 08:59:51 NNC kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 149, rd 0, flush 0, corrupt 37419311, gen 0
May 23 08:59:56 NNC kernel: verify_parent_transid: 21336 callbacks suppressed
May 23 08:59:56 NNC kernel: BTRFS error (device loop2): parent transid verify failed on 8946532352 wanted 53608 found 35546

 

Edit: I've updated Unraid to 6.9.2, removed my old docker image, and changed to Docker Directory instead of an image to see if that helps.

nnc-diagnostics-20210523-0958.zip

Edited by MikaelTarquin
Link to comment
May 23 04:03:31 NNC kernel: BTRFS error (device sde1): block=112534290432 write time tree block corruption detected
May 23 04:03:31 NNC kernel: BTRFS: error (device sde1) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
May 23 04:03:31 NNC kernel: BTRFS info (device sde1): forced readonly

 

Cache filesystem is corrupt, best bet is to backup and re-format cache.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.