Multiple docker databases failing simultaneously after cache drive replacement

July 4, 20242 yr

Hello, I have been running Unraid for 2 years using many docker services such as Plex, Sonarr/Radarr and Nextcloud.

3 weeks ago my main cache drive failed and I had to get it replaced. Over a few days I restored all of the services' appdatas from a backup from a week prior to the cache drive failing. For Nextcloud I ended up doing a complete reinstall using a different container and different database service (Mariadb instead of Postgres), as I was having problems with getting the restored backup working. Not a problem, as the data I care about is safe on the main array and I'm the only Nextcloud user.

But things took an awful turn for the worst. My Plex installation started behaving strangely, where some movies weren't loading at all and playback would grind to a halt every couple of minutes. I did another reinstall from the backup and everything worked perfectly for about 12 hours when suddenly the entire database became corrupted. Not only that, but within the last few days many other docker container's databases all became corrupted including Sonarr, Radarr, Prowlarr, Maraidb (used for Nextcloud) and even Krusader! These were all individually seen as corrupted through their docker logs.

This has been a very stressful time because I spent so much time getting it all working again (some containers multiple times) but the databases keep corrupting. And not just the containers that I restored through backup, but new containers such as the Mariadb one.

One error message I couldn't figure out was this when attempting a reinstall of binhex-sonarr:

Error: failed to register layer: read /var/lib/docker/tmp/GetImageBlob1735051611: input/output error

Other error logs messages sometimes mentioned input/output errors but all said database corruption of some kind. This confuses me because the errors occured basically overnight after working perfectly fine before. I've attached my diagnostics, and happy to share any other error messages I am getting.

Any help is appreciated. Thank you

nazana-diagnostics-20240704-1921.zip

Quote

July 4, 20242 yr

Community Expert

There are ATA errors logged for multiple devices, not necessarily related to the current issues, but still should be resolved, check/replace cables for parity and disk1.

Quote

July 6, 20242 yr

Author

Does anyone else have any advice please? My server is rendered useless now since every docker will basically corrupt eventually.

Quote

July 6, 20242 yr

Community Expert

Did you resolve the ATA errors?

zfs is also detecting data corruption, post the output of

zpool status -v

Because of that, and if you haven't yest, also a good idea to run memtest

Quote

July 8, 20242 yr

Author

Thanks for the advice and recognising the ATA errors.

I ran a memtest for 35 hours and got zero errors, however the zpool status showed 4361 data errors on my brand new nvme ssd. It also showed 12 errors on a zfs hdd pool, but the data on that is of very little concern to me.

Here is the summary from zpool status:

  pool: cache_nvme
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 208K in 00:05:47 with 2764 errors on Wed Jul  3 20:37:28 2024
config:

        NAME         STATE     READ WRITE CKSUM
        cache_nvme   ONLINE       0     0     0
          nvme0n1p1  ONLINE       0     0   442

errors: 4361 data errors, use '-v' for a list

And here is a random snapshot from the -v:

/mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/f/63c610b00f061594e768f953580d43f43648d1d.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/a/b2002353744d57340eb60c8afd4677377c83bff.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/a/0502b8d27d443f2b9cb05c72147afa7e6d9189b.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/4/167d7b43f9fb7de3e54cd803b1270cbdbf5a786.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/3/c3263d7f98af79991b47b34d773116998e20aa9.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex_3-7-24/Library/Application Support/Plex Media Server/Media/localhost/f/b035977911750feb64bd46ea647eb895bc26335.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex_3-7-24/Library/Application Support/Plex Media Server/Media/localhost/1/f98ec84ae597692ac9324f34ee0028c310d5dbf.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/7/03964bf3d60806635ec8709d3c42feb71dd492f.bundle/Contents/Chapters/chapter4.jpg
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/7/f2166e77ee9c84c48b520736285117bdffdad45.bundle/Contents/Chapters/chapter7.jpg
        /mnt/cache_nvme/appdata/plex_3-7-24/Library/Application Support/Plex Media Server/Media/localhost/6/6d99b313050130a93ea0f70f60300bfdbb15dcb.bundle/Contents/Indexes/index-sd.bif
        /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/7/4188f6abb0214f7ecf6176591bf6d032d5a6164.bundle/Contents/Indexes/index-sd.bif
        cache_nvme/ddbcf1320376b06d5dd77d3251170565639df9912b50746dc905e17fc075db4a:/usr/lib/libstdc++.so.6.0.32
        cache_nvme/eb44da049daa41a92a5116c6767bb9eb5a73f3604e4ef75d72faa642267f7ff7@497829946:/app/code-server/lib/vscode/extensions/ms-vscode.js-debug/resources/readme/link-debugging.gif
        cache_nvme/d9dba1f4ea67195242d98700815b1b1dc407aa6e647c201326bee408fc35acdd@783287484:/usr/sbin/grpck
        cache_nvme/d9dba1f4ea67195242d98700815b1b1dc407aa6e647c201326bee408fc35acdd@783287484:/usr/lib/locale/locale-archive
        cache_nvme/d9dba1f4ea67195242d98700815b1b1dc407aa6e647c201326bee408fc35acdd@783287484:/usr/share/zoneinfo/Asia/Krasnoyarsk
        <0x4a9d>:<0xae58>

Note that plex_3-7-24 was a renamed folder so I could attempt to restore a backup without losing an updated database (hopefully).

Edited July 8, 20242 yr by bonefiend

Quote

July 8, 20242 yr

Community Expert

Note that memtest is only definitive if it finds errors, 9 out of 10 times, data corruption is RAM related, if you have multiple RAM sticks, you can try using the server with just one, fix the pools and see if more are corruption is found, if yes try a different stick, that will basically rule out the RAM.

Quote

September 27, 20241 yr

Author

Hello, I know it's been a while but I've been trying out a different fixes and still have not come to a solution

I have changed between my 4 sticks of RAM and I get a corrupted database on all pairs. I reformatted the new cache drive entirely and replaced the cables for the HDDs causing ATA errors.

Using 2 sticks, started a brand new Plex server with no old files, didn't add any libraries to it, and within 2 days it was corrupted again. I then ran a memtest on that RAM for 7 days and had zero errors.

If anyone has any more ideas for diagnosing this issue they would be greatly appreciated.

New diagnostics here:

nazana-diagnostics-20240927-1642.zip

Quote

September 27, 20241 yr

Community Expert

There's filesystem corruption on cache_nvme, which may also be a symptom of your issues, check filesystem, but if RAM is not the issue, it can be a bad CPU for example, a lot of Intel 13th and 14th gens CPUs are known to have stability issues.

Quote

September 30, 20241 yr

Author

Running a filesystem check shows this:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
bad CRC for inode 726229
UUID mismatch on inode 726229
bad CRC for inode 726229, would rewrite
UUID mismatch on inode 726229
would have cleared inode 726229
bad CRC for inode 1204819
bad CRC for inode 1204819, would rewrite
would have cleared inode 1204819
        - agno = 1
Metadata corruption detected at 0x438a03, xfs_inode block 0x3a4c3aa0/0x4000
bad CRC for inode 538459881
bad CRC for inode 538459881, would rewrite
would have cleared inode 538459881
        - agno = 2
bad CRC for inode 1075070919
bad next_unlinked 0x63519b43 on inode 1075070919
bad CRC for inode 1075070919, would rewrite
bad next_unlinked 0x63519b43 on inode 1075070919, would reset next_unlinked
would have cleared inode 1075070919
bad CRC for inode 1075271901
UUID mismatch on inode 1075271901
bad CRC for inode 1075271901, would rewrite
UUID mismatch on inode 1075271901
would have cleared inode 1075271901
bad CRC for inode 1075544118
bad CRC for inode 1075544118, would rewrite
would have cleared inode 1075544118
        - agno = 3
bad CRC for inode 1610870839
bad CRC for inode 1610870839, would rewrite
would have cleared inode 1610870839
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 1
entry "abacbf297c10ba7690086657bde84d40afb9432.bundle" at block 1 offset 2752 in directory inode 536880939 references free inode 726229
	would clear inode number in entry at offset 2752...
bad CRC for inode 726229, would rewrite
UUID mismatch on inode 726229
would have cleared inode 726229
bad CRC for inode 1610870839, would rewrite
would have cleared inode 1610870839
bad CRC for inode 1075070919, would rewrite
bad next_unlinked 0x63519b43 on inode 1075070919, would reset next_unlinked
Would clear next_unlinked in inode 1075070919
would have cleared inode 1075070919
bad CRC for inode 1204819, would rewrite
would have cleared inode 1204819
entry "thumb1.jpg" in shortform directory 1075271900 references free inode 1075271901
would have junked entry "thumb1.jpg" in directory inode 1075271900
bad CRC for inode 1075271901, would rewrite
UUID mismatch on inode 1075271901
would have cleared inode 1075271901
bad CRC for inode 538459881, would rewrite
would have cleared inode 538459881
bad CRC for inode 1075544118, would rewrite
would have cleared inode 1075544118
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
Metadata corruption detected at 0x46f860, inode 0x126253 dinode
couldn't map inode 1204819, err = 117
entry "abacbf297c10ba7690086657bde84d40afb9432.bundle" in directory inode 536880939 points to free inode 726229, would junk entry
would rebuild directory inode 536880939
Metadata corruption detected at 0x438a03, xfs_inode block 0x3a4c3aa0/0x4000
Metadata corruption detected at 0x46f860, inode 0x20183ee9 dinode
couldn't map inode 538459881, err = 117
entry "thumb1.jpg" in shortform directory inode 1075271900 points to free inode 1075271901
would junk entry
Metadata corruption detected at 0x46f860, inode 0x401b8036 dinode
couldn't map inode 1075544118, err = 117
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected dir inode 537162312, would move to lost+found
disconnected dir inode 537817763, would move to lost+found
disconnected dir inode 1611411134, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 536880939 nlinks from 1829 to 1828
Metadata corruption detected at 0x46f860, inode 0x126253 dinode
couldn't map inode 1204819, err = 117, can't compare link counts
Metadata corruption detected at 0x46f860, inode 0x20183ee9 dinode
couldn't map inode 538459881, err = 117, can't compare link counts
Metadata corruption detected at 0x46f860, inode 0x401b8036 dinode
couldn't map inode 1075544118, err = 117, can't compare link counts
No modify flag set, skipping filesystem flush and exiting.

The UUID mismatch seems concerning, how could that have happened?
Then when I try xfs_repair I get this error: Sorry, could not find valid secondary superblock

When I run a SMART extended self test it says it's progressing then at the end says "No self-tests logged on this disk", which is what was happening on my old NVME drive just before total failure (starting to think some other part killed it now). Again, this is a less than 3 month old Crucial NVME drive.

My CPU is the i5 14500, which isn't on the list of affected CPUs as far as I'm aware, but I will do a BIOS update and then a cache drive reset. If that's still the cause then should I try and return it? I have no idea what else to do. Troubleshooting shouldn't be this difficult, right?

Quote

September 30, 20241 yr

Community Expert

Run it again without -n or nothing will be done.

Quote

September 30, 20241 yr

Author

Thanks, I've done that now.

I really appreciate you trying to help, but do you have any other advice for me or where I can go from here?

Quote

September 30, 20241 yr

Community Expert

If it's a hardware issue you will basically need to swap some parts to try and find the culprit.

Quote

Multiple docker databases failing simultaneously after cache drive replacement

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)