July 4, 20242 yr Hello, I have been running Unraid for 2 years using many docker services such as Plex, Sonarr/Radarr and Nextcloud. 3 weeks ago my main cache drive failed and I had to get it replaced. Over a few days I restored all of the services' appdatas from a backup from a week prior to the cache drive failing. For Nextcloud I ended up doing a complete reinstall using a different container and different database service (Mariadb instead of Postgres), as I was having problems with getting the restored backup working. Not a problem, as the data I care about is safe on the main array and I'm the only Nextcloud user. But things took an awful turn for the worst. My Plex installation started behaving strangely, where some movies weren't loading at all and playback would grind to a halt every couple of minutes. I did another reinstall from the backup and everything worked perfectly for about 12 hours when suddenly the entire database became corrupted. Not only that, but within the last few days many other docker container's databases all became corrupted including Sonarr, Radarr, Prowlarr, Maraidb (used for Nextcloud) and even Krusader! These were all individually seen as corrupted through their docker logs. This has been a very stressful time because I spent so much time getting it all working again (some containers multiple times) but the databases keep corrupting. And not just the containers that I restored through backup, but new containers such as the Mariadb one. One error message I couldn't figure out was this when attempting a reinstall of binhex-sonarr: Error: failed to register layer: read /var/lib/docker/tmp/GetImageBlob1735051611: input/output error Other error logs messages sometimes mentioned input/output errors but all said database corruption of some kind. This confuses me because the errors occured basically overnight after working perfectly fine before. I've attached my diagnostics, and happy to share any other error messages I am getting. Any help is appreciated. Thank you nazana-diagnostics-20240704-1921.zip
July 4, 20242 yr Community Expert There are ATA errors logged for multiple devices, not necessarily related to the current issues, but still should be resolved, check/replace cables for parity and disk1.
July 6, 20241 yr Author Does anyone else have any advice please? My server is rendered useless now since every docker will basically corrupt eventually.
July 6, 20241 yr Community Expert Did you resolve the ATA errors? zfs is also detecting data corruption, post the output of zpool status -v Because of that, and if you haven't yest, also a good idea to run memtest
July 8, 20241 yr Author Thanks for the advice and recognising the ATA errors. I ran a memtest for 35 hours and got zero errors, however the zpool status showed 4361 data errors on my brand new nvme ssd. It also showed 12 errors on a zfs hdd pool, but the data on that is of very little concern to me. Here is the summary from zpool status: pool: cache_nvme state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 208K in 00:05:47 with 2764 errors on Wed Jul 3 20:37:28 2024 config: NAME STATE READ WRITE CKSUM cache_nvme ONLINE 0 0 0 nvme0n1p1 ONLINE 0 0 442 errors: 4361 data errors, use '-v' for a list And here is a random snapshot from the -v: /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/f/63c610b00f061594e768f953580d43f43648d1d.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/a/b2002353744d57340eb60c8afd4677377c83bff.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/a/0502b8d27d443f2b9cb05c72147afa7e6d9189b.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/4/167d7b43f9fb7de3e54cd803b1270cbdbf5a786.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/3/c3263d7f98af79991b47b34d773116998e20aa9.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex_3-7-24/Library/Application Support/Plex Media Server/Media/localhost/f/b035977911750feb64bd46ea647eb895bc26335.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex_3-7-24/Library/Application Support/Plex Media Server/Media/localhost/1/f98ec84ae597692ac9324f34ee0028c310d5dbf.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/7/03964bf3d60806635ec8709d3c42feb71dd492f.bundle/Contents/Chapters/chapter4.jpg /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/7/f2166e77ee9c84c48b520736285117bdffdad45.bundle/Contents/Chapters/chapter7.jpg /mnt/cache_nvme/appdata/plex_3-7-24/Library/Application Support/Plex Media Server/Media/localhost/6/6d99b313050130a93ea0f70f60300bfdbb15dcb.bundle/Contents/Indexes/index-sd.bif /mnt/cache_nvme/appdata/plex/Library/Application Support/Plex Media Server/Media/localhost/7/4188f6abb0214f7ecf6176591bf6d032d5a6164.bundle/Contents/Indexes/index-sd.bif cache_nvme/ddbcf1320376b06d5dd77d3251170565639df9912b50746dc905e17fc075db4a:/usr/lib/libstdc++.so.6.0.32 cache_nvme/eb44da049daa41a92a5116c6767bb9eb5a73f3604e4ef75d72faa642267f7ff7@497829946:/app/code-server/lib/vscode/extensions/ms-vscode.js-debug/resources/readme/link-debugging.gif cache_nvme/d9dba1f4ea67195242d98700815b1b1dc407aa6e647c201326bee408fc35acdd@783287484:/usr/sbin/grpck cache_nvme/d9dba1f4ea67195242d98700815b1b1dc407aa6e647c201326bee408fc35acdd@783287484:/usr/lib/locale/locale-archive cache_nvme/d9dba1f4ea67195242d98700815b1b1dc407aa6e647c201326bee408fc35acdd@783287484:/usr/share/zoneinfo/Asia/Krasnoyarsk <0x4a9d>:<0xae58> Note that plex_3-7-24 was a renamed folder so I could attempt to restore a backup without losing an updated database (hopefully). Edited July 8, 20241 yr by bonefiend
July 8, 20241 yr Community Expert Note that memtest is only definitive if it finds errors, 9 out of 10 times, data corruption is RAM related, if you have multiple RAM sticks, you can try using the server with just one, fix the pools and see if more are corruption is found, if yes try a different stick, that will basically rule out the RAM.
September 27, 20241 yr Author Hello, I know it's been a while but I've been trying out a different fixes and still have not come to a solution I have changed between my 4 sticks of RAM and I get a corrupted database on all pairs. I reformatted the new cache drive entirely and replaced the cables for the HDDs causing ATA errors. Using 2 sticks, started a brand new Plex server with no old files, didn't add any libraries to it, and within 2 days it was corrupted again. I then ran a memtest on that RAM for 7 days and had zero errors. If anyone has any more ideas for diagnosing this issue they would be greatly appreciated. New diagnostics here: nazana-diagnostics-20240927-1642.zip
September 27, 20241 yr Community Expert There's filesystem corruption on cache_nvme, which may also be a symptom of your issues, check filesystem, but if RAM is not the issue, it can be a bad CPU for example, a lot of Intel 13th and 14th gens CPUs are known to have stability issues.
September 30, 20241 yr Author Running a filesystem check shows this: Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 bad CRC for inode 726229 UUID mismatch on inode 726229 bad CRC for inode 726229, would rewrite UUID mismatch on inode 726229 would have cleared inode 726229 bad CRC for inode 1204819 bad CRC for inode 1204819, would rewrite would have cleared inode 1204819 - agno = 1 Metadata corruption detected at 0x438a03, xfs_inode block 0x3a4c3aa0/0x4000 bad CRC for inode 538459881 bad CRC for inode 538459881, would rewrite would have cleared inode 538459881 - agno = 2 bad CRC for inode 1075070919 bad next_unlinked 0x63519b43 on inode 1075070919 bad CRC for inode 1075070919, would rewrite bad next_unlinked 0x63519b43 on inode 1075070919, would reset next_unlinked would have cleared inode 1075070919 bad CRC for inode 1075271901 UUID mismatch on inode 1075271901 bad CRC for inode 1075271901, would rewrite UUID mismatch on inode 1075271901 would have cleared inode 1075271901 bad CRC for inode 1075544118 bad CRC for inode 1075544118, would rewrite would have cleared inode 1075544118 - agno = 3 bad CRC for inode 1610870839 bad CRC for inode 1610870839, would rewrite would have cleared inode 1610870839 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 3 - agno = 1 entry "abacbf297c10ba7690086657bde84d40afb9432.bundle" at block 1 offset 2752 in directory inode 536880939 references free inode 726229 would clear inode number in entry at offset 2752... bad CRC for inode 726229, would rewrite UUID mismatch on inode 726229 would have cleared inode 726229 bad CRC for inode 1610870839, would rewrite would have cleared inode 1610870839 bad CRC for inode 1075070919, would rewrite bad next_unlinked 0x63519b43 on inode 1075070919, would reset next_unlinked Would clear next_unlinked in inode 1075070919 would have cleared inode 1075070919 bad CRC for inode 1204819, would rewrite would have cleared inode 1204819 entry "thumb1.jpg" in shortform directory 1075271900 references free inode 1075271901 would have junked entry "thumb1.jpg" in directory inode 1075271900 bad CRC for inode 1075271901, would rewrite UUID mismatch on inode 1075271901 would have cleared inode 1075271901 bad CRC for inode 538459881, would rewrite would have cleared inode 538459881 bad CRC for inode 1075544118, would rewrite would have cleared inode 1075544118 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... Metadata corruption detected at 0x46f860, inode 0x126253 dinode couldn't map inode 1204819, err = 117 entry "abacbf297c10ba7690086657bde84d40afb9432.bundle" in directory inode 536880939 points to free inode 726229, would junk entry would rebuild directory inode 536880939 Metadata corruption detected at 0x438a03, xfs_inode block 0x3a4c3aa0/0x4000 Metadata corruption detected at 0x46f860, inode 0x20183ee9 dinode couldn't map inode 538459881, err = 117 entry "thumb1.jpg" in shortform directory inode 1075271900 points to free inode 1075271901 would junk entry Metadata corruption detected at 0x46f860, inode 0x401b8036 dinode couldn't map inode 1075544118, err = 117 - traversal finished ... - moving disconnected inodes to lost+found ... disconnected dir inode 537162312, would move to lost+found disconnected dir inode 537817763, would move to lost+found disconnected dir inode 1611411134, would move to lost+found Phase 7 - verify link counts... would have reset inode 536880939 nlinks from 1829 to 1828 Metadata corruption detected at 0x46f860, inode 0x126253 dinode couldn't map inode 1204819, err = 117, can't compare link counts Metadata corruption detected at 0x46f860, inode 0x20183ee9 dinode couldn't map inode 538459881, err = 117, can't compare link counts Metadata corruption detected at 0x46f860, inode 0x401b8036 dinode couldn't map inode 1075544118, err = 117, can't compare link counts No modify flag set, skipping filesystem flush and exiting. The UUID mismatch seems concerning, how could that have happened? Then when I try xfs_repair I get this error: Sorry, could not find valid secondary superblock When I run a SMART extended self test it says it's progressing then at the end says "No self-tests logged on this disk", which is what was happening on my old NVME drive just before total failure (starting to think some other part killed it now). Again, this is a less than 3 month old Crucial NVME drive. My CPU is the i5 14500, which isn't on the list of affected CPUs as far as I'm aware, but I will do a BIOS update and then a cache drive reset. If that's still the cause then should I try and return it? I have no idea what else to do. Troubleshooting shouldn't be this difficult, right?
September 30, 20241 yr Author Thanks, I've done that now. I really appreciate you trying to help, but do you have any other advice for me or where I can go from here?
September 30, 20241 yr Community Expert If it's a hardware issue you will basically need to swap some parts to try and find the culprit.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.