Nammertat Posted September 28, 2021 Share Posted September 28, 2021 Hi all, I'm ~5 days into Unraid as a recovering Windows addict, and appear to have a serious situation on my hands. After battling my lack of linux knowledge, I was able to get my plex instance migrated to Docker and running great yesterday. I added a few 3tb drives, and let them complete the format/mounting overnight last night. I was having an issue getting Libvert Service to start, so this morning I rebooted my unraid box. On reboot, all three of my main data drives went to unmountable, leaving only the newly added 3TB in the array. Terrified I've lost all of my progress and a ton of production data. Any help would be immensely appreciated! I'm showing file system corruption in the logs, and something has gone wonky with the UUIDs: Sep 28 09:30:40 Tower kernel: XFS (md1): Corruption detected. Unmount and run xfs_repair Sep 28 09:30:40 Tower kernel: XFS (md1): log has mismatched uuid - can't recover Sep 28 09:30:40 Tower kernel: XFS (md1): failed to find log head Sep 28 09:30:40 Tower kernel: XFS (md1): log mount/recovery failed: error -117 Sep 28 09:30:40 Tower kernel: XFS (md1): log mount failed Sep 28 09:30:40 Tower root: mount: /mnt/disk1: mount(2) system call failed: Structure needs cleaning. Sep 28 09:30:40 Tower emhttpd: shcmd (5272): exit status: 32 Sep 28 09:30:40 Tower emhttpd: /mnt/disk1 mount error: not mounted Sep 28 09:30:40 Tower emhttpd: shcmd (5273): umount /mnt/disk1 Sep 28 09:30:40 Tower root: umount: /mnt/disk1: not mounted. Sep 28 09:30:40 Tower emhttpd: shcmd (5273): exit status: 32 Sep 28 09:30:40 Tower emhttpd: shcmd (5274): rmdir /mnt/disk1 Sep 28 09:30:40 Tower emhttpd: shcmd (5275): mkdir -p /mnt/disk2 Sep 28 09:30:40 Tower emhttpd: shcmd (5276): mount -t xfs -o noatime /dev/md2 /mnt/disk2 Sep 28 09:30:40 Tower kernel: XFS (md2): Mounting V5 Filesystem Sep 28 09:30:40 Tower kernel: XFS (md2): Internal error !uuid_equal(&mp->m_sb.sb_uuid, &head->h_fs_uuid) at line 259 of file Metadata corruption detected at 0x439218, xfs_agf block 0x27fffffd9/0x200 Metadata corruption detected at 0x439218, xfs_agf block 0x2ffffffd1/0x200 Metadata corruption detected at 0x439218, xfs_agf block 0xfffffff1/0x200 Metadata corruption detected at 0x439218, xfs_agf block 0x1/0x200Metadata corruption detected at 0x464290, xfs_agi block 0x2ffffffd2/0x200Metadata corruption detected at 0x464290, xfs_agi block 0x27fffffda/0x200 Metadata corruption detected at 0x439218, xfs_agf block 0x17fffffe9/0x200 bad uuid b89f9a81-1193-47c1-9df7-29bef286d1ca for agf 5 bad uuid b89f9a81-1193-47c1-9df7-29bef286d1ca for agf 6 bad uuid b89f9a81-1193-47c1-9df7-29bef286d1ca for agi 5 bad uuid b89f9a81-1193-47c1-9df7-29bef286d1ca for agi 6 tower-diagnostics-20210928-1006.zip MD1 MD2 MD3 xfs_repair output.txt Quote Link to comment
JorgeB Posted September 28, 2021 Share Posted September 28, 2021 3 disks getting corruption at the same time is not normal and could indicate an underlying hardware problem, I would start by running memtest before attempting any fs repairs. Quote Link to comment
Nammertat Posted September 28, 2021 Author Share Posted September 28, 2021 12 minutes ago, JorgeB said: 3 disks getting corruption at the same time is not normal and could indicate an underlying hardware problem, I would start by running memtest before attempting any fs repairs. Running memtest now. Add'l detail that could impact it - I added my nvme drive as cache last night, and mover did it's thing overnight (before the reboot I confirmed it wasn't still going). Quote Link to comment
Nammertat Posted September 28, 2021 Author Share Posted September 28, 2021 29 minutes ago, Nammertat said: Running memtest now. Add'l detail that could impact it - I added my nvme drive as cache last night, and mover did it's thing overnight (before the reboot I confirmed it wasn't still going). Memtest86 completed without errors. Quote Link to comment
Nammertat Posted September 28, 2021 Author Share Posted September 28, 2021 With docker down, I've got ~80TB of inaccessible library. Does anyone have thoughts on troubleshooting steps here? Quote Link to comment
JorgeB Posted September 29, 2021 Share Posted September 29, 2021 12 hours ago, Nammertat said: Memtest86 completed without errors. It's recommended to run memtest for a few hours, up to 24, though a problem that would cause 3 disks to go unmountable should be easily found it it was bad RAM, if there are no apparent RAM issues next step is to check filesystem on the affected disks. Quote Link to comment
Nammertat Posted September 29, 2021 Author Share Posted September 29, 2021 8 hours ago, JorgeB said: It's recommended to run memtest for a few hours, up to 24, though a problem that would cause 3 disks to go unmountable should be easily found it it was bad RAM, if there are no apparent RAM issues next step is to check filesystem on the affected disks. Thanks so much for the reply - I really appreciate it. I should have been more clear above - I ran memtest for ~13 hours (overnight the night before). I tried the check file system steps before posting, and attached the logs/output in the initial post. Quote Link to comment
JorgeB Posted September 29, 2021 Share Posted September 29, 2021 xfs_repair is aborting due to metadata corruption, you can try upgrading to v6.10-rc1 and running it again since it includes newer xfs-progs, but I suspect there's an underlying hardware issue and trying to repair might not be very successful. Quote Link to comment
Nammertat Posted September 29, 2021 Author Share Posted September 29, 2021 21 minutes ago, JorgeB said: xfs_repair is aborting due to metadata corruption, you can try upgrading to v6.10-rc1 and running it again since it includes newer xfs-progs, but I suspect there's an underlying hardware issue and trying to repair might not be very successful. I'll try the new version today. Assuming that fails as well, do you have any recommendations for identifying the underlying hardware issue? It's worth noting that this hardware ran a Windows environment for the prior 6 months without a problem, although I suppose it's possible windows was compensating for something that is killing xfs. Quote Link to comment
JorgeB Posted September 29, 2021 Share Posted September 29, 2021 4 minutes ago, Nammertat said: do you have any recommendations for identifying the underlying hardware issue? If it's not RAM my next suspect would be the board, also where the controller is, and issues with the onboard SATA controller on some Ryzen boards and Linux are quite common, most cases the controller drops and you get read errors on all the disks, but the filesystem usually survives or it's easily fixed. Quote Link to comment
Nammertat Posted September 29, 2021 Author Share Posted September 29, 2021 3 hours ago, JorgeB said: If it's not RAM my next suspect would be the board, also where the controller is, and issues with the onboard SATA controller on some Ryzen boards and Linux are quite common, most cases the controller drops and you get read errors on all the disks, but the filesystem usually survives or it's easily fixed. The SATA controller is totally plausible. It's the one thing that wasn't utilized before (these disks were in an external USB3.2 enclosure and my OS was on the nvme. Any idea how to go about testing that controller? Quote Link to comment
JonathanM Posted September 29, 2021 Share Posted September 29, 2021 1 minute ago, Nammertat said: Any idea how to go about testing that controller? Get a controller from this list that can handle all your drives and see if the issue is solved. Quote Link to comment
Nammertat Posted September 29, 2021 Author Share Posted September 29, 2021 (edited) 2 hours ago, JonathanM said: Get a controller from this list that can handle all your drives and see if the issue is solved. Will do. I think you're right that my controller is now the most likely candidate. I ran extended SMART tests on all drives without error, and found lots of examples of my controller failing in other builds. Going to replace with an ASM1064 controller and not look back. Now - for the shocking news. I upgraded to the newest Unraid version (not stable) and the updated xfs_repair was able to successfully bring my array back online. This means I can copy off all of my docker config (saving me ~15-20 hours), my img-converted VMs (saving me another ~24 hours of process/wait), and my updated plex metadata and paths. Not to mention I'll be able to leave the server online until the new controller gets here this weekend. @JorgeB - when I get the new controller in, would you recommend I start fresh with a new config/array and copy back all of my docker/vm files given that I've had corruption/fragments on this array? Edited September 29, 2021 by Nammertat Quote Link to comment
JorgeB Posted September 30, 2021 Share Posted September 30, 2021 8 hours ago, Nammertat said: when I get the new controller in, would you recommend I start fresh with a new config/array and copy back all of my docker/vm files given that I've had corruption/fragments on this array? If the repairs were successful it should be fine to continue with the current array, also check for lost+found folders on the disks, there might be lost/partial files there. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.