zfp Posted December 29, 2023 Share Posted December 29, 2023 Huge fan of unraid, been using it for 10+yrs. Always found the forum to be a great source of info and now I'd appreciate some insight. I've run into an issue with disks being available but with a red X. System specs Unraid 6.12.6 Asus w680, i5-13600k, 2x32gb ECC DDR5 16x array HDs (XFS, dual parity), 3x NVME SSD (BTRFS, 2 drive raid 1 pool and a single SSD pool) SAS 2308 (IT mode) -> Intel RES2SV240 -> 16x HDs Radeon PRO WX2100, 1x USB3 card Corsair VX450W PSU (450w) Incident background I recently upgraded my system from a C246 Xeon setup w/out any issues. System has been running stable except for a poor VM migration to a large SSD array (100% user error on my part!). Recently I bought 2x 20tb drives to upgrade my 2x 18tb parity drives. I had a clean shutdown and added the 2x 20tb drives to be precleared before adding to the array. I started the preclear without issue. Then I started a torrent docker which writes to the array then suddenly I had a ton of errors in the array. I performed a clean shutdown and removed the 20tb drives. I also checked the cable connections, I've had issues in the past and loose cabling was often the culprit. Everything seemed fine. Then on reboot, one of my parity drives had a red X as well as disk 1. In addition, disk 1 says it's unmountable and the UI says I need to format a file system on all unmountable disks. Current status Right now I started the array and started a read-check. I haven't done anything else to the red X-ed parity and disk 1. I do have dual parity so in theory, I didn't loose anything but *shrug* who knows? I suspect these errors were caused by an insufficient PSU. I don't do any gaming or what I'd call "high-powered" computing but it is an ancient PSU even though it's served me without issue for many years. In the meantime, what's the best way to resolve the red X issue? My first impulse is to rebuild the parity and disk 1 but to be blunt, my first impulse usually causes more problems later. I've attached diagnostics, thank you for any help you can provide. diagnostics-20231229-0132.zip Quote Link to comment
itimpi Posted December 29, 2023 Share Posted December 29, 2023 Handling of unmountable disks is covered here in the online documentation accessible via the Manual link at the bottom of the Unraid GUI. In addition every forum page has a DOCS link at the top and a Documentation link at the bottom. Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 Thanks so much! I checked your link to the online doc and restarted the array in maintenance mode. I tried disk 1 first and here's what I got... Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. From the help files, it seems like I should replace the "-n" option with nothing. So I gave that a shot and got this... Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Regretfully, I have no idea where to go now. Sorry to ask, what should be my next step? And would these next steps apply to the red X parity drive? Quote Link to comment
itimpi Posted December 29, 2023 Share Posted December 29, 2023 You now need to run without -n and adding the -L option to get the file system repaired. This is mentioned in the section on repairing the file system. After that restart the array in normal mode and the drive should mount OK. For any disk that is disabled (has a red 'x' against it) the procedures to get it back online are here Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 I ran the check w/ the -L and at the end, I'm not sure if it worked. Here's what came out at the end... Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... Metadata corruption detected at 0x46f8c0, inode 0x80 dinode fatal error -- couldn't map inode 128, err = 117 The whole "fatal error" thing doesn't look good. I stopped the array, restarted in maintenance mode and I still have the red X. Should I assume the repair didn't work? Quote Link to comment
itimpi Posted December 29, 2023 Share Posted December 29, 2023 4 minutes ago, zfp said: I ran the check w/ the -L and at the end, I'm not sure if it worked. Here's what came out at the end... Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... Metadata corruption detected at 0x46f8c0, inode 0x80 dinode fatal error -- couldn't map inode 128, err = 117 The whole "fatal error" thing doesn't look good. I stopped the array, restarted in maintenance mode and I still have the red X. Should I assume the repair didn't work? That is not good - suggests the repair did not work. Note that you are not trying to clear the red 'x' at this point, but the 'unmountable' status when running the array in normal mode. If the drive has a red 'x' as well then the repair was running against the emulated drive and not the physical drive (the check/repair does not clear the red' x' status as that requires a rebuild). It might be possible to work with the physical drive instead. Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 (edited) Oh boy. From my interpretation, it seems I should stop the array. Use the command line to run "xfs_repair -L /dev/sdq" (the sdq is what I see on the drive). If that works then I guess I'd be ok to start the array as usual. If not, do I run a rebuild? I do have dual parity and only 1 parity disk has a red X. Also how do I handle the red x parity drive? There's not a file system on there from my understanding. I really do appreciate all your help and apologize if I'm being difficult in any way. It's def not my intention, I'm just hoping I'm not digging a deeper hole than I'm already in! Edited December 29, 2023 by zfp Terrible grammar Quote Link to comment
itimpi Posted December 29, 2023 Share Posted December 29, 2023 That is not quite the right command as one would need to include the partition number. Also that command invalidates parity so had consequences. These can be handled but one needs to proceed cautiously. i looked at the diagnostics listed earlier and could not see the ‘sdq’ device you mention. Perhaps you should post a new set of diagnostics so we can be certain they are current, and a screenshot of the Main tab would be useful as well. The Parity2 disk is definitely going to require a rebuild. However if it had a red ‘x’ against it then it is currently not being used so fixing it can be left until the issue around the data drive is resolved. Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 Yikes, looks like someone else has the same issue I'm having: XFS_REPAIR: FATAL ERROR -- COULDN'T MAP INODE <>, ERR = 117 It looks like the only option is to pull the drive out, attach it to another system and use UFS to pull the data onto another drive. Then take the drive, preclear it and re-add to the array. Now it'll be an empty drive and I'd need to copy the files from UFS back onto it. Which will then update the parity w/ the recovered files? Quote Link to comment
itimpi Posted December 29, 2023 Share Posted December 29, 2023 Just now, zfp said: It looks like the only option is to pull the drive out, attach it to another system and use UFS to pull the data onto another drive. That is a last resort and may not be needed. Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 Here's an updated diagnostics. Disk 1 is a sas drive I bought by accident, fortunately I use an HBA w/ a supermicro 5-in-3 so it worked w/out issue. Well at least until now. The specific txt file is "35000c500d78b50f3-20231229-0325 disk1 (sdq) - DISK_DSBL.txt". diagnostics-20231229-0325.zip Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 5 minutes ago, itimpi said: That is a last resort and may not be needed. Fingers crossed I don't have to resort to this. Didn't seem too bad in the 4tb days but 16tb...eek! Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 Based on itimpi's helpful advice and carefully reading a few other forum posts, I attempted the following on the red X disk 1 (aka sdq): 1) Ran a short smart test with the array stopped. No issues reported. 2) Started the array in maintenance mode, used the webUI to check the file system using the -nv option. The output is as follows... XFS_REPAIR Summary Fri Dec 29 12:16:07 2023 Phase Start End Duration Phase 1: 12/29 12:14:40 12/29 12:14:40 Phase 2: 12/29 12:14:40 12/29 12:14:41 1 second Phase 3: 12/29 12:14:41 12/29 12:16:07 1 minute, 26 seconds Phase 4: 12/29 12:16:07 12/29 12:16:07 Phase 5: Skipped Phase 6: Skipped Phase 7: Skipped 3) SSH-ed into the server w/ the array in maintenance mode. Ran xfs_repair -Lv /dev/md1p1, lots of issues and seems unrepairable. Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 Metadata corruption detected at 0x46f8c0, inode 0x80 dinode fatal error -- couldn't map inode 128, err = 117 4) Stopped the array and tried to run xfs_repair on disk 1 aka sdq w/ "xfs_repair -nv /dev/sdq1". No luck... Phase 1 - find and verify superblock... xfs_repair: read failed: Invalid argument xfs_repair: data size check failed xfs_repair: cannot repair this filesystem. Sorry. At this point, it seems the file system is unrepairable. I saw some posts where they changed disk 1 to "no disk" then mounted the former disk 1 using the Unassigned Devices plug-in. Others have shutdown the array, pulled the drive and attempted xfs_repair on another system. There's also pulling the drive and trying to copy the files to a new disk then copying backing into the array. Are there any other options I'm missing? If not, which of the above scenarios is the safest choice? Quote Link to comment
trurl Posted December 29, 2023 Share Posted December 29, 2023 If you've already attempted to repair the emulated disk (/dev/md1p1) and also attempted to repair the physical disk (/dev/sdq1) then it seems you have done everything xfs_repair can help you with. Quote Link to comment
zfp Posted December 29, 2023 Author Share Posted December 29, 2023 (edited) 2 hours ago, trurl said: If you've already attempted to repair the emulated disk (/dev/md1p1) and also attempted to repair the physical disk (/dev/sdq1) then it seems you have done everything xfs_repair can help you with. Thanks so much for the confirmation even though it's not so great news. I removed disk 1 from the array list and tried to mount it via Unassigned Devices. The mount failed and checked the logs... Dec 29 12:57:33 biollante unassigned.devices: Mounting partition 'sdq1' at mountpoint '/mnt/disks/SEAGATE_ST16000NM007G'... Dec 29 12:57:33 biollante unassigned.devices: Mount cmd: /sbin/mount -t 'xfs' -o rw,relatime '/dev/sdq1' '/mnt/disks/SEAGATE_ST16000NM007G' Dec 29 12:57:33 biollante kernel: XFS (sdq1): device supports 4096 byte sectors (not 512) Dec 29 12:57:35 biollante unassigned.devices: Mount of 'sdq1' failed: 'mount: /mnt/disks/SEAGATE_ST16000NM007G: mount(2) system call failed: Function not implemented. dmesg(1) may have more information after failed mount system call. ' Dec 29 12:57:35 biollante unassigned.devices: Partition 'SEAGATE ST16000NM007G' cannot be mounted. I've waving the white flag on this one. I've pulled the unreadable disk1 and will try UFS explorer to salvage whatever data I can from the drive. Of course I didn't have backups, I kept putting off a backup solution...lesson learned the hard way. Currently I have an empty disk1 slot and I've running a parity-sync since the other disk that went red X was a parity drive. Though unRAID seems unhappy about a missing disk1. It says it's unmountable and has the format option. Does unRAID just dislike having array disks out of sequence? ADDENDUM: Actually since the other array drives seem ok, should I put in a new config then run a parity-sync? It seems like running a parity-sync now would just perpetuate the unmountable disk1 error. Edited December 29, 2023 by zfp Additional question Quote Link to comment
JorgeB Posted December 30, 2023 Share Posted December 30, 2023 Since disk1 is 4Kn xfs_repair won't work on the device directly, without it being assigned to an array data device, so it may be worth booting an Unraid trial flash drive, assign that disk as disk1 and run xfs_repair again using the GUI or /dev/mdXp1 Quote Link to comment
zfp Posted January 1 Author Share Posted January 1 On 12/30/2023 at 2:59 AM, JorgeB said: Since disk1 is 4Kn xfs_repair won't work on the device directly, without it being assigned to an array data device, so it may be worth booting an Unraid trial flash drive, assign that disk as disk1 and run xfs_repair again using the GUI or /dev/mdXp1 Interesting idea! I took your advice and used a fresh unRAID USB on another system. I put in the faulty drive as disk1 (which it was in my main unRAID machine). I first tried to mount the disk using UA, the mount failed as before. Then I started the array in maintenance mode. I didn't see the usual GUI xfs_repair option so I went into the shell and ran "xfs_repair -L /dev/md1p1". This is what I got... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (3:2007914) is ahead of log (1:2). Format log to cycle 6. done It didn't spit out errors like before but the disk is still unmountable. I tried the disk directly using "xfs_repair -n /dev/sdc1". root@Tower:~# xfs_repair -n /dev/sdc1 Phase 1 - find and verify superblock... xfs_repair: read failed: Invalid argument xfs_repair: data size check failed xfs_repair: cannot repair this filesystem. Sorry. It seems like xfs_repair isn't going to bail me out on this. Of course I'm open to any other ideas. On a side note, I gave ufs explorer a shot and the dir structure and files were identified. I had some files w/ md5 checksums saved. This is only a small part of the disk (~2tb or so). A couple files failed the checksum so seems like there is data corruption *sob* Quote Link to comment
JorgeB Posted January 1 Share Posted January 1 1 hour ago, zfp said: Maximum metadata LSN (3:2007914) is ahead of log (1:2). Format log to cycle 6. done This appears to have been successful. 1 hour ago, zfp said: I tried the disk directly using "xfs_repair -n /dev/sdc1". This will still fail, you must only use the md devices for 4Kn drives. Post new diags after array start in normal mode with the disk assigned as disk1. Quote Link to comment
zfp Posted January 1 Author Share Posted January 1 6 hours ago, JorgeB said: This appears to have been successful. This will still fail, you must only use the md devices for 4Kn drives. Post new diags after array start in normal mode with the disk assigned as disk1. I assigned the disk to disk1, started the array normally and the files DO show up. However a spot check shows some files are still corrupted. Looking at other posts, the xfs_repair output when the file system was repaired looks different with messages about each phase. I suspect xfs_repair did not repair the file system. My reasoning is that in the original array, it was red x-ed because it knew there was an error. Putting it in a new system doesn't "know" there's an error so it mounts normally. Also I believe xfs_repair puts corrupted files it identifies into a lost+found dir which I don't see. Diags is posted, thanks again for all the help. recovery-diagnostics-20240101-1039.zip Quote Link to comment
JorgeB Posted January 2 Share Posted January 2 14 hours ago, zfp said: I suspect xfs_repair did not repair the file system. I don't see any indications of that, xfs_repair output looks perfectly normal, you can run it again and check the exit code, if 0 no more corruption was detected. 14 hours ago, zfp said: My reasoning is that in the original array, it was red x-ed because it knew there was an error. Disable disk and filesystem issues are not necessarily related, in any case copying the data directly from the mounted disk (or just reassigning it to the other array) is IMHO the best bet to recover the data. Quote Link to comment
zfp Posted January 3 Author Share Posted January 3 (edited) On 1/2/2024 at 1:15 AM, JorgeB said: I don't see any indications of that, xfs_repair output looks perfectly normal, you can run it again and check the exit code, if 0 no more corruption was detected. Disable disk and filesystem issues are not necessarily related, in any case copying the data directly from the mounted disk (or just reassigning it to the other array) is IMHO the best bet to recover the data. The lack of error codes baffles me. When I ran xfs_repair in the original array, it couldn't repair the array like so... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 Metadata corruption detected at 0x46f8c0, inode 0x80 dinode fatal error -- couldn't map inode 128, err = 117 Next I moved the drive to another system and ran ufs explorer to recover files. For some files, I had md5 checksums and ran a check. A couple files failed. I took your suggestion to try a fresh unRAID system and see what xfs_repair what do. This is when it said no errors. However when I checked the md5 on the files that failed, it still failed. As a sanity check, I had some backups in the cloud and the md5s matched up fine. Comparing the checksum on the ufs recover and new system mounted files, they matched. To sum it up, xfs_repair on the original array showed unrecoverable errors. The fresh unRAID showed no errors BUT the mounted files still have mismatched md5s. It seems like the errors are unidentifiable at this point. It seems like we've hit a dead end. Files can be recovered but there are corrupted files. Not many from what I can tell but they exist. Here is the current situation in my main array... Disk1 is the drive that showed errors. Parity 1 also had a red X but I rebuilt it. Which was prob unnecessary since the disk1 red x is still there. This was all triggered when I added two new drives to be precleared. I had a 450w PSU which was probably woefully inadequate. I've replaced my system with a 650w and I have 2x 20tb ready to go in as new parity drives. It seems like the best scenario to removed the red x on drive 1 and ensure data corruption is to do the following... Power down and add the 2x 20tb drives to replace the parity drives and a precleared empty 14tb to replace disk1. Run the new config utility. Keep all array drive assignments the same EXCEPT the new 14tb as disk1. Unassign the old 2x 18tb parity drives. Start the array which will now be unprotected. Stop the array, assign the 2x 20tb as the new parity drives. Restart the array and run the parity sync. My logic is this approach won't lose any data from the array (besides the corrupted disk) and the parity will be valid. Then I can copy the recovered data back to the array using the UD plugin. Am I missing anything? PS The corrupted disk1 was 16tb. The 14tb is a "placeholder", I just want it empty. I'm going to replace it with one of my old 18tb parity drives. Edited January 3 by zfp clean up Quote Link to comment
JorgeB Posted January 4 Share Posted January 4 Note that xfs doesn't have any data corruption detection, like btrfs or zfs, and the data corruption can be unrelated to the filesystem issues. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.