Jump to content

Sudden parity/data disk unavailablity (red X)


zfp

Recommended Posts

Huge fan of unraid, been using it for 10+yrs. Always found the forum to be a great source of info and now I'd appreciate some insight. I've run into an issue with disks being available but with a red X.

 

System specs

Unraid 6.12.6

Asus w680, i5-13600k, 2x32gb ECC DDR5

16x array HDs (XFS, dual parity), 3x NVME SSD (BTRFS, 2 drive raid 1 pool and a single SSD pool)

SAS 2308 (IT mode) -> Intel RES2SV240 -> 16x HDs

Radeon PRO WX2100, 1x USB3 card

Corsair VX450W PSU (450w)

 

Incident background

I recently upgraded my system from a C246 Xeon setup w/out any issues. System has been running stable except for a poor VM migration to a large SSD array (100% user error on my part!). Recently I bought 2x 20tb drives to upgrade my 2x 18tb parity drives. I had a clean shutdown and added the 2x 20tb drives to be precleared before adding to the array.

 

I started the preclear without issue. Then I started a torrent docker which writes to the array then suddenly I had a ton of errors in the array. I performed a clean shutdown and removed the 20tb drives. I also checked the cable connections, I've had issues in the past and loose cabling was often the culprit. Everything seemed fine. Then on reboot, one of my parity drives had a red X as well as disk 1. In addition, disk 1 says it's unmountable and the UI says I need to format a file system on all unmountable disks.

 

Current status

Right now I started the array and started a read-check. I haven't done anything else to the red X-ed parity and disk 1. I do have dual parity so in theory, I didn't loose anything but *shrug* who knows?

 

I suspect these errors were caused by an insufficient PSU. I don't do any gaming or what I'd call "high-powered" computing but it is an ancient PSU even though it's served me without issue for many years. In the meantime, what's the best way to resolve the red X issue? My first impulse is to rebuild the parity and disk 1 but to be blunt, my first impulse usually causes more problems later.

 

I've attached diagnostics, thank you for any help you can provide.

diagnostics-20231229-0132.zip

Link to comment

Thanks so much! I checked your link to the online doc and restarted the array in maintenance mode. I tried disk 1 first and here's what I got...

 


Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
would write modified primary superblock
Primary superblock would have been modified.
Cannot proceed further in no_modify mode.
Exiting now.

 

From the help files, it seems like I should replace the "-n" option with nothing. So I gave that a shot and got this...

Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

Regretfully, I have no idea where to go now. Sorry to ask, what should be my next step? And would these next steps apply to the red X parity drive?

Link to comment

I ran the check w/ the -L and at the end, I'm not sure if it worked. Here's what came out at the end...

Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
Metadata corruption detected at 0x46f8c0, inode 0x80 dinode

fatal error -- couldn't map inode 128, err = 117

 

The whole "fatal error" thing doesn't look good. I stopped the array, restarted in maintenance mode and I still have the red X. Should I assume the repair didn't work?

Link to comment
4 minutes ago, zfp said:

I ran the check w/ the -L and at the end, I'm not sure if it worked. Here's what came out at the end...

Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
Metadata corruption detected at 0x46f8c0, inode 0x80 dinode

fatal error -- couldn't map inode 128, err = 117

 

The whole "fatal error" thing doesn't look good. I stopped the array, restarted in maintenance mode and I still have the red X. Should I assume the repair didn't work?

 

That is not good - suggests the repair did not work.  Note that you are not trying to clear the red 'x' at this point, but the 'unmountable' status when running the array in normal mode.

 

If the drive has a red 'x' as well then the repair was running against the emulated drive and not the physical drive (the check/repair does not clear the red' x' status as that requires a rebuild).  It might be possible to work with the physical drive instead. 

Link to comment

Oh boy. From my interpretation, it seems I should stop the array. Use the command line to run "xfs_repair -L /dev/sdq" (the sdq is what I see on the drive). If that works then I guess I'd be ok to start the array as usual. If not, do I run a rebuild? I do have dual parity and only 1 parity disk has a red X. Also how do I handle the red x parity drive? There's not a file system on there from my understanding.

 

I really do appreciate all your help and apologize if I'm being difficult in any way. It's def not my intention, I'm just hoping I'm not digging a deeper hole than I'm already in! 

Edited by zfp
Terrible grammar
Link to comment

That is not quite the right command as one would need to include the partition number.  Also that command invalidates parity so had consequences.  These can be handled but one needs to proceed cautiously.

 

i looked at the diagnostics listed earlier and could not see the ‘sdq’ device you mention.   Perhaps you should post a new set of diagnostics so we can be certain they are current, and a screenshot of the Main tab would be useful as well.

 

The Parity2 disk is definitely going to require a rebuild.    However if it had a red ‘x’ against it then it is currently not being used so fixing it can be left until the issue around the data drive is resolved.

Link to comment

Yikes, looks like someone else has the same issue I'm having:

XFS_REPAIR: FATAL ERROR -- COULDN'T MAP INODE <>, ERR = 117

 

It looks like the only option is to pull the drive out, attach it to another system and use UFS to pull the data onto another drive. Then take the drive, preclear it and re-add to the array. Now it'll be an empty drive and I'd need to copy the files from UFS back onto it. Which will then update the parity w/ the recovered files? 

Link to comment

Based on itimpi's helpful advice and carefully reading a few other forum posts, I attempted the following on the red X disk 1 (aka sdq):

 

1) Ran a short smart test with the array stopped. No issues reported.

 

2) Started the array in maintenance mode, used the webUI to check the file system using the -nv option. The output is as follows...

        XFS_REPAIR Summary    Fri Dec 29 12:16:07 2023

Phase           Start           End             Duration
Phase 1:        12/29 12:14:40  12/29 12:14:40
Phase 2:        12/29 12:14:40  12/29 12:14:41  1 second
Phase 3:        12/29 12:14:41  12/29 12:16:07  1 minute, 26 seconds
Phase 4:        12/29 12:16:07  12/29 12:16:07
Phase 5:        Skipped
Phase 6:        Skipped
Phase 7:        Skipped

 

3) SSH-ed into the server w/ the array in maintenance mode. Ran xfs_repair -Lv /dev/md1p1, lots of issues and seems unrepairable.

Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - reset superblock...
Phase 6 - check inode connectivity...
reinitializing root directory
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
Metadata corruption detected at 0x46f8c0, inode 0x80 dinode

fatal error -- couldn't map inode 128, err = 117

 

4) Stopped the array and tried to run xfs_repair on disk 1 aka sdq w/ "xfs_repair -nv /dev/sdq1". No luck...

Phase 1 - find and verify superblock...
xfs_repair: read failed: Invalid argument
xfs_repair: data size check failed
xfs_repair: cannot repair this filesystem.  Sorry.

 

At this point, it seems the file system is unrepairable. I saw some posts where they changed disk 1 to "no disk" then mounted the former disk 1 using the Unassigned Devices plug-in. Others have shutdown the array, pulled the drive and attempted xfs_repair on another system. There's also pulling the drive and trying to copy the files to a new disk then copying backing into the array.

 

Are there any other options I'm missing? If not, which of the above scenarios is the safest choice? 

Link to comment
2 hours ago, trurl said:

If you've already attempted to repair the emulated disk (/dev/md1p1) and also attempted to repair the physical disk (/dev/sdq1) then it seems you have done everything xfs_repair can help you with.

 

Thanks so much for the confirmation even though it's not so great news. I removed disk 1 from the array list and tried to mount it via Unassigned Devices. The mount failed and checked the logs...

 

Dec 29 12:57:33 biollante unassigned.devices: Mounting partition 'sdq1' at mountpoint '/mnt/disks/SEAGATE_ST16000NM007G'...
Dec 29 12:57:33 biollante unassigned.devices: Mount cmd: /sbin/mount -t 'xfs' -o rw,relatime '/dev/sdq1' '/mnt/disks/SEAGATE_ST16000NM007G'
Dec 29 12:57:33 biollante kernel: XFS (sdq1): device supports 4096 byte sectors (not 512)
Dec 29 12:57:35 biollante unassigned.devices: Mount of 'sdq1' failed: 'mount: /mnt/disks/SEAGATE_ST16000NM007G: mount(2) system call failed: Function not implemented.        dmesg(1) may have more information after failed mount system call. '
Dec 29 12:57:35 biollante unassigned.devices: Partition 'SEAGATE ST16000NM007G' cannot be mounted.

 

I've waving the white flag on this one. I've pulled the unreadable disk1 and will try UFS explorer to salvage whatever data I can from the drive. Of course I didn't have backups, I kept putting off a backup solution...lesson learned the hard way.

 

Currently I have an empty disk1 slot and I've running a parity-sync since the other disk that went red X was a parity drive. Though unRAID seems unhappy about a missing disk1. It says it's unmountable and has the format option. Does unRAID just dislike having array disks out of sequence?

 

 

Screenshot 2023-12-29 3.27.03 PM.png

 

ADDENDUM: Actually since the other array drives seem ok, should I put in a new config then run a parity-sync? It seems like running a parity-sync now would just perpetuate the unmountable disk1 error. 

 

 

Edited by zfp
Additional question
Link to comment
On 12/30/2023 at 2:59 AM, JorgeB said:

Since disk1 is 4Kn xfs_repair won't work on the device directly, without it being assigned to an array data device, so it may be worth booting an Unraid trial flash drive, assign that disk as disk1 and run xfs_repair again using the GUI or /dev/mdXp1

 

Interesting idea! I took your advice and used a fresh unRAID USB on another system. I put in the faulty drive as disk1 (which it was in my main unRAID machine). I first tried to mount the disk using UA, the mount failed as before. Then I started the array in maintenance mode. I didn't see the usual GUI xfs_repair option so I went into the shell and ran "xfs_repair -L /dev/md1p1". This is what I got...

Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (3:2007914) is ahead of log (1:2).
Format log to cycle 6.
done

 

It didn't spit out errors like before but the disk is still unmountable. I tried the disk directly using "xfs_repair -n /dev/sdc1". 

root@Tower:~# xfs_repair -n /dev/sdc1
Phase 1 - find and verify superblock...
xfs_repair: read failed: Invalid argument
xfs_repair: data size check failed
xfs_repair: cannot repair this filesystem.  Sorry.

 

It seems like xfs_repair isn't going to bail me out on this. Of course I'm open to any other ideas.

 

On a side note, I gave ufs explorer a shot and the dir structure and files were identified. I had some files w/ md5 checksums saved. This is only a small part of the disk (~2tb or so). A couple files failed the checksum so seems like there is data corruption *sob* 

 

 

Link to comment
1 hour ago, zfp said:
Maximum metadata LSN (3:2007914) is ahead of log (1:2).
Format log to cycle 6.
done

This appears to have been successful.

 

1 hour ago, zfp said:

I tried the disk directly using "xfs_repair -n /dev/sdc1". 

This will still fail, you must only use the md devices for 4Kn drives.

 

Post new diags after array start in normal mode with the disk assigned as disk1.

Link to comment
6 hours ago, JorgeB said:

This appears to have been successful.

 

This will still fail, you must only use the md devices for 4Kn drives.

 

Post new diags after array start in normal mode with the disk assigned as disk1.

 

I assigned the disk to disk1, started the array normally and the files DO show up. However a spot check shows some files are still corrupted. Looking at other posts, the xfs_repair output when the file system was repaired looks different with messages about each phase.

 

I suspect xfs_repair did not repair the file system. My reasoning is that in the original array, it was red x-ed because it knew there was an error. Putting it in a new system doesn't "know" there's an error so it mounts normally. Also I believe xfs_repair puts corrupted files it identifies into a lost+found dir which I don't see.

 

Diags is posted, thanks again for all the help.

 

 

 

recovery-diagnostics-20240101-1039.zip

Link to comment
14 hours ago, zfp said:

I suspect xfs_repair did not repair the file system.

I don't see any indications of that, xfs_repair output looks perfectly normal, you can run it again and check the exit code, if 0 no more corruption was detected.

 

14 hours ago, zfp said:

My reasoning is that in the original array, it was red x-ed because it knew there was an error.

Disable disk and filesystem issues are not necessarily related, in any case copying the data directly from the mounted disk (or just reassigning it to the other array) is IMHO the best bet to recover the data.

Link to comment
Posted (edited)
On 1/2/2024 at 1:15 AM, JorgeB said:

I don't see any indications of that, xfs_repair output looks perfectly normal, you can run it again and check the exit code, if 0 no more corruption was detected.

 

Disable disk and filesystem issues are not necessarily related, in any case copying the data directly from the mounted disk (or just reassigning it to the other array) is IMHO the best bet to recover the data.

 

The lack of error codes baffles me. When I ran xfs_repair in the original array, it couldn't repair the array like so...

 

Phase 6 - check inode connectivity...
reinitializing root directory
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
Metadata corruption detected at 0x46f8c0, inode 0x80 dinode

fatal error -- couldn't map inode 128, err = 117

 

Next I moved the drive to another system and ran ufs explorer to recover files. For some files, I had md5 checksums and ran a check. A couple files failed.

 

I took your suggestion to try a fresh unRAID system and see what xfs_repair what do. This is when it said no errors. However when I checked the md5 on the files that failed, it still failed. As a sanity check, I had some backups in the cloud and the md5s matched up fine. Comparing the checksum on the ufs recover and new system mounted files, they matched.

 

To sum it up, xfs_repair on the original array showed unrecoverable errors. The fresh unRAID showed no errors BUT the mounted files still have mismatched md5s. It seems like the errors are unidentifiable at this point.

 

It seems like we've hit a dead end. Files can be recovered but there are corrupted files. Not many from what I can tell but they exist. Here is the current situation in my main array...

image.thumb.png.9aca17f25e8a3626a8e8928b79b5c4ed.pngimage.thumb.png.9aca17f25e8a3626a8e8928b79b5c4ed.png

 

Disk1 is the drive that showed errors. Parity 1 also had a red X but I rebuilt it. Which was prob unnecessary since the disk1 red x is still there. This was all triggered when I added two new drives to be precleared. I had a 450w PSU which was probably woefully inadequate. I've replaced my system with a 650w and I have 2x 20tb ready to go in as new parity drives.

 

It seems like the best scenario to removed the red x on drive 1 and ensure data corruption is to do the following...

  1. Power down and add the 2x 20tb drives to replace the parity drives and a precleared empty 14tb to replace disk1.
  2. Run the new config utility. Keep all array drive assignments the same EXCEPT the new 14tb as disk1. Unassign the old 2x 18tb parity drives.
  3. Start the array which will now be unprotected.
  4. Stop the array, assign the 2x 20tb as the new parity drives.
  5. Restart the array and run the parity sync.

My logic is this approach won't lose any data from the array (besides the corrupted disk) and the parity will be valid. Then I can copy the recovered data back to the array using the UD plugin. Am I missing anything?

 

PS The corrupted disk1 was 16tb. The 14tb is a "placeholder", I just want it empty. I'm going to replace it with one of my old 18tb parity drives.

Edited by zfp
clean up
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...