Two disks are unmountable after a single drive failure (Unmountable disks present)

lankanmon · March 19, 2019

Note to Mods: I posted this here first: https://ipstest.lime-technology.com/forums/topic/70755-two-disks-are-unmountable-after-a-single-drive-failure-unmountable-disks-present/ (is this a test forum?)

Hi all, I've got a bit of a problem...

One of my Disks (Disk 5) failed last week and after reseating the state, trying a new one and using a different power, I decided to use a backup drive that I had already precleared (and rebuild from parity). The rebuild seemed to be going well, but I may have knocked the SATA cable of Disk 3 and it became unavailable (and also emulated). I have two parity drives. I noticed that the parity rebuild sped up by much when this happened. I still let it finish. With disk 5 rebuilt and Disk 3's SATA cable reseated correctly, I also rebuilt disk 3.

Now, my party is valid (all green).

I now have an issue where when I mount the array, I have a message that says unmountable disks present.

I mounted it in maintenance mode and ran the "Check Filesystem Status" for each disk -- here are the results:

Disk 3

xfs_repair status:
Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
would write modified primary superblock
Primary superblock would have been modified.
Cannot proceed further in no_modify mode.
Exiting now.

Disk 5


Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
would write modified primary superblock
Primary superblock would have been modified.
Cannot proceed further in no_modify mode.
Exiting now.

I am not entirely sure what this means or what is going on.

I am really concerned about my data right now...

Is there any way to fix this without the loss of data?

Any help will be much appreciated!

JorgeB · March 19, 2019

Do you have the diagnostics from the rebuild?

This part makes me suspect it didn't complete successfully:

7 hours ago, lankanmon said:

I noticed that the parity rebuild sped up by much when this happened.

In any case you'll need to run xfs_repair without -n, but superblock corruption is not a very good sign.

lankanmon · March 19, 2019

I have attached the diagnostics.

I did find that strange too,

Is there any way to determine if data is corrupted?

And how well does xfs repair work?

lknserver-diagnostics-20190319-0451.zip

JorgeB · March 19, 2019

32 minutes ago, lankanmon said:

I have attached the diagnostics.

Those are after rebooting, so not much help.

32 minutes ago, lankanmon said:

Is there any way to determine if data is corrupted?

Not easily without the diags from the rebuild and/or checksums from all files.

33 minutes ago, lankanmon said:

And how well does xfs repair work?

It usually works well, but it can't do miracles, if the rebuilds are incomplete there will be data loss.

lankanmon · March 19, 2019

Okay, I will run the xfs_repair without -n for each drive and report back.

Thanks!

lankanmon · March 19, 2019

Update: when I am trying to run it on Disk 3, it os giving this Error:


Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

So, do I run it with the -L option?

Is there any other way to mount to filesystem?

JorgeB · March 19, 2019

41 minutes ago, lankanmon said:

So, do I run it with the -L option?

Yep.

lankanmon · March 19, 2019

Okay, I have finished running the xfs_repair on both drives:

Disk 3:


Phase 1 - find and verify superblock...
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
sb_icount 0, counted 210816
sb_ifree 0, counted 6594
sb_fdblocks 976277683, counted 182759326
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (34:652511) is ahead of log (1:2).
Format log to cycle 37.
done

Disk 5 (took much longer):


Phase 1 - find and verify superblock...
bad primary superblock - bad CRC in superblock !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97
resetting superblock realtime bitmap ino pointer to 97
sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98
resetting superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
sb_icount 0, counted 55660224
sb_ifree 0, counted 142
sb_fdblocks 976277683, counted 343482943
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Note - stripe unit (0) and width (0) were copied from a backup superblock.
Please reset with mount -o sunit=,swidth= if necessary
Maximum metadata LSN (59:518163) is ahead of log (1:2).
Format log to cycle 62.
done

The drives still show "Unmountable: No file system"

Do I need to restart or unmount & remount now to get the drives to mount?

Also, there is a note at the bottom of the Disk 5 log, do I need to run any of those commands?

JorgeB · March 19, 2019

4 minutes ago, lankanmon said:

The drives still show "Unmountable: No file system"

That's weird, post new diags after starting the array.

lankanmon · March 19, 2019

I did not restart before. I have restarted now and mounted the array (not on maintenance mode) and they now appear to show up as normal (xfs).

I would also know where do I go from here?

Is there any way to determine what files may have been corrupted? I noticed a mention of "lost+found" in the logs above... Is that something that I can actually access?

Also, would my parity be valid right now? Should I run a parity check (and if so, should I write corrections to parity)?

Please let me know...

I really appreciate your help!

Thank you so much!

JorgeB · March 19, 2019

19 minutes ago, lankanmon said:

Is there any way to determine what files may have been corrupted?

Like mentioned only if you have checksums for all files, or backups to compare to.

19 minutes ago, lankanmon said:

I noticed a mention of "lost+found" in the logs above... Is that something that I can actually access?

Check if that folder exists on both disks, and for any data there.

20 minutes ago, lankanmon said:

Should I run a parity check (and if so, should I write corrections to parity)?

Since there are no diags from the rebuild it's a good idea.

lankanmon · March 20, 2019

I did a parity check and it did complete successfully with 0 errors. So I hope everything is well.

I do have some backups (although not of the entire server), and will see if I can verify data integrity.

Thanks for all of your help!

Two disks are unmountable after a single drive failure (Unmountable disks present)

Recommended Posts

lankanmon

Link to comment

JorgeB

Link to comment

lankanmon

Link to comment

JorgeB

Link to comment

lankanmon

Link to comment

lankanmon

Link to comment

JorgeB

Link to comment

lankanmon

Link to comment

JorgeB

Link to comment

lankanmon

Link to comment

JorgeB

Link to comment

lankanmon

Link to comment

Join the conversation