[Solved] Unmountable disk in array. Structure needs cleaning.


Recommended Posts

tldr; ran xfs_repair -L and the disk came back online.

 

Hi unRaiders,

 

I'm hoping someone with more knowledge can help me with an issue that happened this morning.

 

As you can see in the screen shot, a disk was kicked out of the array as unmountable (see attached disk log). What's concerning is that it appears parity did not emulate the contents of the drive. I opened a few folders and found them to be empty to confirm this. So I did a safe shutdown and have the machine off for now. In hind sight I should have grabbed the logs; I hope they're not needed at this time.

 

 

Can someone please advise me on next steps and the following questions to minimize any data loss:

 

  • Should I put the disk back into the array after running after running filesystem checks/repair or is it better to keep it outside the array in hopes the data is in tact and then I can move it to an empty replacement drive?

 

  • Why didn't parity emulate the contents?

 

  • If I pull the drive out of the array, will parity then emulate the drive contents or will it have removed everything it would normally emulate as that drive entirely?

 

  • For the future, can unRaid handle a hot or cold spare on standby for moments like this? (i.e. if a drive is kicked out of the array, a designated spare will be used for parity to rebuild to) I'd like to add this as a feature request if it can't

 

 

Thanks in advance for any advice or assistance that anyone can offer!!

Edited by Joseph
marked as solved
Link to comment

Handling of unmountable drives is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the Unraid GUI.     This has nothing to do with parity as that is only responsible for handling ‘failed’ drives (which be marked with a red ‘x’ in the GUI), not file system level corruption.

 

 

Link to comment
48 minutes ago, itimpi said:

Handling of unmountable drives is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the Unraid GUI.     This has nothing to do with parity as that is only responsible for handling ‘failed’ drives (which be marked with a red ‘x’ in the GUI), not file system level corruption.

 

 

Thanks but if a drive isn’t mountable and the array starts (or it gets knocked offline) shouldn’t party emulation kick in? (Note: haven’t read the link yet, maybe the question is addressed there)

Link to comment
6 hours ago, Joseph said:

Thanks but if a drive isn’t mountable and the array starts (or it gets knocked offline) shouldn’t party emulation kick in? (Note: haven’t read the link yet, maybe the question is addressed there)

No - parity only kicks in if a read or write to a disk fails.   In this case that has not happened and the problem is at the file system level where bad data has been written to the drive (and parity updated to reflect this).  Parity has no concept of file systems as it works at the raw sector level so can never recover individual files - just the whole of a failed disk.

Link to comment
4 hours ago, itimpi said:

No - parity only kicks in if a read or write to a disk fails.   In this case that has not happened and the problem is at the file system level where bad data has been written to the drive (and parity updated to reflect this).

 

So, I should be able to physically remove or replace the drive that hiccuped and once the array starts, then parity would emulate the contents of the old disk and the data should be in tact... is this correct?

 

Link to comment
43 minutes ago, Joseph said:

 

So, I should be able to physically remove or replace the drive that hiccuped and once the array starts, then parity would emulate the contents of the old disk and the data should be in tact... is this correct?

 

Probably not :)  if there has been no write failures then Unraid parity also reflects the file system level corruption.   Have you tried the process given in the link earlier to repair a damaged file system?

Link to comment
6 minutes ago, itimpi said:

Probably not :)  if there has been no write failures then Unraid parity also reflects the file system level corruption.   Have you tried the process given in the link earlier to repair a damaged file system?

:(

 

I haven't tried it yet; trying to get an understanding of next steps and weighing my options for the most convenient way for a non linux guy like me to recover as much data in tact as possible. I've lost an entire disk a while back and any attempts of recovery from corruption at the file system level was marginal. A quick scan from this link indicates it's hit and miss

 

https://wiki.unraid.net/Check_Disk_Filesystems#After_running_xfs_repair

 

"If the xfs_repair command fails, and we're hearing numerous reports of this(!), then you will have no recourse but to redo the drive. Use the instructions in the Redoing a drive formatted with XFS section below.  We're sorry, we hope there will be better XFS repair tools some day!"

 

Any other thoughts are greatly appreciated!!

Link to comment

Just thought I would clear up some terminology. The drive was NOT "kicked out" (disabled) because it is still in sync. As noted, parity can't help because it is in sync and rebuild won't change anything. A disabled disk is indicated by a red X at the left, which is not seen in your screenshot.

 

A drive can be disabled, it can be unmountable, it can be both, it can be neither. Disabled and unmountable are independent conditions that require different solutions.

 

You need to repair the filesystem because the disk is unmountable.

 

In any case, you must always have another copy of anything important and irreplaceable. Parity is not a substitute for backups.

Link to comment
3 hours ago, Joseph said:

:(

 

I haven't tried it yet; trying to get an understanding of next steps and weighing my options for the most convenient way for a non linux guy like me to recover as much data in tact as possible. I've lost an entire disk a while back and any attempts of recovery from corruption at the file system level was marginal. A quick scan from this link indicates it's hit and miss

 

https://wiki.unraid.net/Check_Disk_Filesystems#After_running_xfs_repair

 

"If the xfs_repair command fails, and we're hearing numerous reports of this(!), then you will have no recourse but to redo the drive. Use the instructions in the Redoing a drive formatted with XFS section below.  We're sorry, we hope there will be better XFS repair tools some day!"

 

Any other thoughts are greatly appreciated!!

 

Most of the time the file system repair works fine and their is little if no data loss.  At the very least you should run the check as that will give you an idea of whether any damage to the file system will be recovered without issue.

Link to comment
5 hours ago, trurl said:

You need to repair the filesystem because the disk is unmountable.

ok, I will do that as soon as I have ample time...

 

2 hours ago, itimpi said:

 

Most of the time the file system repair works fine and their is little if no data loss.  At the very least you should run the check as that will give you an idea of whether any damage to the file system will be recovered without issue.

...might not happen until Monday. 

 

Also, according to the link provided, it suggests leaving the disk in question available to the array and start in maintenance mode to do the check/repair. Any idea why moving the disk out of the array to run the check/repair and then copying the recovered data back into the array via copy isn't mentioned?

Edited by Joseph
Link to comment
5 hours ago, trurl said:

Just thought I would clear up some terminology. The drive was NOT "kicked out" (disabled) because it is still in sync. ...

...Disabled and unmountable are independent conditions that require different solutions.

It would be cool if unRaid actually kicks a drive off line should the file system hiccup -- before the condition is written to parity -- so parity can be used as a recovery option in that scenario.

Link to comment
1 hour ago, Joseph said:

Any idea why moving the disk out of the array to run the check/repair and then copying the recovered data back into the array via copy isn't mentioned?

Because this would run exactly the same check, but invalidate parity so no real point.

1 hour ago, Joseph said:

It would be cool if unRaid actually kicks a drive off line should the file system hiccup -- before the condition is written to parity -- so parity can be used as a recovery option in that scenario.

The problem is that there is no indication that there had been a problem at the file system level until it is detected later when trying to read (or write) to the disk after the problem has occurred and by that point the parity has already been updated with whatever invalid sector caused the corruption.

Link to comment
20 hours ago, Joseph said:

Any idea why moving the disk out of the array to run the check/repair and then copying the recovered data back into the array via copy isn't mentioned?

Because that would be a much more complicated procedure requiring multiple steps to get your array back in sync, and a lot more time and effort. Lots of ways to get all that wrong, and would take quite a bit to explain how to accomplish all that, with different possibilities along the way depending on other things. So, no good reason to go through all that.

Link to comment
On 7/24/2021 at 2:39 PM, itimpi said:

Because this would run exactly the same check, but invalidate parity so no real point.

understood.

 

21 hours ago, trurl said:

Because that would be a much more complicated procedure requiring multiple steps to get your array back in sync, and a lot more time and effort...

makes sense.

 

 

Good Morning Guys, thanks for all the feedback. For me, I'm in uncharted waters so I appreciate your input thus far and I thought I'd give you an update before I proceed.

 

I ran the "check" part of the test a few minutes ago and it didn't take long to give me these results:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_fdblocks 200013438, counted 200994233
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 7
        - agno = 3
        - agno = 6
        - agno = 5
        - agno = 2
        - agno = 1
        - agno = 4
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

Hitting refresh on the main page indicates there is still some operations going. But I was curious if someone can kindly look at the initial findings and see if there's anything I should be concerned about thus far.

 

Thanks again Guys and the unRaid community!!

Edited by Joseph
Link to comment
19 minutes ago, itimpi said:

That is a good sign - no mention of corruptions of a sort that might lead to data loss.  If you run again without the -n and adding the -L option I would expect that when you restart in normal mode all the content of the drive will be accessible.

 

Whew!
 

I have to step out but if all goes well, I’ll try your next steps when I get back later today. Cheers!!

Link to comment

 

On 7/26/2021 at 7:24 AM, itimpi said:

 If you run again without the -n and adding the -L option I would expect that when you restart in normal mode all the content of the drive will be accessible.

 

so I've let xfs_repair untouched for about 6 hours while I was out. The command output box under Check Filesystem Status hasn't changed. The last detail still reads: "Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting."

 

Also, there are still reads happening across all drives with the disk in question 2777 reads ahead of most others in the array.... AND I forgot to mention that when the array was started, the unmountable message wasn't there. (see attached)

 

Do you think it's ok to run it again with the -L option now or should I wait for something else in the command output box?

 

 

Edited by Joseph
Link to comment
3 minutes ago, Joseph said:

 

so I've let xfs_repair untouched for about 6 hours while I was out. The command output box under Check Filesystem Status hasn't changed. The last detail still reads: "Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting."

 

Also, there are still reads happening across all drives with the disk in question 2777 reads ahead of most others in the array.... AND I forgot to mention that when the array was started, the unmountable message wasn't there. (see attached)

 

Do you think it's ok to run it again with the -L option now or should I wait for something else in the command output box?

 

If the drive is no longer showing as unmountable can you see any contents?    Just guessing as you seemed to have missed the attachment you mention.

Link to comment

 

On 7/26/2021 at 1:25 PM, itimpi said:

Just guessing as you seemed to have missed the attachment you mention.

I hit submit too hastily and forgot to include it on the original comment. I added it later, but I guess it didn't go through... here it is again so you can see.

 

The screen capture is in maintenance mode, and that's where it remains for now.

Edited by Joseph
Link to comment

[UPDATE]

So I started the array normally and the disk in question returned to unmountable. Also, I read this on a RedHat site and it doesn't bode well :(

 

If the mount failed with the Structure needs cleaning error, the log is corrupted and cannot be replayed. Use the -L option (force log zeroing) to clear the log: This command causes all metadata updates in progress at the time of the crash to be lost, which might cause significant file system damage and data loss. This should be used only as a last resort if the log cannot be replayed.

 

I went back into maintenance mode and the disk remained unmountable... debating next steps.

Edited by Joseph
Link to comment

Did you actually run xfs_repair without -n and with -L?   All the examples of the xfs_repair output you gave earlier indicated that -n had been used so no changes were being made.

 

The  xfs_repair is normally very fast - just a few minutes.   Despite the ominous sounding warning about the -L option it virtually never causes data loss, and in the rare cases where it does it only affects the last file being written.

Link to comment
11 hours ago, itimpi said:

Did you actually run xfs_repair without -n and with -L?   All the examples of the xfs_repair output you gave earlier indicated that -n had been used so no changes were being made.

 

The  xfs_repair is normally very fast - just a few minutes.   Despite the ominous sounding warning about the -L option it virtually never causes data loss, and in the rare cases where it does it only affects the last file being written.

 

Good morning Mate, thanks for the info. I'm only hesitant because the RedHat warning is in contradiction to the unRaid wiki link you provided... Nevertheless, I'll give it a shot here in a bit then and report back. Thanks!!

Link to comment

The warning about -L says you should mount the disk first if possible so the log can be replayed. Unraid has already determined the disk is unmountable, so it is not possible to proceed with the log. Therefore, you must use -L. And you must NOT use -n (no-modify) or no repair will be done.

  • Like 1
Link to comment
8 hours ago, trurl said:

The warning about -L says you should mount the disk first if possible so the log can be replayed. Unraid has already determined the disk is unmountable, so it is not possible to proceed with the log. Therefore, you must use -L. And you must NOT use -n (no-modify) or no repair will be done.

ok, I just tried the -L and the drive is back online... but I have no idea what data was lost (if any.)

 

In the output windows of xfs_repair in unRAID it said "- moving disconnected inodes to lost+found ..." but I don't see that folder on that disk. Any idea where I might be able to find it? I'd like to see if there's anything that will clue me in to what data might be lost.

 

Thanks again for everything.

Link to comment
  • Joseph changed the title to [Solved] Unmountable disk in array. Structure needs cleaning.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.