Two drives failed, I have two parity disks, yet the contents are not emulated


Recommended Posts

Recently I had two drives get marked as disabled, but luckily I have two parity drives. Unraid says that both of these drives are emulated, however I am seeing various missing files in my array. Why is this happening and how can I fix this? A parity check was done a week ago, and many of these files would be included in that. I've attached a screenshot of my drive setup in Unraid below.

 

Further, another problem is that these two drives seem to get marked as disabled every few weeks. No idea why, but I will run SMART long tests on them and they will come back as working. From here, it seems the only way to get them up and running again is to wipe the array and create a new one, and then these drives are marked as working. On this, I have two questions. Why are these drives getting marked as disabled in the first place? It seems to exclusively happen during parity checks. Secondly, is there a way of knowing that these drives are actually fine besides SMART tests, or is that the best option? The process of clearing the array and creating a new one works, but I'm very afraid that at some point it will not and wiping the array will clear my parity drives, and therefore I will lose all of my data on these drives.

Screenshot from 2020-01-06 13-56-41.png

Link to comment

The drives are emulated, but unfortunately the emulated drives are unmountable as shown in the screenshot.

 

Go to Tools - Diagnostics and attach the complete diagnostics zip file to your NEXT post.

 

Then wait on further advice. You really should have asked for advice before this since you say this has happened before.

Link to comment
13 minutes ago, trurl said:

The drives are emulated, but unfortunately the emulated drives are unmountable as shown in the screenshot.

 

Go to Tools - Diagnostics and attach the complete diagnostics zip file to your NEXT post.

 

Then wait on further advice. You really should have asked for advice before this since you say this has happened before.

I've attached it here! I definitely should have gotten back earlier, but I had also accidentally unplugged a hard drive right before that happened so I had assumed that they were related. Unfortunately they do not appear to be...

tower-diagnostics-20200106-1909.zip

Link to comment

That looks good. The green "thumbs up" in the SMART column would be a yellow "thumbs down" if you had SMART warnings.

 

Go to Main - Array Operation and Stop the array.

Go to Main - Array Devices

Click on Disk4 to get to its page and Check Filesystem. Be sure to capture the output so you can post it. Do the same for Disk6.

Link to comment
8 hours ago, trurl said:

That looks good. The green "thumbs up" in the SMART column would be a yellow "thumbs down" if you had SMART warnings.

 

Go to Main - Array Operation and Stop the array.

Go to Main - Array Devices

Click on Disk4 to get to its page and Check Filesystem. Be sure to capture the output so you can post it. Do the same for Disk6.

Just did this, I got the same output for both drives.

Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! 
attempting to find secondary superblock... 
.found candidate secondary superblock... 
verified secondary superblock... 
would write modified primary superblock 
Primary superblock would have been modified. 
Cannot proceed further in no_modify mode. 
Exiting now.

I'm not quite sure what this means or what to do with this information.

Link to comment

As mentioned, the disks ARE emulated, they are just unmountable. Maybe not emulated well due to invalid parity and other issues.

 

Realized I left out a step in my instructions earlier

9 hours ago, trurl said:

Go to Main - Array Operation and Stop the array.

Check the box for Maintenance Mode and Start the array

Go to Main - Array Devices

Click on Disk4 to get to its page and Check Filesystem. Be sure to capture the output so you can post it. Do the same for Disk6.

 

I'm not sure whether you figured out that missing step yourself or not. I assume it would not actually do the check without the array started in Maintenance Mode.

 

5 hours ago, johnnie.black said:

This is not very good:

 

image.png.0dfb6b3207fe77bd41f8e12cc37f1b5e.png

 

It implies parity is not valid, even if that was a correcting check, since it didn't complete.

 

I missed that.

 

So how to proceed?

 

Repair emulated disks, then rebuild to new disks, keeping the originals. Then if needed mount the originals Unassigned, repair if necessary, and try to recover any files the rebuild didn't get.

 

Or some other approach involving dd_whatever?

Link to comment

Parity check detected sync errors and was canceled, so it didn't finish correcting the errors, and this if it was a correcting check, if it was for example the auto parity check after an unclean shutdown it would be non correct, so there's a big chance parity isn't 100% in sync, possibly causing the emulated disk problems.

 

We can't see why the disks got disabled, since diags are after rebooting, but since both disks look fine and it's very unlikely that both would fail at the same time we can reasonably assume it was a controller/connection/power issue, best option here is likely to re-enable the disks and re-sync parity, still to leave as much options as possible it would be safer to see if the filesystems on the emulated disks are repairable, if yes rebuild to new disks and then compare data with the old ones, or re-assign the old ones to the array and re-sync parity.

 

Depending on the importance of the data and the backups situation, i.e., if there are any, I would go directly to re-sync parity or play it safer rebuilding first to new disks.

 

Link to comment

I didn't have any backups (something I've been meaning to get around to for awhile but haven't), so I tried to solve this problem the way I had solved a similar problem in the past. This involves making a new config. I figured doing data rebuild won't get me back everything, so might as well try it and if it doesn't work I may have lost that data anyways. Here's the post I made in reddit about that. Strangely, this has worked! The two drives now show all data on them, and all of my files have returned.

However, this has made me more confused than anything else. Why did this work? And how can I prevent this from happening in the future? When I made that reddit post, I had one hard drive that I had accidentally unplugged, and two more drives that had the same issue of being marked as disabled, and this fix had fixed all of those drives. So this problem has happened twice now. Any idea as to why?

Link to comment

Just to clarify, since the reddit post seems to be about a previous occurrence.

 

Are you saying you did New Config again just now?

 

If that is what you are saying then you basically skipped ahead and did this:

25 minutes ago, johnnie.black said:

go directly to re-sync parity

with no safety net. Not really surprising that it worked out though. Clearly invalid parity was at least part of the problem.

 

Unraid disables a disk when any write to it fails. After disabling, Unraid will not use the disk again for either read or write, since its contents are presumed to be invalid (the failed write) and out-of-sync with parity (which was updated as if the write succeeded). Until the disk is rebuilt, Unraid will emulate the disk using the parity calculation to get the data for the disk by reading parity plus all the other disks. It does this whether reading or writing the emulated disk. For writing the emulated disk, it continues to update parity, so those writes can also be recovered with rebuild.

 

In your case, since parity was invalid, the emulated disks were unmountable, so those emulated disks simply couldn't be used at all. And the contents of the actual disks were just as they were when they got disabled.

 

Just wanted to get this out there. I will have more to say about what you should be doing in the future.

Link to comment
7 minutes ago, SupremeArmchair said:

I didn't have any backups (something I've been meaning to get around to for awhile but haven't), so I tried to solve this problem the way I had solved a similar problem in the past. This involves making a new config. I figured doing data rebuild won't get me back everything, so might as well try it and if it doesn't work I may have lost that data anyways. Here's the post I made in reddit about that. Strangely, this has worked! The two drives now show all data on them, and all of my files have returned.

However, this has made me more confused than anything else. Why did this work? And how can I prevent this from happening in the future? When I made that reddit post, I had one hard drive that I had accidentally unplugged, and two more drives that had the same issue of being marked as disabled, and this fix had fixed all of those drives. So this problem has happened twice now. Any idea as to why?

Unraid will disable a drive any time a write to it fails.   More often than not this is due to an external factor (e.g. cabling, power, etc) and is not a fault with the actual drive itself.   When the drive is disabled then Unraid stops writing to it switching to trying to 'emulate' it using the combination of the other drives plus parity.   

 

When you did a New Config you effectively told Unraid to revert to using what it finds on the physical drive.   If the drive is not actually faulty this will be the contents at the point the drive was disabled and will not have any files that were written to the 'emulated' drive after the physical drive was disabled.  In many cases this means the vast majority (if not all) files are available again.   You just need to be aware that any recent updates have been lost.

Link to comment

Make a backup plan. You don't have to backup everything, but you absolutely must have another copy of anything important and irreplaceable.

 

Setup Notifications to alert you immediately by email or other agent when any problems are detected. Don't allow a single problem to develop into multiple problems.

 

Go to the Apps page and install Fix Common Problems. Pay attention to any warnings it gives and correct them.

 

As mentioned:

30 minutes ago, trurl said:

The only acceptable number of parity errors is exactly zero.

 

And as also mentioned:

31 minutes ago, johnnie.black said:

We can't see why the disks got disabled, since diags are after rebooting

Get diagnostics before rebooting if possible. Better yet, also setup Syslog Server so syslogs will be saved instead of being lost on reboot:

 

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601

 

If we had the saved syslogs, we might be able to give a more definitive answer to what happened to cause all this.

 

And seek advice on the forum before doing anything.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.