Jump to content

(SOLVED) Data loss after drive failure during parity check


tazman

Recommended Posts

I have just lost about 3TB of data on single disk failure during a party check/rebuild on a 2 parity drive system.

 

The other day, I ran a parity check. During the party check Drive 12 (4TB) was disabled (red cross). The parity check reported more than 2000 errors. 

 

I checked the drive, ran an extended SMART test and then precleared the disabled drive once. No problem at all.

 

So I opened the case and checked for any problem and suspected a 90 degrees bent cable. 

 

I then reinstalled the disk in the same slot and unRAID accepted it as a new drive. So I started the rebuild. 

 

Rebuild took about 12 hours and was finished without any errors. But then I noticed that the drive was listed having nearly all its space free:

 

image.png.e52ff63f5290a20cb2f620e4f859c644.png

 

I am using water level and know for sure that the drive had at least 3TB on it.

 

I checked the drive and, yes, the files on it were only from a recent backup job.

 

So I fear that this data is lost. Despite dual parity and despite a normal scenario like a single disk failure which has happened several times before.

 

It sounds similar to the issue reported here: https://forums.lime-technology.com/topic/60963-disk-3-error-disk-3-rebuilt-from-parity-disk-3-missing-almost-everything/.

 

The array was running normally during the build. 

 

Here is what I found out myself that may indicate possible other problems or even causes:

  • My first place to go was the syslog. I could access the tail of it via Log on the Main page without problems. Then I wanted to check the full syslog on Tools/System Log but I only got a page not responsive message from the browser. 
  • I was able to get a larger chunk from the syslog with tail -f /var/log/syslog > /boot/syslog.txt. 
  • The parts that I could get to were from after the rebuild and didn't show any problems. 
  • I rebooted but the disk 12 was still empty. Syslog worked again.
  • I suspected memory problems. A 48 hour memtest didn’t produce any errors. 
  • The BIOS reports several “Smbios 0x01 DIMM_2B (Single Bit ECC Memory Error)”. Many while the memory test was running and also before, but not one during the time when the drive failed. It seems like a single ECC error is normal if it occurs only few time during the day. The BIOS is set to stop the machine when a dual ECC memory occurs. That has never happened.
  • Fix Common Problems reports “It appears that your server has a Marvel based hard drive controller installed within it. Some users with Marvel based controllers exhibit random drives dropping offline, recurring parity errors during checks etc. This tends to be exacerbated if VT-D / IOMMU is enabled in the BIOS. Generally, LSI based controllers would be preferred over Marvel based controllers because of these issues. Note that these issues are out of Limetech's hands. Depending upon the exact combination of hardware present in your server, you may not have any problems whatsoever. If you have no problems, then this warning can be safely ignored, but future versions of unRaid (and later Kernel versions) may (or may not) present you with the previously mentioned issues.” 

 

On the Marvel Issue (based on https://forums.lime-technology.com/topic/59091-marvel-issues-starting-point-for-investigation/ and others)

  •  I do have three Marvel SATA cards AOC-SAS2LP-MV8 installed (great!!!). The failure occurred on one of them. 
  • I don’t’ use any VMs.
  • IOMMU is not supported
  • VT-D is enabled
  • The drive is a HGST HDN724040ALE640 with ATA8-ACS T13/1699-D revision 4

 

 So from this, I conclude/wonder about the following and welcome your recommendations and any suggestions on how to analyze or remediate this situation further:

- Is there any way to get the data back? I fear the answer is no. 

- It seems that the most likely cause is the Marvel controller, not the bent cable:

- I have seen no indication on the boards that a fix is on the horizon. So replacing it seems to be warranted. Which card(s) would you go for?

- While the Marvel controller is still in place it seems prudent to:

     - disable VT-D / IOMMU

     - not to run any parity checks.

- Maybe replacing DIMM_2B that gives the ECC error is also in order.

- I am not sure at all what to do about the syslog unresponsive problem.

 

I am attaching the syslog part I was able to salvage after the incident and the most recent one.

 

Thanks!

syslog incident.txt

syslog last.txt

 

_________________

unRAID 6.3.5 - Board:  Supermicro X9SCM-F - CPU: Intel i3 2120T - RAM: 4GB, 2x 2GB Kingston KVR1333D3S8E9S/2GEC SDRAM DIMM with up to 32DB DDR3, unbuffered, ECC 1333/1066 - PSU: Corsair Professional Series Gold AX850 - SATA Card: 3x AOC-SAS2LP-MV8 - Backplane: 6x icyDock MB455SPF 5in3 - Case: Lian Li PC-343B 

Link to comment
16 minutes ago, tazman said:

checked the drive, ran an extended SMART test and then precleared the disabled drive once.

16 minutes ago, tazman said:

I then reinstalled the disk in the same slot and unRAID accepted it as a new drive. So I started the rebuild.

 

I have a feeling that after the drive got redballed (due to a failed write which could be caused by the cable, controller, or drive), and you did the above, unRaid also said that the drive was unmountable and presented you with a "format option" (or this was presented to you earlier).  At that point, just as with any other OS, you wound up erasing the contents of the drive (or the emulated drive while it wasn't present) when you selected the check box and clicked the format button

 

Single Parity (or dual parity) on any system of redundant drives be it unRaid or a true RAID system is designed around rebuilding a failed drive, and does not recover a corrupted filesystem.

 

 

Link to comment
On 11/3/2017 at 10:07 AM, Squid said:

Single Parity (or dual parity) on any system of redundant drives be it unRaid or a true RAID system is designed around rebuilding a failed drive, and does not recover a corrupted filesystem.

 

Correct, but the rebuild should reconstruct the data on that failed drive. Why did this not happen in this case?

Link to comment

I never touched the failed drive while it was still part of the array. The preclear was only done after the disk was removed from the array. While the disk was being precleared, it was emulated by the array. After the preclear unRAID accepted the drive as a new one and started the rebuild. I don't remember that there was a format. These were the steps:

 

  1. Drive 12 redballed during parity check
  2. Let parity check finish
  3. Stop array, remove and check drive - everything was ok
  4. Reinsert drive to check if anything has changed. unRAID didnt accept the drive as good and still showed it as redballed. 
  5. Stop the array, set Drive 12 to "no device", start the array
  6. At this time the disk 12 showed under unassigned devices and I could preclear it
  7. After the preclear finished I stopped the array
  8. Assign the drive to disk 12 and start array
  9. Start rebuild

Is there a mistake in that sequence that could have caused the data loss?

 

I cannot say for sure that the data loss occurred during the rebuild as I did not check the disk 12 contents while it was being emulated. 

 

Thanks!

Link to comment

The actual drive 12 being red-balled is to be expected at this point as it has not been rebuilt.

 

if unraid also reported disk 12 as unmountable at this point when you tried to start the array, it indicates file system corruption on the emulated disk and as you don’t mention you took any action to correct this then only way you could have started the array is to check the box to tell it to format the (emulated) drive.    If this is what you did, this is when your files were deleted.

Link to comment
11 minutes ago, remotevisitor said:

If this is what you did, this is when your files were deleted.

 

I did not format the drive. I don't think unRAID even offered it. I stopped the array, set Drive 12 to 'no device', then precleared the old drive and reassigned it. At this point int time - this is my assumption! - it was accepted as a new drive and rebuilt.

 

 

Link to comment

Couple points.

 

- if looking to replace the controllers, consider the LSI SAS9201. It comes in several variants including 8 and 16 internal port versions. It requires no flashing and is often found at good price point on eBay.

 

- Although guidance is often to rebuild on the kicked drive, this is actually risky. Although the rebuild typically works properly, if it doesn't, as in this case, you are stuck with no options. If you had pulled the failed disk and set it to the side, and done your rebuild on a different disk and noticed problems, there would be recovery options open to you with the failed disk.

 

- when a disk is kicked, unRaid will automatically emulate the disk. Its disk (if they are enabled) and user shares will continue to operate. It is appropriate to look at the emulated disk and determine it is being properly emulated. Because that is exactly what the rebuild is going to recreate when doing a rebuild.

 

If you are going to rebuild onto the same physical disk, it is especially important to verify the emulated disk.

Link to comment
1 hour ago, tazman said:

Thanks, SSD, very good points. 

 

I am still unclear about why the data loss has occurred. 

 

Wish I had an explanation. UnRaid's rebuild logic is tried and true. I think it likely if you had looked at the emulated disk, it would have been faulty / incomplete. (You could pull the disk and let unRaid simulate it again, and see if the simulated is complete. My guess is it will be the same as what you are seeing on the physical). This would all point to partly itself being corrupted at the time of the failure.

 

Forum recommends monthly parity checks to verify parity is properly maintained. If parity were to get out of whack, you would otherwise never know. Parity is totally unused except for being maintained. Only when a disk is kicked is it called upon. And then it has to be 100%  right to be useful.

 

On a hard shutdown, parity can fall lot of sync with the data. So a parity check is kicked off. But a user can cancel it. And the check is not correcting, so if it finds a problem, a second correcting check is needed. UnRaid doesn't force you to do. Marvell controller issues could also cause subtle parity issue. You can also define an array and tell unRaid to trust parity. Again, there are no obvious ill effects of parity being off, even horribly off, until a disk is kicked. And the owner has a lot of ways to allow it to happen.

 

I, and most here, like the power that unRaid affords, to reconstruct arrays and tell unRaid to trust it, to cancel parity checks, etc. But user needs to understand how it all works and takes precautions.

 

I'm not saying any of these were the problem here. Just pointing out some possibilities and unRaid design principles. Getting burned is not fun, but does sometimes teach valuable lessons to avoid similar problems in the future. And I'm just trying to help so it doesn't happen again to you or others reading this post.

Link to comment
  • 1 month later...

To close this one out I would like to report that I have replaced the three Marvel controllers with a LSI SAS9201-16i and everything is working fine again. Replacing the controller was straight-forward: just replace the controller and start unRAID. 

 

Before the replacement I got two more drives failing so I was down to no redundancy with my second parity drive. 

 

Like the first failed drive, the two additional drives that failed reported write errors in the log despite the fact that nothing was written on them when they failed. 

 

I left the failed drives untouched and replaced them with new ones after the controller was replaced. 

 

I then compared their contents with the rebuild drives. Mounting the old drives failed as the drive had the same ID as the rebuild new ones. I had to generate new IDs for them. This also failed as both drives had unplayed xfs journal entries. I suspected that the journal entries would lead to data loss and deleted them so I could mount the drives. One drive had a current pending sector.

 

Both old drives (without the journal entries played) did not have any data loss. 

 

Looking back I am still unclear why I had a data loss on my first drive. 

 

When I precleared the old drives both showed changing current pending sectors during the several preclear cycles but at the end settled with none and an overall 100% clean SMART report. I suspect/hope that the drives are now ok again and that the pending sectors were a consequence of the controller problems. 

Link to comment
46 minutes ago, tazman said:

Like the first failed drive, the two additional drives that failed reported write errors in the log despite the fact that nothing was written on them when they failed. 

 

That is consistent with how unraid handles errors. When a read request fails, unraid spins up all the drives, calculates what data should be there from all the other drives, and writes that data back to the correct spot. If that write succeeds, then unraid increments the error column for that drive and continues on. If the write fails, then unraid red balls the drive and no further requests are made to that physical device. All further activity to that drive slot is emulated using the rest of the data drives + parity. So, once a drive is red balled, even if the drive itself is physically ok, any data written to it will only be on the rebuilt or emulated drive. Re-enabling the physical drive guarantees data loss going forward from the instant it was red balled.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...