Jump to content

Failed Array - Determine Shares or File names on failed drives


Airless

Recommended Posts

Hi All,

 

My array is in a bad place. I'll admit up front, this is 100% my fault and I should have had offsite backups of parts of the array but I didn't. Full story below but the short of it is:

  • I fried 6/11 drives 2 of which were parity drives
  • I'm in the process of having a recovery service recover the drives (should be possible)
  • To recover even just 4 of the drives and reconstruct the full array will be prohibitively expensive

 

I can afford to recover some of the drives but not all. So I'd like to get the best bang for my buck and focus on the drives with the more important/irreplaceable data.

 

Is there a way to generate a list of files or shares that occupied each drive? I want to generate a manifest of file and their disk locations so I can decide the specific disks I want to recover instead of rolling the dice and picking one at random!

(END OF SUMMARY)

 

 

---- The long story (Lesson: Don't get lazy no matter how deep into debugging you are) ----

This all started when I was having some stability issues with my server. Every day or two my system would become unresponsive even when logged into through IPMI. After the 3rd incident I decided to disconnect the drives and start testing components. This seemed like a hardware problem.

 

My gut feeling was that the issue was either with the RAM or Motherboard. It felt like a RAM issue so I started running MemTest on the system to see if anything came up. There were no errors reported during the tests but the system would hang after 20 - 180 mins of testing. I was starting to think this wasn't a RAM issue.

 

I have a love hate relationship with my Supermicro Motherboard so it was next on my hit list. It's been a pain to work with. Had to buy a license from SuperMicro to flash the BIOS so it would actually boot with my Kaby Lake processor, couldn't get it to work with my original PCI Sata card, etc...

 

Unfortunately I don't have any extra server grade parts lying around to test with so swapping the RAM, CPU, or Mobo for a good variant was going to be a lot of trouble. So, I decided to try hooking up a different PSU.

 

Initially I hooked up just the Motherboard connections to the PSU and ran MemTest. To my surprise the tests ran for 22+ hours without issue. Cool! I thought, replacing a PSU is really the cheapest part to break, I lucked out.

 

Here's where things go south...

 

Without thinking I decide to hook up the power and data cables to the 6 HDDs in the main case to double check everything still runs fine before putting the server back in its place and hooking it up to my other drive enclosure. I was hoping to have the server back up and running while I RMA'd the defective PSU.

 

I did a nice job running cables in the server case and rather than undo all of that nice work and re-run cable I decided to hook up the HDD power cables from my dead PSU to the working replacement. They're both modular, the plugs fit. No problem right?...

 

Wrong!

 

I hook everything up, hit power, the fans spin maybe 3/4's of a rotation and the system immediately powers down. I press the power button again. Same issue. Scratching my head I try a few different things (reset CMOS, reseat RAM, etc..). Weird, maybe the ports on this modular power supply aren't working. To be fair, this PSU has only ever been used in a gaming PC and hooked up to maybe 2 HDDs at a time (not 6). So I try different modular ports on the PSU but get the same result. Hmmm...

 

I unplug the HDD power cables, try again, and the system posts. Great! But wait, why didn't the system start up with the HDDs powered? Are modular power cables with standardized ends really different between manufacturer?

 

YEP...

 

A wave of dread washes over me. Did I really just kill 6 drives at once?!

 

I find the right cables, hook them up, power on the system and don't hear anything too out of the ordinary. I do hear a bit of clicking from one drive but for the most part things seem fine. I get into unRAID and sure enough 6 out of my 11 drives, all 6 that are in the main case are all missing...

 

Luckily 2 of those failed drives were parity drives so I really only need to fully recover 4 drives so long as I can get 100% of the data off the parity drives (if they're the ones that get recovered).

 

However, after getting a quote to fully recover 4 drives it's more than I can justify spending. So now, as my initial question states, I'm trying to figure out which 1 or 2 drives I get recovered. Any ideas?

 

A Final Note:

Just to be clear, this isn't a sob story. This is 100% my own damn fault. I just hope someone else reads this as a cautionary tale and stops before haphazardly working deep into the night to try and get their server fixed sooner rather than taking their time and being disciplined when dealing with their hoard of data :)

And for the thousandth time.... Keep an offsite backup!

 

Any suggestions are greatly appreciated!

Link to comment

I should add that I have both the Dynamix File Integrity, Dynamix Cache Directories and a number of other plugins installed. So if there's maybe a log file that would help indicate what paths were on a drive that could be helpful too.

 

Anything to figure out what was on those drives!

Link to comment
8 hours ago, Airless said:

However, after getting a quote to fully recover 4 drives it's more than I can justify spending.

Who did you get a quote from? And is it a full recovery house, or one that specializes in circuit board replacement?

 

Typically drives that are killed in the manner you describe do NOT require clean room service, and are not all that expensive to repair. Worst case is a rare circuit board that requires a rework station to swap some chips.

 

 

Link to comment

I got my quote from a local Data Recovery company. One that does recovery of all sorts (HDDs, SD Cards, Phones, etc...).

 

According to them the electronics on all of the drives are fried (not surprising) and a firmware reconstruction is required (need to look into what that means...I'd figure the software was fairly standardized within a drive model).

 

Half of the drives also have damaged read/write head assemblies which need to be replaced. Normally I'd be skeptical but this does somewhat make sense as I only hooked up half the drives to a proper power source after initially using the wrong cables on all 6 of them. It makes sense that half of the drives would have a different type of damage.

 

I'm essentially looking at $700 CAD per drive for the less damaged drives and $1,300 CAD per drive for the more damaged ones. At that price you can see why I'm not going to try and recover 4 drives! On the plus side they do only charge if they're able to recover the data so I am guaranteed to get data back for any money I spend.

 

Link to comment

Thanks, I'll see what they quote. Hopefully they do work within Canada. I Don't really want to deal with sending these drives internationally.

 

Any chance there's a way to figure out what files and/or shares were on each of the dead drives? I figure unRaid must have a config file or something that maps drives to shares at least.

Link to comment
2 hours ago, Airless said:

I figure unRaid must have a config file or something that maps drives to shares at least.

If you were including/excluding disks for any share that can been seen in the share's cfg file, but if it was automatic there won't be any info on which drives were being used.

Link to comment

While waiting on some competing quotes for drive recovery I spent some more time pouring through logs and a copy of my unRaid boot key to see if I could find any hits of what the file structure might look like on the damaged drives.

 

File Integrity

I struck gold with the dynamix.file.integrity plugin!

As I had suspected there is a complete list of files stored with their hash on the key. For future reference, in case this helps anyone else you can find the files at:

{yourKey}/config/plugins/dynamix.file.integrity/export/disk{n}.export.hash.

 

Unless you've been regularly exporting your hashes these files probably won't be up to date but even if they're a bit old they can give you a good indication of what might be on the drives (depending on share drive fill strategy).

 

Diagnostics Exports

If, like me, you were having trouble with your server before destroying your drives you may have turned on regular diagnostics exports for your system. In my case I have diagnostics exports every 30mins for the last few days of operation. I haven't poured through these yet but there may be some file path hints in the logs here. Will report back if I find anything worth watching out for if someone else is in this situation.

 

I did notice that in each export in {export}/system/vars.txt is a list of disks and what appears to be their state. Of note is the [fsFree] field. Seems like this is the amount of free space on the drive. I have yet to verify this but if this is the actual free space on the drive at the time of the export it could help me decide which drives to pay for recovery on.

 

 

The adventure continues :)

Link to comment
On 7/6/2018 at 4:49 AM, Airless said:

They're both modular, the plugs fit. No problem right?


This is one of the huge dangers with modular PSU.

Kills disks every day all over the globe.

And no way to spread the message to be "real careful", since there isn't a place where all potential users will see the message. :(

 

On 7/6/2018 at 1:11 PM, jonathanm said:

Typically drives that are killed in the manner you describe do NOT require clean room service, and are not all that expensive to repair. 

 

This is really important to remember - broken electronics is way different broken mechanics. Multiple companies working with donor electronics to recover data. Then they don't lock up their clean room equipment during the file copy process and can directly focus on the broken electronics of the next disk.

 

On 7/6/2018 at 1:11 PM, jonathanm said:

Worst case is a rare circuit board that requires a rework station to swap some chips.

 

It's more common to use a donor card and - depending on model - optionally move a configuration chip from the original board.

 

On 7/7/2018 at 4:31 AM, Airless said:

I got my quote from a local Data Recovery company.

 

Unless you live in "huge city", your local data recovery company would normally be a quite small company. Maybe not even very experienced.

 

On 7/7/2018 at 4:31 AM, Airless said:

a firmware reconstruction is required (need to look into what that means..

 

Firmware is normally the software running in embedded equipment. But I guess they didn't mean firmware but actually the disk-specific configuration. All disks are unique - because of the small mechanical tolerances they can't produce millions of perfectly identical disks. So the disk has a table with information telling it lots of parameters about location of tracks, number of sectors/track and at what track number the disk changes to a different number of sectors/track for the different surfaces. Some disks store most of this information on sectors on the disk, while some disks stores most of this information in EEPROM or flash memory on the PCB.

 

On 7/7/2018 at 4:31 AM, Airless said:

Half of the drives also have damaged read/write head assemblies which need to be replaced.

 

I wouldn't expect damaged mechanics. But possibly the read amplifiers, since the signal from the heads are very weak.

 

On 7/8/2018 at 4:48 AM, Airless said:

I struck gold with the dynamix.file.integrity plugin!

 

I think everyone who cares even the slightest about data integrity should use some software to create a table of file hashes. When the shit hits the fan, a hash table really is worth plenty.

 

Please keep us informed about the outcome. I'm holding my thumbs.

Link to comment
On 7/9/2018 at 8:19 AM, pwm said:

Unless you live in "huge city", your local data recovery company would normally be a quite small company. Maybe not even very experienced.

 

I wouldn't say I live in a huge city but it's a combination government and tech town. Big telcom/chip companies here and lots of government buildings. The capabilities are here but their government and enterprise customers tend to drive up prices.

 

On 7/9/2018 at 8:19 AM, pwm said:

 

Firmware is normally the software running in embedded equipment. But I guess they didn't mean firmware but actually the disk-specific configuration. All disks are unique - because of the small mechanical tolerances they can't produce millions of perfectly identical disks. So the disk has a table with information telling it lots of parameters about location of tracks, number of sectors/track and at what track number the disk changes to a different number of sectors/track for the different surfaces. Some disks store most of this information on sectors on the disk, while some disks stores most of this information in EEPROM or flash memory on the PCB.

That's what I figured. I send them the SMART data, they see the Firmware version, they download and flash it. Everything works. Your explanation makes sense though. Per unit config parameters to account for manufacturing variations seems reasonable.

 

On 7/9/2018 at 8:19 AM, pwm said:

I wouldn't expect damaged mechanics. But possibly the read amplifiers, since the signal from the heads are very weak.

Hmm, ok. I'll definitely get a second opinion from another shop.

 

On 7/9/2018 at 8:19 AM, pwm said:

I think everyone who cares even the slightest about data integrity should use some software to create a table of file hashes. When the shit hits the fan, a hash table really is worth plenty.

Yea, live and learn. My data storage protocol will be changing quite a bit once I get through this recovery :)

 

Thanks for the input. I'll make sure to update this thread as I make progress!

Link to comment
2 hours ago, Airless said:

Big telcom/chip companies here and lots of government buildings. The capabilities are here but their government and enterprise customers tend to drive up prices.

 

I would have expected the above companies/organizations to not need to be big data recovery customers - shouldn't they have well-payed employees who knows how to build robust systems with working backup and redundancy?

Link to comment
9 hours ago, pwm said:

I would have expected the above companies/organizations to not need to be big data recovery customers - shouldn't they have well-payed employees who knows how to build robust systems with working backup and redundancy?

 

Re: Government - Their IT teams are not exactly known for their speedy work. I'm sure there are parts of the government that handle very sensitive data and have their own internal recovery teams but in the spirit of keeping things lean there's been a big movement around here since the 90's to privatize services where possible.  The private services can also handle overflow work when there is an unexpected spike in recovery required (Ex: building fire).

 

Re Big Telcom Companies - Yea I don't know. Maybe at this point so many services have been moved off site and there isn't enough work to staff a proper recovery team. It's cheaper to hire a service to recover a few RAIDs a year than to staff/maintain a team. I don't really know though. I've only ever worked at small, haphazard tech companies in the region that treat things like backups and version control as "optional"...

 

 

Update

Still waiting on Donor Drives to get back to me. I'm going to have to pull the trigger on a company soon though.

Through the file manifests I've been able to cobble together I have a plan. I'm going to get the service to recover one drive at a time and update me based on the result since the completeness of a drive's recovery will somewhat dictate my strategy.

 

1. Recover two drives. One of the expensive/more damaged drives that I KNOW has most of my irreplaceable data and the other that is cheaper to recover but has enough useful/pain in the but data on it to be worth paying for the recovery

 

Here's where things split

Option 1

2a. If the previous 2 drives were recovered 100% then I'll recover another one of the cheaper data disks.

3a. If that next drive is recovered 100% then I'll have the less damaged/cheaper parity drive recovered. At that point I'll have good confidence that I'll get a 100% recovery of that last drive and it's the last drive that can be recovered at the cheaper price point. So long as that one recovers 100% I should have all of my data back!

 

Option 2

2b. If the previous 2 drives were not recovered 100% then I can stop here and know that it's not worth spending any more money on.

 

Option 3

2c. If one of the previous 2 drives weren't 100% recovered then I know it's not worth even looking at the Parity drives. In theory I could try to recover 3 more drives and hope for the best but I really don't want to take that gamble. I'll just recover the 2 remaining cheaper data drives and call it a day.

 

Regardless of what happens I'll make sure to get the original HDDs back. Who knows, maybe data recovery will get super cheap in the future or I'll win the lottery :) Worst case I can harvest a bunch of real strong fridge magnets out of them!

Link to comment
3 hours ago, Airless said:

I'm sure there are parts of the government that handle very sensitive data and have their own internal recovery teams

 

Something have already failed when recovery is needed. I can understand some important files needing to be recovered from a laptop or two after accidents and before the laptop has been able to synchronize with the backup servers.

 

On the other hand - laptop disks should always be encrypted because of the danger of losing the computer while traveling. And encryption should make it hard or impossible for outside companies to be able to recover actual files, while the IT department should have a copy of each individual disk encryption key to make sure they can recover files when the laptop owner has forgotten the password to unlock the disk.

 

3 hours ago, Airless said:

I've only ever worked at small, haphazard tech companies in the region that treat things like backups and version control as "optional"...

 

This is a not too uncommon situation. But the larger companies really should have collected enough experience many years ago and invested in multi-redundant storage pools so they can burn down a building and still keep all important information except what people produced locally on their personal machines and that hadn't been backed up yet. Clustered storage isn't exactly rocket science anymore.

 

3 hours ago, Airless said:

If one of the previous 2 drives weren't 100% recovered then I know it's not worth even looking at the Parity drives.

 

In your situation, parity drives shouldn't normally be on the list of disks to recover. They are only meaningful (to rebuild other disks) if every single disk required for the rebuild are 100% valid. In a system where you have lost 4 data and 2 parity disks, you need to recover 4 disks in total. And none of the four should be a parity disk unless you already know that one or two of the data disks are totally impossible to recover. If you do data recovery on four data disks and get 80% recovery on each disk, then you have 80% recovery of the lost data. If you do data recovery on the parity disks and they manage 80% recovery, then you'll have a total mess figuring out what is valid data and what is broken data on disks you let unRAID rebuild.

 

Obviously, you have a huge advantage here if you have access to a database of file hashes, so you can seed out broken files.

Link to comment

Parity disks are only useful with their data in the raw sectors intact.There is no file system to do a partial recovery. Its going to be all or nothing. So a Data recovery attempt is going to be next to useless unless they recover all the sectors of the parity disk. And if I was spending money on a data recovery of a disk, I'd rather spend it on an actual data disk.

Link to comment

There is one additional issue when considering recovery of parity drives to let unRAID rebuild one or more data disks.

 

It isn't just the parity drives that must be recovered as full-disk images. Any recovered data disk must also have been recovered as a full-disk image. This is needed because unRAID does the disk emulation (and disk rebuilds) based on each disk being a huge image file. The emulation/rebuild treats all the disks as raw devices - not as collections of files.

 

So the data recovery company can't just recover the individual files of the data disks - they must recover the actual full disk image of every recovered disk, and hand them over on suitable disks that will mount and behave as the original disks.

 

When ignoring the parity disks and just focusing on the data disks, it's ok if the recovery company recovers the files and hands them over on any other media that you can then import into your unRAID system and then rebuild parity.

 

Anyway - with your kind of failure, I would find it most likely that the recovery company either recovers 0% (a total failure to get new electronics to interact with your platters and locate the actual data) or near 100% (everything recovered with the potential exception of one or more uncorrectable sectors that you weren't aware of or that might be marginal and just about readable in the original disk but outside the margin after the work the recovery company had to do).

Link to comment
On 7/11/2018 at 2:27 AM, pwm said:

In your situation, parity drives shouldn't normally be on the list of disks to recover. They are only meaningful (to rebuild other disks) if every single disk required for the rebuild are 100% valid. In a system where you have lost 4 data and 2 parity disks, you need to recover 4 disks in total. And none of the four should be a parity disk unless you already know that one or two of the data disks are totally impossible to recover. If you do data recovery on four data disks and get 80% recovery on each disk, then you have 80% recovery of the lost data. If you do data recovery on the parity disks and they manage 80% recovery, then you'll have a total mess figuring out what is valid data and what is broken data on disks you let unRAID rebuild.

Right, I think you missed the details of Option 1. The only option where I even consider recovering a parity. The parity is purposely the last drive to get recovered and that only happens IF all previous drives have been recovered 100% as an image. Going this route will save me a good chunk of change and the company has already agreed that I'll only be charged for that drive if they get a perfect image.

 

On 7/11/2018 at 2:49 AM, pwm said:

Anyway - with your kind of failure, I would find it most likely that the recovery company either recovers 0% (a total failure to get new electronics to interact with your platters and locate the actual data) or near 100% (everything recovered with the potential exception of one or more uncorrectable sectors that you weren't aware of or that might be marginal and just about readable in the original disk but outside the margin after the work the recovery company had to do)

My thought exactly. For the cheaper to recover drives it's purely an electronics replacement so it'll either work or not. The more expensive recoveries are the higher risk of failure or partial data recovery. That's why I'm recovering the highest risk (also highest value) drive first and choosing my course of action based on the result of that drive. If it's 100% then I move forward with recovering 2 more data drives which should only have electronic damage. If both of those recover 100% then I have good certainty the parity drive with just electronic damage will recover to 100% as well.

 

Keep in mind evaluated condition of my drives are
2 Data, 1 Parity - PCB damage only

2 Data, 1 Parity - PCB + Head damage

 

On 7/11/2018 at 2:33 AM, ken-ji said:

Parity disks are only useful with their data in the raw sectors intact.There is no file system to do a partial recovery. Its going to be all or nothing. So a Data recovery attempt is going to be next to useless unless they recover all the sectors of the parity disk. And if I was spending money on a data recovery of a disk, I'd rather spend it on an actual data disk.

Yes, they're willing to recover an image and have agreed that it's no charge unless a complete image is retrieved from that drive. That comes with the condition that they were successful with the 3 previous drive recoveries and we attempt recovery of the less damaged parity. With that agreement there's no reason I shouldn't try. Either I get all my data back for less money or we try and fail for free and I decide if I want to try and recover that last data disk.

Link to comment

Update

I did finally hear back from Donor Drive, or rather their sister company that does the repair/recovery, and their prices were quite a bit better. PCB replacement + Data recovery (no head damage) was about 30% cheaper. Unfortunately, they only have locations in the US.  Shipping all of these drives across the boarder is not something I'm all that comfortable with. Ironically, the data I'm least concerned with recovering (source backups of client projects) is the data that makes me least comfortable sending cross border.

 

So, I'm going to bite the bullet tomorrow and start the recovery of the first two data drives. Wish me luck!

 

Side note, I RMA'd the PSU that was the start of this whole fiasco. Replacement PSU exhibits the same issue on multiple systems! Started another RMA. Hopefully 3rd time is the charm.

Link to comment

A question just came to mind. Looking at HDDs these days it seems it's not really worth going for 2TB drives anymore. 3TB seem to be the sweet spot (not complaining!).

 

If I successfully recover that one parity so I have 3 valid data + 1 valid parity in a 6 disk total array is there any issue with the 2TB drives being replaced with 3TB ones?

 

From my understanding of how the parity works, so long as I pre-clear the 3TB drives before loading them up with the recovered data I SHOULD be fine. This is because the new space on the data drives will just return a 0 and not affect the end result of the bit had the space not existed at all.

Link to comment

When putting a value on drives, keep in mind that each slot used has a cost, in power budget, SATA connections, and physical space.

Personally I think the 8TB drives are where the value is right now, particularly if you catch a sale on WD externals that are easily removable from their cases and used internally.

 

7 hours ago, Airless said:

From my understanding of how the parity works, so long as I pre-clear the 3TB drives before loading them up with the recovered data I SHOULD be fine. This is because the new space on the data drives will just return a 0 and not affect the end result of the bit had the space not existed at all.

 

Theoretically that would work. However... I'm not sure I'd risk it, especially if you are going to try to rebuild a drive with a recovered image. It would be much better to set a HPA on the drive to set it to the exact number of sectors needed for the image, then back up the data to a different drive, then once all your data is safely backed up, remove the HPA and rebuild one at a time.

 

Link to comment
On 7/13/2018 at 5:58 PM, jonathanm said:

When putting a value on drives, keep in mind that each slot used has a cost, in power budget, SATA connections, and physical space.

Personally I think the 8TB drives are where the value is right now, particularly if you catch a sale on WD externals that are easily removable from their cases and used internally.

I see what you're saying but even the cheapest 8TB drives available to me are ~$300. I need to replace 10TB of storage + 2 parities. That's $1,200 for cheap 8TBs vs $780 for 6x 3TB Reds. Having 14 HDD free slots available to me at the moment (including the 6 from the fried drives) leave me enough room to grow for my needs. By the time I start outgrowing the number of available slots 8TBs should come down in price and I can start replacing 2TB drives.

 

HPA is a smart idea if I end up going that route. Thanks! I didn't think of that but that is a much safer option. At the moment the recovery service is working through the first 2 drives. That will influence my next steps. Kinda hoping I don't have to try and recover the parity but at this point extra work on my part will outweigh the cost. If it comes to backup up the entire raid to another system and attempting the recover that's what I'll do. I have access to plenty of temporary storage to try something like that without risk.

 

Thanks for all the advice!

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...