Jump to content

Best way to verify drives are ok after water damage while maintaining as much array integrity as possible


Recommended Posts

Hallo all,

 

I lost my main server (23 disks in the array) to a broken water pipe. About 5 of my drives had water stream right past them. The remaining ones were dry. I have ordered a new server chassis, but my question is this:  What is the best way to verify each drive is still functional and, if I got lucky, doing this in a way that doesn't invalidate the array? And for each drive that is functional, is there any kind of stress testing I can/should do?

 

However, I am expecting some loss of data here. I'm not generally that lucky. I'm just not sure on the best way to try and salvage and rebuild here.

 

The drives in question were 3 data drives and both my parity drives.

 

Any insights are greatly appreciated.

 

Link to comment

Yeah. Whole server was up and running. I have no idea (yet) what is salvageable.. if anything.

 

Server was mounted face up on a vertical bracket. Water streamed down past about 5 drives and right into the PSU. I was doing stuff with the server when it suddenly stopped responding. All the other water in the chassis was from my fans throwing droplets everywhere. My plan is to go over every inch of the mobo and my HBAs and scrub any hard water deposits I see with 99% iso, and then test. It's all just been sitting with a fan on the individual components for a few days.

 

But the drives.. I'm pretty sure the remaining 18 or so didn't get any water. So unless something electrical fried them, I have some hope there.

Link to comment

I would remove them and tag with information as to their physical location in the case.  Then put them in warm dry environment and thoroughly dry them out.  If you want, you could then put them in a desiccate material for a couple of days to remove any remaining moisture. 

 

Then I would install each drive into a USB housing and use the manufacturer's testing program to run a long read test on each drive.  I would use another computer to do this.  If they can be read, then copy the data off. 

 

After you recover all data possible, you can decide what to do with any drives that tested good and you recovered all the data from.  (Any drive with data loss, I would toss at that point.)   You could subject these drives to an extended write-read process (e.g., three cycles of Unraid Preclear-- plugin or Docker) to see if they pass that.

 

PS-- Pitch the power supply, you don't want to take a chance that it might damage something that survived the initial problem because it is defective. 

Link to comment
3 hours ago, spall said:

What is the best way to verify each drive is still functional and, if I got lucky, doing this in a way that doesn't invalidate the array?

Physically, drives are almost sealed against water ingress. If they are helium drives, they are sealed. Problem is, if the drives were hot, and the deluge of water was cold, the drives could have sucked water into their pressure equalization ports if so equipped.

I think I'd put them in a tightly temperature regulated environment at around 50C for several hours if possible, as well as doing the alcohol wipedown.

 

As long as the drive is only mounted read only, the array will still be valid.

 

If you have a spare PC, I think my approach would be to first try to boot the spare PC with ONLY the Unraid USB, NO OTHER DRIVES ATTACHED, and see if it boots, preferably in GUI mode. If it does, change the array to not autostart (not like it would anyway with no drives) and power off from the GUI. Then attach the first suspect drive, boot Unraid, and see if it shows the drive in the appropriate slot. If it does, you could run a SMART test from there. Try each drive in turn, one at a time. Take inventory and see where you are at. Depending on the results we can formulate a game plan from there.

 

(Actually the first thing I would do is attempt to copy the config folder from the Unraid boot drive to a safe place)

 

I think using the Unraid boot USB is one of the safest ways of evaluating each individual drive.

Link to comment

 

@Frank1940@JonathanM

Thanks guys..  this gives me some stuff to think about.

 

I'm not sure how I'd regulate and control an environment at 50C. Dipping them in a sous vide bath seems counter-productive. Heh. 

 

This all happened last Wednesday, May 11. I had them sitting out in my breezeway in around 80F temps for 48 hours with a fan blowing across everything. Unfortunately we had some high humidity roll in and they've been back in my basement since then. The ambient down there is probably only 68F, stiil dry and with air moving across. To be completely accurate, none of the drives were submerged. It's just those 5 or 6 that were wet with visible water droplets. One of them was fairly wet. I believe that was Parity 2. The SATA and power connectors were facing down and didn't get hit with water. So it was a cascade across the top and sides. The drive sleds were fairly tight on the circuit board side, I think the path of least resistance was down the label side. They were the most wet.

 

I do have enough spare equipment I could piece together a system to test with. I had actually just upgraded the internals on this server a few months ago, so I still have my old server board laying around. I'll pop it in a chassis, do to the voodoo with the unRAID boot drive, and see where I'm at. I do have two 14TB and three 4TB drive spares laying around, so copying off some data or replacing whole drives outright is fairly achievable up to that limit without much more expenditure. I have a replacement chassis on the way. I still have some hope that my motherboard might be salvageable.. but.

 

Ok. So, I'll backup the USB drive, then boot the array and turn of autostart, then start testing each drive one at at time. If any drive seems iffy, I'll try to copy off any data I can, and then report back with where I am.

 

Anyway, I really appreciate the suggestions. I will start working on checking the drives first, and worry about the motherboard and HBAs later.

 

@Frank1940 That PSU went in the trash first thing, but thanks for the suggestion. It was soaked.

 

Link to comment
  • 2 weeks later...

Took me a while, things have been crazy over here..

 

1) Assembled a test system from spare parts.

2) Backed up my USB unRAID drive

3) Disabled autostart on the array

4) Inserted one drive at a time and ran a short SMART test. With the exception of 1 drive, they all passed without errors. The odd drive is a SSD that I use in a cache pool. It just seems to start the SMART test and then stops after 10% and reports completed without error.

 

For a moment I thought I had two drives that were failing to power up, but it turns out those were white label WDs shucked from EasyStores and it was the 3.3v pin issue. Molex to SATA spun them up.

 

Did I get extremely lucky here? At least from a data perspective? Should I be running extended SMART tests on these?

Link to comment

The question you pose should be changed to "What is your aversion to risk?"  A quick example of what I mean is:  "Would you go sky diving"?  It might be 'YES!"   Or is your response to that suggestion:  "Why would I jump out of a perfectly good airplane"?  The extreme reaction would be: "Why would I even get in an airplane"?

 

I tend toward the second response so I would definitely run the non-correcting parity check.  If it passes, I would make sure that any irreplaceable data on those disks is copied off into an archive.  (Running the parity check is not an extreme risk.  If there are bad disks, they are going to fail in any type of use.  If the parity check passes, you have a bit of breathing time to make a decision.  Plus, you know that you have single disk failure protection.)

 

Since you have a couple of extra 14TB drives, you might want to consider pulling those 'wet' drives, shrink the array and add the two 14Tb drives to it and then copy the data off the wet drives to the array using the Unassigned Devices plugin. 

Link to comment
Quote

I would definitely run the non-correcting parity check

 

I will do this when my new chassis arrives. My test system cannot hold the entire array.

 

Quote

Since you have a couple of extra 14TB drives, you might want to consider pulling those 'wet' drives

 

Well, actually, the two wettest drives were my two 14TB parity drives. Does that change this suggestion? Does it make sense to pull them and instead of running non-correcting parity, replace them and build new?  <-  I realize this doesn't help me. A classic case of speaking before thinking.   :)

 

Thanks!

Edited by spall
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...