Drive crash during ReiserFS to XFS move - help on recovering

t33j4y · July 15, 2015

After a very long time without any "real" problems with my unRAID server, it seems karma came back and sunkt its teeth into my behind

I am running 6.0-rc5 and am in the process of converting from ReiserFS to XFS, drive by drive. While moving files back to one of the drives, the target drive seems to have failed. I have all the data on another disk, but am a bit unsure about how to proceed.

There are 4 data drives in all (4+4+4+3 TB - see attached screenshot) - disk2 seems to be having trouble (I've included a screenshot of the smart test which - to me - doesn't indicate any errors. Disk5 is the drive that I move data off to, while I reformat the array drive and the copy data back to (yup, a bit tedious but I find I have more control that way than some of the rsync approaches that have been posted)

I have succesfully migrated disk1 to XFS. The faulty disk is disk2, which comes up with a red x (emulated) no matter what I do. I have tried formatting to XFS again, moved it to another controller but to no avail. It would seem that the drive is not recognized by unRAID anymore.

Now, I have a spare 4TB on hand, that I can pop in. The problem is that parity is invalid (as it has not been recalculated after migrating disk1 to XFS). As I said, I have the data safe and sound on disk5. But if I replace disk2 with a new one, unRAID will want to rebuild when I start the array (I dry-ran this and saw that was what unRAID wanted). And it would do so based on parity that is invalid (and one of the sources, disk1, has changed).

I have pondered disabling the parity drive (I am running in unprotected mode anyway) to prevent unRAID from trying to rebuild, and then

a. Replace disk2

b. Copy data back to disk2 from disk5

c. Verify data is copied correctly

d. Enable parity

e. Do "New config" and get back parity, to create a safe baseline to proceed from

But since there is a lot of data hanging in the balance, I'd really appreciate to hear from others before I go ahead.

Cheers,

Thomas

syslog-2015-07-15.zip

trurl · July 15, 2015

Why do you think parity is invalid?

I don't see anything in your description of events that would make me think parity is invalid, and the screenshot doesn't seem to say it is either. unRAID maintains parity real-time, and that includes formatting of disks. Formatting is just like any other write operation as far as parity is concerned.

t33j4y · July 15, 2015

Hi trurl,

thanks for replying.

What you are saying sounds very reasonable - I am just being very cautious and initially thought that parity would have become invalid because of the move from one filesystem to another.

So if I let the rebuild run, that would mean that it would return the new disk2 drive to an XFS formatted disk with the share of the copied files and directories that actually made it through, before the disk gave in, right?

trurl · July 15, 2015

Hi trurl,

thanks for replying.

What you are saying sounds very reasonable - I am just being very cautious and initially thought that parity would have become invalid because of the move from one filesystem to another.

So if I let the rebuild run, that would mean that it would return the new disk2 drive to an XFS formatted disk with the share of the copied files and directories that actually made it through, before the disk gave in, right?

Well, I'm not sure what to think now. The screenshot indicates disk2 is reiserfs so it seems like a rebuild would result in a reiserfs disk2. Your syslog is a little confusing though.

It shows it trying and failing to mount disk2 as reiserfs, then later formatting disk2 as reiserfs and successfully mounting it.

Then, with no indication that it was formatted as xfs, it later it tries and fails to mount disk2 as xfs.

Then later it formats disk2 as xfs and successfully mounts it.

Nothing in the syslog about a write error which would have resulted in the redball, and no errors in the screenshot. I suspect the syslog does not actually include the redball event because there was a reboot after it occurred.

Are you saying you successfully reformatted disk2 to xfs and then copied files to it before it redballed? Are you sure it didn't redball before the format? Maybe you were just copying files to the simulated disk2 after it redballed. Can you read from the simulated disk2?

t33j4y · July 15, 2015

Well, I'm not sure what to think now. The screenshot indicates disk2 is reiserfs so it seems like a rebuild would result in a reiserfs disk2. Your syslog is a little confusing though.

At the time the drive redballed, it was XFS. In a futile attempt to get the drive back, because I thought the filesystem was squashed, I reformatted to ReiserFS. As I have the files on disk5, I was not too worried about losing the contents on disk2.

It shows it trying and failing to mount disk2 as reiserfs, then later formatting disk2 as reiserfs and successfully mounting it.

Then, with no indication that it was formatted as xfs, it later it tries and fails to mount disk2 as xfs.

Then later it formats disk2 as xfs and successfully mounts it.

Nothing in the syslog about a write error which would have resulted in the redball, and no errors in the screenshot. I suspect the syslog does not actually include the redball event because there was a reboot after it occurred.

The syslog shows my user generated reformat actions.

And it is correct that the syslog likely doesn't hold the actual redballing.

Are you saying you successfully reformatted disk2 to xfs and then copied files to it before it redballed? Are you sure it didn't redball before the format? Maybe you were just copying files to the simulated disk2 after it redballed. Can you read from the simulated disk2?

Absolutely correct that I reformatted the drive to XFS and then started copying files to it. I set that up last night, and when I checked it this morning, MC had stopped copying as the drive was not available anymore. I would have noticed if the drive was redballed before I started copying, so I am quite confident it wasn't. I can open the emulated disk2 from a file explorer, but it shows as empty.

My goal would be to get disk2 replaced with another disk - whether it is blank, ReiserFS or XFS doesn't matter as I would copy files back from disk5 anyway. My worry is that unRAID would start meddling with other disks while recovering disk2 to the array.

trurl · July 15, 2015

If you don't care about rebuilding you could just do a new config and let parity rebuild. It would be a good idea to try to get to the bottom of why disk2 failed so it doesn't happen again. Maybe a cabling issue?

t33j4y · July 15, 2015

So, the steps would be:

1. Stop array

2. Backup flash

3. Remove faulty drive from disk2 position and replace with working one (is spinning, just not assigned)

4. Remove disk5 which holds the backup from the array in a t subfolder and shrink array size by 1

5. Do "New Config" and start array

6. Wait for rebuild

7. Stop array when done

8. Expand array to hold disk5 again and restart array

9. Copy back data from disk5/t to disk2

Or are steps 4, 7 and 8 unnecessary since the data is in the t subfolder?

Regarding figuring out what happened, I will monitor - the drives are on a SAS backplane with a SATA to SAS breakout cable (4 SATA ports feed into a backplane with 4 disks). I wonder if this could be the on-board controller of the motherboard acting up. Could of course also be a bad SATA connector in the backplane.

Cheers,

Thomas

t33j4y · July 15, 2015

Update:

After removing all drives on the backplane where the faulty drive was and wiring them directly to the SATA ports to eliminate any backplane issues, I took the plunge and replaced the (apparently) faulty drive2. The new drive was accepted (blue ball) and the array is now in the process of Data-Rebuild. It'll be interesting to see what is actually restored to the disk.

The old disk2 will be put through its paces with preclear as that is the most intensive disk test I have on hand, once the array is restored. Will put off more migration to XFS until I have absolute trust that my hardware is not acting up.

Maybe this is a hint that I should get going doing that custom build enclosure I've been meaning to for a long time and ditch the Norco :-)

trurl · July 16, 2015

So, the steps would be:

1. Stop array

2. Backup flash

3. Remove faulty drive from disk2 position and replace with working one (is spinning, just not assigned)

4. Remove disk5 which holds the backup from the array in a t subfolder and shrink array size by 1

5. Do "New Config" and start array

6. Wait for rebuild

7. Stop array when done

8. Expand array to hold disk5 again and restart array

9. Copy back data from disk5/t to disk2

Or are steps 4, 7 and 8 unnecessary since the data is in the t subfolder?

Regarding figuring out what happened, I will monitor - the drives are on a SAS backplane with a SATA to SAS breakout cable (4 SATA ports feed into a backplane with 4 disks). I wonder if this could be the on-board controller of the motherboard acting up. Could of course also be a bad SATA connector in the backplane.

Cheers,

Thomas

I know you have already moved on from this and didn't use this approach, but there is something in your steps that I thought needed clearing up in case you or someone else takes it as good advice.

8. Expand array to hold disk5 again and restart array

There are only 2 ways this can go. One is you do another new config and rebuild parity. The other is you don't do a new config and unRAID clears disk5 so it can be added as a new disk that conforms to existing parity. Obviously the last option is not what you want, and the first option is pointless since you could have just included disk5 in the new config before, skipping step 4 which removed disk5.

Drive crash during ReiserFS to XFS move - help on recovering

Recommended Posts

t33j4y

Link to comment

trurl

Link to comment

t33j4y

Link to comment

trurl

Link to comment

t33j4y

Link to comment

trurl

Link to comment

t33j4y

Link to comment

t33j4y

Link to comment

trurl

Link to comment

Archived