Emulated disk but unable to complete rebuild

Phoenix26 · November 26, 2023

I've managed to end up with my server in a weird state and I'm not sure how to resolve. I had a failing drive (12) so replaced it and began the rebuild process but I can't get it to complete before my server crashes due to and issue with another drive (5).

When the rebuild process get close to finishing (90%+) the server will completely lock up and require a hard restart, checking the logs once its back up shows this:

Nov 26 01:37:04 AlphaV2 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0202): Drive ECC error:port=0.
Nov 26 01:37:04 AlphaV2 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=0.

This drive (5) is connected to a 3ware raid controller that is throwing the error, I have double checked all connections and even moved the drive to a different 3ware port and the error came with it. So pretty sure the issue is with the drive even though the Smart looks pretty good.

So how to resolve? I considered moving all the data off drive 5 so I could remove it from the array but I'm not sure that's possible without loosing data on drive 12 as its currently emulated.

All important data is backed up so data lose isn't the end of the world but would like to avoid/minimise it if possible.

Thanks in advanced!

alphav2-diagnostics-20231126-0839.zip

JorgeB · November 27, 2023

Disk5 shows a pending sector, but that should not make the server crash, unless that controller has a peculiar behavior, I assume since it's a RAID controller you cannot connect the disk to a different controller? A possible option would be to clone disk 5 with ddrescue the use the clone to rebuild the other disk.

Phoenix26 · November 27, 2023

That's right, this disk has to stay on its controller.

I do have a spare disk so ddrescue could work, once I've cloned disk 5 to the spare how would I switch it out without impacting emulated disk 12?

Edited November 27, 2023 by Phoenix26

JorgeB · November 27, 2023

You'd need to do a new config and manually disable the other disk, note the the cloned disk must be the same size as the current one, I can post complete instructions if interested.

Phoenix26 · November 27, 2023

Complete instructions would be great, thanks!

JorgeB · November 27, 2023

See here for ddrecuse instructions, also note that since disk5 has a pending sector any read errors on it will be skipped, this in turn means that the rebuilt disk can also have some corruption, if the read errors coincide with data, but if it's a single pending sector it should be minimal or none with some luck.

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, including the new cloned disk5
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign the disk you want to rebuild
-Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags
-If the emulated disk mounts and contents look correct stop the array
-Re-assign the disk to rebuild and start array to begin.

Phoenix26 · November 28, 2023

Unfortunately I've hit what appears to be the same issue I got during trying to rebuild initially. I was cloning the bad disk with ddrescue and the entire server has locked up, presumably once its gets to the bad sector on the disk.


root@AlphaV2:/tmp# ddrescue -f /dev/sdq /dev/sdc /boot/ddrescue.log
GNU ddrescue 1.23
Press Ctrl-C to interrupt
Initial status (read from mapfile)
rescued: 1150 GB, tried: 0 B, bad-sector: 0 B, bad areas: 0

Current status
     ipos:    3284 GB, non-trimmed:        0 B,  current rate:    115 MB/s
     opos:    3284 GB, non-scraped:        0 B,  average rate:    119 MB/s
non-tried:  716453 MB,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:    3284 GB,   bad areas:        0,        run time:  4h 58m 19s
pct rescued:   82.09%, read errors:        0,  remaining time:      1h 46m
                              time since last successful read:          0s
Copying non-tried blocks... Pass 1 (forwards)
Network error: Software caused connection abort

JorgeB · November 28, 2023

Connect the disk to a different controller, for cloning it doesn't matter where it is connected.

Phoenix26 · November 28, 2023

Thanks, trying again with it connected via the IT mode card.

Thinking ahead, once I got a successful clone should I connect that back via the 3ware card?

JorgeB · November 28, 2023

Yes, the cloned disk should go there again.

Phoenix26 · November 30, 2023

Thanks for your help. I managed to clone disk 5 and create a new config using the clone. Currently rebuilding disk 12 so fingers crossed it finishes.

Phoenix26 · December 1, 2023

Rebuild completed without issue. 🤩

Now I'm wondering how I can figure out exactly which files potentially have corruption. I can see on the ddrescue post there is a way to fill the bad sectors with a string. Is that still possible now the clone drive is part of the array and in use?

Also, I assume that any files in the same position on the rebuilt drive 12 could be corrupted too? Can the same approach be used on that drive.

JorgeB · December 2, 2023

15 hours ago, Phoenix26 said:

Is that still possible now the clone drive is part of the array and in use?

Yes, if you still have the log, note that it will invalidate parity.

15 hours ago, Phoenix26 said:

Also, I assume that any files in the same position on the rebuilt drive 12 could be corrupted too?

Possibly, if there was data on those sectores, but that will be more difficult to see which files, unless you have preexisting checksums, or are using btrfs/xfs.

Emulated disk but unable to complete rebuild

Recommended Posts

Phoenix26

Link to comment

JorgeB

Link to comment

Phoenix26

Link to comment

JorgeB

Link to comment

Phoenix26

Link to comment

JorgeB

Link to comment

Phoenix26

Link to comment

JorgeB

Link to comment

Phoenix26

Link to comment

JorgeB

Link to comment

Phoenix26

Link to comment

Phoenix26

Link to comment

JorgeB

Link to comment

Join the conversation