Emulated disk but unable to complete rebuild


Go to solution Solved by JorgeB,

Recommended Posts

I've managed to end up with my server in a weird state and I'm not sure how to resolve. I had a failing drive (12) so replaced it and began the rebuild process but I can't get it to complete before my server crashes due to and issue with another drive (5).
 

When the rebuild process get close to finishing (90%+) the server will completely lock up and require a hard restart, checking the logs once its back up shows this:
 

Nov 26 01:37:04 AlphaV2 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0202): Drive ECC error:port=0.
Nov 26 01:37:04 AlphaV2 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=0.


This drive (5) is connected to a 3ware raid controller that is throwing the error, I have double checked all connections and even moved the drive to a different 3ware port and the error came with it. So pretty sure the issue is with the drive even though the Smart looks pretty good.

So how to resolve? I considered moving all the data off drive 5 so I could remove it from the array but I'm not sure that's possible without loosing data on drive 12 as its currently emulated.

 

All important data is backed up so data lose isn't the end of the world but would like to avoid/minimise it if possible.

Thanks in advanced! 
 

alphav2-diagnostics-20231126-0839.zip

Link to comment

Disk5 shows a pending sector, but that should not make the server crash, unless that controller has a peculiar behavior, I assume since it's a RAID controller you cannot connect the disk to a different controller? A possible option would be to clone disk 5 with ddrescue the use the clone to rebuild the other disk.

Link to comment
  • Solution

See here for ddrecuse instructions, also note that since disk5 has a pending sector any read errors on it will be skipped, this in turn means that the rebuilt disk can also have some corruption, if the read errors coincide with data, but if it's a single pending sector it should be minimal or none with some luck.

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, including the new cloned disk5
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign the disk you want to rebuild
-Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags
-If the emulated disk mounts and contents look correct stop the array
-Re-assign the disk to rebuild and start array to begin.

 

 

  • Thanks 1
Link to comment

Unfortunately I've hit what appears to be the same issue I got during trying to rebuild initially. I was cloning the bad disk with ddrescue and the entire server has locked up, presumably once its gets to the bad sector on the disk.
 


root@AlphaV2:/tmp# ddrescue -f /dev/sdq /dev/sdc /boot/ddrescue.log
GNU ddrescue 1.23
Press Ctrl-C to interrupt
Initial status (read from mapfile)
rescued: 1150 GB, tried: 0 B, bad-sector: 0 B, bad areas: 0

Current status
     ipos:    3284 GB, non-trimmed:        0 B,  current rate:    115 MB/s
     opos:    3284 GB, non-scraped:        0 B,  average rate:    119 MB/s
non-tried:  716453 MB,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:    3284 GB,   bad areas:        0,        run time:  4h 58m 19s
pct rescued:   82.09%, read errors:        0,  remaining time:      1h 46m
                              time since last successful read:          0s
Copying non-tried blocks... Pass 1 (forwards)
Network error: Software caused connection abort

 

Link to comment

Rebuild completed without issue. 🤩

Now I'm wondering how I can figure out exactly which files potentially have corruption. I can see on the ddrescue post there is a way to fill the bad sectors with a string. Is that still possible now the clone drive is part of the array and in use?

 

Also, I assume that any files in the same position on the rebuilt drive 12 could be corrupted too? Can the same approach be used on that drive.

Link to comment
15 hours ago, Phoenix26 said:

Is that still possible now the clone drive is part of the array and in use?

Yes, if you still have the log, note that it will invalidate parity.

 

15 hours ago, Phoenix26 said:

Also, I assume that any files in the same position on the rebuilt drive 12 could be corrupted too?

Possibly, if there was data on those sectores, but that will be more difficult to see which files, unless you have preexisting checksums, or are using btrfs/xfs.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.