Phoenix26 Posted November 26, 2023 Share Posted November 26, 2023 I've managed to end up with my server in a weird state and I'm not sure how to resolve. I had a failing drive (12) so replaced it and began the rebuild process but I can't get it to complete before my server crashes due to and issue with another drive (5). When the rebuild process get close to finishing (90%+) the server will completely lock up and require a hard restart, checking the logs once its back up shows this: Nov 26 01:37:04 AlphaV2 kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x0202): Drive ECC error:port=0. Nov 26 01:37:04 AlphaV2 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=0. This drive (5) is connected to a 3ware raid controller that is throwing the error, I have double checked all connections and even moved the drive to a different 3ware port and the error came with it. So pretty sure the issue is with the drive even though the Smart looks pretty good. So how to resolve? I considered moving all the data off drive 5 so I could remove it from the array but I'm not sure that's possible without loosing data on drive 12 as its currently emulated. All important data is backed up so data lose isn't the end of the world but would like to avoid/minimise it if possible. Thanks in advanced! alphav2-diagnostics-20231126-0839.zip Quote Link to comment
JorgeB Posted November 27, 2023 Share Posted November 27, 2023 Disk5 shows a pending sector, but that should not make the server crash, unless that controller has a peculiar behavior, I assume since it's a RAID controller you cannot connect the disk to a different controller? A possible option would be to clone disk 5 with ddrescue the use the clone to rebuild the other disk. Quote Link to comment
Phoenix26 Posted November 27, 2023 Author Share Posted November 27, 2023 (edited) That's right, this disk has to stay on its controller. I do have a spare disk so ddrescue could work, once I've cloned disk 5 to the spare how would I switch it out without impacting emulated disk 12? Edited November 27, 2023 by Phoenix26 Quote Link to comment
JorgeB Posted November 27, 2023 Share Posted November 27, 2023 You'd need to do a new config and manually disable the other disk, note the the cloned disk must be the same size as the current one, I can post complete instructions if interested. Quote Link to comment
Phoenix26 Posted November 27, 2023 Author Share Posted November 27, 2023 Complete instructions would be great, thanks! Quote Link to comment
Solution JorgeB Posted November 27, 2023 Solution Share Posted November 27, 2023 See here for ddrecuse instructions, also note that since disk5 has a pending sector any read errors on it will be skipped, this in turn means that the rebuilt disk can also have some corruption, if the read errors coincide with data, but if it's a single pending sector it should be minimal or none with some luck. -Tools -> New Config -> Retain current configuration: All -> Apply -Check all assignments and assign any missing disk(s) if needed, including the new cloned disk5 -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign the disk you want to rebuild -Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags -If the emulated disk mounts and contents look correct stop the array -Re-assign the disk to rebuild and start array to begin. 1 Quote Link to comment
Phoenix26 Posted November 28, 2023 Author Share Posted November 28, 2023 Unfortunately I've hit what appears to be the same issue I got during trying to rebuild initially. I was cloning the bad disk with ddrescue and the entire server has locked up, presumably once its gets to the bad sector on the disk. root@AlphaV2:/tmp# ddrescue -f /dev/sdq /dev/sdc /boot/ddrescue.log GNU ddrescue 1.23 Press Ctrl-C to interrupt Initial status (read from mapfile) rescued: 1150 GB, tried: 0 B, bad-sector: 0 B, bad areas: 0 Current status ipos: 3284 GB, non-trimmed: 0 B, current rate: 115 MB/s opos: 3284 GB, non-scraped: 0 B, average rate: 119 MB/s non-tried: 716453 MB, bad-sector: 0 B, error rate: 0 B/s rescued: 3284 GB, bad areas: 0, run time: 4h 58m 19s pct rescued: 82.09%, read errors: 0, remaining time: 1h 46m time since last successful read: 0s Copying non-tried blocks... Pass 1 (forwards) Network error: Software caused connection abort Quote Link to comment
JorgeB Posted November 28, 2023 Share Posted November 28, 2023 Connect the disk to a different controller, for cloning it doesn't matter where it is connected. 1 Quote Link to comment
Phoenix26 Posted November 28, 2023 Author Share Posted November 28, 2023 Thanks, trying again with it connected via the IT mode card. Thinking ahead, once I got a successful clone should I connect that back via the 3ware card? Quote Link to comment
JorgeB Posted November 28, 2023 Share Posted November 28, 2023 Yes, the cloned disk should go there again. 1 Quote Link to comment
Phoenix26 Posted November 30, 2023 Author Share Posted November 30, 2023 Thanks for your help. I managed to clone disk 5 and create a new config using the clone. Currently rebuilding disk 12 so fingers crossed it finishes. 1 Quote Link to comment
Phoenix26 Posted December 1, 2023 Author Share Posted December 1, 2023 Rebuild completed without issue. 🤩 Now I'm wondering how I can figure out exactly which files potentially have corruption. I can see on the ddrescue post there is a way to fill the bad sectors with a string. Is that still possible now the clone drive is part of the array and in use? Also, I assume that any files in the same position on the rebuilt drive 12 could be corrupted too? Can the same approach be used on that drive. Quote Link to comment
JorgeB Posted December 2, 2023 Share Posted December 2, 2023 15 hours ago, Phoenix26 said: Is that still possible now the clone drive is part of the array and in use? Yes, if you still have the log, note that it will invalidate parity. 15 hours ago, Phoenix26 said: Also, I assume that any files in the same position on the rebuilt drive 12 could be corrupted too? Possibly, if there was data on those sectores, but that will be more difficult to see which files, unless you have preexisting checksums, or are using btrfs/xfs. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.