Craigb Posted April 11, 2021 Share Posted April 11, 2021 Help! I noticed that one of the data disks was showing a red cross. I ran an extended SMART test with no faults detected. After rebuilding the drive, I noticed that the parity drive had become disabled (red cross) at some point during the rebuild process. The original data disk is now not mountable. The file system is XFS with a single parity drive and about 40TB in total capacity. No other drive is showing any issues. As a precaution, I re-seated all disks, connectors and HBA cards. I received the following when I ran XFS_repair from the GUI: >>> ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. >>> This is from the read/parity check history. >>>> Date /Duration /Speed /Status /Errors 2021-04-09 15:55:44 27 min, 4 sec 3.7, GB/s OK 1465130625 2021-02-01 06:56:03 18 hr, 3 min, 48 sec 92.3 MB/s OK 0 2020-10-27 09:12:39 17 hr, 32 min, 20 sec 95.0 MB/s OK 0 2020-09-22 04:44:56 16 hr, 58 min, 34 sec 98.2 MB/s OK 0 >>> That's a 6T drive which is about 95% full. Other than data that might have been written to the log from the rebuild process, nothing else has been written there for weeks. Not sure what my options are if the drive can't be mounted and the log replayed as suggested but, there is a substantial amount of data that I'd really like to keep if possible. I've included the diagnostics from before the rebuild. Thanks! nas1-diagnostics-20210409-1217.zip Quote Link to comment
JorgeB Posted April 11, 2021 Share Posted April 11, 2021 Unfortunately you rebooted since the rebuild, did you save those diags by any chance? Still the errors suggest the rebuilt was corrupt. Quote Link to comment
Craigb Posted April 11, 2021 Author Share Posted April 11, 2021 Thanks!! I have several since then. I think this is the one post rebuild. nas1-diagnostics-20210409-1458.zip Quote Link to comment
JorgeB Posted April 12, 2021 Share Posted April 12, 2021 17 hours ago, Craigb said: think this is the one post rebuild. Nope, no rebuild on those. Quote Link to comment
Craigb Posted April 12, 2021 Author Share Posted April 12, 2021 Got it - I think. The attached is the syslog which includes the rebuild and related events. I don't have a diagnostics package for that period. At some point I'm going to do a byte level image of that disk to try to preserve whatever may be recoverable from further damage. Thanks!! syslog-20210409-161331.rar Quote Link to comment
JorgeB Posted April 12, 2021 Share Posted April 12, 2021 Yep, there were write errors on parity before the rebuild even begun: Apr 9 15:28:31 NAS1 kernel: md: disk0 write error, sector=1953567832 Apr 9 15:28:31 NAS1 kernel: md: disk0 write error, sector=1953567840 Apr 9 15:28:31 NAS1 kernel: md: disk0 write error, sector=1953567848 So parity got disabled and the rebuild would be 100% corrupt: Apr 9 15:28:40 NAS1 kernel: md: recovery thread: recon D12 ... Apr 9 15:28:40 NAS1 kernel: md: recovery thread: multiple disk errors, sector=8 Apr 9 15:28:40 NAS1 kernel: md: recovery thread: multiple disk errors, sector=16 etc Not clear to me how parity is enable again and disk12 disable in the 1st diags you posted, or are they old diags? If not, what happened after this: On 4/11/2021 at 1:15 PM, Craigb said: After rebuilding the drive, I noticed that the parity drive had become disabled (red cross) at some point during the rebuild process. Quote Link to comment
Craigb Posted April 13, 2021 Author Share Posted April 13, 2021 The rebuild completed with errors at 1555. The first diagnostic file (nas1-diagnostics-20210409-1217) was after disk 12 was disabled, prior to the rebuild attempt. The second diagnostic file (nas1-diagnostics-20210409-1458) is a continuation of the first diagnostic's syslog, after disk 12 was unassigned and reassigned, but still prior to the rebuild attempt. The third file (syslog-20210409-161331) is the syslog from immediately after the rebuild. I did not get a diagnostic file at this point. As of this morning, disk 12 is assigned but unmountable. The parity disk is disabled, red cross. I'm doing a byte level image of the data disk to preserve whatever data might still be recoverable and barring any suggestions from the forum, will attempt to run XFS_repair. All the syslogs from before the failure through to today are available. Thanks! Quote Link to comment
JorgeB Posted April 13, 2021 Share Posted April 13, 2021 2 hours ago, Craigb said: prior to the rebuild attempt. Ah, OK, then post current diags please. Quote Link to comment
Craigb Posted April 13, 2021 Author Share Posted April 13, 2021 And the most recent diagnostics... nas1-diagnostics-20210413-1345.zip Quote Link to comment
JorgeB Posted April 13, 2021 Share Posted April 13, 2021 Parity appears OK, if you haven't yet you should replace cables then you can try re-enabling parity to see if the emulated disk is better then the current one: -Tools -> New Config -> Retain current configuration: All -> Apply -Check all assignments are correct -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign disk12 -Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't you should run a filesystem check on the emulated disk -If the emulated disk mounts and contents look correct stop the array -Data on current disk12 should be in very bad shape, but if you want to check it later rebuild using a new disk. Quote Link to comment
Craigb Posted April 13, 2021 Author Share Posted April 13, 2021 Thanks for getting back to me and the advice! The server is currently down as I'm doing the byte level copy to a new drive. That's got about 5 hours to run but I'll get back on this as soon as it's completed. Quote Link to comment
Craigb Posted April 14, 2021 Author Share Posted April 14, 2021 Disk image complete, new config step complete, drives are in the correct positions. However, there is no tick box for "parity is already valid". Did I miss something? Quote Link to comment
Craigb Posted April 14, 2021 Author Share Posted April 14, 2021 Check that. Found it... Need to get my glasses checked!! Quote Link to comment
Craigb Posted April 14, 2021 Author Share Posted April 14, 2021 Success! The array started without problem. The faulty disk is being emulated and the data appears to be completely intact. The replacement drive goes in this morning along with a second parity drive. Many thanks for your assistance! Very much appreciated! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.