Data rebuild failure


Recommended Posts

Running unRAID Pro 6.6.6, 28 data drives (130TB) with dual parity and a 1TB SSD cache drive.  I have 23 of the data drives mounted in a 24-bay Supermicro 846 chassis and five of the drives in a Supermicro 5-drive bay connected via shielded SATA cables connected directly to the server motherboard. Both parity drives are mounted inside the 24-bay enclosure.  One is in one of the standard drive bays and the other is mounted on the drive bracket attached to the side of the power supply bay.

 

Last night one of the drives in the 5-bay enclosure had a red X next to it in the web GUI.  Under the two columns that showed the capacity of the drive and the amount used it indicated "Unmountable:  No file system."  I shut down the array and swapped out the 4TB drive that had the error with a new 8TB drive that I had already precleared.  I restarted the array and allocated the new drive to the array and started a data rebuild.  While it was being rebuilt, another drive in the enclosure started showing all kinds of errors and an enormous number of writes to the drive.  The new drive was no longer being written to so the data rebuild had halted.  The drive that was replaced was still showing the "Unmountable: No file system" indication.  I shut down the array again and replaced the second drive that was having all of the write errors with another new 8TB drive that had been precleared.  I also replaced the new drive that I had installed as a replacement for the original drive that showed the errors.  I started the array and assigned the new drives to the two slots where the previous drive had been having the write errors as well as the original one with the unmountable file system.  I try to keep several spare drives on hand that have been precleared just for such an emergency.

 

After a while I noticed that a third drive in the enclosure was having a large number of write errors and now both new drives were showing the "Unmountable:  No file system" message and again it had stopped the data rebuild for both new drives.  I canceled the data rebuild and shut down the array.  I swapped out the 5-bay enclosure with another one that I had on hand and powered up the array once again.  The data rebuild started but I am still getting the "Unmountable:  No file system" indication for the two drives, but the third drive with the write errors was now behaving normally.  It looks like it is going through a normal data rebuild, but I have no idea what data it's putting on the two drives.  I suspect that whatever is being written is simply corrupted and the data is lost.  I would expect it to show the capacity of each drive plus the amount of data to be restored instead of the file system error message.  The display attached to the server is indicating metadata CRC errors and there's another message to unmount and run XFS repair.  I'm at a total loss right now.  I've been running unRAID for over 10 years and I've never seen anything like this before.  I've attached the system log and a couple of screen shots to show the error messages.  The two parity drives are not shown because it would have cut off the drives at the bottom of the screen.  You will notice that I am also running a preclear on another drive in the background.  The total capacity is also being shown as 122TB instead of the previous 130TB due to the two missing 4TB data drives (disks 26 and 27).

 

image.thumb.png.82a168834f63e653da3d5c67775d6b85.png

 

image.thumb.png.5365d7255f5fb0e602fca81b4789a969.png

tower-syslog-20181228-1214.zip

Edited by captain_video
Link to comment

Ideally you'd have the diagnostics from when the errors happened, but based on the description best bet would be to reuse the old disks (if available and assuming SMART checks out OK), do a new config and re-sync parity.

 

P.S. in this short log there are already some ATA errors on two drives , disks 24 and 27, check connections.

 

P.P.S always post the complete diagnostics, not just the syslog.

Link to comment

It is worth pointing out that an ‘unmountable’ status is not cleared by doing a rebuild.    That is the status of the drive being emulated and so all that will happen after the rebuild is that you will have the same ‘unmountable’ status.    

 

Typically the correct procedure in such a case is to run a file system check/correct against the emulated disk.   If all goes well this will result in the emulated disk becoming mountable again and you can check that the data looks intact.   At this point a rebuild can be done to bring the physical disk back into alignment with the emulated disk.    Doing it this way you still have the physical disk intact (in case more drastic action is needed to recover data) while you get the emulated disk back into a good state to rebuild from.

Link to comment

Here's the diagnostics file that you requested

 

tower-diagnostics-20181229-1036.zip

 

I'm going through the various shares to see what data was lost and it's quite considerable.  I may never get it all sorted out.  I have a share for music and the share folder is showing as completely empty.  I had well over 1,000 CDs ripped to my server and spent the past several months getting them all named and tagged properly.  I'm going through the individual disks one at a time to take inventory of what remains so I can figure out what's missing.  This totally sucks.

Edited by captain_video
Link to comment

Yes, like I suspected:

Dec 29 03:31:22 Tower kernel: md: recovery thread: multiple disk errors, sector=1710153656
Dec 29 03:31:22 Tower kernel: md: disk24 read error, sector=1710153664
Dec 29 03:31:22 Tower kernel: md: recovery thread: multiple disk errors, sector=1710153664
Dec 29 03:31:22 Tower kernel: md: disk24 read error, sector=1710153672
Dec 29 03:31:22 Tower kernel: md: recovery thread: multiple disk errors, sector=1710153672
Dec 29 03:31:22 Tower kernel: md: disk24 read error, sector=1710153680
Dec 29 03:31:22 Tower kernel: md: recovery thread: multiple disk errors, sector=1710153680
Dec 29 03:31:22 Tower kernel: md: disk24 read error, sector=1710153688
Dec 29 03:31:22 Tower kernel: md: recovery thread: multiple disk errors, sector=1710153688
Dec 29 03:31:22 Tower kernel: md: disk24 read error, sector=1710153696
Dec 29 03:31:22 Tower kernel: md: recovery thread: multiple disk errors, sector=1710153696

"md: recovery thread: multiple disk errors" is Unraid speak for "there are errors in more disks than current redundancy can correct, the rebuild/sync will continue but there will be some (or a lot) of corruption on the rebuild disk(s)", so I recommend going back to the original disks (if still available and SMART looks OK).

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.