Bad disk. Bad super.dat file. Now what?


Recommended Posts

So, I had a bad disk.  I let it run as simulated from the parity for a while.

Then, I shut down and brought the system back up.  It was at this point that I had an empty array.  The /boot/super.dat file is 0 bytes and is dated from almost two weeks ago.

WHat I want to do is to use the good disks and the parity drive to rebuild the missing data drive.  I know which drive is parity.

How do I do this?

Link to comment

I don't think it exists, as I upgraded a drive from 3tb to 4tb about 10 days ago -- the day that the super.dat 0 byte file was created.  I also replaced a failed drive.  I haven't taken a backup since then.

I do have a screen print of the array configuration from just before I did all this.

Link to comment

This is not good.  You can't build a new config with a missing disk, so there's no way to rebuild the drive.

 

The 2 options I can think of -

* Email LimeTech support, explain the situation, include the screen print with the desired array config, and ask if Tom can generate a super.dat from it.

* Re-install the bad or failed drive.  In general, such drives are still usable, accessible, if not perfect.  That would allow you hopefully to copy off all or most of the data to a safe place.  Then create your new array, rebuild parity, and move the data back, wherever you want.

 

By the way, you may also have issues with your flash drive, if the super.dat is corrupted.  Try doing a ChkDsk (Check Disk, Scandisk, etc) on it.

Link to comment

Well, I figured I can get the data off the drives (even the one that failed, as I suspect it was just a bad cable).  The problem is that the disk failure occurred a while ago (maybe 10 days or so), so the 'bad' drive will have stale data.

I guess I'll email LimeTech and see if there's any workarounds.

Super.dat is quite the weak point.  It was clearly corrupted 10 days ago, but the array went on blindly just waiting for a shutdown/reboot.  It would have been nice if it detected the imminent problem somewhere in that 10 days.

 

Link to comment

Those linked instructions should work, and you can try it if LT does not provide a better way, but since you didn't post a syslog so I can check there are a couple of requirements:

 

1-you have to be on v6.1.5 or above

2-new replacement disk should be precleared/unformatted and has to be the same size of the disk it's replacing (it may work with a larger disk but I didn't test it)

 

Don't use the old disk for this, keep it intact in case you need it.

 

You have an added difficulty since you don't know which one is your parity disk, don't try to identify it by setting all disks as data disks and looking for the unmountable one, I tested this a while ago and in most cases this changed parity enough so that this option would then be unsuccessful.

 

This should work, on the console type:

 

blkid /dev/sdX1

replace X with all your disks except the new replacement and the old failed disk.

 

output for data disks will look something like this:

/dev/sdm1: UUID="93d5bddd-5460-4c06-9799-887e731b132e" TYPE="xfs"

 

Disk without any output or no fs "TYPE" output is your parity disk.

 

If you identified your parity disk and want to continue then:

 

-assign all disks, including parity and the new replacement disk, double check parity disk is on the parity slot

-very important, check the box "parity is already valid" before starting array

-start array, new replacement disk is going to appear as unmountable, it's ok for now

-stop array, unassign replacement disk (select "no device") and start array

-missing disk is going to be emulated and should now mount and have all data available

-if all looks good, stop array, reassign replacement disk and start array to begin rebuild

 

 

 

 

 

Link to comment

Thanks for taking the time Johnny!

 

I'm running 6.1.7 so this sounds like it could work.  The other thread seems like the exact same problem I had.

 

You said

You have an added difficulty since you don't know which one is your parity disk, don't try to identify it by setting all disks as data disks and looking for the unmountable one, I tested this a while ago and in most cases this changed parity enough so that this option would then be unsuccessful.

 

I do know which is the parity drive, but I may have broken things anyway...

When I looked at the empty array the first time, I momentarily thought I'd reassign drives.  I assigned the first data drive to the correct drive.  When it didn't show up with the correct FS/size/used/free data (and I noticed this after about 3 seconds), I immediately changed it back to 'unassigned' and came to the forums.  Hopefully that lapse in judgement didn't cause anything to be written to the drive. (?)

 

I don't think that I have a spare drive matching the size of the one that was noted as failed (4TB).  What I do have is a second unRaid server that has 10TB of free storage.  The failed drive was probably only about 25% used.  How can I mount that drive in read-only mode so that I can copy it's contents off the drive (and then use the drive as if it were a new one)?  It was formatted as XFS.

 

Link to comment

If I remember correctly starting the array with the parity drive on a disk slot writes something to it that makes this option fail, but you can still try it when you copy all data from the old disk, there's nothing to lose, just make sure you don't assign any data drive to the parity slot.

 

Also, if the old disk is bad don't try to rebuild, if it works just copy the files you need from the emulated disk.

 

As for mounting the old disk you can just assign it to any data slot on this server and start array, without assigning parity, and copy over LAN or mount it with unassigned devices plugin on the other server. Another option is booting from a Ubunt live usb drive on a windows pc, you will have access to the xfs disk and your windows disks, so you can easily copy all data.

 

 

Link to comment

If I remember correctly starting the array with the parity drive on a disk slot writes something to it that makes this option fail, but you can still try it when you copy all data from the old disk, there's nothing to lose, just make sure you don't assign any data drive to the parity slot.

I don't think I did that.  I'm pretty sure that when I momentarily added the drive to the array it was the old data drive #1 in slot #1.

 

Also, if the old disk is bad don't try to rebuild, if it works just copy the files you need from the emulated disk.

I'm in the process of copying that data to my second server.  However, the old disk is stale.  My goal here is to get what I can off of the drive in case my rebuild from parity has some kind of problem.

 

As for mounting the old disk you can just assign it to any data slot on this server and start array, without assigning parity, and copy over LAN or mount it with unassigned devices plugin on the other server. Another option is booting from a Ubunt live usb drive on a windows pc, you will have access to the xfs disk and your windows disks, so you can easily copy all data.

 

I ended up using a command line mount statement to mount it as read-only.  I am in the process of copying this to my second unraid server.  Even though it's stale, it's better than nothing.

 

Once this drive's contents are copied off of it, am I required to preclear it?  I really just want to start the rebuild -- preclearing it seems like a waste of time.  Well, maybe a preclear will find a problem, but given my issue had been a cable problem I don't see much value in wasting a day or two doing a preclear -- i'd rather start the rebuild from parity.  If it finds an error, i won't be any worse off.  But if it doesn't find an error, I will have avoided a slow pre-clear.

 

Link to comment

Once this drive's contents are copied off of it, am I required to preclear it?  I really just want to start the rebuild -- preclearing it seems like a waste of time.  Well, maybe a preclear will find a problem, but given my issue had been a cable problem I don't see much value in wasting a day or two doing a preclear -- i'd rather start the rebuild from parity.  If it finds an error, i won't be any worse off.  But if it doesn't find an error, I will have avoided a slow pre-clear.

 

There's no need to preclear, you can use a previous formatted disk, and since you're using a known bad disk don't try to rebuild:

 

-assign all disks, including parity and the old bad disk, double check parity disk is on the parity slot

-very important, check the box "parity is already valid" before starting array

-start array, old disk should mount and show whatever data it has on it

-stop array, unassign old disk (select "no device") and start array

-missing disk is going to be emulated and should now have all data available, including the one missing from the old disk

-if all looks good, start copying data from the emulated disk

 

 

Link to comment

There's no need to preclear, you can use a previous formatted disk, and since you're using a known bad disk don't try to rebuild:

 

-assign all disks, including parity and the old bad disk, double check parity disk is on the parity slot

-very important, check the box "parity is already valid" before starting array

-start array, old disk should mount and show whatever data it has on it

-stop array, unassign old disk (select "no device") and start array

-missing disk is going to be emulated and should now have all data available, including the one missing from the old disk

-if all looks good, start copying data from the emulated disk

 

I don't see how unraid will know which of my data disks is the 'bad' one... since it is stale but has a valid filesystem on it.  The moment I assign all drives including parity and bring the array up, won't it somehow write to the parity drive (as soon as any process anywhere writes to the array), and therefore destroy the data I need to recover the bad drive?  (for clarity, I'm referring to the moment between steps 3 and 4 above).  More specifically, if any data gets written to the bad/stale drive, then the parity bits of those corresponding writes will get written to the parity drive... which means I won't be able to restore the drive properly.

 

Maybe I should first start a preclear on the stale drive (and cancel after a minute or two) -- that should ensure that the old/stale filesystem is not recognized by unraid -- which will prevent unraid from trying to write to it, which would destroy the related parity drive data.  ?????

 

Link to comment

While I would prefer to use a precleared disk, I did try with a formatted one and it worked as well, no data from any of the disks is written to the parity disk when using the "parity is already valid" option on v6.1.5 or newer.

 

Only thing I don't recommend is using a larger disk, I didn't test that enough, and while it also appears to work I remember there were some minor file system issues, fixed by running xfs_repair, but didn't test if all fires were 100% correct with checksums.

Link to comment

I don't see how unraid will know which of my data disks is the 'bad' one... since it is stale but has a valid filesystem on it.

 

Just one more thing, unRAID will never know which is the bad disk, that's why you have to unassign the correct one, doesn't matter if it's empty or contains data.

Link to comment

While I would prefer to use a precleared disk, I did try with a formatted one and it worked as well, no data from any of the disks is written to the parity disk when using the "parity is already valid" option on v6.1.5 or newer.

Well, I can't guarantee that I don't have some device on my network which will send a file to unraid when it's up.  If that happens, then unraid could try to write to my stale/bad drive.  I think I'll do a partial clear to ensure that the drive can't be written to.

Link to comment

Well, I can't guarantee that I don't have some device on my network which will send a file to unraid when it's up.  If that happens, then unraid could try to write to my stale/bad drive.  I think I'll do a partial clear to ensure that the drive can't be written to.

 

If you can I think you should as it's safer.

Link to comment

fyi, this approach worked -- my server was rebuilt from parity and I didn't lose any data.

 

The only strange/worrisome moment was the two array shutdowns that I had to do -- each of them took about 10 minutes to complete.  I think that the delay was related to some kind of network share issue related to the temporary mount I had created on the problem server which was allowing me to access data on my other unRaid server.  For some reason, that mount became unusable when the array was taken offline.  the web gui became unresponsive, and all filesystem commands were very slow (including lsof, mount, df).

 

In any event, I'm a happy camper this morning.

 

Oh - A scan of the USB key did not find any errors, so I am not sure what caused the 0 byte super.dat file.  In any event, I made sure to take a copy of the super.dat file now that it's all working well!

 

Thanks!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.