87,300,027 Errors, Parity Rebuild - BTFS Array Drive


Recommended Posts

I just replaced my parity drive from a 2TB to a 4TB drive.

I ran a parity check prior to removing the old drive with 0 errors.

New parity drive ran 3 cycles of preclear with no issues prior to becoming the new parity drive.

 

During parity rebuild I received a notification that one of my disks had read errors.

Parity rebuild completed successfully, and supposedly all is well.

The number of errors on the main page for that 1 disk (my only array disk that is BTRFS) is 87,300,027.

 

I am not sure what this means exactly, the server thinks everything is fine, device is not disabled, etc...

This drive has no history of SMART errors, etc....

I am currently running a scrub to see if it finds any issues (not exactly sure what that is supposed to do for this situation).

 

So what exactly does read errors mean (especially THAT many!)?. I assume since the rebuild of parity was successful that the data from that drive was able to be read?

I have another 4TB I plan to install, however I don't want to do anything until I know that I am not making things worse.

I have the old parity drive here if I had to put it back in to (I guess) rebuild the data that could be bad on the drive with the read errors.

 

Steps/suggestions and what this means exactly would be very appreciated.

 

I'd grab a system log, however it seems with the BTRFS scrub running it refuses to come up in the GUI, it is just blank, and a download of it is an empty file.

Also, still running 6B15 (figured I'd upgrade after the parity disk swap).

Link to comment

I have to wait for the scrub to finish to do anything at this point. It's at ~100GB currently, drive is only 1TB, and data on it is only ~300GB, so it shouldn't take too long... Once it does I'll stop the array.

 

This may or may not be odd, but once I hit scrub for that drive unRAID told me that "Disk 8 in error state (disk dsbl)" and all drives are now spun up with my array at the bottom showing a yellow triangle, which for the popup on an array drive means invalid data content..

 

Screenie for better detail attached.

Main.png.ef5d45201ec944ce0e8f4c78db1d3a07.png

Link to comment

It sounds as if that drive might have dropped offline.  If you stop the array what does the GUI say?

 

 

Scrub finished with zero errors, all was well, stopped the array

 

Attached a picture, the disk shows as "Not Installed". However shortly after stopping the array I got the notification (can be seen in the picture also) that:

Event: unRAID array errors

Subject: Notice [sERVER] - array turned good

Description: Array has 0 disks with read errors

Importance: normal

 

As of right now it shows as no disks present to re-assign to Disk 8.

System log is still blank.

stopped.png.3a634fb76290415f19ce0cb7fce7ec57.png

Link to comment

If the disk is showing as not installed when you stopped the array then it dropped offline.    Normally the only way to recover from this is to reboot the server.  Why it dropped offline is the issue?  It is worth checking that you have not disturbed the power or SATA cables for that drive.

Link to comment

If the disk is showing as not installed when you stopped the array then it dropped offline.    Normally the only way to recover from this is to reboot the server.  Why it dropped offline is the issue?  It is worth checking that you have not disturbed the power or SATA cables for that drive.

Thank you for the help so far!

 

The sequence of events, and access to the disk throughout makes it too convenient for it to have just dropped offline...

 

Maybe performing a scrub on an array drive changes an attribute of the disk and breaks something that allows it to be part of the array, and is why that disk is now flagged as "new device not in array". If so, then I think this option should be disabled from allowing you to click scrub and then breaking everything and wreaking havoc on your setup!... I thought it was merely a good way to check for corruption, etc... however I think it broke things!  :-\

 

Let's recap:

So the parity rebuild succeeded however with a LOT of errors for one specific disk.

Disk shows up as fine, just has a bunch of errors.

I run a scrub on that disk which completed successfully, however AS SOON as I started the scrub, unRAID dropped the disk and said it was unavailable, which to me sounds as if the scrub process took exclusive rights to the disk, and unRAID could not access it, so it flagged it as such.

I stop the array (scrub is now completed) however that disk slot is empty and I cannot see that specific disk.

I now shutdown completely, check cables (all is well), etc...  Power up, and the disk 8 slot is empty, however I can select the same disk as before, however if I do that the disk is blue and the start button states that "Start will bring the array on-line, start Data-Rebuild, then expand the file system (if possible)."

 

This sounds like I am adding a new disk, which I am NOT, it's the exact same disk as before.

Should I tell it to go ahead?...

If the parity is in place it shouldn't matter, however when I reconstructed parity (just previously) is when I got these 87,300,027 errors, so does that mean I am reconstructing garbage?

 

If so and it is actually going to rebuild the disk (which should be unnecessary) I mines well just pull it and put in the other 4TB drive I have sitting here.

I just have no idea why it thinks I am re-assigning disk 8 to this 1TB drive as it is the exact disk that was there prior to.

 

 

I now have a system log (not from before, it was always a blank file) from after I powered down up until now.

 

Am I right in making the assumption that if the parity rebuild succeeded even with the listed read errors, that all is well as it was able to read the data and therefore complete parity?

If the errors during the parity rebuild were unreadable it would have then failed, correct?

syslog.txt

Link to comment

Well, guess what........  :o

 

I decided to start the array unprotected without adding back in a disk to disk 8.

 

I then installed the unassigned devices plugin to mount the drive (previous disk 8 ) outside of the array, and copy the files off of it.

It lists the drive fine, FS btrfs, I hit mount, size 1TB, used 0B.

 

So... I have NO idea (and also am thinking WTF!!) what could have erased everything on this drive, but unless something is wrong with the reporting on the unassigned devices, all data there is gone.

 

Being that the case, if I rebuild that disk with the same one (now seen as new) or a new 4TB I have sitting around, is the data I am putting back riddled with 87,300,027 errors, or was that just reported as an issue, and the notification of parity "Subject: Notice [sERVER] - Parity disk returned to normal operation" mean all is good, no issue?..

 

Thanks for listening!

Link to comment

I replaced disk 8 with the new 4TB drive, and have started rebuilding the previous disk onto it.

Hopefully I am not reconstructing corrupted data, I would assume that isn't the case.

 

Afterwards I will move the data and reformat as XFS, this is the 2nd time I have had issues with BTRFS.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.