Is my disk failing

dboonthego · January 31, 2017

Some background... I was using the free 3 disk unRAID 5.0-beta14. My hardware except for the drives was made up of spare parts including a case with no side panel. All my drive issues seem to have been due to the SATA cables; mostly because the case was exposed and problems have always happened right after I have messed with the case somehow. Re-seating cables always resolved any errors. Disk 2 always seemed to have the issue.

This past November, re-seating cables was not enough. I had to replace the SATA cable for disk 2 and execute these commands. It fixed some stuff, but I have been skeptical of the drive integrity ever since.

reiserfsck --check /dev/md2

reiserfsck --rebuild-tree /dev/md2

All three of the disks are the same, but disk 2 had the file system errors. It makes a slight clicking noise during spin up, but does not repeat the clicking like a dead drive usually does. The other two drives don't do this. It seemed like when disk 2 was attached, the web interface was slow to become available at boot and slow to respond during page to page clicks. The slow response may have been because of the bad cable.

Fast forward to present day, I upgraded unRAID to 6.2.4 and upgraded all hardware, but am still using the same disks. Things seem to be running fine. I'm just not sure what to make of disk 2. I think I am reading the SMART results correctly. The values look good to me. I also executed both the short and extended SMART self tests and both passed without error. Thoughts?

tower-smart-20170129-2233.zip

SSD · January 31, 2017

You have one pending sector.

Basically this means that the disk has determined that this sector is bad, but it cannot take that sector of service because it may contain valuable user data. It doesn't know and won't ever take it out of service based on a read error - even a read error that is so bad that it will never succeed. If that sector were ever to be written to, it would give the drive an opportunity to "remap" a spare sector to replace the bad sector, and take the bad sector out of service, all without risking the loss of user data. (This is the magic of the SMART system). Parity checks that do no writing do not afford the ability to do the remapping.

It looks from your smart log that this sector may have been causing mischief for quite some time. (look for "Error: UNC 8 sectors at LBA = 0x0000f330 = 62256" in the smart report)

There are a couple of options to get this to clear, including replacing and preclearing the disk.

If all other disks in the array appear good, you could "rebuild" this disk with the pending sector, similar to the procedure used to rebuild a disk that was kicked from an array due to a bad or loose cable. The rebuild would write to each and every sector on the disk, and should provlde the disk the opportunity to take corrective action.

I will note that a good disk will never develop a bad or pending sector, and although sometimes a procedure like I mention above will resolve the issue and allow the drive to stay in service for a period of time, my experience is that one a drive starts to act up, its days are numbered. But that should not deter you from trying to fix it. Just keep an eye on the drive.

dboonthego · January 31, 2017

There are a couple of options to get this to clear, including replacing and preclearing the disk.

Do you mean preclearing a brand new disk and replacing the suspect disk with the brand new one? Or unassign, preclear, and reassign the suspect disk?

If all other disks in the array appear good, you could "rebuild" this disk with the pending sector, similar to the procedure used to rebuild a disk that was kicked from an array due to a bad or loose cable.

I'm not sure of the difference between the two procedures, but if I recall correctly from a few months ago, the array would not not start because of the filesystem errors; I had the red X for disk2. I unassigned it (probably not the right thing to do). This is when I executed the smartctl and reiserfsck commands. After that completed, I re-assigned disk2 and started the array. I don't remember exactly, but I think disk2 was rebuilt from parity at this point. Either that or parity was recalculated. Is that the same procedure as "rebuiding with the pending sector?"

trurl · January 31, 2017

There are a couple of options to get this to clear, including replacing and preclearing the disk.

Do you mean preclearing a brand new disk and replacing the suspect disk with the brand new one? Or unassign, preclear, and reassign the suspect disk?

This 1st method is just replacing the disk with another disk with the usual rebuilding, and then preclearing the original disk.

If all other disks in the array appear good, you could "rebuild" this disk with the pending sector, similar to the procedure used to rebuild a disk that was kicked from an array due to a bad or loose cable.

I'm not sure of the difference between the two procedures, but if I recall correctly from a few months ago, the array would not not start because of the filesystem errors; I had the red X for disk2. I unassigned it (probably not the right thing to do). This is when I executed the smartctl and reiserfsck commands. After that completed, I re-assigned disk2 and started the array. I don't remember exactly, but I think disk2 was rebuilt from parity at this point. Either that or parity was recalculated. Is that the same procedure as "rebuiding with the pending sector?"

If you have a link maybe we could figure out what you did. Filesystem errors would not keep you from starting. Possibly you corrected the filesystem errors on the emulated disk, then rebuilt to the same disk. Or possibly you rebuilt to the same disk, then corrected the filesystem errors. Either would have worked.

But for the present case, the 2nd method is just saying to rebuild the disk to itself.

Pending means pending reallocation, which means it needs to be reallocated but hasn't been yet.

Both methods rewrite the entire disk. When the disk tries to rewrite the pending sector, it may reallocate it, thus changing a pending sector into a reallocated sector.

The advantage of method 1 is you have a new disk that should get the data rebuilt to it, and you also have the original disk, which should still have the data on it, though it might be slightly invalid due to missing writes. So you wind up having the original in case there is a problem that keeps the rebuild from succeeding.

The advantage of method 2 is you don't have to have another disk.

The main thing that might keep a rebuild from succeeding is a problem with another disk during the rebuild. Do any of your other disks have SMART issues?

dboonthego · February 1, 2017

Thank you guys for your responses. They have been very helpful.

Filesystem errors would not keep you from starting. Possibly you corrected the filesystem errors on the emulated disk, then rebuilt to the same disk. Or possibly you rebuilt to the same disk, then corrected the filesystem errors. Either would have worked.

Maybe it was a parity check that I could not do while disk2 had the drive error. It is all fuzzy in my head now. I'm second guessing what I really did, but I know I did a "--rebuild-tree" twice and each time there were corrections. I saved the report. This was of course after directed to do so by the "--check" switch.

I think it went like this:

[*]disk2 has drive error (red x)

[*]re-seat cables w/ no improvement

[*]unassign disk2

[*]replace sata cable

[*]run reiserfsck --check /dev/md2

[*]run reiserfsck --rebuild-tree /dev/md2

[*]assign disk2

[*]rebuild disk2

[*]run reiserfsck --check /dev/md2

[*]run reiserfsck --rebuild-tree /dev/md2

[*]start array

The main thing that might keep a rebuild from succeeding is a problem with another disk during the rebuild. Do any of your other disks have SMART issues?

The other disks are in good shape, no SMART issues. I did buy an extra drive over Christmas with the forethought that I may need it so I have that option also.

I don't totally understand the preclear stuff. It seems unRAID will zero additional disks when added to the array, but if those disks are precleared, they can be added to the array more quickly? But, if replacing a disk, unRAID must write all bytes to the replacement disk anyway so it isn't necessary? Is that right? Other than being a time saver, are there other benefits like more passes; pre-read, write, post-read?

Is preclearing my suspect drive the way to go? If I replace suspect disk2 now, then decide to "add" it back later as disk3, what will a preclear do for me as opposed to just adding it and letting unRAID zero it then?

trurl · February 1, 2017

unRAID only needs a clear disk when adding a disk to a new slot in an array that already has parity. A clear disk is all zeros and so has no effect on parity, so adding a clear disk keeps parity valid.

Since unRAID will clear a disk if it needs a clear disk, the main advantage of preclear now is for testing (burn-in). It used to be more useful because unRAID used to take your array offline when it needed to clear a disk, but recent version don't take the array offline to clear.

dboonthego · February 7, 2017

Thanks for your support guys. I unassigned disk2 and ran it through 3 preclear cycles. It looks to have corrected the pending sector; results attached. I am about to add it back to the array and let it rebuild from parity. Is it possible to format XFS before rebuilding? The format option is greyed out.

trurl · February 7, 2017

Thanks for your support guys. I unassigned disk2 and ran it through 3 preclear cycles. It looks to have corrected the pending sector; results attached. I am about to add it back to the array and let it rebuild from parity. Is it possible to format XFS before rebuilding? The format option is greyed out.

Format means "write an empty filesystem to this disk". That is what it has always meant in every operating system you have ever used. unRAID treats this write just like any other, by updating parity so that it agrees that the disk has an empty filesystem on it. Obviously not what you want when trying to rebuild a disk that had files on it.

dboonthego · February 7, 2017

Nevermind. It can't be done. Right after posting this, I saw this thread: https://lime-technology.com/forum/index.php?topic=37490.0

Is my disk failing

Recommended Posts

dboonthego

Link to comment

SSD

Link to comment

dboonthego

Link to comment

trurl

Link to comment

dboonthego

Link to comment

trurl

Link to comment

dboonthego

Link to comment

trurl

Link to comment

dboonthego

Link to comment

Archived