Disk with errors (but green) during parity rebuild

steve1977 · November 17, 2017

mmh... one more idea. i create an UD formatted as ntfs. i then move with “mv -r” from array disk to ntfs. then i delete the files from the ntfs UD. that’s the only working way to get rid of these files elegantly?

Sent from my iPhone using Tapatalk

pwm · November 17, 2017

XFS might possibly be slower than other file systems for delete of large sets of files. But it isn't normally an important test case - in general disks tends to be filled and the data read many times, so it's more important to optimize for adding a large number of small files and for being able to read small and large files efficiently.

Personally, I often use smaller, mirrored, drives (and then normally SSD) for handling larger sets of small files. This also means that the "big data" RAID volumes don't need to worry about disk fragmentation.

steve1977 · November 18, 2017

Got it. Let me go the route with NTFS to "fix" the specific issue.

Almost all of my files are rather large files, so this issue hopefully will not replicate. Only exception may be my itunes library with many music files. So, hopefully the disk with my itunes library does not get corrupted. This will be too large though to move outside the array (would like to keep UDs to below 1TB).

One related question, which may be obvious. What to do with my "old" disk that had shown the 128 errors. Can I still use it by marking these sectors? Or is it a lost disk and I'll need to throw it away / recycle it?

JorgeB · November 18, 2017

What to do with my "old" disk that had shown the 128 errors. Can I still use it by marking these sectors? Or is it a lost disk and I'll need to throw it away / recycle it?

You can try preclearing to see if the pending sector(s) are remapped, but in my experience once a disk starts getting bad sectors it's much more likely to get more in the near future, I would possible re-use it in a backup server.

steve1977 · November 18, 2017

is there even the possibility that the disk is fine and the bad sectors were mis-reported? maybe because of the many small files with xfs issue?

Sent from my iPhone using Tapatalk

JorgeB · November 18, 2017

49 minutes ago, steve1977 said:

is there even the possibility that the disk is fine and the bad sectors were mis-reported?

Sometimes disks (mostly WD in my experience) report "false positives", but since the SMART test failed it was not the case, they are real.

pwm · November 18, 2017

File copy can report errors that are not related to the disk but caused by access rights, file locking etc. So number of errors from copying doesn't relate to number of actual sector errors for the drive.

But when the SMART test reports errors, it's because a sector really can't be read. The sector may be physically damaged, in which case it must be remapped with a spare sector. But the disk can't do the remapping unless it knows the sector is broken and later sees a write of the sector (so it know what new information to put into the sector).

The sector may also be physically okay, but some issue (power spike, mechanical bump etc) may have resulted in the write of the data to somehow fail. If the disk just knows what value to write then it may rewrite the same sector again and then be happy with being to read back the data correctly. So the SMART error can go away without the disk needing to remap the sector.

The sector may also be physically okay, but the head or head assembly can have issues that gives the drive problems reading. This would then normally give very quick increases in number of failed sectors and the disk has to be replaced.

The sector (and possibly many other sectors) may be physically damaged because of manufacturing issues, because the drive head has crash-landed and damaged the surface or maybe something have happened with the lubrication that is intended to reduce the friction from the air turbulence when the head passes at extremely low distances from the surface. This is also likely to give increasing failures and you should consider replacing the drive.

So - if you see a disk that show a very low number of bad sectors (in my view maybe 1-3) and then stop counting then it's likely an extremely localized issue that has happened or the original factory test before shipping failed to detect and remap some bad sectors. In the end, all new disks have some bad sectors that gets remapped at factory but aren't visible in the SMART data. I have had two disks that have one or two bad sectors and then run for 3-5 years with no more issues. But a disk that has many broken sectors or shows an increasing number of bad sectors has to be quickly replaced - you are likely to get even more failures or a total inability to access the drive in a not too far future.

If you have no warranty on the drive, and it has a low error count, you could possibly store some less important data on it (such as an additional backup copy) and then keep it powered and regularly do a long SMART test for a couple of months - if the error counter isn't stable for the next three months it's likely to continue to increase. Several reports shows that there is a very strongly increased probability of a fatal disk error within three months after a drive have ticked up a bad sector counter.

This report is 10 years old so actual figures will probably have changed significantly because of changes in technology used in the drives, but the outcome then was a 39 times larger risk of disk failure within 60 days of the first SMART scan failure:

http://storagemojo.com/2007/02/19/googles-disk-failure-experience/

steve1977 · November 22, 2017

I finally succeeded to delete all the files. So, things should be good now.

Unfortunately, my brand new parity disk now shows some errors. Any insights from the log whether these are HW errors or just cabling?

tower-diagnostics-20171122-2026.zip

JorgeB · November 22, 2017

15 minutes ago, steve1977 said:

Any insights from the log whether these are HW errors or just cabling?

They look more like cable/connection issues, though the LSI driver is not very helpful at showing the error type.

steve1977 · January 4, 2018

Unfortunately, my brand new parity disk (WD 10TB) keeps failing on me. I tried to run an extended SMART and believe that it completed without errors. I am not sure though as there may have been restarts or I may just reading things wrong. May I ask if someone skilled can have a look at my diagnostic log to see whether the extended SMART actually ran and completed? And whether anything suspicious in the log file why the disk failed? Thanks for your help!

tower-diagnostics-20180104-0121.zip

JorgeB · January 4, 2018

SMART looks fine and the extended test completed without error:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       960         -

steve1977 · January 4, 2018

Unfortunately the disk failed again, while building new parity. Log attached. Any indication? Pointing to a cable issue again?

tower-diagnostics-20180105-0135.zip

JorgeB · January 4, 2018

Unfortunately the mptsas driver is not very helpful with the type of error, you could try swapping it with a disk connected on the onboard controller, it would rule out cables and if it fails again it would be easier to identify the type of error.

steve1977 · January 8, 2018

Thanks. The extended SMART test completed without any error messages.

Unfortunately, the parity always fails to build. I am now tried 4 times or so and each times it does not work.

Could it be that Unraid does not support 10TB parity disks yet? I am attaching another diagnostic log.

tower-diagnostics-20180109-0115.zip

Edited January 8, 2018 by steve1977

JorgeB · January 8, 2018

It's still on the same controller.

On 04/01/2018 at 6:11 PM, johnnie.black said:

you could try swapping it with a disk connected on the onboard controller

steve1977 · January 8, 2018

True, I forgot.

I am quite sure that it will work once swapping the disk to the on-board controller. I never had any disk failing on on-board and failures on the raid controller happen quite frequently. However, I never had a disk systematically failing every single time. So, there is something worse and wrong here since I am trying to use the 10TB disk for parity.

Are you suspecting that the controller cannot handle that much data volume?

JorgeB · January 8, 2018

2 minutes ago, steve1977 said:

Are you suspecting that the controller cannot handle that much data volume?

It should if it is working correctly, just trying to rule things out.

steve1977 · January 8, 2018

I will give it a try and report back. I am near certain that the disk will not fail on the on-board controller card, but this would not be really solving my issue :-( I had alrady replaced the controller card before, so the card per se should not be the issue. Also changed cabling. I am running out of options what I can still do to get it to work.

The "fix common problems" app quite often complains about my restart/shutdowns not being "clean" and suggest an UPS. I could give this a try, but somehow doubt this is the issue as the on-board disks are never impacted. Unless power shortage impacts the controller card somehow more?

steve1977 · January 8, 2018

One more observation. Parity build has previously running with 100-120k/sec. Since adding a 10TB parity, the build is running at only 50k/sec. Is this expected behavior or indication that something wen south?

I am currently running the parity build with the 10TB connected to the on-board controller. Should know in around 24-48 hours whether this will complete. I can already confirm now that it also only running at 50k/sec.

JorgeB · January 8, 2018

Assuming you mean MB/s, no, it's not normal, if new parity is replacing old parity, i.e., there's not an extra disk.

steve1977 · January 8, 2018

Yes, MB/s. No, not an extra disk though the size of the disk has changed (i.e., 6TB->10TB). Is there any indication from the log what could be the reason for the slower speed.

Rebuild will still be running for the next 2 days, but I am confident that it will finish this time as the parity is now on the on-board controller.

pwm · January 9, 2018

9 hours ago, steve1977 said:

I will give it a try and report back. I am near certain that the disk will not fail on the on-board controller card, but this would not be really solving my issue :-( I had alrady replaced the controller card before, so the card per se should not be the issue. Also changed cabling. I am running out of options what I can still do to get it to work.

The "fix common problems" app quite often complains about my restart/shutdowns not being "clean" and suggest an UPS. I could give this a try, but somehow doubt this is the issue as the on-board disks are never impacted. Unless power shortage impacts the controller card somehow more?

If a disk drops out from a controller, then the OS has no way of performing a clean unmount of that file system. So the next boot will show log lines complaining about the state of the file system on these disks. So with a controller issue, it's expected if you get warnings about unclean restarts/shutdowns.

steve1977 · January 9, 2018

Got it, this is helpful. So, UPS may not be the silver bullet.

I realized that the reason for the decline in speed may be completely unrelated. I actually added a second GPU and thus had to move the controller card to a different slot. Potentially, the "new" slot is slower. Would also explain why it is exactly half of the transfer speed. Besides building the parity, how does this impact once the parity is built? I could consider getting rid of one GPU as I don't think my mobo has a third fast slot available. I have an ASUS X299-A (https://www.asus.com/de/Motherboards/PRIME-X299-A/). Although looking at the screenshot on the link, maybe there is?

I also suspect that this is unrelated to my drive errors, but let me report back on this once the parity is built (or not built). Thanks again for all the help from this forum.

JorgeB · January 9, 2018

2 hours ago, steve1977 said:

I actually added a second GPU and thus had to move the controller card to a different slot. Potentially, the "new" slot is slower. Would also explain why it is exactly half of the transfer speed.

If your using the bottom x16 slot it will be slower.

2 hours ago, steve1977 said:

Besides building the parity, how does this impact once the parity is built?

Parity checks, disk rebuilds and writes with turbo write enable.

steve1977 · January 9, 2018

Got it, thanks. Does my GPU benefit from the "faster" lane or could I swap my GPU with the controller card? It is still building the parity and I am confident it now completes, but will take some days at 50 MB/s.

My sense is that my WD 10TB is not fully compatible with my M1015 and the speed decrease is caused by the change in the lange. Fingers crossed that it completes and confirms this assessment.

Disk with errors (but green) during parity rebuild

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation