Disk with errors (but green) during parity rebuild


Recommended Posts

  • Replies 92
  • Created
  • Last Reply

Top Posters In This Topic

XFS might possibly be slower than other file systems for delete of large sets of files. But it isn't normally an important test case - in general disks tends to be filled and the data read many times, so it's more important to optimize for adding a large number of small files and for being able to read small and large files efficiently.

 

Personally, I often use smaller, mirrored, drives (and then normally SSD) for handling larger sets of small files. This also means that the "big data" RAID volumes don't need to worry about disk fragmentation.

Link to comment

Got it. Let me go the route with NTFS to "fix" the specific issue.

 

Almost all of my files are rather large files, so this issue hopefully will not replicate. Only exception may be my itunes library with many music files. So, hopefully the disk with my itunes library does not get corrupted. This will be too large though to move outside the array (would like to keep UDs to below 1TB).

 

One related question, which may be obvious. What to do with my "old" disk that had shown the 128 errors. Can I still use it by marking these sectors? Or is it a lost disk and I'll need to throw it away / recycle it?

Link to comment
What to do with my "old" disk that had shown the 128 errors. Can I still use it by marking these sectors? Or is it a lost disk and I'll need to throw it away / recycle it?

 

You can try preclearing to see if the pending sector(s) are remapped, but in my experience once a disk starts getting bad sectors it's much more likely to get more in the near future, I would possible re-use it in a backup server.

 

 

Link to comment

File copy can report errors that are not related to the disk but caused by access rights, file locking etc. So number of errors from copying doesn't relate to number of actual sector errors for the drive.

 

But when the SMART test reports errors, it's because a sector really can't be read. The sector may be physically damaged, in which case it must be remapped with a spare sector. But the disk can't do the remapping unless it knows the sector is broken and later sees a write of the sector (so it know what new information to put into the sector).

 

The sector may also be physically okay, but some issue (power spike, mechanical bump etc) may have resulted in the write of the data to somehow fail. If the disk just knows what value to write then it may rewrite the same sector again and then be happy with being to read back the data correctly. So the SMART error can go away without the disk needing to remap the sector.

 

The sector may also be physically okay, but the head or head assembly can have issues that gives the drive problems reading. This would then normally give very quick increases in number of failed sectors and the disk has to be replaced.

 

The sector (and possibly many other sectors) may be physically damaged because of manufacturing issues, because the drive head has crash-landed and damaged the surface or maybe something have happened with the lubrication that is intended to reduce the friction from the air turbulence when the head passes at extremely low distances from the surface. This is also likely to give increasing failures and you should consider replacing the drive.

 

So - if you see a disk that show a very low number of bad sectors (in my view maybe 1-3) and then stop counting then it's likely an extremely localized issue that has happened or the original factory test before shipping failed to detect and remap some bad sectors. In the end, all new disks have some bad sectors that gets remapped at factory but aren't visible in the SMART data.  I have had two disks that have one or two bad sectors and then run for 3-5 years with no more issues. But a disk that has many broken sectors or shows an increasing number of bad sectors has to be quickly replaced - you are likely to get even more failures or a total inability to access the drive in a not too far future.

 

If you have no warranty on the drive, and it has a low error count, you could possibly store some less important data on it (such as an additional backup copy) and then keep it powered and regularly do a long SMART test for a couple of months - if the error counter isn't stable for the next three months it's likely to continue to increase. Several reports shows that there is a very strongly increased probability of a fatal disk error within three months after a drive have ticked up a bad sector counter.

 

This report is 10 years old so actual figures will probably have changed significantly because of changes in technology used in the drives, but the outcome then was a 39 times larger risk of disk failure within 60 days of the first SMART scan failure:

http://storagemojo.com/2007/02/19/googles-disk-failure-experience/

Link to comment
  • 1 month later...

Unfortunately, my brand new parity disk (WD 10TB) keeps failing on me. I tried to run an extended SMART and believe that it completed without errors. I am not sure though as there may have been restarts or I may just reading things wrong. May I ask if someone skilled can have a look at my diagnostic log to see whether the extended SMART actually ran and completed? And whether anything suspicious in the log file why the disk failed? Thanks for your help!

tower-diagnostics-20180104-0121.zip

Link to comment

True, I forgot.

 

I am quite sure that it will work once swapping the disk to the on-board controller. I never had any disk failing on on-board and failures on the raid controller happen quite frequently. However, I never had a disk systematically failing every single time. So, there is something worse and wrong here since I am trying to use the 10TB disk for parity.

 

Are you suspecting that the controller cannot handle that much data volume?

Link to comment

I will give it a try and report back. I am near certain that the disk will not fail on the on-board controller card, but this would not be really solving my issue :-( I had alrady replaced the controller card before, so the card per se should not be the issue. Also changed cabling. I am running out of options what I can still do to get it to work.

 

The "fix common problems" app quite often complains about my restart/shutdowns not being "clean" and suggest an UPS. I could give this a try, but somehow doubt this is the issue as the on-board disks are never impacted. Unless power shortage impacts the controller card somehow more?

Link to comment

One more observation. Parity build has previously running with 100-120k/sec. Since adding a 10TB parity, the build is running at only 50k/sec. Is this expected behavior or indication that something wen south?

 

I am currently running the parity build with the 10TB connected to the on-board controller. Should know in around 24-48 hours whether this will complete. I can already confirm now that it also only running at 50k/sec.

Link to comment

Yes, MB/s. No, not an extra disk though the size of the disk has changed (i.e., 6TB->10TB). Is there any indication from the log what could be the reason for the slower speed.

 

Rebuild will still be running for the next 2 days, but I am confident that it will finish this time as the parity is now on the on-board controller.

Link to comment
9 hours ago, steve1977 said:

I will give it a try and report back. I am near certain that the disk will not fail on the on-board controller card, but this would not be really solving my issue :-( I had alrady replaced the controller card before, so the card per se should not be the issue. Also changed cabling. I am running out of options what I can still do to get it to work.

 

The "fix common problems" app quite often complains about my restart/shutdowns not being "clean" and suggest an UPS. I could give this a try, but somehow doubt this is the issue as the on-board disks are never impacted. Unless power shortage impacts the controller card somehow more?

 

If a disk drops out from a controller, then the OS has no way of performing a clean unmount of that file system. So the next boot will show log lines complaining about the state of the file system on these disks. So with a controller issue, it's expected if you get warnings about unclean restarts/shutdowns.

Link to comment

Got it, this is helpful. So, UPS may not be the silver bullet.

 

I realized that the reason for the decline in speed may be completely unrelated. I actually added a second GPU and thus had to move the controller card to a different slot. Potentially, the "new" slot is slower. Would also explain why it is exactly half of the transfer speed. Besides building the parity, how does this impact once the parity is built? I could consider getting rid of one GPU as I don't think my mobo has a third fast slot available. I have an ASUS X299-A (https://www.asus.com/de/Motherboards/PRIME-X299-A/). Although looking at the screenshot on the link, maybe there is?

 

I also suspect that this is unrelated to my drive errors, but let me report back on this once the parity is built (or not built). Thanks again for all the help from this forum.

Link to comment
2 hours ago, steve1977 said:

I actually added a second GPU and thus had to move the controller card to a different slot. Potentially, the "new" slot is slower. Would also explain why it is exactly half of the transfer speed.

If your using the bottom x16 slot it will be slower.

 

2 hours ago, steve1977 said:

Besides building the parity, how does this impact once the parity is built?

Parity checks, disk rebuilds and writes with turbo write enable.

Link to comment

Got it, thanks. Does my GPU benefit from the "faster" lane or could I swap my GPU with the controller card? It is still building the parity and I am confident it now completes, but will take some days at 50 MB/s.

 

My sense is that my WD 10TB is not fully compatible with my M1015 and the speed decrease is caused by the change in the lange. Fingers crossed that it completes and confirms this assessment.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.