Jump to content

Disk Read Errors/ Smart Errors (Solved)


AnyColourYouLike

Recommended Posts

Hey guys I'm not so good at troubleshooting my drives. I was manually copying some files from /user to /user0 (might be a bad thing to do?)  I was attempting to move things from a seeding folder (cache) to my array for storing/streaming. /user to /user then suddenly I started getting notifications on my dashboard for read errors... like hundreds, then thousands, It's sitting at 6000+ now. The copying I was doing didn't fail or show errors, I stopped it when I noticed but the read errors kept going. 

 

That's when I stopped the array and started again in maintenance mode, ran an extended Smart test and that's where I am. 

 

Any advice? The drive in question is #10. 

 

 

Link to comment

I started moving some things off the problem drive onto another, fortunately this disk was just recently added to the array after being used in another build so it's not very full. Most directories aren't having any issue moving except the one that I was in the process of moving when I first discovered the issue. 

 

Though now it is showing reallocated sectors for the first time... This drive must be on its way out fast. I have a drive precleared and ready to rebuild, I just hope everything is good with my parity and there aren't further corruptions. (it's been about a month since a parity check)

Link to comment

Hey I didn't want to make a new thread after having this one so recently. My array is in dire straits right now, I don't know what's happening. 

 

I was planning to remove and replace the drive I asked about in the OP, but then out of nowhere I had an older drive completely disconnect on me without warning (TOSHIBA_MD04ACA500_Y593KDFZFS9A). There really wasn't any warning with this disk except that its sister drive I bought at the same date recently failed as well so I didn't really question the validity of this failure and just immediately replaced it with a new 8TB I had precleared ready to go. 

 

Now this is where things get scary. While this drive was being rebuilt, and knowing the aforementioned drive referenced in OP was likely on its way out I watched biting my nails as it attempted to rebuild. About 60% in its rebuild a completely DIFFERENT drive, one of my newest ones completely disconnected without warning. So here I was deep into a 24hr rebuild and I just lost a drive, have another one that had read errors and reallocated sectors the day before (OP)... If that wasn't bad enough there is now ANOTHER drive that started throwing read errors and has a pending sector with multiple UDMA CRC errors where there were none before. 

 

I'm freaking out right now. This is a nightmare come true. 

 

Disk 4: New 8TB completely disconnects no warning

Disk 7: A freshly rebuilt 8TB drive (it did finish) albeit with 128 errors (?)

Disk 8: The disk that during the rebuild startower-diagnostics-20190801-1941.zipting throwing multiple errors including a pending sector

Disk 10: The aforementioned disk in OP that I managed to empty but was reallocating sectors and read erroring the past day. 

 

Where do I go from here? I have the array stopped, I ordered another 8TB replacement drive as well as new mini-sas to Sata connectors for my 9207-8i and additional SATA cables for the 4 additional ports I use from my motherboard. Could this all be symptoms of faulty cables? Marvell controller on the motherboard, or the 9207-8i? I was really not prepared for this many things to snowball at once.

 

breathingbag.gif

 

Link to comment
3 hours ago, AnyColourYouLike said:

Marvell controller on the motherboard, or the 9207-8i? I was really not prepared for this many things to snowball at once.

Seems not controller relate and disk not connect to Marvell. Are you periodic health check all disks ? If yes, you would properly not in current multiple disk fault state.

 

Last, how long does those fail disk use ? To exam disk fault self or not, you can run extend SMART test on that disk to identify.

Link to comment
21 hours ago, Benson said:

Seems not controller relate and disk not connect to Marvell. Are you periodic health check all disks ? If yes, you would properly not in current multiple disk fault state.

 

Last, how long does those fail disk use ? To exam disk fault self or not, you can run extend SMART test on that disk to identify.

I don't make it a habit to ignore disk errors. As far as periodic health checks I haven't scheduled any extended SMART tests or anything out of the ordinary. These recent errors were made notice of by notifications on my main page, they have only been present in the past few days as I assess my options. 

 

21 hours ago, johnnie.black said:

Disk4 looks fine, likely a connection problem.

Disk8 is starting to fail, you should run an extended test.

Disk 8 passed an extended SMART test but the pending sector is what has me worried. 

 

I'm about to power things down and attempt to check all connections, I've done this recently though hence why I ordered all new SATA cables for all connections. Those will be here tomorrow. I just hope you are right about disk4

 

As for the disk8 and disk10 I just purchased replacements for them all and will be replacing them, I just don't know which to do first considering the likelihood of multiple failures during a rebuild. What happens if things go belly up during a rebuild?

 

For what's it worth (not lost data obv) all my drives minus the oldest 3TB are all under warranty, i'd like to address all errors and RMA affected drives ASAP.

 

What would be my best course of action going forward? The dual parity has gotten me thus far (thank jebus) but I'm still feeling like I'm flying too close to the sun here. Also what are the consequences of parity errors? is that a sign of corrupted files that I'll never really be sure about?

 

Thank you all to have read my ramblings, I've used Unraid for years and obviously stumbled my way through it thus far.

 

 

*EDIT* I moved disk4 to another bay and restarted the array, still shows as disconnected. No errors, nothing to explain that one.

 

*Edit2* I screwed up and unassigned disk4 momentarily, now it detects it but won't start the array unless it rebuilds it. I am terrible at this. There doesn't seem to be anything wrong with the disk but I really don't want to put the array through the stress of rebuilding it when 8TB of data sits just fine on the disk as it is. 

WDC_WD40EFRX-68WT0N0_WD-WCC4E2CLT71C-20190802-1942.txt

Link to comment
3 hours ago, AnyColourYouLike said:

Disk 8 passed an extended SMART test but the pending sector is what has me worried. 

Since the extended test passed that pending sector is a "false positive" and can be ignored for now, but that disk will likely fail again sooner rather than later.

 

3 hours ago, AnyColourYouLike said:

What would be my best course of action going forward?

Post new diags so we can see current array status.

Link to comment
5 minutes ago, johnnie.black said:

Since the extended test passed that pending sector is a "false positive" and can be ignored for now, but that disk will likely fail again sooner rather than later.

 

Post new diags so we can see current array status.

I have two new 8TBs I plan to start preclearing overnight. The previous ditower-diagnostics-20190802-2352.zipsk4 is unassigned at the moment. I moved any critical data off of the disk8. I will be getting new SATA cables in the mail tomorrow and another 8TB will arrive Monday. I'd like to remove and send in the 4TB's with errors as they are under warranty until October. 

Link to comment

If no write were done to the emulated disk you could re-enable disk4, but it would require a parity sync since a few sync errors are expected, that would cause the same stress as a disk rebuild, if you have a spare I would just rebuild to one and keep old disk4 intact in case something goes wrong during the rebuild.

 

5 hours ago, AnyColourYouLike said:

Also what are the consequences of parity errors? is that a sign of corrupted files that I'll never really be sure about?

I forgot to reply to this earlier, if there were multiple errors during a rebuild the rebuilt disk will have some corruption, but would need the diags from that to confirm, as to what files are/were affected you'd need to have checksums to check them all.

Link to comment
7 hours ago, johnnie.black said:

This assuming there's no reason to think parity isn't 100% valid, if there are doubts might be best to re-enable the disk.

 

I guess i'm not sure how to know if it is 100% valid. Also I don't think the emulated disk would have been written to unless starting the array and promptly stopping docker would have resulted in incidental writes. When you say it could be re-enabled you are referring to a new config while trusting parity? I've had the array started and disks nearby have 5 writes showing since a reboot, could that have also happened to the emulated disk and would that be enough to not go the re-enable route?

 

 

8 hours ago, johnnie.black said:

I forgot to reply to this earlier, if there were multiple errors during a rebuild the rebuilt disk will have some corruption, but would need the diags from that to confirm, as to what files are/were affected you'd need to have checksums to check them all.

 

The diagnostics on my third post should be right after the disk7 was rebuilt "successfully" with 128 errors. As far as checksums go that is something I would have to have been prepared for correct? Is there a easy way to catalog files and their checksums for the future? I guess i'm not sure how one compares files once they have been replaced by possibly corrupt versions. 

 

Looking at my parity check history these aren't my first string of errors. I guess I always assumed these errors were being corrected and that was the point of doing parity checks regularly. 

Parity.PNG

parity2.PNG

Link to comment
On 8/3/2019 at 6:02 PM, AnyColourYouLike said:

When you say it could be re-enabled you are referring to a new config while trusting parity? I've had the array started and disks nearby have 5 writes showing since a reboot, could that have also happened to the emulated disk and would that be enough to not go the re-enable route?

Few number of writes is normal when mounting the disks, you could still do a new config but like mentioned you'd need to run a correcting check to fix the few expected errors.

 

On 8/3/2019 at 6:02 PM, AnyColourYouLike said:

The diagnostics on my third post should be right after the disk7 was rebuilt "successfully" with 128 errors.

That was a very tricky rebuild, first you had some read errors on disk10, dual parity saved you here, but then disk4 got disabled, so dual parity couldn't help when there were also read errors on disk8, every time you see this on the log:

 

Aug  1 03:06:53 Tower kernel: md: recovery thread: multiple disk errors, sector=4510123224

 

is Unraid speak for "there are errors in more disks than current redundancy can correct, the rebuild/sync will continue but there will be some (or a lot) of corruption."

 

On 8/3/2019 at 6:02 PM, AnyColourYouLike said:

As far as checksums go that is something I would have to have been prepared for correct? Is there a easy way to catalog files and their checksums for the future?

Yes, you'd need to already have checksums (using for example file integrity plugin or Corz) or be using btrfs.

 

Link to comment
18 hours ago, johnnie.black said:

That was a very tricky rebuild, first you had some read errors on disk10, dual parity saved you here, but then disk4 got disabled, so dual parity couldn't help when there were also read errors on disk8, every time you see this on the log...

 

I went ahead and used one of my spares to do the rebuild of disk 4, that finished with 0 errors, though disk 10 did throw read errors so parity helped me through that one as well. I'm currently rebuilding disk 10 now so it's no longer a factor. The other disk with errors, disk 8 seems to be doing just fine but i'll be pulling it after this next rebuild. 

 

I'll be looking into the file integrity plug-in from here on out, this has been quite the ordeal. (just realized I already had it installed just never set it up)

 

I really appreciate the help and guidance you have provided. I will go ahead and mark this post as solved. It's been a bumpy ride but I think I have survived with minimal damage, I guess I'll just have to take a closer look at the disk i rebuilt that had errors. Thankfully nothing I have is really critical data, I'm just a massive hoarder of archival media from my childhood, this might give me some excuse to actually watch and slow down the collecting ;) 

 

Thanks guys. 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...