Jump to content

Disk disabled, content emulated (DRDY ERR ICRC ABRT)


stealth82

Recommended Posts

Hello, I think this could be my first disk dying of old age but I wanted to have some double confirmation by experts here.

Today I was manually copying through mc data from my cache drive to /mnt/disk2 when promptly a notification on my iPhone came in. It was unRAID telling me something was wrong...

 

Now disk2 is emulated and I tried to check SMART results to see what happened. Point is... it says the disk is unavailable and it can be spun up for diagnostic.

 

I checked the syslog, which I attached, and looked up these 2 errors that I saw: DRDY ERR ICRC ABRT

 

They should be, respectively:

 

Drive media issue #1: These are almost always associated with bad sectors.

Drive media issue #2: a pretty good indicator of a poor quality SATA cable

 

Now the last one made me think. Some weeks ago I bought a Supermicro AOC-SASLP-MV8 controller and 2 Mini SAS to 4-SATA SFF-8087 Multi-Lane Forward Breakout Internal Cables. Till some moments ago I had no issues whatsoever though.

 

Is it possible that just one sub-cable out of 4 is bad?

Should I be worried about it or it could be that the cause is the disk's old age?

I say old age because it shouldered  4y, 6m, 9d, 14h of service so far (I read that stat from its sibling, I have 2 disks bought in the same period).

 

A new 4TB drive is on the way now and I will have to go for a parity swap procedure when it arrives. Are there any suggestions before getting into that or I should just give up on the old disk?

tower-diagnostics-20151130-1731.zip

Link to comment

Are you sure? If it reads "A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options." is it a sign that looks OK?

 

WDC_WD20EARS-00MVWB0_WD-WMAZA0747093-20151130-1731.txt

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.13-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /1:0:1:0
Product:              
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
Physical block size:  1549687900 bytes
Lowest aligned LBA:   14896
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Link to comment

Are you sure? If it reads "A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options." is it a sign that looks OK?

 

WDC_WD20EARS-00MVWB0_WD-WMAZA0747093-20151130-1731.txt

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.1.13-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /1:0:1:0
Product:              
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
Physical block size:  1549687900 bytes
Lowest aligned LBA:   14896
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Sorry my bad. I was looking at smart for disk1. Replace the drive.

 

 

Link to comment

OK, I think the worst case scenario has just occurred.

I wanted to take down the disk but since I bought a sata cage and rewired everything I wanted to give the disk another try.

The disk came back online and it reported no errors. I guess a wire really got loose - it wasn't the disk, I can't give myself another explanation.

 

Anyway I put it back into the array and unRAID started rebuilding it.

As it was some hours in the rebuilding process the parity drive started throwing errors (843 in the errors column)

 

187	Reported uncorrect	0x0032	017	017	000	Old age	Always	Never	83
197	Current pending sector	0x0012	100	099	000	Old age	Always	Never	128
198	Offline uncorrectable	0x0010	100	099	000	Old age	Offline	Never	128

 

The disk that is getting rebuilt is toasted - data can't be trusted, I'm toasted. Am I right?  :'(

Link to comment

OK, I think the worst case scenario has just occurred.

I wanted to take down the disk but since I bought a sata cage and rewired everything I wanted to give the disk another try.

The disk came back online and it reported no errors. I guess a wire really got loose - it wasn't the disk, I can't give myself another explanation.

 

Anyway I put it back into the array and unRAID started rebuilding it.

As it was some hours in the rebuilding process the parity drive started throwing errors (843 in the errors column)

 

187	Reported uncorrect	0x0032	017	017	000	Old age	Always	Never	83
197	Current pending sector	0x0012	100	099	000	Old age	Always	Never	128
198	Offline uncorrectable	0x0010	100	099	000	Old age	Offline	Never	128

 

The disk that is getting rebuilt is toasted - data can't be trusted, I'm toasted. Am I right?  :'(

That disk should be replaced. Most likely the parity issues are a connection problem caused by your rewiring since its SMART looked good from your diagnostics. Check your connections and remove the bad drive and reboot. You should be able to see if the data is being emulated. If so then you will be able to rebuild on a new disk.
Link to comment

Unfortunately, I don't think so. The parity drive has always been directly attached to the motherboard with a cable I don't have reasons to doubts. The connection was and is solid.

 

That drive, though, had given me that very same error in the past. After that I put it under observation, ran a couple of preclears on it and seemed fine (I think some under a 100 sectors reallocated but no more growing pending sectors). I guess the best thing to do would have been to trash it rather than risk it... but i didn't have any disk to spare at the time.

 

Is there any way I can know what sectors have affected the rebuilt drive now.

 

What I would like to do if I can isolate the problem is to replace the parity drive with a new disk but what you are saying makes me think I could try to rebuild again from the "faulty" parity drive. I really don't know what to do now.

Link to comment

I attached a new diagnostic file.

 

I'd really love to know if there's any way to track down whether the rebuilt has been affected - I think it has - and on what data, if any, the bad sectors "landed". I say if any because the rebuilt disk is 75% full and the errors started appearing in the last 25% of the rebuilding process I think. I don't know if this might mean that maybe there were not files there but just empty space to rebuild.

 

Any insight?

 

P.S. Why is unRAID considering the rebuilt disk OK considered it knows there were read errors from the parity?

tower-diagnostics-20151203-1024.zip

Link to comment

Don't know much about trying to test btrfs disks for corruption or fixing them. You can try searching, but I don't think there is much documented in our forum or wiki. Maybe out in the wild wild web where btrfs is used there may be some documentation you could google.

Link to comment

Well, I don't know how to intepret this but...

 

Inspired by this thread, I just gave a couple of more tries to the issue.

 

I ran a parity sync without corrections and it finished just a few minutes ago. The colums reads/writes at the end read 1883 errors but apart from that no sync errors?!? How should I intepret the 0 sync errors count? I don't know.

 

Anyway I'm burying all this. A new parity disk is running the array just now and the sync is in progress.

 

tower-diagnostics-20151206-1534.zip

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...