Jump to content

Read Error Recovery Failuer?


Recommended Posts

I am concerned that there might be an issue when a drive connected to a AOC-SASLP-MV8 controller reports a read error. In my case the drive is a 1.5TB Samsung.

 

My first hint of a prblem came on the third pass of a preclear something bad happend that caused it to completely fail. I did not capture any documentation that time but just went on and ran another preclear that finished without error.

- reboot

- Added the drive to the arrary

- Filled it with data

- Started running a utility to verify that all files copied correctly

- During the file verify I got the following and the drive was disabled

May  7 12:54:37 Cyclops kernel: sas: command 0xf6786240, task 0xc39d7400, timed out: BLK_EH_NOT_HANDLED

May  7 12:54:37 Cyclops kernel: sas: Enter sas_scsi_recover_host

May  7 12:54:37 Cyclops kernel: sas: trying to find task 0xc39d7400

May  7 12:54:37 Cyclops kernel: sas: sas_scsi_find_task: aborting task 0xc39d7400

May  7 12:54:37 Cyclops kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1701:mvs_abort_task:rc= 5

May  7 12:54:37 Cyclops kernel: sas: sas_scsi_find_task: querying task 0xc39d7400

May  7 12:54:37 Cyclops kernel: /usr/src/sas/trunk/mvsas_tgt/mv_sas.c 1645:mvs_query_task:rc= 5

May  7 12:54:37 Cyclops kernel: sas: sas_scsi_find_task: task 0xc39d7400 failed to abort

May  7 12:54:37 Cyclops kernel: sas: task 0xc39d7400 is not at LU: I_T recover

May  7 12:54:37 Cyclops kernel: sas: I_T nexus reset for dev 0100000000000000

May  7 12:54:37 Cyclops kernel: sas: I_T 0100000000000000 recovered

May  7 12:54:37 Cyclops kernel: sas: --- Exit sas_scsi_recover_host

May  7 12:54:37 Cyclops kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00

May  7 12:54:37 Cyclops kernel: ata2: status=0x41 { DriveReady Error }

May  7 12:54:37 Cyclops kernel: ata2: error=0x04 { DriveStatusError }

May  7 12:54:37 Cyclops kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00

May  7 12:54:37 Cyclops kernel: ata2: status=0x41 { DriveReady Error }

May  7 12:54:37 Cyclops kernel: ata2: error=0x04 { DriveStatusError }

May  7 12:54:37 Cyclops kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00

May  7 12:54:37 Cyclops kernel: ata2: status=0x41 { DriveReady Error }

May  7 12:54:37 Cyclops kernel: ata2: error=0x04 { DriveStatusError }

May  7 12:54:37 Cyclops kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00

May  7 12:54:37 Cyclops kernel: ata2: status=0x41 { DriveReady Error }

May  7 12:54:37 Cyclops kernel: ata2: error=0x04 { DriveStatusError }

May  7 12:54:37 Cyclops kernel: ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00

May  7 12:54:37 Cyclops kernel: ata2: status=0x41 { DriveReady Error }

May  7 12:54:37 Cyclops kernel: ata2: error=0x04 { DriveStatusError }

May  7 12:54:37 Cyclops kernel: sd 2:0:1:0: [sdc] Result: hostbyte=0x00 driverbyte=0x08

May  7 12:54:37 Cyclops kernel: sd 2:0:1:0: [sdc] Sense Key : 0xb [current] [descriptor]

May  7 12:54:37 Cyclops kernel: Descriptor sense data with sense descriptors (in hex):

May  7 12:54:37 Cyclops kernel:        72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00

May  7 12:54:37 Cyclops kernel:        00 00 00 c0

May  7 12:54:37 Cyclops kernel: sd 2:0:1:0: [sdc] ASC=0x0 ASCQ=0x0

May  7 12:54:37 Cyclops kernel: sd 2:0:1:0: [sdc] CDB: cdb[0]=0x28: 28 00 72 d6 21 1f 00 02 00 00

May  7 12:54:37 Cyclops kernel: end_request: I/O error, dev sdc, sector 1926635807

May  7 12:54:37 Cyclops kernel: md: disk3 read error

May  7 12:54:37 Cyclops kernel: handle_stripe read error: 1926635744/3, count: 1

- The previous line repeated several times. Over 150 read errors were reported in the stats screen.

- There were a few more instances of the sense data and disk3 read error messages

- Over the next few hours there were a thousands of these messages logged

May  7 17:30:47 Cyclops kernel: ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00

May  7 17:30:47 Cyclops kernel: ata2: status=0x51 { DriveReady SeekComplete Error }

May  7 17:30:47 Cyclops kernel: ata2: error=0x04 { DriveStatusError }

- Sprinked in here and there were these

May  7 17:18:28 Cyclops kernel: mdcmd (15046): spindown 3

May  7 17:18:28 Cyclops kernel: md: disk3: ATA_OP_STANDBYNOW1 ioctl error: -5

- All of the above is from the 5/7 log

- rebooted

- enabled the drive and used the trust my array procedure to restart

- got a hand full of parity updates right at the beginning of the parity check (as expected)

- about 3 hours into the parity check pretty much the same thing happened except this time over 500 read errors were reported in status display

- See 5/8 log

- rebooted

- removed problem drive from array

- reset arrary and rebuilt parity from remaining drives (I still had not deleted the source files)

- two of the drives in the array were still empty so I moved on of them to the physical port the problem drive was plugged into

- packed the new drive full of data, ran file and parity verifies without any problems

 

Now am nearly certain that this is not a cabling or power problem.

 

Looking at the SMART data for the problem drive I currently see:

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      4

Right after having the above problems 197 Current_Pending_Sector showed a count of 2. I had moved the drive to a MOBO port and ran another preclear on it. Now I see:

5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

So it does not look like the read errors resulted in a sector reallocate.

 

Conclusions:

 

- No cabling or power problems

- The problem drive reproted 4 errors while all this was going on

- When an error was reported something (the driver?) went bonkers

- unRAID read error recovery did not work as expected

 

To correct the read errors I had to remove the drive from the array and use preclear. My expectation was that when unRAID encountered a read error during normal processing or parity check it would use the other drives in the array to reconstruct the missing data and write it back to the block that could not be read, thus allowing the drive to do it's possible relocte sector thing. unRAID might have been going through the motions but maybe the write was not getting to the drive.

 

On the bright side unRAID did a perfect job of simulating the inaccessable data. The file verifies I was running completed without error illustrating perfect data recovery. I am sure I could have let unRAID just rebuild the data on a replacement drive but chose not to do that.

 

There may be flaws in my logic and conclusion but there is no doubt that something bad happened and it involved the only drive I have for which the 187 Reported_Uncorrect value is not 0. I have not yet added this drive back to the array but will in the next week or so. Of course I don't know what would have happened if the hardware configuration were different. I suspect something related to the MV8 but don't have enough data to prove anything. All I have is a smoking gun, specifically 187 Reported_Uncorrect = 4.

syslog-2010-05-07.zip

syslog-2010-05-08.zip

Link to comment
My expectation was that when unRAID encountered a read error during normal processing or parity check it would use the other drives in the array to reconstruct the missing data and write it back to the block that could not be read, thus allowing the drive to do it's possible relocte sector thing. unRAID might have been going through the motions but maybe the write was not getting to the drive.

Your expectation is correct.  The "write"  did get through to the disk, because if it did not, the disk would have immediately been taken out of service.   A single "write" failure will make the drive "red" (invalid) in the management console.

 

You incorrectly assumed the "read" failure was due to an un-readable sector on the physical disk.  If that had occurred, it would have shown itself in the syslog as a "media error" an an UNC (un-correctable error) type would have been reported.  

 

Instead, the error you saw was a "DriveStatusError" error.  

 

To the operating system, it is still a "read" error, and unRAID would indeed supply the contents of the sector needed by getting it from all the other disks and re-creating it.   You do not see any re-allocated sectors or sectors pending re-allocation because the drive itself did not have any un-readable sectors, and none marked for re-allocation.  It was that the drive controller just could not get to the disk.

 

Joe L.

Link to comment
A single "write" failure will make the drive "red" (invalid) in the management console.

Agreed, to the extent the driver is reporting it correctly to the OS.

 

incorrectly assumed the "read" failure was due to an un-readable sector on the physical disk.

The SMART doc I read is pretty clear that "187 Reported_Uncorrect = 4" indicates the drive reported 4 unrecoverable read errors.

 

You do not see any re-allocated sectors or sectors pending re-allocation

Yes I did. Immidiatly after having this problem SMART showed "197 Current_Pending_Sector = 2". This only changed back to zero after I took the drive out of the array and ran a preclear on it. I understand that just because a sector is marked for pending re-allocation that does not necessairly mean it is reallocated. The drive decides what to do on the next write to that sector.

 

Also note that on both 5/7 and 5/8 the problem started at about the same place on the drive and everything was just fine prior to these errors.

May  7 12:54:37 Cyclops kernel: end_request: I/O error, dev sdc, sector 1926635807

May  8 06:55:12 Cyclops kernel: end_request: I/O error, dev sdc, sector 1926635783

I don't beleive that is a coincidence. I suppose the bad behavior could be a controller firmware problem but that would be very discumforting. I do know that by replacing the drive with one that has never reported a sector read error I have gotten arround the problem. I am concerned about what will happen the next time any one of my drives connected to the MV8 reports an unreadable sector. Maybe I will just have to wait and see and hope it does not happen for a long time. I fear that one read error will result in the drive being disabled (red, invalid). This might be the price I pay for using a controller that has only been supported for a month or two.

Link to comment

A single "write" failure will make the drive "red" (invalid) in the management console.

Agreed, to the extent the driver is reporting it correctly to the OS.

 

incorrectly assumed the "read" failure was due to an un-readable sector on the physical disk.

The SMART doc I read is pretty clear that "187 Reported_Uncorrect = 4" indicates the drive reported 4 unrecoverable read errors.

 

You do not see any re-allocated sectors or sectors pending re-allocation

Yes I did. Immidiatly after having this problem SMART showed "197 Current_Pending_Sector = 2". This only changed back to zero after I took the drive out of the array and ran a preclear on it. I understand that just because a sector is marked for pending re-allocation that does not necessairly mean it is reallocated. The drive decides what to do on the next write to that sector.

 

Also note that on both 5/7 and 5/8 the problem started at about the same place on the drive and everything was just fine prior to these errors.

May  7 12:54:37 Cyclops kernel: end_request: I/O error, dev sdc, sector 1926635807

May  8 06:55:12 Cyclops kernel: end_request: I/O error, dev sdc, sector 1926635783

I don't beleive that is a coincidence. I suppose the bad behavior could be a controller firmware problem but that would be very discumforting. I do know that by replacing the drive with one that has never reported a sector read error I have gotten arround the problem. I am concerned about what will happen the next time any one of my drives connected to the MV8 reports an unreadable sector. Maybe I will just have to wait and see and hope it does not happen for a long time. I fear that one read error will result in the drive being disabled (red, invalid). This might be the price I pay for using a controller that has only been supported for a month or two.

I did not know the prior history of the drive.  You are correct, too much of a coincidence.  Also true that if a sector marked for re-allocation can be written (and re-read) it will not be re-allocated.  It simply meant it was not written properly the first time.

 

You can, at any time, simulate a "read" or "write" error.

 

Just enter at the command line:

/root/mdcmd rderror N

or

/root/mdcmd wrerror N

 

(where N = 0 through 19. 0=parity drive, 1=disk1, etc)

The next read or write to the given drive will act as if a read or write error  occurred in the disk driver.

 

I might suggest doing this on a disk you can play with (no critical files)

Link to comment
/root/mdcmd rderror N

Thanks for the tip. I will figure out someway to safely give that a try.

 

act as if a read or write error  occurred in the disk driver

Literal interpitation suggests that this will exercise the error logic in the OS and unRAID layers, not in the driver itself. If that is the case and the driver itself is actually causing the problem, the simulated error will probably be recoverd just fine. I'll provide an update when I can give it a try but it will be several days. I am thinking I will boot a 3 drive free version of unRAID and create a test array.

 

Also the fact that "197 Current_Pending_Sector = 2" while the drive was in the array and only accessed via unRAID is not a good sign. I would expect pending re-allocates to always be cleared by properly functioning error recover.

Link to comment

I decided to take a chance and test the simulated read error with my primary array. Here are the results.

 

Before:

Disk status  Model / Serial No. Temperature Size Free Reads Writes Errors

parity SAMSUNG_HD203WI_S1UYJ1KZ402326 22°C 1,953,514,552 - 40 17 0

disk1 SAMSUNG_HD154UI_S1XWJDWZ203486 18°C 1,465,138,552 42,088 58 5 0

disk2 SAMSUNG_HD154UI_S1XWJDWZ203489 17°C 1,465,138,552 72,340 56 5 0

disk3 SAMSUNG_HD154UI_S1XWJ1KZ104867 19°C 1,465,138,552 101,888,144 58 5 0

disk4 SAMSUNG_HD154UI_S1XWJ1KS912511 18°C 1,465,138,552 1,465,060,992 50 5 0

 

Telent session:

Cyclops login: root

Linux 2.6.32.9-unRAID.

root@Cyclops:~# /root/mdcmd set rderror 4

cmdOper=set

cmdResult=ok

root@Cyclops:~#

 

Syslog:

May 16 10:14:52 Cyclops in.telnetd[2203]: connect from 192.168.0.21 (192.168.0.21)

May 16 10:14:55 Cyclops login[2204]: ROOT LOGIN on `pts/0' from `192.168.0.21'

May 16 10:15:43 Cyclops kernel: mdcmd (59): set rderror 4

May 16 10:15:43 Cyclops kernel:

May 16 10:17:16 Cyclops kernel: md: disk4 read error

May 16 10:17:16 Cyclops kernel: handle_stripe read error: 262144/4, count: 1

 

After:

Disk status  Model / Serial No. Temperature Size Free Reads Writes Errors

parity SAMSUNG_HD203WI_S1UYJ1KZ402326 22°C 1,953,514,552 - 47 28 0

disk1 SAMSUNG_HD154UI_S1XWJDWZ203486 18°C 1,465,138,552 42,088 59 5 0

disk2 SAMSUNG_HD154UI_S1XWJDWZ203489 18°C 1,465,138,552 72,340 57 5 0

disk3 SAMSUNG_HD154UI_S1XWJ1KZ104867 19°C 1,465,138,552 101,888,144 59 5 0

disk4 SAMSUNG_HD154UI_S1XWJ1KS912511 18°C 1,465,138,552 1,465,060,992 56 17 1

 

Thoughts and conclusions:

1. unRAID handling of the simulated read error is perfect

2. There is a serious problem with something relating specifically to the AOC-SASLP-MV8 controller. If a Samsung HD154UI reports just one sector read error the result is a barage of hundreds of nasty error messages that evetually lead to the drive being disabled. It is entirely possible that the barage of error messages is the result of continued attempts to access the drive that fail not because of a hardware problem but rather because software driver problems, up until the drive is finally disabled. The unRAID write to the drive apparently does not get through because the drive is left with a pending re-allocate.

3. Another way to state the problem is that when one sector read error is reported by the drive the driver looses the ability to correctly communicate to the drive.

4. Even though I am using Samsung drives, my guess is the symptoms would be the same for any drive connected to a AOC-SASLP-MV8. Since there is no way to force the hard drive to report a sector read error it seems we need to wait for read errors to hapen naturally to collect more evidence that supports or refutes my conclusions.

 

Does anyone have drives connected to a AOC-SASLP-MV8 that have had read errors reported in the management console or syslog?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...