Jump to content

Disk1 red-balled for second time - interpretation, please!


Recommended Posts

For the second time in a few days, my disk1 has red-balled.

 

However, it is a different physical disk in a different physical slot in a different drive cage.  On the first occasion, it was a WD10EARS which failed as sdc.

 

On this occasion I had just inserted a drive into a hot-swap slot and started a pre-clear (sdh).  It appears that, seconds after inserting the drive for pre-clear, my disk1 (sdf), a WD20EARS, failed.  Is there a connection ... perhaps a problem with hot-swap?

 

The significant section of the syslog is:

Aug 17 23:31:08 Tower kernel: mdcmd (105): spindown 1
Aug 17 23:31:18 Tower kernel: mdcmd (106): spindown 0
Aug 17 23:31:18 Tower kernel: mdcmd (107): spindown 3
Aug 18 00:00:01 Tower logger: mover started
Aug 18 00:00:01 Tower logger: skipping */
Aug 18 00:00:01 Tower logger: mover finished
Aug 18 00:00:07 Tower kernel: scsi 0:0:4:0: Direct-Access     ATA      SAMSUNG HD502HJ  0001 PQ: 0 ANSI: 6
Aug 18 00:00:07 Tower kernel: scsi 0:0:4:0: SATA: handle(0x000d), sas_addr(0x4433221105000000), phy(5), device_name(0x0000000000000000)
Aug 18 00:00:07 Tower kernel: scsi 0:0:4:0: SATA: enclosure_logical_id(0x50030480041f4c00), slot(6)
Aug 18 00:00:07 Tower kernel: scsi 0:0:4:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Aug 18 00:00:07 Tower kernel: scsi 0:0:4:0: qdepth(32), tagged(1), simple(1), ordered(0), scsi_level(7), cmd_que(1)
Aug 18 00:00:07 Tower kernel: sd 0:0:4:0: [sdh] 976773168 512-byte logical blocks: (500 GB/465 GiB)
Aug 18 00:00:07 Tower kernel: sd 0:0:4:0: [sdh] Write Protect is off
Aug 18 00:00:07 Tower kernel: sd 0:0:4:0: [sdh] Mode Sense: 7f 00 00 08
Aug 18 00:00:07 Tower kernel: sd 0:0:4:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Aug 18 00:00:07 Tower kernel:  sdh: sdh1
Aug 18 00:00:07 Tower kernel: sd 0:0:4:0: [sdh] Attached SCSI disk
Aug 18 00:00:11 Tower kernel: sd 0:0:3:0: attempting task abort! scmd(f52963c0)
Aug 18 00:00:11 Tower kernel: sd 0:0:3:0: [sdf] CDB: cdb[0]=0x85: 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Aug 18 00:00:11 Tower kernel: scsi target0:0:3: handle(0x000c), sas_address(0x4433221104000000), phy(4)
Aug 18 00:00:11 Tower kernel: scsi target0:0:3: enclosure_logical_id(0x50030480041f4c00), slot(7)
Aug 18 00:00:13 Tower kernel: sd 0:0:3:0: task abort: SUCCESS scmd(f52963c0)
Aug 18 00:00:13 Tower kernel: mdcmd (108): spindown 1
Aug 18 00:03:21 Tower kernel:  sdh: sdh1
Aug 18 00:04:30 Tower kernel: sd 0:0:4:0: attempting task abort! scmd(f5aff600)
Aug 18 00:04:30 Tower kernel: sd 0:0:4:0: [sdh] CDB: cdb[0]=0x85: 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Aug 18 00:04:30 Tower kernel: scsi target0:0:4: handle(0x000d), sas_address(0x4433221105000000), phy(5)
Aug 18 00:04:30 Tower kernel: scsi target0:0:4: enclosure_logical_id(0x50030480041f4c00), slot(6)
Aug 18 00:04:30 Tower kernel: sd 0:0:4:0: task abort: SUCCESS scmd(f5aff600)
Aug 18 00:07:20 Tower kernel: sd 0:0:4:0: attempting task abort! scmd(f53a6240)
Aug 18 00:07:20 Tower kernel: sd 0:0:4:0: [sdh] CDB: cdb[0]=0x85: 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Aug 18 00:07:20 Tower kernel: scsi target0:0:4: handle(0x000d), sas_address(0x4433221105000000), phy(5)
Aug 18 00:07:20 Tower kernel: scsi target0:0:4: enclosure_logical_id(0x50030480041f4c00), slot(6)
Aug 18 00:07:20 Tower kernel: sd 0:0:4:0: task abort: SUCCESS scmd(f53a6240)
Aug 18 00:10:03 Tower mountd[1452]: authenticated mount request from 10.2.1.15:927 for /mnt/cache (/mnt/cache)
Aug 18 00:10:03 Tower mountd[1452]: authenticated mount request from 10.2.1.15:689 for /mnt/user/Movies (/mnt/user/Movies)
Aug 18 00:10:03 Tower mountd[1452]: authenticated mount request from 10.2.1.15:765 for /mnt/disk1 (/mnt/disk1)
Aug 18 00:10:03 Tower kernel: sd 0:0:3:0: [sdf] Device not ready
Aug 18 00:10:03 Tower kernel: sd 0:0:3:0: [sdf]  Result: hostbyte=0x00 driverbyte=0x08
Aug 18 00:10:03 Tower kernel: sd 0:0:3:0: [sdf]  Sense Key : 0x2 [current] 
Aug 18 00:10:03 Tower kernel: sd 0:0:3:0: [sdf]  ASC=0x4 ASCQ=0x2
Aug 18 00:10:03 Tower kernel: sd 0:0:3:0: [sdf] CDB: cdb[0]=0x28: 28 00 30 b6 7d 08 00 00 08 00
Aug 18 00:10:03 Tower kernel: end_request: I/O error, dev sdf, sector 817265928
Aug 18 00:10:03 Tower kernel: md: disk1 read error
Aug 18 00:10:03 Tower kernel: handle_stripe read error: 817265864/1, count: 1
Aug 18 00:10:11 Tower kernel: sd 0:0:3:0: [sdf] Device not ready
Aug 18 00:10:11 Tower kernel: sd 0:0:3:0: [sdf]  Result: hostbyte=0x00 driverbyte=0x08
Aug 18 00:10:11 Tower kernel: sd 0:0:3:0: [sdf]  Sense Key : 0x2 [current] 
Aug 18 00:10:11 Tower kernel: sd 0:0:3:0: [sdf]  ASC=0x4 ASCQ=0x2
Aug 18 00:10:11 Tower kernel: sd 0:0:3:0: [sdf] CDB: cdb[0]=0x2a: 2a 00 30 b6 7d 08 00 00 08 00
Aug 18 00:10:11 Tower kernel: end_request: I/O error, dev sdf, sector 817265928
Aug 18 00:10:11 Tower kernel: md: disk1 write error
Aug 18 00:10:11 Tower kernel: handle_stripe write error: 817265864/1, count: 1
Aug 18 00:10:11 Tower kernel: md: recovery thread woken up ...
Aug 18 00:10:11 Tower kernel: md: recovery thread has nothing to resync
Aug 18 00:11:31 Tower kernel: NTFS driver 2.1.29 [Flags: R/W MODULE].
Aug 18 00:15:58 Tower mountd[1452]: authenticated unmount request from 10.2.1.15:660 for /mnt/user/Movies (/mnt/user/Movies)
Aug 18 00:15:59 Tower mountd[1452]: authenticated unmount request from 10.2.1.15:665 for /mnt/disk1 (/mnt/disk1)
Aug 18 00:15:59 Tower mountd[1452]: authenticated unmount request from 10.2.1.15:669 for /mnt/cache (/mnt/cache)

 

The entire syslog is attached.

 

There is absolutely nothing untoward in the SMART report for that drive.

 

Is anyone able to interpret this, and tell me why the failure occurred?

 

I have gone through the process of stop/remove drive/start/stop/replace drive/start in order to rebuild disk1 on the same physical drive.  The rebuild has reached 7%, and all appears to be proceeding normally.

syslog-2011-08-18.txt.zip

Link to comment

Well, the drive rebuild ran to completion without any apparent problems.

 

Because unRAID was reporting that the parity check had never been run, I ran a non-correcting parity check.  The last time I looked at the status on the web page, it was around 95% complete, with no errors reported.  The next time I looked, the parity check was no longer shown as running but the drive had red-balled again and is now showing 576 errors.

 

The SMART report still looks very clean:

smartctl -a /dev/sdf (disk1)

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA5985215
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Aug 18 23:57:04 2011 SGT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (39180) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   201   170   021    Pre-fail  Always       -       4925
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       176
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1606
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       912
194 Temperature_Celsius     0x0022   117   106   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Syslog attached.

 

What are all the occurrences of:

Aug 18 17:28:41 Tower kernel: sd 0:0:2:0: attempting task abort! scmd(effa13c0)

Aug 18 17:28:41 Tower kernel: sd 0:0:2:0: [sde] CDB: cdb[0]=0x85: 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00

Aug 18 17:28:41 Tower kernel: scsi target0:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)

Aug 18 17:28:41 Tower kernel: scsi target0:0:2: enclosure_logical_id(0x50030480041f4c00), slot(1)

Aug 18 17:28:41 Tower kernel: sd 0:0:2:0: task abort: SUCCESS scmd(effa13c0)

which start happening after the parity check is started?

 

Then it appears that, at 23:45:34, drive 1 is spun down, and a second later is reported 'not ready', followed by a series of read errors.  What is causing that?

syslog-2011-08-19.txt.zip

Link to comment

Thanks for the reply.

 

Different physical drives, in different cages on different ports of the same adapter - but all data drives are on the same adapter and it is only disk1 which has failed.

 

I guess that I will have to move disk1 to a mobo port next.

 

Can anyone tell me what the 'attempting task abort' etc messages mean?  I've never seen them before.

Link to comment

Thanks for the reply.

 

Different physical drives, in different cages on different ports of the same adapter - but all data drives are on the same adapter and it is only disk1 which has failed.

 

I guess that I will have to move disk1 to a mobo port next.

 

Can anyone tell me what the 'attempting task abort' etc messages mean?  I've never seen them before.

me neither.  google it.
Link to comment

Thanks for the reply.

 

Different physical drives, in different cages on different ports of the same adapter - but all data drives are on the same adapter and it is only disk1 which has failed.

 

I guess that I will have to move disk1 to a mobo port next.

 

Can anyone tell me what the 'attempting task abort' etc messages mean?  I've never seen them before.

me neither.   google it.

 

Thanks for the suggestion, Joe.  If I google 'attempting task abort', I get lots of references to scsi devices, with suggestions that it indicates a hardware error.  If I add 'mptsas2' to the search string I get 'no results'.

 

However, if I google 'attempting task abort mptsas' I find suggestions that patches to the mptsas driver may resolve the issue.  Also, there are suggestions that it only affects WD drives when there is heavy i/o activity - the two drives which have exhibited the problem in my case are WD10EARS and WD20EARS.  I find it strange that I could go through a full recovery of the 2TB drive and then 95% of a parity check before the problem became apparent.

 

I'll continue my search.

 

Edit

This sounds exactly like my situation, except that it relates to the mptsas, not mpt2sas, driver.  However, I presume that the mpt2sas drivers will be heavily based on the mptsas, so any faults my have been propagated through to the later drivers.

 

Short of building my own kernel, is there anyway that I can invoke the LSI-published drivers, rather than those supplied in the unRAID build?

 

It's strange that the problem has only become apparent since I installed my 5in3 cages - perhaps the problem is related to timings, and use of the cages introduces some delays?

 

Edit 2

 

Again, the rebuild completed satisfactorily and the parity check failed.  This time it was disk3 (a 1TB Hitachi) which red-balled, at 30% through the parity check.

 

Okay, I still haven't exceeded the number of motherboard SATA ports.  I will move all my drives back to the mobo and try more parity checks - when the disk3 rebuild has completed.

 

I am coming to the conclusion that it is, indeed, a timing problem.  The problem is provoked by the i/o timings during the parity check (lots of consecutive reads).  It seems that the rebuild, with its read/write processing doesn't provoke the problem.

 

The occurrence may also have something to do with the asynchronous nature of mail arrival writing to the mailstore on disks1 and 3, although there was no indication in the mail logs that mail had arrived at the time when I first experienced the problem.

Link to comment
If I google 'attempting task abort', I get lots of references to scsi devices, with suggestions that it indicates a hardware error.  If I add 'mptsas2' to the search string I get 'no results'.

 

Ooops .. I ought to type that correctly!  If I look for '"attempting task abort" mpt2sas', then I do get one hit.  Someone else reported the same messages, but there was no ensuing discussion.  I feel that it's probably the same problem which is afflicting the mptsas driver.

Link to comment
Okay, I still haven't exceeded the number of motherboard SATA ports.  I will move all my drives back to the mobo and try more parity checks - when the disk3 rebuild has completed.

 

As I expected, the parity check runs to completion, without any errors, when the drives are all connected to the motherboard ports, still using the iStarUSA 5in3 cages.

 

So, this truly does appear to be some interaction between HBA and cages - probably associated with some defect in the general Linux distribution of the mpt2sas driver.  I really need to try the LSI version of the driver.

Link to comment

Try using the HBA with a bare drive. I can't see how the cages could have any effect. They are just direct extentions of the cables.  

 

Quite!  But I've been using that HBA for more than three months without any problems, with the same drives.  I installed the cages last week (Aug 12) and every parity check since then has failed with this "attempting task abort" problem.  The same failure has occurred in 3 different bays, in two different cages, with three different drives.  That would suggest that it's not an individual cabling problem, not a problem with one individual drive, not a problem with one individual cage - what's left?

 

Sure, the cages are just an extension of the cables - but that can introduce small timing differences.  I think my next test will be to run a pre-clear on the AOC controller/iStarUSA cage combo, and see what happens.

 

Edit

 

Much as I expected, looking at the log file, seconds after starting the preclear, it's full of the same messages for the drive being precleared.

 

This is a fourth drive, and one on which I had run preclear at least six times while attached directly to the AOC hba.

 

I am convinced, beyond reasonable doubt, that some interaction between hba and cage is provoking this problem.  The only suggestion I have for a fix, is the page I linked above, where a system builder was experiencing this problem (admittedly, with the mptsas driver) and replacing the standard linux distribution driver with the LSI version cured his problem.

 

So, I repeat my question:

How can I substitute the LSI derived mpt2sas driver for the driver which comes with unRAID?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...