Jump to content
Joe L.

A good, but bad disk... badblocks alone can never be trusted.

78 posts in this topic Last Reply

Recommended Posts

I just finished running several full 4-pass write badblock tests on one of my KNOWN FAILED 2TB Hitachi disk drives.  The first stopped about 3/4 of the way through  (after about 48 hours) when my laptop crashed and the telnet connection dropped.  The second time I used "screen" (just in case my laptop decided to blue-screen again)

badblocks reported 0 errors, and yet the disk has failed SMART, has re-allocated nearly 2000 sectors, and still has a sector pending re-allocation.

 

The full badblocks command I used was:

badblocks -c 1024 -b 65536 -vsw -o /boot/badblocks_out_sdl.txt  /dev/sdl

beware, this is a destructive test. It will overwrite everything on the disk.  Do not use this command on a disk with data you wish to retain.

The larger block sizes allowed it to complete in 64 hours.    When done, there were no records in the /boot/badblocks_out_sdl.txt file.  (In other words, every block on the disk was written and verified with four different patterns of values.)

 

Now... all this is great, as I'm learning how to use badblocks and how to track its progress for inclusion in a new version of the preclear script. 

 

However...  I purposely used this specific Hitachi disk because it has a large number of re-allocated sectors.  It was already failing SMART when I started the "badblocks" command on it.  I was curious to see if it would do anything to discover additional un-readable sectors.  The current smart report looks like this:

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    ML0220F30XGTPD
Firmware Version: ML6OA580
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Nov 17 09:10:16 2012 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (22457) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   098   098   016    Pre-fail  Always       -       131075
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       98
  3 Spin_Up_Time            0x0007   140   140   024    Pre-fail  Always       -       415 (Average 370)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   002   002   005    Pre-fail  Always   FAILING_NOW 1954
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       9139
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       29
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       29
194 Temperature_Celsius     0x0002   222   222   000    Old_age   Always       -       27 (Min/Max 23/42)
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2535
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

Yes... the badblocks pass was successful.  Its bad-blocks log file was empty.  The last thing it printed was:

Pass Completed: done

Pass completed, 0 bad blocks found.

 

Yet, in the smart report, there is one sector pending re-allocation.  It had to have been set that way in the final verification pass when reading all the zeros, otherwise it would have been re-allocated in the prior writing phase.

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      1

 

Also, notice the number of re-allocated sectors.  Yikes!!!

5 Reallocated_Sector_Ct  0x0033  002  002  005    Pre-fail  Always  FAILING_NOW 1954

 

My conclusion.  No test is perfect.  Even though a 64 hour badblocks test passed with zero bad blocks, it left one sector pending re-allocation on a disk that  has already had 1954 blocks re-allocated and SMART considers as

SMART overall-health self-assessment test result: FAILED!  Drive failure expected in less than 24 hours. SAVE ALL DATA.

 

I'm going to RMA the disk...  (First Hitachi I've had fail on me)

 

Joe L.

Share this post


Link to post

A subsequent "read-only" pass of badblocks resulted in:

badblocks -t0x00 -c 1024 -b 65536 -vsv /dev/sdl

Checking blocks 0 to 30523664

Checking for bad blocks in read-only mode

Testing with pattern 0x00: done

Pass completed, 0 bad blocks found.

Isn't it great, badblocks shows no bad blocks found. >:( >:(>:( >:(

 

However, the disk is still FAILING SMART and smartctl now shows additional re-allocated sectors and 12 additional sectors pending re-allocation:

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  095  095  016    Pre-fail  Always      -      19

  2 Throughput_Performance  0x0005  135  135  054    Pre-fail  Offline      -      97

  3 Spin_Up_Time            0x0007  140  140  024    Pre-fail  Always      -      415 (Average 370)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      25

  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 1992

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  146  146  020    Pre-fail  Offline      -      29

  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9187

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      25

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      29

193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      29

194 Temperature_Celsius    0x0002  214  214  000    Old_age  Always      -      28 (Min/Max 23/42)

196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      2577

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      12

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0

 

Interesting, isn't it?      I think I'll let badblocks try another read-only pass.

 

Joe L.

Share this post


Link to post

A second "read-only" pass of badblocks resulted in exactly the same output:

badblocks -t0x00 -c 1024 -b 65536 -vsv /dev/sdl

Checking blocks 0 to 30523664

Checking for bad blocks in read-only mode

Testing with pattern 0x00: done

Pass completed, 0 bad blocks found.

Looks like un-readable sectors return zeros, otherwise badblocks should fail... right? >:( >:(>:( >:(

 

Oh yes, the disk is still FAILING SMART and smartctl now shows additional re-allocated sectors and one additional  sector (total of 13) are now pending re-allocation:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  098  098  016    Pre-fail  Always      -      5

  2 Throughput_Performance  0x0005  135  135  054    Pre-fail  Offline      -      97

  3 Spin_Up_Time            0x0007  140  140  024    Pre-fail  Always      -      415 (Average 370)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      25

  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 1992

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  146  146  020    Pre-fail  Offline      -      29

  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9195

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      25

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      29

193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      29

194 Temperature_Celsius    0x0002  206  206  000    Old_age  Always      -      29 (Min/Max 23/42)

196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      2577

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      13

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0

 

To those who think badblocks is the best thing since sliced bread... I'm not as sure it is the only thing you need to look at.  I think you need to consider smart data both before and after it is run.

 

I think I'll let badblocks try another read-write of 0xFF.  See if it fails on the un-readable sectors, assuming it does not re-allocate them all.

 

Joe L.

Share this post


Link to post

Maybe a good idea to corroborate your findings by checking the kernel message log for UNCorrectable errors on the drive in question, for the duration of the badblocks run. Sort of a "proof of the pudding ..." rather than rely on SMART to "prove" your case.

 

I'm inclined to agree with your assessment. But it is better to actually find the dead body than just be suspicious based on the odor of decomposition.

 

Share this post


Link to post

Maybe a good idea to corroborate your findings by checking the kernel message log for UNCorrectable errors on the drive in question, for the duration of the badblocks run. Sort of a "proof of the pudding ..." rather than rely on SMART to "prove" your case.

That is easy... here they are (the errors in the time-period of interest. ) remember, this disk is not part of the protected array.

Nov 16 01:37:54 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 16 01:37:54 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 16 01:37:54 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 16 01:37:54 Tower2 kernel: ata12.00: cmd 60/00:b0:00:d4:66/04:00:b5:00:00/40 tag 22 ncq 524288 in
Nov 16 01:37:54 Tower2 kernel:          res 51/40:7a:86:d7:66/00:00:b5:00:00/40 Emask 0x409 (media error) <F>
Nov 16 01:37:54 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 16 01:37:54 Tower2 kernel: ata12.00: error: { UNC }
Nov 16 01:37:54 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 16 01:37:54 Tower2 kernel: ata12: EH complete
Nov 17 09:09:59 Tower2 kernel: NTFS driver 2.1.30 [Flags: R/W MODULE].
Nov 18 02:16:33 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7800000f SErr 0x0 action 0x0
Nov 18 02:16:33 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 02:16:33 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 02:16:33 Tower2 kernel: ata12.00: cmd 60/00:d8:00:e0:e1/04:00:75:00:00/40 tag 27 ncq 524288 in
Nov 18 02:16:33 Tower2 kernel:          res 51/40:44:bc:e3:e1/00:00:75:00:00/40 Emask 0x409 (media error) <F>
Nov 18 02:16:33 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 02:16:33 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 02:16:33 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 02:16:33 Tower2 kernel: ata12: EH complete
Nov 18 12:06:37 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:06:37 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:06:37 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:06:37 Tower2 kernel: ata12.00: cmd 60/00:d8:00:6c:3a/04:00:2d:00:00/40 tag 27 ncq 524288 in
Nov 18 12:06:37 Tower2 kernel:          res 51/40:c6:3a:6c:3a/00:03:2d:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:06:37 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:06:37 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:06:37 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:06:37 Tower2 kernel: ata12: EH complete
Nov 18 12:07:22 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:07:22 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:07:22 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:07:22 Tower2 kernel: ata12.00: cmd 60/00:98:00:4c:90/04:00:2d:00:00/40 tag 19 ncq 524288 in
Nov 18 12:07:22 Tower2 kernel:          res 51/40:3c:c4:4e:90/00:01:2d:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:07:22 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:07:22 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:07:22 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:07:22 Tower2 kernel: ata12: EH complete
Nov 18 12:20:08 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffcf SErr 0x0 action 0x0
Nov 18 12:20:08 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:20:08 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:20:08 Tower2 kernel: ata12.00: cmd 60/00:30:00:8c:0d/04:00:33:00:00/40 tag 6 ncq 524288 in
Nov 18 12:20:08 Tower2 kernel:          res 51/40:8c:74:8f:0d/00:00:33:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:20:08 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:20:08 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:20:08 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:20:08 Tower2 kernel: ata12: EH complete
Nov 18 12:20:40 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0xc SErr 0x0 action 0x0
Nov 18 12:20:40 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:20:40 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:20:40 Tower2 kernel: ata12.00: cmd 60/00:10:00:f8:0f/04:00:33:00:00/40 tag 2 ncq 524288 in
Nov 18 12:20:40 Tower2 kernel:          res 51/40:f6:0a:f9:0f/00:02:33:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:20:40 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:20:40 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:20:40 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:20:40 Tower2 kernel: ata12: EH complete
Nov 18 12:36:44 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:36:44 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:36:44 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:36:44 Tower2 kernel: ata12.00: cmd 60/00:18:00:0c:d0/04:00:38:00:00/40 tag 3 ncq 524288 in
Nov 18 12:36:44 Tower2 kernel:          res 51/40:43:bd:0c:d0/00:03:38:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:36:44 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:36:44 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:36:44 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:36:44 Tower2 kernel: ata12: EH complete
Nov 18 12:42:12 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:42:12 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:42:12 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:42:12 Tower2 kernel: ata12.00: cmd 60/00:a8:00:54:d2/04:00:3a:00:00/40 tag 21 ncq 524288 in
Nov 18 12:42:12 Tower2 kernel:          res 51/40:74:8c:57:d2/00:00:3a:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:42:12 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:42:12 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:42:12 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:42:12 Tower2 kernel: ata12: EH complete
Nov 18 12:43:26 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x8 SErr 0x0 action 0x0
Nov 18 12:43:26 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:43:26 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:43:26 Tower2 kernel: ata12.00: cmd 60/00:18:00:fc:ed/04:00:3a:00:00/40 tag 3 ncq 524288 in
Nov 18 12:43:26 Tower2 kernel:          res 51/40:2d:d3:ff:ed/00:00:3a:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:43:26 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:43:26 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:43:26 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:43:26 Tower2 kernel: ata12: EH complete
Nov 18 12:44:05 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:44:05 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:44:05 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:44:05 Tower2 kernel: ata12.00: cmd 60/00:48:00:1c:25/04:00:3b:00:00/40 tag 9 ncq 524288 in
Nov 18 12:44:05 Tower2 kernel:          res 51/40:fc:04:1d:25/00:02:3b:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:44:05 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:44:05 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:44:05 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:44:05 Tower2 kernel: ata12: EH complete
Nov 18 12:45:04 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:45:04 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:45:04 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:45:04 Tower2 kernel: ata12.00: cmd 60/00:38:00:1c:42/04:00:3b:00:00/40 tag 7 ncq 524288 in
Nov 18 12:45:04 Tower2 kernel:          res 51/40:42:be:1f:42/00:00:3b:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:45:04 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:45:04 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:45:04 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:45:04 Tower2 kernel: ata12: EH complete
Nov 18 12:45:58 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:45:58 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:45:58 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:45:58 Tower2 kernel: ata12.00: cmd 60/00:c8:00:e0:44/04:00:3b:00:00/40 tag 25 ncq 524288 in
Nov 18 12:45:58 Tower2 kernel:          res 51/40:a8:58:e2:44/00:01:3b:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:45:58 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:45:58 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:45:58 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:45:58 Tower2 kernel: ata12: EH complete
Nov 18 12:46:15 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:46:15 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:46:15 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:46:15 Tower2 kernel: ata12.00: cmd 60/00:28:00:0c:5d/04:00:3b:00:00/40 tag 5 ncq 524288 in
Nov 18 12:46:15 Tower2 kernel:          res 51/40:fa:06:0d:5d/00:02:3b:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:46:15 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:46:15 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:46:15 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:46:15 Tower2 kernel: ata12: EH complete
Nov 18 12:47:48 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:47:48 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:47:48 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:47:48 Tower2 kernel: ata12.00: cmd 60/00:30:00:18:b4/04:00:3b:00:00/40 tag 6 ncq 524288 in
Nov 18 12:47:48 Tower2 kernel:          res 51/40:dc:24:18:b4/00:03:3b:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:47:48 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:47:48 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:47:48 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:47:48 Tower2 kernel: ata12: EH complete
Nov 18 12:56:50 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 12:56:50 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 12:56:50 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 12:56:50 Tower2 kernel: ata12.00: cmd 60/00:a0:00:48:47/04:00:3e:00:00/40 tag 20 ncq 524288 in
Nov 18 12:56:50 Tower2 kernel:          res 51/40:1e:e2:48:47/00:03:3e:00:00/40 Emask 0x409 (media error) <F>
Nov 18 12:56:50 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 12:56:50 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 12:56:50 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 12:56:50 Tower2 kernel: ata12: EH complete
Nov 18 13:11:20 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fc0000f SErr 0x0 action 0x0
Nov 18 13:11:20 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 13:11:20 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 13:11:20 Tower2 kernel: ata12.00: cmd 60/00:b0:00:cc:cf/04:00:42:00:00/40 tag 22 ncq 524288 in
Nov 18 13:11:20 Tower2 kernel:          res 51/40:c9:37:ce:cf/00:01:42:00:00/40 Emask 0x409 (media error) <F>
Nov 18 13:11:20 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 13:11:20 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 13:11:20 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 13:11:20 Tower2 kernel: ata12: EH complete
Nov 18 13:47:21 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 13:47:21 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 13:47:21 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 13:47:21 Tower2 kernel: ata12.00: cmd 60/00:80:00:40:68/04:00:54:00:00/40 tag 16 ncq 524288 in
Nov 18 13:47:21 Tower2 kernel:          res 51/40:62:9e:41:68/00:02:54:00:00/40 Emask 0x409 (media error) <F>
Nov 18 13:47:21 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 13:47:21 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 13:47:21 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 13:47:21 Tower2 kernel: ata12: EH complete
Nov 18 13:50:46 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 13:50:46 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 13:50:46 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 13:50:46 Tower2 kernel: ata12.00: cmd 60/00:90:00:48:ac/04:00:55:00:00/40 tag 18 ncq 524288 in
Nov 18 13:50:46 Tower2 kernel:          res 51/40:8e:72:48:ac/00:03:55:00:00/40 Emask 0x409 (media error) <F>
Nov 18 13:50:46 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 13:50:46 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 13:50:46 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 13:50:46 Tower2 kernel: ata12: EH complete
Nov 18 17:37:32 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 17:37:32 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 17:37:32 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 17:37:32 Tower2 kernel: ata12.00: cmd 60/00:38:00:14:e3/04:00:c1:00:00/40 tag 7 ncq 524288 in
Nov 18 17:37:32 Tower2 kernel:          res 51/40:24:dc:16:e3/00:01:c1:00:00/40 Emask 0x409 (media error) <F>
Nov 18 17:37:32 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 17:37:32 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 17:37:32 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 17:37:32 Tower2 kernel: ata12: EH complete
Nov 18 17:44:54 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 17:44:54 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 17:44:54 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 17:44:54 Tower2 kernel: ata12.00: cmd 60/00:b8:00:5c:c2/04:00:c3:00:00/40 tag 23 ncq 524288 in
Nov 18 17:44:54 Tower2 kernel:          res 51/40:28:d8:5f:c2/00:00:c3:00:00/40 Emask 0x409 (media error) <F>
Nov 18 17:44:54 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 17:44:54 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 17:44:54 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 17:44:54 Tower2 kernel: ata12: EH complete
Nov 18 17:44:57 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 18 17:44:57 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 17:44:57 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 17:44:57 Tower2 kernel: ata12.00: cmd 60/00:00:00:78:c2/04:00:c3:00:00/40 tag 0 ncq 524288 in
Nov 18 17:44:57 Tower2 kernel:          res 51/40:00:00:78:c2/00:04:c3:00:00/40 Emask 0x409 (media error) <F>
Nov 18 17:44:57 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 17:44:57 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 17:44:57 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 17:44:57 Tower2 kernel: ata12: EH complete
Nov 18 22:52:17 Tower2 kernel: mdcmd (77): spindown 1
Nov 18 22:52:17 Tower2 kernel: mdcmd (78): spindown 2
Nov 18 22:52:18 Tower2 kernel: mdcmd (79): spindown 4
Nov 18 22:52:18 Tower2 kernel: mdcmd (80): spindown 6
Nov 18 23:52:18 Tower2 kernel: mdcmd (81): spindown 3
Nov 18 23:52:18 Tower2 kernel: mdcmd (82): spindown 5

 

I completed the read-write test with 0xFF.  When It had completed writing 0xFF and was just starting to read them back to verify I ran a smartctl report.  It showed the sectors pending re-allocation had been re-allocated. (there were zero sectors pending re-allocation)   

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0

  2 Throughput_Performance  0x0005  135  135  054    Pre-fail  Offline      -      97

  3 Spin_Up_Time            0x0007  140  140  024    Pre-fail  Always      -      415 (Average 370)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      25

  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 1992

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  146  146  020    Pre-fail  Offline      -      29

  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9203

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      25

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      29

193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      29

194 Temperature_Celsius    0x0002  206  206  000    Old_age  Always      -      29 (Min/Max 23/42)

196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      2577

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0

 

Interestingly, the verify pass of badblocks found no issue when reading back the pattern of 0xFF

badblocks -t0xFF -c 1024 -b 65536 -vswv /dev/sdl

Checking for bad blocks in read-write mode

From block 0 to 30523664

Testing with pattern 0xff: done

Reading and comparing: done

Pass completed, 0 bad blocks found.

 

A smartctl report following the completion of the verify showed this, with one additional re-allocated sector:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  097  097  016    Pre-fail  Always      -      458752

  2 Throughput_Performance  0x0005  135  135  054    Pre-fail  Offline      -      97

  3 Spin_Up_Time            0x0007  140  140  024    Pre-fail  Always      -      415 (Average 370)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      25

  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 1993

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  146  146  020    Pre-fail  Offline      -      29

  9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      9211

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      25

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      29

193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      29

194 Temperature_Celsius    0x0002  200  200  000    Old_age  Always      -      30 (Min/Max 23/42)

196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      2578

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0

 

 

I'm inclined to agree with your assessment. But it is better to actually find the dead body than just be suspicious based on the odor of decomposition.

 

I'm going to try another read-write of 0x00 next.

 

Joe L.

Share this post


Link to post

That is easy... here they are (the errors in the time-period of interest. ) remember, this disk is not part of the protected array.

Nov 16 01:37:54 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Nov 16 01:37:54 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 16 01:37:54 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
...
Nov 18 02:16:33 Tower2 kernel: ata12.00: exception Emask 0x0 SAct 0x7800000f SErr 0x0 action 0x0
Nov 18 02:16:33 Tower2 kernel: ata12.00: irq_stat 0x48000000
Nov 18 02:16:33 Tower2 kernel: ata12.00: failed command: READ FPDMA QUEUED
Nov 18 02:16:33 Tower2 kernel: ata12.00: cmd 60/00:d8:00:e0:e1/04:00:75:00:00/40 tag 27 ncq 524288 in
Nov 18 02:16:33 Tower2 kernel:          res 51/40:44:bc:e3:e1/00:00:75:00:00/40 Emask 0x409 (media error) <F>
Nov 18 02:16:33 Tower2 kernel: ata12.00: status: { DRDY ERR }
Nov 18 02:16:33 Tower2 kernel: ata12.00: error: { UNC }
Nov 18 02:16:33 Tower2 kernel: ata12.00: configured for UDMA/133
Nov 18 02:16:33 Tower2 kernel: ata12: EH complete
... [19 more] ...

(You started one day/error early; NBD)

 

============================

CORRECTION: [30Nov12]

badblocks does report (very) bad blocks, and, for that, it can not be faulted. The test scenario upon which I based my [(now) misplaced] criticism, below (w/retractions), involves a drive that was throwing UNCorrectable errors, and corresponding nullifying increases/decreases to SMART's Current_Pending_Sector count, BUT none of those (15+) flaky sectors were sufficiently persistent in their flakiness to cross the AHCI driver's RETRY threshold, and, hence, did not result in any error returns to the calling program's (ie, badblocks) read() requests. Thus, badblocks had nothing to report.

 

[it is unfortunate that there is no mechanism available for a (privileged) user program to be informed of drive errors (below the RETRY threshold), provoked by its own read()s. I'll mention this to the author of badblocks, who is also heavily involved with kernel/filesystem development.]

 

This highlights the importance of simultaneously monitoring the tested drive's (mis-) behavior by other means.

=======================================

 

Yep, those are [REAL / HARD] errors. They were NOT reported to badblocks (via error return from read() calls), and the user should have been apprised of such events.

 

The most important conclusion to draw is:

The badblocks program SUCKS at precisely what it claims to do. FALSE

badblocks( - Linux man page
Name
badblocks - search a device for bad blocks

I'm inclined to revise your thread subtitle to:

"badblocks can never be trusted" ALSO FALSE

I completed the read-write test with 0xFF.  When It had completed writing 0xFF and was just starting to read them back to verify I ran a smartctl report.  It showed the sectors pending re-allocation had been re-allocated. (there were zero sectors pending re-allocation)

No. (In this case,) It showed that the formerly pending sectors had actually recuperated, and had not been sent to the morgue.

[the morgue population (Reallocated_Sector_Ct) remained at 1992] [cf: Reply#2]

Interestingly, the verify pass of badblocks found no issue when reading back the pattern of 0xFF

At least the program is consistent :(

A smartctl report following the completion of the verify showed this, with one additional re-allocated sector: ...

Yes, that verify run actually created/generated a new morgue resident.

I'm going to try another read-write of 0x00 next.

What's left to prove? You did well!

[but I might have to report you to the ASPCA. I know it's just a lab rat (that 5K3000), but this is bordering on cruelty. :)]

 

--UhClem

 

Share this post


Link to post

The most important conclusion to draw is:

The badblocks program SUCKS at precisely what it claims to do.

badblocks( - Linux man page
Name
badblocks - search a device for bad blocks

I'm inclined to revise your thread subtitle to:

"badblocks can never be trusted"

Well... it is doing what it was coded to do, but that is not what was intended.  Perhaps it was designed and initially coded prior to SMART firmware performing dynamic re-allocation.

 

If the "writing" of a sector did not re-allocate it, and the value read back was different, then it would detect a bad block.  However, since it seems to ignore a read failure, or the failed read returns a block of zeros which it uses while ignoring the read error code (does it ignore, or trap the error?)  Perhaps it uses the same buffer which still has the contents of the prior block... have to examine the code to see what it is actually doing some day (if UhClem does not beat me to it)

 

With SMART firmware re-allocating or re-writing in place I think you MUST also use smartctl to evaluate what is happening.

 

Once the current read-write of zeros finishes, I'll fill it will 0xFF again and then I'm going to put it in a multi-cycle read only pass, hoping one of them marks a block as un-readable, and a subsequent one finds it.

 

Oh yes, the disk I'm beating up is under warranty until 2014, and I'm not needing it immediately for storage, so I'll let it suffer through a few more cycles before sending it in for an RMA.

 

I've never seen smart completely exhaust the pool of spare blocks.  Have you?

 

Joe L.

 

Share this post


Link to post
I've never seen smart completely exhaust the pool of spare blocks.  Have you?

 

I don't see how you know it has. You might have more spares yet.

Share this post


Link to post

I've never seen smart completely exhaust the pool of spare blocks.  Have you?

I don't see how you know it has.

I would guess it would stop re-allocating them, and the number re-allocated would be some nice round number (2000) or a power of 2 (2048) or something like that.
You might have more spares yet.
Don't know either, but I have nothing to lose by running a few more cycles.

 

Joe L.

Share this post


Link to post

Aha, so its not so easy???  Was this version 1.42 of badblocks? 

Can't tell...  Apparently it has no way to determine the version number.  It is not printed in its output, nor is it available via some optional parameter.

 

It is the version supplied with the 5.0-rc8a unRAID.

>> Joe, I have started to play with weebotech's advice for using badblocks in addition to your preclear.  Is it possible to use them together?

>> See here: http://lime-technology.com/forum/index.php?topic=23384.msg207021#msg207021

 

Thanks for your research.

You can use them together, but from what I'm learning, the badblocks program can be very misleading and not report bad blocks. 

 

When I do put it into the preclear script I still need to perform many of the other tests to exercise the disk (interspersed seeks of random sectors, first and last sectors), to keep it from just doing a linear seek from track to track.  That will not uncover some mechanical issues as easily.

Share this post


Link to post

badblocks alone cannot be trusted, you need to use the values of smart before and after, that is a given.

 

Each drive's firmware is different in how it handles the error. It's known that a read or short read returns 0's.

I've seen this many times.

 

For my failed WD drives, it did detect the bad sectors and also seemed to refresh other sectors.

I did not use the -c and -b parameters I used the defaults.

On my simpler hardware, Whenever I got an error in the syslog, badblocks incremented the counters.

 

Now I would have to say, If your SMART status is reporting Drive status FAILED or FAILING NOW, badblocks is not going to fix it.

What it helps with if you do a smart long and/or short test with a failed LBA, it may refresh the bad sector or mark it bad.

 

Part of the issue is, pending sectors is something the firmware posts that the sector is questionable or took too many times to re-read. It doesn't necessarily mean that the sector was read incorrect.

 

You need to use both tools the pattern write/read and the output of SMART to determine if a drive is safe for use.

 

You have to consider, if bad blocks did not detect the sectors in question, DD would not have either. They both run at the same level of interface.

 

 

>> Joe, I have started to play with weebotech's advice for using badblocks in addition to your preclear.  Is it possible to use them together?

>> See here: http://lime-technology.com/forum/index.php?topic=23384.msg207021#msg207021

 

Thanks for your research.

You can use them together, but from what I'm learning, the badblocks program can be very misleading and not report bad blocks.

 

 

You still need to look at SMART values, if the firmware rereads a sector over and over and finally gets it correct, then reports the correct information back to the operating system, the program does not know the sector is bad.

It's all depends on how the firmware and operating system interact with suspect sectors.

 

 

I have seen this tool refresh pending sectors, reallocate failed sectors making the drive usable and reporting failed sectors when the operating system reports this back to the application.

Share this post


Link to post

I cannot update this until I have a home to go to and computer to work with.

 

There is no rush.  Thanks to you guys for the excellent support you provide to the entire community.  Tom's unRaid would be dead without the volunteers here.

Share this post


Link to post

I have badblocks 1.42 posted here.

http://code.google.com/p/unraid-weebotech/downloads/detail?name=badblocks-1.42

 

 

I cannot update this until I have a home to go to and computer to work with.

Maybe someone else can compile the latest version statically and post it.

I did a quick compare of the "strings" in your version and the only input option it lists that is not in the one supplied by unRAID is the "-B" option. 

(No idea what it does, as "-B" is not listed in any manual page I've seen)

 

Joe L.

Share this post


Link to post

The output is more informative on 1.42.

I was working to add a MB/s output and using \n instead of \r for the output. also altering the time per message so it could be piped.  Unfortunately my machines were ruined.

Share this post


Link to post

I have badblocks 1.42 posted here.

http://code.google.com/p/unraid-weebotech/downloads/detail?name=badblocks-1.42

 

 

I cannot update this until I have a home to go to and computer to work with.

Maybe someone else can compile the latest version statically and post it.

I did a quick compare of the "strings" in your version and the only input option it lists that is not in the one supplied by unRAID is the "-B" option. 

(No idea what it does, as "-B" is not listed in any manual page I've seen)

 

Joe L.

 

-B    Use  buffered  I/O  and  do  not  use  Direct I/O, even if it is available.

 

Source: badblocks Manpage

 

EDIT:

And it appears we have version 1.41.11, atleast that is the version of the e2fsprog package we have, which includes badblocks(and as far as I can tell, the same version numbers are used)

 

EDIT #2:

According to the e2fsprog git repo, the last change to badblocks was 9 months ago, and it was to "honor -s option when in read only -t mode", so show the progress while in test only mode. This may be helpful for your script Joe L, but otherwise weebotechs compiled 1.42 should be sufficient.

Share this post


Link to post

-B    Use  buffered  I/O  and  do  not  use  Direct I/O, even if it is available.

Source: badblocks Manpage

Thanks...  I really don't think that will be needed.  The entire disk is written, then read.  No disk buffer cache will cache the entire disk.. 

EDIT:

And it appears we have version 1.41.11, atleast that is the version of the e2fsprog package we have, which includes badblocks(and as far as I can tell, the same version numbers are used)

Thanks.

EDIT #2:

According to the e2fsprog git repo, the last change to badblocks was 9 months ago, and it was to "honor -s option when in read only -t mode", so show the progress while in test only mode. This may be helpful for your script Joe L, but otherwise weebotechs compiled 1.42 should be sufficient.

Actually, I can use the "-v" argument twice to get the additional progress output, even in -t in read-only mode, as seen here in this example:

root@Tower2:/boot#  badblocks -t0x00 -c 1024 -b 65536 -vsv /dev/sdl

Checking blocks 0 to 30523664

Checking for bad blocks in read-only mode

Testing with pattern 0x00:  0.03% done, 0:12 elapsed

 

I do not need any different version to interface to in a script.  I've already figured that out. 

I'm just a tad bit disappointed in how "badblocks" handles "read" errors returned from the OS when reading back the 0x00 pattern for verification.

 

Joe L.

Share this post


Link to post
 

I'm just a tad bit disappointed in how "badblocks" handles "read" errors returned from the OS when reading back the 0x00 pattern for verification.

Joe L.

The problem is the kernel did not return an error to the program.

The drive must have recovered and the sector was readable enough with error correction, otherwise you would have gotten a bad block. Either that or the firmware of the hitachi handles it differently.

 

 

Here's some open code I found for those who want to review.

http://stuff.mit.edu/afs/sipb/user/tytso/e2fsprogs/misc/badblocks.c

 

 

Point is if read(); does not return a -1 the application has no way of knowing the read failed.

If the block that was got matches what it's supposed to match, then the read is assumed to be good.

 

 

There's more going on here that we do not know about.

Share this post


Link to post

Just reporting what my google-fu could dig up. A lot of this stuff is over my head. I'm sure if I had the time to read through it all I'd understand it, but for now, I get the gist and can follow whats going on, but can't add anything useful to the conversation...besides the version numbers that is  :o

Share this post


Link to post

 

I'm just a tad bit disappointed in how "badblocks" handles "read" errors returned from the OS when reading back the 0x00 pattern for verification.

Joe L.

The problem is the kernel did not return an error to the program.

Yes, but the kernel did return an error... it was even logged in the syslog (see my prior post)

The drive must have recovered and the sector was readable enough with error correction, otherwise you would have gotten a bad block. Either that or the firmware of the hitachi handles it differently.

you would think so... but not so sure it is being handled correctly in badblocks.

 

Here's some open code I found for those who want to review.

http://stuff.mit.edu/afs/sipb/user/tytso/e2fsprogs/misc/badblocks.c

Thanks, code review is always fun.

Point is if read(); does not return a -1 the application has no way of knowing the read failed.

I agree, BUT if the read returns "-1" and it is ignored, or handled poorly, what happens??

 

I see this in the "read()" code.  look what it does with a -1.  It sets the number of bytes read to 0.

 

got = read (dev, buffer, try * block_size);

if (got < 0)

got = 0;

 

Then later in the code it seems to use the zero to mark the block as bad...

if (got == 0) {

bb_count += bb_output(currently_testing++);

}

If the block that was got matches what it's supposed to match, then the read is assumed to be good.

 

 

There's more going on here that we do not know about.

I agree.  The quick code review seems to indicate a read that returns a -1 will result in a bad block being logged.  Guess a read error must return something >= 0.

 

A quick browse of the "read()" man page results in me spotting this sentence:

POSIX allows a read() that is interrupted after reading some data to return -1 (with errno set to EINTR) or to return the number of bytes already read.

 

Since we are reading multiple blocks of data at a time, it is very possible that some data was read from sectors prior to the un-readable sector.  In that case, perhaps the "read()" is returning a value > 0.

We'll probably never know for sure without compiling a version that has more error reporting output, but it is interesting.

 

Joe L.

Share this post


Link to post

One more thing.. I did not see anywhere in the code where the "read buffer" is emptied/marked in any way each time through the loop prior to being passed as a pointer to the read() system call.  In other words, it always seems to have the contents of the prior iteration of the loop.  To make it more foolproof, it probably should have its last byte written to have a value other than what is expected.  That way the subsequent memcmp is guaranteed to fail if the buffer is not entirely written because of an error.

Share this post


Link to post

Perhaps it was designed and initially coded prior to SMART firmware performing dynamic re-allocation.

While the code to handle SMART resides in the drive's firmware, it really is not a part of the functioning/controlling of the drive. The drive performs reallocation when the conditions for doing so are met, and records the salient parameters for access (ie, Analysis) [and Reporting] by the co-resident SMART code.

Summary: the presence, or absence, of SMART functionality has nothing to do with badblocks performing (or not) according to its stated spec.

 

As I understand it, the closest that SMART gets to affecting the attributes is the "Offline" effort to (successfully) read recently-added Current_Pending_Sector sectors, and, if it does satisfactorily read one, that sector's Current_Pending status is "cleared", the count is decremented, and that sector becomes a normal citizen again.

If the "writing" of a sector did not re-allocate it, and the value read back was different, then it would detect a bad block.  However, since it seems to ignore a read failure, or the failed read returns a block of zeros which it uses while ignoring the read error code (does it ignore, or trap the error?)  Perhaps it uses the same buffer which still has the contents of the prior block... have to examine the code to see what it is actually doing some day (if UhClem does not beat me to it)

Funny ... I did have a cursory look at badblocks code about a year ago. I had written a little program to do (read-only, mostly performance) testing of hard drives, and was looking for some examples of what type of errors to anticipate, and how they were best dealt with. But all of my drives were perfectly healthy.[Weebo might just recall ... I had written here:

Personally, I'm looking forward to one of my own drives becoming "interesting" (like yours, or some variation). I've got some ideas, and some low-level code written to help test them. I just need my own lab rat.

:) ]

So, I did look at badblocks, and was shocked to see that it essentially ignored the error return [from read()] and strictly relied on data comparisons. Never looked at it again; and certainly would not use it myself. And thank you JoeL for providing this "case in point" (now) that does support my quick 5-minute surmisal, about a year ago.

 

Reminder: a failed read() returns (-1) [and sets errno] the resulting contents of the read()-specified buf are totally/completely undefined (as is the value of the system's seek-pointer for that filedesc now).

 

 

Share this post


Link to post

I think we are missing the point.

If the read fails, it returns -1, This is normal system call protocol.

Somehow the read or retry of the read is succeeding. Otherwise the read would return -1.

I know there are errors in the syslog, Somewhere else something is going wrong or the read is succeeding through retries.

 

 

We seem to be blaming badblocks for not detecting the read failure, when in fact, if it failed, it would handle it.

I've run this tool on MANY drives that had bad sectors, It finds them and logs them.

If possible the second, third or fourth pass, reallocates them.

 

 

My friend gave me boxes and boxes of hard drives. This tool has worked to ween out the bad ones or make questionable ones useful again.  After the sectors have been reallocated on one of the 4 passes, I do a SMART long test and the drives pass.

 

 

I'm not saying this is the be all / end all tool to detect and/or fix errors, but I've tested on box loads of drives and it works well.

 

 

I would ask Joe L, to try badblocks again without the speed enhancements and see what happens. I know it would take a long time, but that's what I've used and it found errors and/or reallocated them.

 

 

Share this post


Link to post

There is also the very lengthy non destructive read/write/read/rewrite test.

It takes a really long time without the speed enhancements, but it may prove useful in this example.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.