Jump to content
vca

[solved] Disk5 redballed while rebuilding Disk4, what to do?

6 posts in this topic Last Reply

Recommended Posts

This past week I have been replacing a few drives that are getting old (about the 3 year mark) and are showing some issues in the SMART reports.

 

Tonight I installed a replacement for disk4 and started the rebuild process.  All seemed to be going well but after about 15 minutes disk5 redballed and the rebuild stopped.

 

When I hit the stop button to take the array offline the system seemed to enter an endless loop of "retry unmounting users share(s)" messages and the console was spewing errors.  After about 10 minutes of this I just hit the reset button and the system rebooted more or less normally.

 

At this point disk5 is showing red and disk4 is orange.

 

I have the original disk4 still and a spare disk that is the same size as disk5.

 

I can think of two options for proceeding, but are there some better ones?

 

1.  put in the original disk4 and the spare disk for disk5 and then rebuild disk5 from parity.  Is this the "trust my parity" process?

 

2.  if disk5 appears to be working still, put the original disk4 back in and then build parity again

 

Regards,

 

Stephen

Share this post


Link to post

This is how I am proceeding. 

 

I took a look at the SMART report for disk5 and compared it to one that I took a week ago. Two lines showed significant changes:

 

A week ago:

183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1

 

After the drive red balled:

183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       20

 

From the increase in UDMA errors I'm guessing that the most likely cause was a SATA cabling glitch and that the drive is actually still good. So I did the usual check all the cables but didn't find anything amiss. 

 

So I decided to try putting the old disk4 back into service and doing a noncorrecting parity check to see if disk5 still matches up.  As the array was not in use during the 15 minutes or so that the failed disk4 rebuild was taking place it seemed like a pretty safe bet that the parity and contents of all the drives would have been unchanged at the time disk5 hit its issue and the rebuild stopped.

 

So after reading various things about the trust my parity procedure I eventually came across this 16-Nov-2013 message from Tom on http://lime-technology.com/forum/index.php?topic=30270.msg272137#msg272137 removing a drive but preserving parity.  In my case it would boil down to the following 3 steps:

 

    1. Stop array, go to Utils page, click 'New Config' and execute that Utility.

    2. Go back to Main, assign Parity, and all devices as they were, assign the old disk4 back into the right spot and leave the new disk4 (the failed rebuild) out of the array.

    3. Click checkbox "Parity is already valid.", and click Start

 

This I did and it appears to be working as expected, except it is doing a "correcting" parity check:

 

Dec 14 09:25:55 saturn kernel: mdcmd (52): check CORRECT (unRAID engine)

 

and the check box beside "correct any parity-check errors..." in the Main GUI page is checked (but is greyed out)

 

This found 60 parity errors in the first 65K block region of the array, which were probably due to the Reiser-FS journal replays that happened when the array started.  I got messages about replaying a few transactions for most of the disks like:

 

Dec 14 09:25:54 saturn kernel: REISERFS (device md10): found reiserfs format "3.6" with standard journal (Routine)
Dec 14 09:25:54 saturn kernel: REISERFS (device md10): using ordered data mode (Routine)
Dec 14 09:25:54 saturn kernel: reiserfs: using flush barriers
Dec 14 09:25:54 saturn kernel: REISERFS (device md10): journal params: device md10, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 (Routine)
Dec 14 09:25:54 saturn kernel: REISERFS (device md10): checking transaction log (md10) (Routine)
Dec 14 09:25:54 saturn kernel: REISERFS (device md10): replayed 5 transactions in 0 seconds (Minor Issues)

 

anyway it has now got to 3% (100GB) on the parity check without finding any additional parity mismatches, so I'll let it run.

 

 

Share this post


Link to post

Some more excitement, the parity check got to about 33% when it slowed to a crawl, looking at the main page showed no additional errors on the parity check but errors were showing against disk5.

 

Looking at the syslog showed a lot of errors like:

 

Dec 14 16:56:18 saturn kernel: ata2.00: status: { DRDY }
Dec 14 16:56:18 saturn kernel: ata2: hard resetting link
Dec 14 16:56:19 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 14 16:56:19 saturn kernel: ata2.00: configured for UDMA/33
Dec 14 16:56:19 saturn kernel: ata2: EH complete
Dec 14 16:56:19 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 14 16:56:19 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 14 16:56:19 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 14 16:56:19 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 14 16:56:19 saturn kernel: ata2.00: cmd 25/00:f8:38:4d:91/00:03:a0:00:00/e0 tag 0 dma 520192 in
Dec 14 16:56:19 saturn kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error)

 

the log is attached (I had to cut off the last hour of repeated errors).  Looking through it shows this sort of activity having appeared and disappeared around 14:28 and 14:58 and then returned and became constant at 15:52.

 

Along with this the SMART report is showing a dramatic increase in the UDMA error count and one more runtime bad block:

 

  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       178194368
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5763
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       13478189
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6530
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       34
183 Runtime_Bad_Block       0x0032   098   098   000    Old_age   Always       -       2
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   059   045    Old_age   Always       -       35 (Min/Max 33/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       -       16412
194 Temperature_Celsius     0x0022   035   041   000    Old_age   Always       -       35 (0 20 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   195   195   000    Old_age   Always       -       6949

 

At this point I figure I have demonstrated the parity drive is fine and the previous disk4 was accepted back into service ok, so I'm just going to put the spare drive into the array as a replacement for the disk5.

 

Any other thoughts?

syslog-2013-12-14-shorter.zip

Share this post


Link to post

The saga continues.

 

The rebuild of disk5 finished on time over night.  No errors were reported and the log file looked clean:

 

Dec 14 17:28:03 saturn emhttp_event: disks_mounted
Dec 14 17:28:03 saturn kernel: mdcmd (51): check CORRECT
Dec 14 17:28:03 saturn kernel: md: recovery thread woken up ...
Dec 14 17:28:03 saturn kernel: md: recovery thread rebuilding disk5 ...
Dec 14 17:28:04 saturn kernel: md: using 1536k window, over a total of 3907018532 blocks.
Dec 14 17:28:04 saturn emhttp: shcmd (79): :>/etc/samba/smb-shares.conf
Dec 14 17:28:05 saturn emhttp: get_config_idx: fopen /boot/config/shares/ImageBackup.cfg: No such file or directory - assigning defaults
Dec 14 17:28:05 saturn emhttp: Restart SMB...
Dec 14 17:28:05 saturn emhttp: shcmd (80): killall -HUP smbd
Dec 14 17:28:05 saturn emhttp: shcmd (81): ps axc | grep -q rpc.mountd
Dec 14 17:28:05 saturn emhttp: _shcmd: shcmd (81): exit status: 1
Dec 14 17:28:05 saturn emhttp: shcmd (82): /usr/local/sbin/emhttp_event svcs_restarted
Dec 14 17:28:05 saturn emhttp_event: svcs_restarted
Dec 15 01:12:24 saturn kernel: mdcmd (52): spindown 1
Dec 15 01:12:25 saturn kernel: mdcmd (53): spindown 4
Dec 15 01:12:25 saturn kernel: mdcmd (54): spindown 7
Dec 15 01:12:26 saturn kernel: mdcmd (55): spindown 9
Dec 15 01:43:59 saturn kernel: mdcmd (56): spindown 1
Dec 15 01:44:00 saturn kernel: mdcmd (57): spindown 4
Dec 15 01:44:00 saturn kernel: mdcmd (58): spindown 7
Dec 15 01:44:01 saturn kernel: mdcmd (59): spindown 9
Dec 15 02:43:56 saturn kernel: mdcmd (60): spindown 1
Dec 15 02:43:57 saturn kernel: mdcmd (61): spindown 4
Dec 15 02:43:57 saturn kernel: mdcmd (62): spindown 7
Dec 15 02:43:58 saturn kernel: mdcmd (63): spindown 9
Dec 15 03:43:54 saturn kernel: mdcmd (64): spindown 1
Dec 15 03:43:54 saturn kernel: mdcmd (65): spindown 4
Dec 15 03:43:55 saturn kernel: mdcmd (66): spindown 7
Dec 15 03:43:55 saturn kernel: mdcmd (67): spindown 9
Dec 15 04:44:00 saturn kernel: mdcmd (68): spindown 1
Dec 15 04:44:01 saturn kernel: mdcmd (69): spindown 4
Dec 15 04:44:01 saturn kernel: mdcmd (70): spindown 7
Dec 15 04:44:02 saturn kernel: mdcmd (71): spindown 9
Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: renewing lease of 192.168.1.90
Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: acknowledged 192.168.1.90 from 192.168.1.1
Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: leased 192.168.1.90 for 86400 seconds
Dec 15 05:43:59 saturn kernel: mdcmd (72): spindown 1
Dec 15 05:44:00 saturn kernel: mdcmd (73): spindown 4
Dec 15 05:44:00 saturn kernel: mdcmd (74): spindown 7
Dec 15 05:44:01 saturn kernel: mdcmd (75): spindown 9
Dec 15 05:51:38 saturn kernel: md: sync done. time=44615sec
Dec 15 05:51:38 saturn kernel: md: recovery thread sync completion status: 0

 

So at 0651 I launched the parity check, it immediately ran into IO errors:

 

Dec 15 06:51:04 saturn kernel: mdcmd (86): check CORRECT
Dec 15 06:51:04 saturn kernel: md: recovery thread woken up ...
Dec 15 06:51:04 saturn kernel: md: recovery thread checking parity...
Dec 15 06:51:04 saturn kernel: md: using 1536k window, over a total of 3907018532 blocks.
Dec 15 06:51:42 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280901 action 0x6 frozen
Dec 15 06:51:42 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:51:42 saturn kernel: ata2: SError: { RecovData UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:51:42 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:51:42 saturn kernel: ata2.00: cmd 25/00:00:50:60:23/00:02:00:00:00/e0 tag 0 dma 262144 in
Dec 15 06:51:42 saturn kernel:          res 50/00:00:4f:60:23/00:00:00:00:00/e0 Emask 0x50 (ATA bus error)
Dec 15 06:51:42 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:51:42 saturn kernel: ata2: hard resetting link
Dec 15 06:51:42 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 15 06:51:42 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:51:42 saturn kernel: ata2: EH complete
Dec 15 06:52:00 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:00 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:00 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:00 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:00 saturn kernel: ata2.00: cmd 25/00:00:88:df:40/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:00 saturn kernel:          res 50/00:00:87:df:40/00:00:00:00:00/e0 Emask 0x50 (ATA bus error)
Dec 15 06:52:00 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:00 saturn kernel: ata2: hard resetting link
Dec 15 06:52:00 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 15 06:52:00 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:52:00 saturn kernel: ata2: EH complete
Dec 15 06:52:01 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:01 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:01 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:01 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:01 saturn kernel: ata2.00: cmd 25/00:00:38:11:42/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:01 saturn kernel:          res 50/00:00:37:11:42/00:00:00:00:00/e0 Emask 0x50 (ATA bus error)
Dec 15 06:52:01 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:01 saturn kernel: ata2: hard resetting link
Dec 15 06:52:01 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 15 06:52:01 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:52:01 saturn kernel: ata2: EH complete
Dec 15 06:52:06 saturn kernel: ata2: limiting SATA link speed to 1.5 Gbps
Dec 15 06:52:06 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:06 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:06 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:06 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:06 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:06 saturn kernel:          res 50/00:00:0f:b0:48/00:00:00:00:00/e0 Emask 0x50 (ATA bus error)
Dec 15 06:52:06 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:06 saturn kernel: ata2: hard resetting link
Dec 15 06:52:06 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 15 06:52:06 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:52:06 saturn kernel: ata2: EH complete
Dec 15 06:52:06 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:06 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:06 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:06 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:06 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:06 saturn kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error)
Dec 15 06:52:06 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:06 saturn kernel: ata2: hard resetting link
Dec 15 06:52:07 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 15 06:52:07 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:52:07 saturn kernel: ata2: EH complete
Dec 15 06:52:07 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:07 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:07 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:07 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:07 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:07 saturn kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error)
Dec 15 06:52:07 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:07 saturn kernel: ata2: hard resetting link
Dec 15 06:52:07 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 15 06:52:07 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:52:07 saturn kernel: ata2: EH complete
Dec 15 06:52:07 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:07 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:07 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:07 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:07 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:07 saturn kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error)
Dec 15 06:52:07 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:07 saturn kernel: ata2: hard resetting link
Dec 15 06:52:08 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 15 06:52:08 saturn kernel: ata2.00: configured for UDMA/133
Dec 15 06:52:08 saturn kernel: ata2: EH complete
Dec 15 06:52:08 saturn kernel: ata2.00: limiting speed to UDMA/100:PIO4
Dec 15 06:52:08 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:08 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:08 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:08 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:08 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:08 saturn kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error)
Dec 15 06:52:08 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:08 saturn kernel: ata2: hard resetting link
Dec 15 06:52:08 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 15 06:52:08 saturn kernel: ata2.00: configured for UDMA/100
Dec 15 06:52:08 saturn kernel: ata2: EH complete
Dec 15 06:52:08 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen
Dec 15 06:52:08 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Dec 15 06:52:08 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC }
Dec 15 06:52:08 saturn kernel: ata2.00: failed command: READ DMA EXT
Dec 15 06:52:08 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in
Dec 15 06:52:08 saturn kernel:          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error)
Dec 15 06:52:08 saturn kernel: ata2.00: status: { DRDY }
Dec 15 06:52:08 saturn kernel: ata2: hard resetting link
Dec 15 06:52:09 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 15 06:52:09 saturn kernel: ata2.00: configured for UDMA/100
Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb]  Result: hostbyte=0x00 driverbyte=0x08
Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb]  Sense Key : 0xb [current] [descriptor]
Dec 15 06:52:09 saturn kernel: Descriptor sense data with sense descriptors (in hex):
Dec 15 06:52:09 saturn kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 15 06:52:09 saturn kernel:         00 00 00 00 
Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb]  ASC=0x0 ASCQ=0x0
Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 00 48 b0 10 00 04 00 00
Dec 15 06:52:09 saturn kernel: end_request: I/O error, dev sdb, sector 4763664
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763600/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763608/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763616/5, count: 1
Dec 15 06:52:09 saturn kernel: ata2: EH complete
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763624/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763632/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763640/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763648/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763656/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763664/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763672/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763680/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763688/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763696/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763704/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763712/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763720/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763728/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763736/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763744/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763752/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763760/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763768/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763776/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763784/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763792/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763800/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763808/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763816/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763824/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763832/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763840/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763848/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763856/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763864/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763872/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763880/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763888/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763896/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763904/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763912/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763920/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763928/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763936/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763944/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763952/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763960/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763968/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763976/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763984/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763992/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4764000/5, count: 1
Dec 15 06:52:09 saturn kernel: md: disk5 read error
Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4764008/5, count: 1

 

I then stopped the array and recorded the SMART report for disk5, which now shows one runtime bad block and 668 UDMA errors:

 

183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
199 UDMA_CRC_Error_Count    0x003e   196   194   000    Old_age   Always       -       668

 

So at this point I started thinking there must be something wrong with the SATA cable or the controller port, so I shutdown the system and moved the drive to a different port and started the parity check again.

 

At this point the parity check is running normally without any issues.

syslog-2013-12-15.zip

Share this post


Link to post

A final note to report that the process of reassembling the array seems to have gone fine.  About half the files on disk5 (the disk that lost communications and was later rebuilt from parity) were backup data files.  As these have a built-in MD5 checksum I was able to test them and they all were fine.  The other files are covered by backup, but I need to run a verify pass over these (to compare them to their backups) to double check.  First I need to modify my backup software a bit to allow me to just test the files on disk5 rather than the whole virtual share that has many disks in it (which is the path that gets backed up).

 

Stephen

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.