vca Posted December 14, 2013 Share Posted December 14, 2013 This past week I have been replacing a few drives that are getting old (about the 3 year mark) and are showing some issues in the SMART reports. Tonight I installed a replacement for disk4 and started the rebuild process. All seemed to be going well but after about 15 minutes disk5 redballed and the rebuild stopped. When I hit the stop button to take the array offline the system seemed to enter an endless loop of "retry unmounting users share(s)" messages and the console was spewing errors. After about 10 minutes of this I just hit the reset button and the system rebooted more or less normally. At this point disk5 is showing red and disk4 is orange. I have the original disk4 still and a spare disk that is the same size as disk5. I can think of two options for proceeding, but are there some better ones? 1. put in the original disk4 and the spare disk for disk5 and then rebuild disk5 from parity. Is this the "trust my parity" process? 2. if disk5 appears to be working still, put the original disk4 back in and then build parity again Regards, Stephen Quote Link to comment
vca Posted December 14, 2013 Author Share Posted December 14, 2013 I'm running version 5.0-rc11 Stephen Quote Link to comment
vca Posted December 14, 2013 Author Share Posted December 14, 2013 This is how I am proceeding. I took a look at the SMART report for disk5 and compared it to one that I took a week ago. Two lines showed significant changes: A week ago: 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1 After the drive red balled: 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 20 From the increase in UDMA errors I'm guessing that the most likely cause was a SATA cabling glitch and that the drive is actually still good. So I did the usual check all the cables but didn't find anything amiss. So I decided to try putting the old disk4 back into service and doing a noncorrecting parity check to see if disk5 still matches up. As the array was not in use during the 15 minutes or so that the failed disk4 rebuild was taking place it seemed like a pretty safe bet that the parity and contents of all the drives would have been unchanged at the time disk5 hit its issue and the rebuild stopped. So after reading various things about the trust my parity procedure I eventually came across this 16-Nov-2013 message from Tom on http://lime-technology.com/forum/index.php?topic=30270.msg272137#msg272137 removing a drive but preserving parity. In my case it would boil down to the following 3 steps: 1. Stop array, go to Utils page, click 'New Config' and execute that Utility. 2. Go back to Main, assign Parity, and all devices as they were, assign the old disk4 back into the right spot and leave the new disk4 (the failed rebuild) out of the array. 3. Click checkbox "Parity is already valid.", and click Start This I did and it appears to be working as expected, except it is doing a "correcting" parity check: Dec 14 09:25:55 saturn kernel: mdcmd (52): check CORRECT (unRAID engine) and the check box beside "correct any parity-check errors..." in the Main GUI page is checked (but is greyed out) This found 60 parity errors in the first 65K block region of the array, which were probably due to the Reiser-FS journal replays that happened when the array started. I got messages about replaying a few transactions for most of the disks like: Dec 14 09:25:54 saturn kernel: REISERFS (device md10): found reiserfs format "3.6" with standard journal (Routine) Dec 14 09:25:54 saturn kernel: REISERFS (device md10): using ordered data mode (Routine) Dec 14 09:25:54 saturn kernel: reiserfs: using flush barriers Dec 14 09:25:54 saturn kernel: REISERFS (device md10): journal params: device md10, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 (Routine) Dec 14 09:25:54 saturn kernel: REISERFS (device md10): checking transaction log (md10) (Routine) Dec 14 09:25:54 saturn kernel: REISERFS (device md10): replayed 5 transactions in 0 seconds (Minor Issues) anyway it has now got to 3% (100GB) on the parity check without finding any additional parity mismatches, so I'll let it run. Quote Link to comment
vca Posted December 15, 2013 Author Share Posted December 15, 2013 Some more excitement, the parity check got to about 33% when it slowed to a crawl, looking at the main page showed no additional errors on the parity check but errors were showing against disk5. Looking at the syslog showed a lot of errors like: Dec 14 16:56:18 saturn kernel: ata2.00: status: { DRDY } Dec 14 16:56:18 saturn kernel: ata2: hard resetting link Dec 14 16:56:19 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 14 16:56:19 saturn kernel: ata2.00: configured for UDMA/33 Dec 14 16:56:19 saturn kernel: ata2: EH complete Dec 14 16:56:19 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 14 16:56:19 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 14 16:56:19 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 14 16:56:19 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 14 16:56:19 saturn kernel: ata2.00: cmd 25/00:f8:38:4d:91/00:03:a0:00:00/e0 tag 0 dma 520192 in Dec 14 16:56:19 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) the log is attached (I had to cut off the last hour of repeated errors). Looking through it shows this sort of activity having appeared and disappeared around 14:28 and 14:58 and then returned and became constant at 15:52. Along with this the SMART report is showing a dramatic increase in the UDMA error count and one more runtime bad block: 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 178194368 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 095 095 020 Old_age Always - 5763 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13478189 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 6530 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 34 183 Runtime_Bad_Block 0x0032 098 098 000 Old_age Always - 2 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 059 045 Old_age Always - 35 (Min/Max 33/38) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9 193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 16412 194 Temperature_Celsius 0x0022 035 041 000 Old_age Always - 35 (0 20 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 195 195 000 Old_age Always - 6949 At this point I figure I have demonstrated the parity drive is fine and the previous disk4 was accepted back into service ok, so I'm just going to put the spare drive into the array as a replacement for the disk5. Any other thoughts? syslog-2013-12-14-shorter.zip Quote Link to comment
vca Posted December 15, 2013 Author Share Posted December 15, 2013 The saga continues. The rebuild of disk5 finished on time over night. No errors were reported and the log file looked clean: Dec 14 17:28:03 saturn emhttp_event: disks_mounted Dec 14 17:28:03 saturn kernel: mdcmd (51): check CORRECT Dec 14 17:28:03 saturn kernel: md: recovery thread woken up ... Dec 14 17:28:03 saturn kernel: md: recovery thread rebuilding disk5 ... Dec 14 17:28:04 saturn kernel: md: using 1536k window, over a total of 3907018532 blocks. Dec 14 17:28:04 saturn emhttp: shcmd (79): :>/etc/samba/smb-shares.conf Dec 14 17:28:05 saturn emhttp: get_config_idx: fopen /boot/config/shares/ImageBackup.cfg: No such file or directory - assigning defaults Dec 14 17:28:05 saturn emhttp: Restart SMB... Dec 14 17:28:05 saturn emhttp: shcmd (80): killall -HUP smbd Dec 14 17:28:05 saturn emhttp: shcmd (81): ps axc | grep -q rpc.mountd Dec 14 17:28:05 saturn emhttp: _shcmd: shcmd (81): exit status: 1 Dec 14 17:28:05 saturn emhttp: shcmd (82): /usr/local/sbin/emhttp_event svcs_restarted Dec 14 17:28:05 saturn emhttp_event: svcs_restarted Dec 15 01:12:24 saturn kernel: mdcmd (52): spindown 1 Dec 15 01:12:25 saturn kernel: mdcmd (53): spindown 4 Dec 15 01:12:25 saturn kernel: mdcmd (54): spindown 7 Dec 15 01:12:26 saturn kernel: mdcmd (55): spindown 9 Dec 15 01:43:59 saturn kernel: mdcmd (56): spindown 1 Dec 15 01:44:00 saturn kernel: mdcmd (57): spindown 4 Dec 15 01:44:00 saturn kernel: mdcmd (58): spindown 7 Dec 15 01:44:01 saturn kernel: mdcmd (59): spindown 9 Dec 15 02:43:56 saturn kernel: mdcmd (60): spindown 1 Dec 15 02:43:57 saturn kernel: mdcmd (61): spindown 4 Dec 15 02:43:57 saturn kernel: mdcmd (62): spindown 7 Dec 15 02:43:58 saturn kernel: mdcmd (63): spindown 9 Dec 15 03:43:54 saturn kernel: mdcmd (64): spindown 1 Dec 15 03:43:54 saturn kernel: mdcmd (65): spindown 4 Dec 15 03:43:55 saturn kernel: mdcmd (66): spindown 7 Dec 15 03:43:55 saturn kernel: mdcmd (67): spindown 9 Dec 15 04:44:00 saturn kernel: mdcmd (68): spindown 1 Dec 15 04:44:01 saturn kernel: mdcmd (69): spindown 4 Dec 15 04:44:01 saturn kernel: mdcmd (70): spindown 7 Dec 15 04:44:02 saturn kernel: mdcmd (71): spindown 9 Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: renewing lease of 192.168.1.90 Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: acknowledged 192.168.1.90 from 192.168.1.1 Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: leased 192.168.1.90 for 86400 seconds Dec 15 05:43:59 saturn kernel: mdcmd (72): spindown 1 Dec 15 05:44:00 saturn kernel: mdcmd (73): spindown 4 Dec 15 05:44:00 saturn kernel: mdcmd (74): spindown 7 Dec 15 05:44:01 saturn kernel: mdcmd (75): spindown 9 Dec 15 05:51:38 saturn kernel: md: sync done. time=44615sec Dec 15 05:51:38 saturn kernel: md: recovery thread sync completion status: 0 So at 0651 I launched the parity check, it immediately ran into IO errors: Dec 15 06:51:04 saturn kernel: mdcmd (86): check CORRECT Dec 15 06:51:04 saturn kernel: md: recovery thread woken up ... Dec 15 06:51:04 saturn kernel: md: recovery thread checking parity... Dec 15 06:51:04 saturn kernel: md: using 1536k window, over a total of 3907018532 blocks. Dec 15 06:51:42 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280901 action 0x6 frozen Dec 15 06:51:42 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:51:42 saturn kernel: ata2: SError: { RecovData UnrecovData HostInt 10B8B BadCRC } Dec 15 06:51:42 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:51:42 saturn kernel: ata2.00: cmd 25/00:00:50:60:23/00:02:00:00:00/e0 tag 0 dma 262144 in Dec 15 06:51:42 saturn kernel: res 50/00:00:4f:60:23/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:51:42 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:51:42 saturn kernel: ata2: hard resetting link Dec 15 06:51:42 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 15 06:51:42 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:51:42 saturn kernel: ata2: EH complete Dec 15 06:52:00 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:00 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:00 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:00 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:00 saturn kernel: ata2.00: cmd 25/00:00:88:df:40/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:00 saturn kernel: res 50/00:00:87:df:40/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:52:00 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:00 saturn kernel: ata2: hard resetting link Dec 15 06:52:00 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 15 06:52:00 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:00 saturn kernel: ata2: EH complete Dec 15 06:52:01 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:01 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:01 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:01 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:01 saturn kernel: ata2.00: cmd 25/00:00:38:11:42/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:01 saturn kernel: res 50/00:00:37:11:42/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:52:01 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:01 saturn kernel: ata2: hard resetting link Dec 15 06:52:01 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 15 06:52:01 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:01 saturn kernel: ata2: EH complete Dec 15 06:52:06 saturn kernel: ata2: limiting SATA link speed to 1.5 Gbps Dec 15 06:52:06 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:06 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:06 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:06 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:06 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:06 saturn kernel: res 50/00:00:0f:b0:48/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:52:06 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:06 saturn kernel: ata2: hard resetting link Dec 15 06:52:06 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:06 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:06 saturn kernel: ata2: EH complete Dec 15 06:52:06 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:06 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:06 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:06 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:06 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:06 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:06 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:06 saturn kernel: ata2: hard resetting link Dec 15 06:52:07 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:07 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:07 saturn kernel: ata2: EH complete Dec 15 06:52:07 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:07 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:07 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:07 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:07 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:07 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:07 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:07 saturn kernel: ata2: hard resetting link Dec 15 06:52:07 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:07 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:07 saturn kernel: ata2: EH complete Dec 15 06:52:07 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:07 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:07 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:07 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:07 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:07 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:07 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:07 saturn kernel: ata2: hard resetting link Dec 15 06:52:08 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:08 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:08 saturn kernel: ata2: EH complete Dec 15 06:52:08 saturn kernel: ata2.00: limiting speed to UDMA/100:PIO4 Dec 15 06:52:08 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:08 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:08 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:08 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:08 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:08 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:08 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:08 saturn kernel: ata2: hard resetting link Dec 15 06:52:08 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:08 saturn kernel: ata2.00: configured for UDMA/100 Dec 15 06:52:08 saturn kernel: ata2: EH complete Dec 15 06:52:08 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:08 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:08 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:08 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:08 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:08 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:08 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:08 saturn kernel: ata2: hard resetting link Dec 15 06:52:09 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:09 saturn kernel: ata2.00: configured for UDMA/100 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] Sense Key : 0xb [current] [descriptor] Dec 15 06:52:09 saturn kernel: Descriptor sense data with sense descriptors (in hex): Dec 15 06:52:09 saturn kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Dec 15 06:52:09 saturn kernel: 00 00 00 00 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 00 48 b0 10 00 04 00 00 Dec 15 06:52:09 saturn kernel: end_request: I/O error, dev sdb, sector 4763664 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763600/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763608/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763616/5, count: 1 Dec 15 06:52:09 saturn kernel: ata2: EH complete Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763624/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763632/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763640/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763648/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763656/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763664/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763672/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763680/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763688/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763696/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763704/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763712/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763720/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763728/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763736/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763744/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763752/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763760/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763768/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763776/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763784/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763792/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763800/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763808/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763816/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763824/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763832/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763840/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763848/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763856/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763864/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763872/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763880/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763888/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763896/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763904/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763912/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763920/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763928/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763936/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763944/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763952/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763960/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763968/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763976/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763984/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763992/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4764000/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4764008/5, count: 1 I then stopped the array and recorded the SMART report for disk5, which now shows one runtime bad block and 668 UDMA errors: 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 199 UDMA_CRC_Error_Count 0x003e 196 194 000 Old_age Always - 668 So at this point I started thinking there must be something wrong with the SATA cable or the controller port, so I shutdown the system and moved the drive to a different port and started the parity check again. At this point the parity check is running normally without any issues. syslog-2013-12-15.zip Quote Link to comment
vca Posted December 20, 2013 Author Share Posted December 20, 2013 A final note to report that the process of reassembling the array seems to have gone fine. About half the files on disk5 (the disk that lost communications and was later rebuilt from parity) were backup data files. As these have a built-in MD5 checksum I was able to test them and they all were fine. The other files are covered by backup, but I need to run a verify pass over these (to compare them to their backups) to double check. First I need to modify my backup software a bit to allow me to just test the files on disk5 rather than the whole virtual share that has many disks in it (which is the path that gets backed up). Stephen Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.