vca

Members
  • Posts

    321
  • Joined

  • Last visited

Everything posted by vca

  1. As bjp999 suggests there might be an issue with a drive not always returning the same data each time a block is read. I had a battle with this that is reported here: http://lime-technology.com/forum/index.php?topic=11515.msg109840#msg109840 though this sort of thing appears to be very rare, so might not be your case at all. If this is the cause you have to test all the drives by reading the blocks in the region that the parity error is reported. You do this many times and if you have this error on one of those drives the read will occasionally return different data (even though the drives are not being written to). Regards, Stephen
  2. From the SMART report your drive shows: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 274 So only has one bad block that has already been remapped. So the surface is good. But has 274 UDMA errors which are usually some problem with the SATA or power cabling. So time to check (reseat or perhaps replace) the SATA or power cables. If you have a power splitter in the cable then look at it too. Regards, Stephen
  3. The preclear of a pair of 4TB Seagate desktop drives finished on the weekend. So here are the results for the old and new (32bit) preclears. Note the old preclear was done on a 2 pass basis. == invoked as: ./preclear_disk.sh -c 2 /dev/sdd == ST4000DM000-1F2168 Z30093E3 == Disk /dev/sdd has been successfully precleared == with a starting sector of 1 == Ran 2 cycles == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 10:57:55 (101 MB/s) == Last Cycle's Zeroing time : 10:02:21 (110 MB/s) == Last Cycle's Post Read Time : 22:53:14 (48 MB/s) == Last Cycle's Total Time : 32:56:35 == == Total Elapsed Time 76:56:37 == invoked as: ./pc15b.sh -f /dev/sdc == ST4000DM000-1F2168 Z30093E3 == Disk /dev/sdc has been successfully precleared == with a starting sector of 1 == Ran 1 cycle == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 11:07:18 (99 MB/s) == Last Cycle's Zeroing time : 9:55:48 (111 MB/s) == Last Cycle's Post Read Time : 11:39:35 (95 MB/s) == Last Cycle's Total Time : 32:43:43 == == Total Elapsed Time 32:43:43 == invoked as: ./preclear_disk.sh -c 2 /dev/sdc == ST4000DM000-1F2168 W3002WDC == Disk /dev/sdc has been successfully precleared == with a starting sector of 1 == Ran 2 cycles == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 10:56:18 (101 MB/s) == Last Cycle's Zeroing time : 9:36:47 (115 MB/s) == Last Cycle's Post Read Time : 23:18:09 (47 MB/s) == Last Cycle's Total Time : 32:55:57 == == Total Elapsed Time 76:35:13 == invoked as: ./pc15b.sh -f /dev/sdd == ST4000DM000-1F2168 W3002WDC == Disk /dev/sdd has been successfully precleared == with a starting sector of 1 == Ran 1 cycle == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 11:07:22 (99 MB/s) == Last Cycle's Zeroing time : 9:56:03 (111 MB/s) == Last Cycle's Post Read Time : 11:39:59 (95 MB/s) == Last Cycle's Total Time : 32:44:25 == == Total Elapsed Time 32:44:25 Regards, Stephen
  4. I have two of these drives, I used one in my unRAID server for about a year without any issues, but recently I have replaced it with the NAS version (I'll use the desktop version for backup storage). The thing that was bothering me about these drives was the UDMA_CRC_Error_Count, though most of that may have come from one cable problem. I just finished doing a retest of these drives by preclearing them without any indications of trouble. Both of these drives also show large values for the Seek_Error_Rate and they also have the 60/30 numbers for the worst and thresh normalized values - so I figure these are typical of this particular drive. Here are the smarts from my drives so you can compare (note the newer version of unRAID has an updated version of the smart tool that gives some better attribute names than the version you have): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 2137384 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 095 095 020 Old_age Always - 5765 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13846410 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 6620 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 36 183 Runtime_Bad_Block 0x0032 098 098 000 Old_age Always - 2 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 071 059 045 Old_age Always - 29 (Min/Max 21/34) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9 193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 16430 194 Temperature_Celsius 0x0022 029 041 000 Old_age Always - 29 (0 20 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 195 000 Old_age Always - 6949 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1089h+21m+15.431s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 51506131152 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 222555473047 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 125981968 3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 098 098 020 Old_age Always - 2281 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5927616 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2670 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 31 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 091 091 000 Old_age Always - 9 190 Airflow_Temperature_Cel 0x0022 069 050 045 Old_age Always - 31 (Min/Max 21/36) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 6649 194 Temperature_Celsius 0x0022 031 050 000 Old_age Always - 31 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 194 000 Old_age Always - 668 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 475h+23m+44.493s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 44747895302 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 147844353977 Regards, Stephen
  5. Here's my first result from the 32 bit version of the first beta, this is on a pair of old WD 2TB green drives: == invoked as: ./pc15b.sh -f /dev/sdc == WDCWD20EARS-00J2GB0 WD-WCAYY0100121 == Disk /dev/sdc has been successfully precleared == with a starting sector of 63 == Ran 1 cycle == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 8:08:34 (68 MB/s) == Last Cycle's Zeroing time : 10:56:21 (50 MB/s) == Last Cycle's Post Read Time : 8:00:06 (69 MB/s) == Last Cycle's Total Time : 27:06:01 == invoked as: ./pc15b.sh -f /dev/sdd == WDCWD20EARS-00MVWB0 WD-WCAZA6293604 == Disk /dev/sdd has been successfully precleared == with a starting sector of 63 == Ran 1 cycle == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 8:09:39 (68 MB/s) == Last Cycle's Zeroing time : 10:57:49 (50 MB/s) == Last Cycle's Post Read Time : 7:59:14 (69 MB/s) == Last Cycle's Total Time : 27:07:44 I'll rerun these drives through an old preclear next. And here's the results with the old preclear: == invoked as: ./preclear_disk.sh /dev/sdc == WDCWD20EARS-00J2GB0 WD-WCAYY0100121 == Disk /dev/sdc has been successfully precleared == with a starting sector of 63 == Ran 1 cycle == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 7:37:40 (72 MB/s) == Last Cycle's Zeroing time : 11:13:52 (49 MB/s) == Last Cycle's Post Read Time : 15:02:26 (36 MB/s) == Last Cycle's Total Time : 33:54:57 == == Total Elapsed Time 33:54:57 == invoked as: ./preclear_disk.sh /dev/sdd == WDCWD20EARS-00MVWB0 WD-WCAZA6293604 == Disk /dev/sdd has been successfully precleared == with a starting sector of 63 == Ran 1 cycle == == Using :Read block size = 8388608 Bytes == Last Cycle's Pre Read Time : 7:37:22 (72 MB/s) == Last Cycle's Zeroing time : 11:13:41 (49 MB/s) == Last Cycle's Post Read Time : 14:51:50 (37 MB/s) == Last Cycle's Total Time : 33:43:53 == == Total Elapsed Time 33:43:53 So the new preclear cut the second pass time from 15 hours to 8 hours, which is great. One odd thing is the preread time is about 30 minutes longer with the new code. Stephen
  6. Just finished a two pass preclear (the old, slow, version) on a pair of 4TB Seagate NAS drives. Took about 75 hours to run. Looking forward to a faster version. Regards, Stephen
  7. I've been switching from WD Greens (I've replaced 4 so far) to WD Reds and the Seagate NAS drives. Both of these run as cool and have about the same power requirements as the Greens plus run faster. Regards, Stephen
  8. From your SMART report: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 041 041 140 Pre-fail Always FAILING_NOW 1265 196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 829 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 191 191 000 Old_age Offline - 1833 the above lines are of great concern. It's actually rather rare that we get to see a report with the "FAILING_NOW" state set, usually we see them with far fewer errors and rarely see them with more probably because by the time the a drive gets to this point it rapidly fails... Given that Current_Pending_Sector is zero I think your drive has successfully remapped all the bad sectors it has found (though I'm not certain that none of your data has been corrupted). But as the Reallocated_Sector_Ct is so high there might not be many spare sectors left in case further bad spots develop. Certainly WD will RMA this drive (if it is still in warranty), I've done RMAs with them on drives with far less badness. One further note of caution, the one WD drive I have had that showed a significant value for Multi_Zone_Error_Rate failed completely after another 50 hours of heavy use. Copy your data of this drive as soon as you can and then replace it. Regards, Stephen
  9. I'd be a bit worried about your disk or maybe your power supply: 12 Power_Cycle_Count 0x0032 071 071 020 Old_age Always - 29950 this is showing that the drive has gone through almost 30,000 power cycles! And since its only logged about 5000 hours that is like one every 10 minutes. Seems very strange. Most of my drives (Seagates, WD, Hitachi) have fewer than 20 power cycles in several years of use. Perhaps there is a problem with your power supply or the power connector to the drive? Regards, Stephen
  10. The quantity of errors is not particularly alarming right now, but if you see more appearing during a few passes of preclearing then its time to either RMA (if the cost makes sense) or toss it in the bin. If it susrvives several preclear passes then its probably still safe to use, perhaps as an extra backup copy or a drive to experiment with. Regards, Stephen
  11. You might also have bad RAM in either the unRAID box or the computer these were copied from. I would run the memory tester on all the machines that these files were copied through. Regards, Stephen
  12. I'm doing preclears right now and for the next week... I've got a pair of new Seagate NAS 4TB drives to burn in and a pair of Seagate Desktop 4TB drives to retest and a pair of old WD 2TB greens that I'm taking out of service (so I'm preclearing them to erase and test). The 4tb NAS drives are just at 26% complete on the post-read of the first of 2 cycles and have taken about 26 hours so far, they'll probably take about 34-38 hours if I recall correctly. So I'd be interested in running the new beta. Regards, Stephen
  13. Unscrambling the important part of the report: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 119 095 006 Pre-fail Always - 223323636 3 Spin_Up_Time 0x0023 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 104 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2 7 Seek_Error_Rate 0x002f 075 060 030 Pre-fail Always - 36417414 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2743 10 Spin_Retry_Count 0x0033 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 38 180 Unknown_HDD_Attribute 0x002b 100 100 000 Pre-fail Always - 2112145650 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 097 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 217 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 081 070 045 Old_age Always - 19 (Min/Max 14/28) 194 Temperature_Celsius 0x0022 019 040 000 Old_age Always - 19 (0 8 0 0 0) 195 Hardware_ECC_Recovered 0x003a 059 041 000 Old_age Always - 223323636 196 Reallocated_Event_Count 0x0032 100 100 036 Old_age Always - 2 197 Current_Pending_Sector 0x0032 098 098 000 Old_age Always - 112 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 99 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 The following lines are of concern: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 217 196 Reallocated_Event_Count 0x0032 100 100 036 Old_age Always - 2 197 Current_Pending_Sector 0x0032 098 098 000 Old_age Always - 112 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 99 There are 112 currently identified bad blocks that have not been remapped (Current_Pending_Sector), which I think puts this drive into the "do not trust" territory. Especially as the drive is not very old (2700 hours). Seeing it is only a 250GB drive it's probably not worth the bother of doing an RMA. Copy your data off it soon! Regards, Stephen
  14. When the tape drive I was using for backups started to die back in about 2004 or 2005 I ended up writing my own backup utility, initially to store the backups to DVDs and then as the cost of hard drives dropped I switched to using external drives. This utility is written in Python and I use it to backup my unRAID server to removable drives attached to my Windows desktop. It is built on the notion of a single full backup followed by an unlimited number of incrementals, so while the first backup takes a lot of time the incrementals run pretty quick. Typically I run an incremental pass on the weekend to grab all the new media files, a process that might take a half hour or so. The backups are written in user-configurable chunks, typically about 500MB (the system will automatically split large files across multiple chunks), to a drive in my Windows desktop machine. From there they get copied to an external drive in one of my backup media sets. I have two media sets, one is kept at a remote location (to further protect against fire, flood or theft - but not far enough away to protect against a meteor strike). Periodically I will take the external drive I am currently saving backups to over to the external location, swap it for the last disk in that set and bring that disk back. When I return with the swapped disk I then update it with the backup chunks that were kept on the workstation in its absence and then I can delete those from the work station and repeat the process. In this way I have quadruple redundancy for all the backed up data almost all the time: 1. the unraid disk where the data resides 2. the unraid parity protection (not truly a copy, but close) 3. the copy on the workstation internal cache drive 4. the copy on the local external drive once the data is swapped to site the items 3 and 4 become the local external drive and the remote external drive. About once every year or two I restart the whole process, because by then I'll have some higher capacity drives that I can use to remove the older (and smaller) back up drives from service. The last time I did this I was able to retire a handful of 500GB drives, replacing them with 2TB units that I had removed from the unRAID box when I started moving to 4TB drives. The data on the external drives is check summed both at the chunk level and at the individual file level. And the database that manages this has a SHA1 hash of all the individual files as well, so in theory I could use it to check against the current contents of the unRAID server without having to access any of the external drives. But I've not written that code yet. The backup utility is called ArcvBack and is available on: http://arcvback.com/arcvback.html It currently uses Python 2.5, one of these days I'll have to update it to the Python 3.x series. Regards, Stephen
  15. I'm using one with the X9scm-IIF motherboard, works fine. Speed should be the same as your motherboard ports, up to about 120MB/sec when you have 8 drives doing a parity check. Note I have found that putting both sata2 and sata3 drives on this card at the same time causes a major slow down to parity checking. When I had a sata2 drive attached my parity check speed was only 60MB/s after I moved that drive to the motherboard the speed rose to 105MB/s Also you must set the disk setting for tunable to something like 1024 otherwise parity check speed will be really bad. It was only 40MB/s for me at the default value of 384 Stephen
  16. You might want to read this post: http://lime-technology.com/forum/index.php?topic=27969.msg248016#msg248016 The multi zone error rate on your drive looks unusual.
  17. I had a drive fail last summer, the first clue was an increase in the multi zone error rate, look for MZER articles. Stephen
  18. A final note to report that the process of reassembling the array seems to have gone fine. About half the files on disk5 (the disk that lost communications and was later rebuilt from parity) were backup data files. As these have a built-in MD5 checksum I was able to test them and they all were fine. The other files are covered by backup, but I need to run a verify pass over these (to compare them to their backups) to double check. First I need to modify my backup software a bit to allow me to just test the files on disk5 rather than the whole virtual share that has many disks in it (which is the path that gets backed up). Stephen
  19. Please post a SMART report for that disk. Regards, Stephen
  20. The saga continues. The rebuild of disk5 finished on time over night. No errors were reported and the log file looked clean: Dec 14 17:28:03 saturn emhttp_event: disks_mounted Dec 14 17:28:03 saturn kernel: mdcmd (51): check CORRECT Dec 14 17:28:03 saturn kernel: md: recovery thread woken up ... Dec 14 17:28:03 saturn kernel: md: recovery thread rebuilding disk5 ... Dec 14 17:28:04 saturn kernel: md: using 1536k window, over a total of 3907018532 blocks. Dec 14 17:28:04 saturn emhttp: shcmd (79): :>/etc/samba/smb-shares.conf Dec 14 17:28:05 saturn emhttp: get_config_idx: fopen /boot/config/shares/ImageBackup.cfg: No such file or directory - assigning defaults Dec 14 17:28:05 saturn emhttp: Restart SMB... Dec 14 17:28:05 saturn emhttp: shcmd (80): killall -HUP smbd Dec 14 17:28:05 saturn emhttp: shcmd (81): ps axc | grep -q rpc.mountd Dec 14 17:28:05 saturn emhttp: _shcmd: shcmd (81): exit status: 1 Dec 14 17:28:05 saturn emhttp: shcmd (82): /usr/local/sbin/emhttp_event svcs_restarted Dec 14 17:28:05 saturn emhttp_event: svcs_restarted Dec 15 01:12:24 saturn kernel: mdcmd (52): spindown 1 Dec 15 01:12:25 saturn kernel: mdcmd (53): spindown 4 Dec 15 01:12:25 saturn kernel: mdcmd (54): spindown 7 Dec 15 01:12:26 saturn kernel: mdcmd (55): spindown 9 Dec 15 01:43:59 saturn kernel: mdcmd (56): spindown 1 Dec 15 01:44:00 saturn kernel: mdcmd (57): spindown 4 Dec 15 01:44:00 saturn kernel: mdcmd (58): spindown 7 Dec 15 01:44:01 saturn kernel: mdcmd (59): spindown 9 Dec 15 02:43:56 saturn kernel: mdcmd (60): spindown 1 Dec 15 02:43:57 saturn kernel: mdcmd (61): spindown 4 Dec 15 02:43:57 saturn kernel: mdcmd (62): spindown 7 Dec 15 02:43:58 saturn kernel: mdcmd (63): spindown 9 Dec 15 03:43:54 saturn kernel: mdcmd (64): spindown 1 Dec 15 03:43:54 saturn kernel: mdcmd (65): spindown 4 Dec 15 03:43:55 saturn kernel: mdcmd (66): spindown 7 Dec 15 03:43:55 saturn kernel: mdcmd (67): spindown 9 Dec 15 04:44:00 saturn kernel: mdcmd (68): spindown 1 Dec 15 04:44:01 saturn kernel: mdcmd (69): spindown 4 Dec 15 04:44:01 saturn kernel: mdcmd (70): spindown 7 Dec 15 04:44:02 saturn kernel: mdcmd (71): spindown 9 Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: renewing lease of 192.168.1.90 Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: acknowledged 192.168.1.90 from 192.168.1.1 Dec 15 05:27:16 saturn dhcpcd[1123]: eth0: leased 192.168.1.90 for 86400 seconds Dec 15 05:43:59 saturn kernel: mdcmd (72): spindown 1 Dec 15 05:44:00 saturn kernel: mdcmd (73): spindown 4 Dec 15 05:44:00 saturn kernel: mdcmd (74): spindown 7 Dec 15 05:44:01 saturn kernel: mdcmd (75): spindown 9 Dec 15 05:51:38 saturn kernel: md: sync done. time=44615sec Dec 15 05:51:38 saturn kernel: md: recovery thread sync completion status: 0 So at 0651 I launched the parity check, it immediately ran into IO errors: Dec 15 06:51:04 saturn kernel: mdcmd (86): check CORRECT Dec 15 06:51:04 saturn kernel: md: recovery thread woken up ... Dec 15 06:51:04 saturn kernel: md: recovery thread checking parity... Dec 15 06:51:04 saturn kernel: md: using 1536k window, over a total of 3907018532 blocks. Dec 15 06:51:42 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280901 action 0x6 frozen Dec 15 06:51:42 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:51:42 saturn kernel: ata2: SError: { RecovData UnrecovData HostInt 10B8B BadCRC } Dec 15 06:51:42 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:51:42 saturn kernel: ata2.00: cmd 25/00:00:50:60:23/00:02:00:00:00/e0 tag 0 dma 262144 in Dec 15 06:51:42 saturn kernel: res 50/00:00:4f:60:23/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:51:42 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:51:42 saturn kernel: ata2: hard resetting link Dec 15 06:51:42 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 15 06:51:42 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:51:42 saturn kernel: ata2: EH complete Dec 15 06:52:00 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:00 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:00 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:00 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:00 saturn kernel: ata2.00: cmd 25/00:00:88:df:40/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:00 saturn kernel: res 50/00:00:87:df:40/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:52:00 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:00 saturn kernel: ata2: hard resetting link Dec 15 06:52:00 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 15 06:52:00 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:00 saturn kernel: ata2: EH complete Dec 15 06:52:01 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:01 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:01 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:01 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:01 saturn kernel: ata2.00: cmd 25/00:00:38:11:42/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:01 saturn kernel: res 50/00:00:37:11:42/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:52:01 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:01 saturn kernel: ata2: hard resetting link Dec 15 06:52:01 saturn kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 15 06:52:01 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:01 saturn kernel: ata2: EH complete Dec 15 06:52:06 saturn kernel: ata2: limiting SATA link speed to 1.5 Gbps Dec 15 06:52:06 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:06 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:06 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:06 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:06 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:06 saturn kernel: res 50/00:00:0f:b0:48/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Dec 15 06:52:06 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:06 saturn kernel: ata2: hard resetting link Dec 15 06:52:06 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:06 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:06 saturn kernel: ata2: EH complete Dec 15 06:52:06 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:06 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:06 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:06 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:06 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:06 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:06 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:06 saturn kernel: ata2: hard resetting link Dec 15 06:52:07 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:07 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:07 saturn kernel: ata2: EH complete Dec 15 06:52:07 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:07 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:07 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:07 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:07 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:07 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:07 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:07 saturn kernel: ata2: hard resetting link Dec 15 06:52:07 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:07 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:07 saturn kernel: ata2: EH complete Dec 15 06:52:07 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:07 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:07 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:07 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:07 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:07 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:07 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:07 saturn kernel: ata2: hard resetting link Dec 15 06:52:08 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:08 saturn kernel: ata2.00: configured for UDMA/133 Dec 15 06:52:08 saturn kernel: ata2: EH complete Dec 15 06:52:08 saturn kernel: ata2.00: limiting speed to UDMA/100:PIO4 Dec 15 06:52:08 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:08 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:08 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:08 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:08 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:08 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:08 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:08 saturn kernel: ata2: hard resetting link Dec 15 06:52:08 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:08 saturn kernel: ata2.00: configured for UDMA/100 Dec 15 06:52:08 saturn kernel: ata2: EH complete Dec 15 06:52:08 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 15 06:52:08 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 15 06:52:08 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 15 06:52:08 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 15 06:52:08 saturn kernel: ata2.00: cmd 25/00:00:10:b0:48/00:04:00:00:00/e0 tag 0 dma 524288 in Dec 15 06:52:08 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) Dec 15 06:52:08 saturn kernel: ata2.00: status: { DRDY } Dec 15 06:52:08 saturn kernel: ata2: hard resetting link Dec 15 06:52:09 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 15 06:52:09 saturn kernel: ata2.00: configured for UDMA/100 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] Sense Key : 0xb [current] [descriptor] Dec 15 06:52:09 saturn kernel: Descriptor sense data with sense descriptors (in hex): Dec 15 06:52:09 saturn kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Dec 15 06:52:09 saturn kernel: 00 00 00 00 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 Dec 15 06:52:09 saturn kernel: sd 2:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 00 48 b0 10 00 04 00 00 Dec 15 06:52:09 saturn kernel: end_request: I/O error, dev sdb, sector 4763664 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763600/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763608/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763616/5, count: 1 Dec 15 06:52:09 saturn kernel: ata2: EH complete Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763624/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763632/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763640/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763648/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763656/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763664/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763672/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763680/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763688/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763696/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763704/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763712/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763720/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763728/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763736/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763744/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763752/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763760/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763768/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763776/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763784/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763792/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763800/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763808/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763816/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763824/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763832/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763840/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763848/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763856/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763864/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763872/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763880/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763888/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763896/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763904/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763912/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763920/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763928/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763936/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763944/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763952/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763960/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763968/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763976/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763984/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4763992/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4764000/5, count: 1 Dec 15 06:52:09 saturn kernel: md: disk5 read error Dec 15 06:52:09 saturn kernel: handle_stripe read error: 4764008/5, count: 1 I then stopped the array and recorded the SMART report for disk5, which now shows one runtime bad block and 668 UDMA errors: 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 199 UDMA_CRC_Error_Count 0x003e 196 194 000 Old_age Always - 668 So at this point I started thinking there must be something wrong with the SATA cable or the controller port, so I shutdown the system and moved the drive to a different port and started the parity check again. At this point the parity check is running normally without any issues. syslog-2013-12-15.zip
  21. Some more excitement, the parity check got to about 33% when it slowed to a crawl, looking at the main page showed no additional errors on the parity check but errors were showing against disk5. Looking at the syslog showed a lot of errors like: Dec 14 16:56:18 saturn kernel: ata2.00: status: { DRDY } Dec 14 16:56:18 saturn kernel: ata2: hard resetting link Dec 14 16:56:19 saturn kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 14 16:56:19 saturn kernel: ata2.00: configured for UDMA/33 Dec 14 16:56:19 saturn kernel: ata2: EH complete Dec 14 16:56:19 saturn kernel: ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen Dec 14 16:56:19 saturn kernel: ata2.00: irq_stat 0x08000000, interface fatal error Dec 14 16:56:19 saturn kernel: ata2: SError: { UnrecovData HostInt 10B8B BadCRC } Dec 14 16:56:19 saturn kernel: ata2.00: failed command: READ DMA EXT Dec 14 16:56:19 saturn kernel: ata2.00: cmd 25/00:f8:38:4d:91/00:03:a0:00:00/e0 tag 0 dma 520192 in Dec 14 16:56:19 saturn kernel: res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x50 (ATA bus error) the log is attached (I had to cut off the last hour of repeated errors). Looking through it shows this sort of activity having appeared and disappeared around 14:28 and 14:58 and then returned and became constant at 15:52. Along with this the SMART report is showing a dramatic increase in the UDMA error count and one more runtime bad block: 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 178194368 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 095 095 020 Old_age Always - 5763 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13478189 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 6530 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 34 183 Runtime_Bad_Block 0x0032 098 098 000 Old_age Always - 2 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 059 045 Old_age Always - 35 (Min/Max 33/38) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9 193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 16412 194 Temperature_Celsius 0x0022 035 041 000 Old_age Always - 35 (0 20 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 195 195 000 Old_age Always - 6949 At this point I figure I have demonstrated the parity drive is fine and the previous disk4 was accepted back into service ok, so I'm just going to put the spare drive into the array as a replacement for the disk5. Any other thoughts? syslog-2013-12-14-shorter.zip
  22. This is how I am proceeding. I took a look at the SMART report for disk5 and compared it to one that I took a week ago. Two lines showed significant changes: A week ago: 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1 After the drive red balled: 183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 20 From the increase in UDMA errors I'm guessing that the most likely cause was a SATA cabling glitch and that the drive is actually still good. So I did the usual check all the cables but didn't find anything amiss. So I decided to try putting the old disk4 back into service and doing a noncorrecting parity check to see if disk5 still matches up. As the array was not in use during the 15 minutes or so that the failed disk4 rebuild was taking place it seemed like a pretty safe bet that the parity and contents of all the drives would have been unchanged at the time disk5 hit its issue and the rebuild stopped. So after reading various things about the trust my parity procedure I eventually came across this 16-Nov-2013 message from Tom on http://lime-technology.com/forum/index.php?topic=30270.msg272137#msg272137 removing a drive but preserving parity. In my case it would boil down to the following 3 steps: 1. Stop array, go to Utils page, click 'New Config' and execute that Utility. 2. Go back to Main, assign Parity, and all devices as they were, assign the old disk4 back into the right spot and leave the new disk4 (the failed rebuild) out of the array. 3. Click checkbox "Parity is already valid.", and click Start This I did and it appears to be working as expected, except it is doing a "correcting" parity check: Dec 14 09:25:55 saturn kernel: mdcmd (52): check CORRECT (unRAID engine) and the check box beside "correct any parity-check errors..." in the Main GUI page is checked (but is greyed out) This found 60 parity errors in the first 65K block region of the array, which were probably due to the Reiser-FS journal replays that happened when the array started. I got messages about replaying a few transactions for most of the disks like: Dec 14 09:25:54 saturn kernel: REISERFS (device md10): found reiserfs format "3.6" with standard journal (Routine) Dec 14 09:25:54 saturn kernel: REISERFS (device md10): using ordered data mode (Routine) Dec 14 09:25:54 saturn kernel: reiserfs: using flush barriers Dec 14 09:25:54 saturn kernel: REISERFS (device md10): journal params: device md10, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 (Routine) Dec 14 09:25:54 saturn kernel: REISERFS (device md10): checking transaction log (md10) (Routine) Dec 14 09:25:54 saturn kernel: REISERFS (device md10): replayed 5 transactions in 0 seconds (Minor Issues) anyway it has now got to 3% (100GB) on the parity check without finding any additional parity mismatches, so I'll let it run.
  23. This past week I have been replacing a few drives that are getting old (about the 3 year mark) and are showing some issues in the SMART reports. Tonight I installed a replacement for disk4 and started the rebuild process. All seemed to be going well but after about 15 minutes disk5 redballed and the rebuild stopped. When I hit the stop button to take the array offline the system seemed to enter an endless loop of "retry unmounting users share(s)" messages and the console was spewing errors. After about 10 minutes of this I just hit the reset button and the system rebooted more or less normally. At this point disk5 is showing red and disk4 is orange. I have the original disk4 still and a spare disk that is the same size as disk5. I can think of two options for proceeding, but are there some better ones? 1. put in the original disk4 and the spare disk for disk5 and then rebuild disk5 from parity. Is this the "trust my parity" process? 2. if disk5 appears to be working still, put the original disk4 back in and then build parity again Regards, Stephen
  24. you might want to read this post: http://lime-technology.com/forum/index.php?topic=27969.msg248016#msg248016 that I wrote about my experience with a high MZER situation. Based on it I am guessing your drive won't be long for this world. Regards, Stephen