RobJ Posted February 17, 2009 Share Posted February 17, 2009 Looks very good, about typical for new Seagates. I checked and there is STILL nothing newer than smartctl v5.38. Need a newer one to properly interpret attributes 240, 241, and 242. ( Apparently, logger is too dumb to display tabs (^I, ASCII 9) correctly! ) Quote Link to comment
prostuff1 Posted February 18, 2009 Share Posted February 18, 2009 I started another preclear on my old parity drive (a 1TB Seagate) and it seems (or at least the screen is) frozen. The server is still up and running just fine, nothing looks to be over heated or anything like that, but the updating seems to have stopped. I set it on a a cycle of 3 and it looks to have stopped at cycle one on the 10 step at 88%. I can telnet into the tower and type ps -ef and I can see this: root 28309 2657 0 07:42 tty1 00:00:57 /bin/bash ./preclear_disk.sh -c and this root 5740 28309 99 16:15 tty1 07:34:56 /bin/bash ./preclear_disk.sh -c I am not sure if i have two preclears going or not. I don't think i started two but i guess i could have accidentally. The seagate drive appears to still be spinning so I don't think it has crashed and burned. But it would appear that preclear has stopped on the disk. Using unMenu and myMain and can see that there are not longer any reads or writes going to the disk. This is kinda disturbing. Any input is appreciated. Quote Link to comment
Joe L. Posted February 18, 2009 Share Posted February 18, 2009 I started another preclear on my old parity drive (a 1TB Seagate) and it seems (or at least the screen is) frozen. The server is still up and running just fine, nothing looks to be over heated or anything like that, but the updating seems to have stopped. I set it on a a cycle of 3 and it looks to have stopped at cycle one on the 10 step at 88%. I can telnet into the tower and type ps -ef and I can see this: root 28309 2657 0 07:42 tty1 00:00:57 /bin/bash ./preclear_disk.sh -c and this root 5740 28309 99 16:15 tty1 07:34:56 /bin/bash ./preclear_disk.sh -c I am not sure if i have two preclears going or not. I don't think i started two but i guess i could have accidentally. The seagate drive appears to still be spinning so I don't think it has crashed and burned. But it would appear that preclear has stopped on the disk. Using unMenu and myMain and can see that there are not longer any reads or writes going to the disk. This is kinda disturbing. Any input is appreciated. Since the one process is a child of the other, I don't think you started two... it is normal to see two processes while it is clearing the drive as the clear is done in a background process while a foreground process updates the display. On the other hand, it sure looks as if the process has stopped (assuming no read or write activity is actually occurring) You might just abort it by typing "Control-C" in the window where it was started and try once more after running a smartctl test on the drive. I have seen drives stop and look like they locked up like this when other activity occurred concurrently. It is a matter of a "deadlock" where two process both wait for the same resource to be freed, but each is really waiting for the other. If you have time, and anything else is going on on the server, I'd just try waiting (overnight) Then, I'd let it know who's boss. Joe L. Quote Link to comment
prostuff1 Posted February 18, 2009 Share Posted February 18, 2009 I started another preclear on my old parity drive (a 1TB Seagate) and it seems (or at least the screen is) frozen. The server is still up and running just fine, nothing looks to be over heated or anything like that, but the updating seems to have stopped. I set it on a a cycle of 3 and it looks to have stopped at cycle one on the 10 step at 88%. I can telnet into the tower and type ps -ef and I can see this: root 28309 2657 0 07:42 tty1 00:00:57 /bin/bash ./preclear_disk.sh -c and this root 5740 28309 99 16:15 tty1 07:34:56 /bin/bash ./preclear_disk.sh -c I am not sure if i have two preclears going or not. I don't think i started two but i guess i could have accidentally. The seagate drive appears to still be spinning so I don't think it has crashed and burned. But it would appear that preclear has stopped on the disk. Using unMenu and myMain and can see that there are not longer any reads or writes going to the disk. This is kinda disturbing. Any input is appreciated. Since the one process is a child of the other, I don't think you started two... it is normal to see two processes while it is clearing the drive as the clear is done in a background process while a foreground process updates the display. On the other hand, it sure looks as if the process has stopped (assuming no read or write activity is actually occurring) You might just abort it by typing "Control-C" in the window where it was started and try once more after running a smartctl test on the drive. I have seen drives stop and look like they locked up like this when other activity occurred concurrently. It is a matter of a "deadlock" where two process both wait for the same resource to be freed, but each is really waiting for the other. If you have time, and anything else is going on on the server, I'd just try waiting (overnight) Then, I'd let it know who's boss. Joe L. Thanks for the input. I figured i would have to quit it and start it over. I ended up killing the preclear and am now running a smartctl -t long on the drive. When that finishes i will see what the output it and start a one cycle preclear on the drive. Thanks JoeL. Quote Link to comment
Biggy2872 Posted February 18, 2009 Author Share Posted February 18, 2009 ============================================================================ == == Disk /dev/sdc has been successfully precleared == ============================================================================ S.M.A.R.T. error count differences detected after pre-clear note, some 'raw' values may change, but not be an indication of a problem 20,21c20,21 < Offline data collection status: (0x82) Offline data collection activity < was completed without error. --- > Offline data collection status: (0x84) Offline data collection activity > was suspended by an interrupting command from host. ============================================================================ I've used this script a few times now with no problems... the last time I got the output from SMART copied above. Is this an issue/error? Should I re-run a smart test? Cheers, Matt Quote Link to comment
prostuff1 Posted February 18, 2009 Share Posted February 18, 2009 OK, the smartctl -t long test finished on my 1TB Seagate. this is the output of smartctl -l selftest /dev/sdc === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 993 - this is the output of smartctl -A /dev/sdc === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 38731720 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 32 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 8 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 9300322 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1001 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 26 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 381 190 Airflow_Temperature_Cel 0x0022 066 053 045 Old_age Always - 34 (Lifetime Min/Max 26/36) 194 Temperature_Celsius 0x0022 034 047 000 Old_age Always - 34 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 036 025 000 Old_age Always - 38731720 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 Some of the above worries me but i would like some input from the experts in here that are better at reading this then i am. Thanks Quote Link to comment
RobJ Posted February 19, 2009 Share Posted February 19, 2009 Some of the above worries me but i would like some input from the experts in here that are better at reading this then i am. I personally don't think there are any experts yet, as to interpreting SMART reports. It is too new a data analysis tool, plus they keep changing and adding attributes, *and* they are different for each drive vendor. All I can give is my impressions from what I have seen so far, and learn from each new one I see. The Raw_Read_Error_Rate and the corresponding Hardware_ECC_Recovered are quite normal for a Seagate, would be very high for any other brand. The one attribute I would keep an eye on is the Seek_Error_Rate, which seems higher to me than it should be, and has taken a hit in its VALUE. The High_Fly_Writes attributes is rather new, and I don't think anyone really understands it yet. What troubles me about it, is the VALUE and WORST have already bottomed out, but perhaps Seagate themselves did not know how to appropriately scale it. Overall, it looks fine, but I would monitor it once a month for awhile. I think you will learn from watching it, what is OK and what bears continued monitoring. Quote Link to comment
RobJ Posted February 19, 2009 Share Posted February 19, 2009 > Offline data collection status: (0x84) Offline data collection activity > was suspended by an interrupting command from host. I have seen that in 1 or 2 of the other pre_clear reports earlier. It is completely harmless, just means a test was aborted, so no result from it. I suspect that something in the pre_clear script is initiating an offline test that is aborted by a later command to the drive. Possibly just a timing issue. Quote Link to comment
prostuff1 Posted February 19, 2009 Share Posted February 19, 2009 Thanks for the input RobJ. The stats i was concerned about were the ones that you basically outlines there. The High_Fly_Writes one is one that I am not sure about. I have been searching around trying to understand what exactly it does and i am not quite sure. All i know is that the 1.5TB drive I just ran a 3 cycle preclear on has about 15 High_Fly_Write "errors" already. I will monitor the drive(s) like you suggested and see if anything changes. Quote Link to comment
Biggy2872 Posted February 19, 2009 Author Share Posted February 19, 2009 > Offline data collection status: (0x84) Offline data collection activity > was suspended by an interrupting command from host. I have seen that in 1 or 2 of the other pre_clear reports earlier. It is completely harmless, just means a test was aborted, so no result from it. I suspect that something in the pre_clear script is initiating an offline test that is aborted by a later command to the drive. Possibly just a timing issue. Thanks, I figured it was probably a timing issue but wanted to make 100% sure. I had not seen this behaviour before. Matt Quote Link to comment
SSD Posted February 19, 2009 Share Posted February 19, 2009 Thanks for the input RobJ. The stats i was concerned about were the ones that you basically outlines there. The High_Fly_Writes one is one that I am not sure about. I have been searching around trying to understand what exactly it does and i am not quite sure. All i know is that the 1.5TB drive I just ran a 3 cycle preclear on has about 15 High_Fly_Write "errors" already. I will monitor the drive(s) like you suggested and see if anything changes. According to the smart report you have 381 high fly writes. From reading about it, the high_fly_writes have to do with a condition that the drive detects where the heads get into a location too far about the disk location during a write. When this is detected, the drive cancels / retries the write. If you had 381 of these out of the millions or billions of write operations, it is likely not too serious. That being said, that seems a pretty high value for this attribute. Nevertheless, I wouldn't be too concerned about it unless you start to see other attribute problems or errors indicating that the drive is malfunctioning. I have never seen a drive go bad because of high_fly_writes. Quote Link to comment
prostuff1 Posted February 21, 2009 Share Posted February 21, 2009 Thanks for the input RobJ. The stats i was concerned about were the ones that you basically outlines there. The High_Fly_Writes one is one that I am not sure about. I have been searching around trying to understand what exactly it does and i am not quite sure. All i know is that the 1.5TB drive I just ran a 3 cycle preclear on has about 15 High_Fly_Write "errors" already. I will monitor the drive(s) like you suggested and see if anything changes. According to the smart report you have 381 high fly writes. From reading about it, the high_fly_writes have to do with a condition that the drive detects where the heads get into a location too far about the disk location during a write. When this is detected, the drive cancels / retries the write. If you had 381 of these out of the millions or billions of write operations, it is likely not too serious. That being said, that seems a pretty high value for this attribute. Nevertheless, I wouldn't be too concerned about it unless you start to see other attribute problems or errors indicating that the drive is malfunctioning. I have never seen a drive go bad because of high_fly_writes. Thanks for explaining that a little more. Now i know what to pay attention to when looking at some of this smart information. I just finished running preclear on a another 1.5 TB Seagate and all went well. I ran 2 cycles on it and it took about 25 hours all told. Quote Link to comment
CrashnBrn Posted February 23, 2009 Share Posted February 23, 2009 After running the script it told me the error count differences detected after pre-clear. Is this anything to be worried about? Results look similar to a previous person's post but I want to double check. The drive is a new Seagate 1.5tb St315005n1a1as Took 12:12:24 to complete the preclear. Edit: Seems like I can't get the image to show here. Here is a link http://img10.imageshack.us/my.php?image=results.jpg Quote Link to comment
RobJ Posted February 23, 2009 Share Posted February 23, 2009 Looks good. Actually, it's claiming to be 'better than good'. If you check the Raw_Read_Error_Rate, the VALUE starts at an initialized value of 100, and rises to 118! It's like asking for maximum effort from someone, and they responding they will give 110%, which is not possible but you know what they mean. In this case, their scientists have measured and calculated appropriate scales for these error rates, and your drive must have such a low error rate, relative to their statistical norms, that their algorithm determined a higher than 100 value. The Seek_Error_Rate stayed at 100. The idea with these seems to be that as the rate of errors increases with wear and tear, the *_Error_Rate will drop from 100 down to the threshold value, at which point they have decided that the error rate is too high to trust the drive, and it will return a failing SMART grade. With Error_Rate's, you should probably ignore the RAW values. They may or may not be actual error counts, but what is being monitored with these attributes is not how many, but at what rate the errors are occurring, and how these numbers fit within the expected norms for that drive model. Quote Link to comment
CrashnBrn Posted February 23, 2009 Share Posted February 23, 2009 Looks good. Actually, it's claiming to be 'better than good'. If you check the Raw_Read_Error_Rate, the VALUE starts at an initialized value of 100, and rises to 118! It's like asking for maximum effort from someone, and they responding they will give 110%, which is not possible but you know what they mean. In this case, their scientists have measured and calculated appropriate scales for these error rates, and your drive must have such a low error rate, relative to their statistical norms, that their algorithm determined a higher than 100 value. The Seek_Error_Rate stayed at 100. The idea with these seems to be that as the rate of errors increases with wear and tear, the *_Error_Rate will drop from 100 down to the threshold value, at which point they have decided that the error rate is too high to trust the drive, and it will return a failing SMART grade. With Error_Rate's, you should probably ignore the RAW values. They may or may not be actual error counts, but what is being monitored with these attributes is not how many, but at what rate the errors are occurring, and how these numbers fit within the expected norms for that drive model. Awesome! The parity rebuild as started. Thanks RobJ Quote Link to comment
bluto Posted February 27, 2009 Share Posted February 27, 2009 Hi, I've run preclear_disk.sh on a drive. I'm most of the way through. All steps say done but I'm on the "Post-Read in progress: 88% complete" step. It appears as if the console has locked. I cannot see any updates and it's been a little while (over 2 hours). I can see in TOP that the "./preclear_disk.sh /dev/sdi" process is spiking one of my processors with 99% cpu utilization. Should I wait it out and hope for the best -- or try to kill the process and start over? Attached is my syslog for your review. Thank you. Quote Link to comment
Joe L. Posted February 27, 2009 Share Posted February 27, 2009 Hi, I've run preclear_disk.sh on a drive. I'm most of the way through. All steps say done but I'm on the "Post-Read in progress: 88% complete" step. It appears as if the console has locked. I cannot see any updates and it's been a little while (over 2 hours). I can see in TOP that the "./preclear_disk.sh /dev/sdi" process is spiking one of my processors with 99% cpu utilization. Should I wait it out and hope for the best -- or try to kill the process and start over? Attached is my syslog for your review. Thank you. One of your disks (/dev/sdi) is having lots of errors as seen in this excerpt below from your syslog. Either the drive died, or a cable came loose. In either case, I doubt the pre-clear will finish on its own. Joe L. Feb 26 16:51:40 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 26 16:51:40 Tower kernel: ata8.00: irq_stat 0x00020002, device error via SDB FIS Feb 26 16:51:40 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in Feb 26 16:51:40 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:51:40 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:51:40 Tower kernel: ata8.00: error: { UNC } Feb 26 16:51:40 Tower kernel: ata8.00: configured for UDMA/100 Feb 26 16:51:40 Tower kernel: ata8: EH complete Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 26 16:51:44 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 26 16:51:44 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS Feb 26 16:51:44 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in Feb 26 16:51:44 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:51:44 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:51:44 Tower kernel: ata8.00: error: { UNC } Feb 26 16:51:44 Tower kernel: ata8.00: configured for UDMA/100 Feb 26 16:51:44 Tower kernel: ata8: EH complete Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 26 16:51:48 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 26 16:51:48 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS Feb 26 16:51:48 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in Feb 26 16:51:48 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:51:48 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:51:48 Tower kernel: ata8.00: error: { UNC } Feb 26 16:51:48 Tower kernel: ata8.00: configured for UDMA/100 Feb 26 16:51:48 Tower kernel: ata8: EH complete Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 26 16:51:53 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 26 16:51:53 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS Feb 26 16:51:53 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in Feb 26 16:51:53 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:51:53 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:51:53 Tower kernel: ata8.00: error: { UNC } Feb 26 16:51:53 Tower kernel: ata8.00: configured for UDMA/100 Feb 26 16:51:53 Tower kernel: ata8: EH complete Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 26 16:51:57 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 26 16:51:57 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS Feb 26 16:51:57 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in Feb 26 16:51:57 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:51:57 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:51:57 Tower kernel: ata8.00: error: { UNC } Feb 26 16:51:57 Tower kernel: ata8.00: configured for UDMA/100 Feb 26 16:51:57 Tower kernel: ata8: EH complete Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 26 16:52:01 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 26 16:52:01 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS Feb 26 16:52:01 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in Feb 26 16:52:01 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:52:01 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:52:01 Tower kernel: ata8.00: error: { UNC } Feb 26 16:52:01 Tower kernel: ata8.00: configured for UDMA/100 Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Result: hostbyte=0x00 driverbyte=0x08 Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Sense Key : 0x3 [current] [descriptor] Feb 26 16:52:01 Tower kernel: Descriptor sense data with sense descriptors (in hex): Feb 26 16:52:01 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Feb 26 16:52:01 Tower kernel: 05 37 80 05 Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] ASC=0x11 ASCQ=0x4 Feb 26 16:52:01 Tower kernel: end_request: I/O error, dev sdi, sector 87523333 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940416 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940417 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940418 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940419 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940420 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940421 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940422 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940423 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940424 Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940425 Feb 26 16:52:01 Tower kernel: ata8: EH complete Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00 Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 26 16:52:06 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 Feb 26 16:52:06 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS Feb 26 16:52:06 Tower kernel: ata8.00: cmd 60/00:00:e0:80:37/01:00:05:00:00/40 tag 0 ncq 131072 in Feb 26 16:52:06 Tower kernel: res 68/02:00:00:00:00/00:00:00:00:68/00 Emask 0x2 (HSM violation) Feb 26 16:52:06 Tower kernel: ata8.00: status: { DRDY DF DRQ } Feb 26 16:52:06 Tower kernel: ata8.00: cmd 60/08:08:00:80:37/00:00:05:00:00/40 tag 1 ncq 4096 in Feb 26 16:52:06 Tower kernel: res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F> Feb 26 16:52:06 Tower kernel: ata8.00: status: { DRDY ERR } Feb 26 16:52:06 Tower kernel: ata8.00: error: { UNC } Feb 26 16:52:06 Tower kernel: ata8: hard resetting link Quote Link to comment
Rob_Esc Posted March 3, 2009 Share Posted March 3, 2009 Just wanted to say that I used this script to preclear two new 1 TB drives to my array: a Seagate 7200.11 Barracuda and a WD EACS green. Everything worked great with no problems. Thanks, Joe! Quote Link to comment
gbdesai Posted March 15, 2009 Share Posted March 15, 2009 I've used the preclear script on several of my drives to great results. I just used it on two new drives a Seagate 1.5TB and a WD 1.5TB (wanted to see if the WD works well). Anyway both ran through the single cycle including pre and post reads. Everything finished on both fine as usual (though the WD took over 18 hours while the Seagate took the usual 12:12). When I shutdown, moved them to their new drive locations and started up, both showed as unformatted and I had to format them for them to be usuable. In the past I seem to recall they simply become active... Anything I did wrong or is it that I brought both drives up at the same time, I think in the past I did one at a time... G Quote Link to comment
Joe L. Posted March 15, 2009 Share Posted March 15, 2009 I've used the preclear script on several of my drives to great results. I just used it on two new drives a Seagate 1.5TB and a WD 1.5TB (wanted to see if the WD works well). Anyway both ran through the single cycle including pre and post reads. Everything finished on both fine as usual (though the WD took over 18 hours while the Seagate took the usual 12:12). When I shutdown, moved them to their new drive locations and started up, both showed as unformatted and I had to format them for them to be usuable. In the past I seem to recall they simply become active... Anything I did wrong or is it that I brought both drives up at the same time, I think in the past I did one at a time... G They will always need to be formatted... They are partitioned in a special way, but not formatted. The pre-clearing erases almost everything on the entire drive... There is no file-system, just an empty partition defined, in a special way that can be recognized. The lengthy clearing step is skipped when adding them to the array. That step takes your array off-line for 4 to 5 hours for a 1.5 TB drive. You did nothing wrong. Instead of it being unavailable for many hours, your array came on-line and the formatting step takes a minute or two (for that size drive) Instead of your family being unhappy because they wanted to view a movie, they can still get access to the server... even while the new drive is being formatted. Glad the script worked for you... If nothing else, you know that the drive is a bit less likely to suffer a mechanical problem in its first few hours of life in the server. The only time a drive does not need to be formatted explicitly is when it is being used to replace an existing drive. In that situation the bytes from the original drive are written to the replacement and those bytes represent a formatted drive, so no additional formatting is necessary. In effect you copied the formatting from the old drive to the replacement. It is possible you just did not notice the formatting before, or forgot you had to press the "Format" button... but you probably did. Joe L. Quote Link to comment
gbdesai Posted March 16, 2009 Share Posted March 16, 2009 Joe, Thanks for the quick reply. I'm sure I did format them before, I haven't added drives in a while so it seemed new to me. But you point out the very reasons this script is so awesome, exercising the drives (especially to see if there are any oddities, especially important with these 1.5TB drives( and letting me keep the array live while it does it's thing. I didn't realize you could run this on multiple drives at once until this time when I did 2 simultaneously in different terminal sessions, way cool. Thanks for taking the time to make such a great tool for all of us! G Quote Link to comment
Joe L. Posted March 16, 2009 Share Posted March 16, 2009 Joe, Thanks for the quick reply. I'm sure I did format them before, I haven't added drives in a while so it seemed new to me. But you point out the very reasons this script is so awesome, exercising the drives (especially to see if there are any oddities, especially important with these 1.5TB drives( and letting me keep the array live while it does it's thing. I didn't realize you could run this on multiple drives at once until this time when I did 2 simultaneously in different terminal sessions, way cool. Thanks for taking the time to make such a great tool for all of us! G Make sure you check the firmware version on the Seagate 1.5TB drive. There is a whole series of Seagate drives with firmware versions that have a bug that will kill the drive to where the BIOS does not even see it and it takes a special jig to get it back alive once more... You can prevent the failure by upgrading the firmware before the drive dies. Joe L. Quote Link to comment
gbdesai Posted March 16, 2009 Share Posted March 16, 2009 Yeah, all of my drives are either CC1H or CC1J. I bought the server direct from Lime-Tech and it came with 4 of the CC1H and they work fine (except one of developed an unrecoverable sector or three and I had to RMA it with Seagate). I also always seem to get CC1Hs from NewEgg, I did get two of the CC1J from Amazon, and apparently there is no firmware to upgrade them to CC1H, you just have to buy em that way for now. I did run the latest firmware upgrade util on all the drives and it always said the latest was installed... They've been running for quite a while now with no problems, that's why I appreciate your script. Thanks again! Quote Link to comment
GoChris Posted March 16, 2009 Share Posted March 16, 2009 I've noticed that drives I pull from my Drobo have SMART turned off. Silly me I keep forgetting to turn it on before running this, so usually I turn it on then run it again. Could you update your script to make sure SMART is turned on before it starts? Otherwise, working great and love that it says so much downtime adding drives to the array. Quote Link to comment
Joe L. Posted March 17, 2009 Share Posted March 17, 2009 I've noticed that drives I pull from my Drobo have SMART turned off. Silly me I keep forgetting to turn it on before running this, so usually I turn it on then run it again. Could you update your script to make sure SMART is turned on before it starts? Otherwise, working great and love that it says so much downtime adding drives to the array. it is apparently enabled on all my drives... or at least on anything remotely recent. I have a few 8Gig drives where it is not available at all.. they pre-date SMART. If you want to add the enable command to your version of preclear_disk.sh, add the line in blue below in the get_smart_start function get_start_smart() { smartctl -s on $1 >/dev/null 2>&1 smartctl -d ata -a $1 2>&1 | egrep -v "Power_On_Minutes|Temperature_Celsius" >/tmp/smart_start$$ cat /tmp/smart_start$$ | logger -tpreclear_disk-start -plocal7.info -is } Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.