JDGJr Posted January 31, 2013 Share Posted January 31, 2013 System running on 5.0beta12 for months. Last weekend I replaced a 1TB drive with a 2TB and ran a parity check 3 days ago without error. Tonite I noticed the server's drive light was flashing even though the UI showed no drives spinning. One of my older 1.5TB drives (not involved in the upgrade) was marked as disabled. I stopped the array, captured the attached log and power down - all cables seem to be connected properly. Should I preclear a brand new 2TB drive and upgrade to it? thanks for any suggestions on how to proceed. Syslog is attached. Smart status report after reboot (drive hdb still showing as disabled): Statistics for /dev/hdb ST31500341AS_9VS0G44C smartctl -a -d ata /dev/hdb smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST31500341AS Serial Number: 9VS0G44C Firmware Version: CC1J User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Jan 30 22:11:54 2013 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 617) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 209637772 3 Spin_Up_Time 0x0003 096 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1171 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 10316718 9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 33405 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 166 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 098 000 Old_age Always - 262152 189 High_Fly_Writes 0x003a 023 023 000 Old_age Always - 77 190 Airflow_Temperature_Cel 0x0022 070 063 045 Old_age Always - 30 (Min/Max 24/30) 194 Temperature_Celsius 0x0022 030 040 000 Old_age Always - 30 (0 13 0 0) 195 Hardware_ECC_Recovered 0x001a 054 032 000 Old_age Always - 209637772 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 3 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 85993835204398 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 724182207 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 399207213 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 33405 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. syslog-2013-01-30-hdb_issues.zip Quote Link to comment
RobJ Posted January 31, 2013 Share Posted January 31, 2013 You have several issues, but the drive itself is probably fine, the SMART report certainly looks good. At (Jan 30) 18:10:32, something major happened to this server, probably a serious electrical spike. Perhaps there was lightning in the area at that time? The server stayed up, but 7 of the 9 drives were hit, and reported trouble. 6 of the 7 drives were configured correctly as SATA drives, and the SATA exception handler was able to hard reset them and recover them successfully (although one of them needed a second hard reset). The other drive was hdb (Disk 1) and both it and your parity drive are configured in an IDE emulation mode, and the IDE handler did not know how to recover hdb, which is actually a SATA drive. You have IDE emulation turned on for 2 of your SATA drives, in particular the Parity drive and Disk 1. When you next boot, go into the BIOS settings and look for the SATA mode, and change it to a native SATA mode, preferably AHCI, anything but IDE emulation mode. You are still running UnRAID v5.0-beta12, which is rather old and includes non-patched Realtek support, and has other issues. I strongly recommend upgrading to v5.0-rc11, the latest release. When you restart, Disk 1 will still be missing, but you should be able to stop the array if started, unassign Disk 1, start the array, stop the array again, re-assign Disk 1, and start the array again, rebuilding Disk 1. You also have numerous 'bad method' messages being reported, and filling up much of the syslog, at an astonishing rate, faster than 5 per second! Search the forums for bad method, and you will find info about it. I'm dozing off here, so hope I haven't made any mistakes. Quote Link to comment
JDGJr Posted January 31, 2013 Author Share Posted January 31, 2013 Thanks for the detailed suggestions! I'm still on beta12 because it has been working for my system for a long time and I wanted to keep everything the same for my drive upgrade last weekend. Was planning to move to rc11 this weekend, but am glad I didn't change versions until I get this figured out. the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something. I did notice this value I the smart report, which a wiki page said could be a SATA cabling issue: 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 3 I'll recheck my cabling again today. Quote Link to comment
Joe L. Posted January 31, 2013 Share Posted January 31, 2013 Thanks for the detailed suggestions! I'm still on beta12 because it has been working for my system for a long time and I wanted to keep everything the same for my drive upgrade last weekend. Was planning to move to rc11 this weekend, but am glad I didn't change versions until I get this figured out. the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something. I did notice this value I the smart report, which a wiki page said could be a SATA cabling issue: 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 3 I'll recheck my cabling again today. It typically is caused by un-shielded cables picking up noise from adjacent cables. If you have tie-wrapped your SATA cables to make them neat looking, you've made them MORE susceptible to pick up induced noise from the adjacent cables. Your best bet if you are getting CRC errors is to keep them at a distance to each other. (or use shielded cables, and nearly none of the internal SATA cables sold are shielded.) Quote Link to comment
Joe L. Posted January 31, 2013 Share Posted January 31, 2013 the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something. Yes, it is unMENU that is reporting them, but it is another device on your LAN constantly probing the devices on your LAN that is the cause. Disabling unMENU will not stop the probing on your LAN and extra network traffic that basically only slows down your desired network traffic. Joe L. Quote Link to comment
JDGJr Posted January 31, 2013 Author Share Posted January 31, 2013 I noticed my signature had old hardware info. I've updated it to be current. Rob - I have an AmiBios, the latest version available for my BoiStar board, and I think the SATA settings you're referring to are on the Southbridge page. The settings that generated the IDE mode you saw were: OnChip SATA Type = IDE -> AHCI SATA IDE Combined = enabled SATA-III Mode = Auto Changing from 'IDE -> AHCI' to 'AHCI' does not allow the system to boot, and I end up at a hardware summary page. Is this the proper page and settings? I checked the wiring and made minor changes, restarted and reran the SMART test - the UDMA value remains the same: Statistics for /dev/hdb ST31500341AS_9VS0G44C smartctl -a -d ata /dev/hdb smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST31500341AS Serial Number: 9VS0G44C Firmware Version: CC1J User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Jan 31 11:16:21 2013 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 617) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 209659794 3 Spin_Up_Time 0x0003 097 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1173 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 10322993 9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 33406 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 168 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 098 000 Old_age Always - 262152 189 High_Fly_Writes 0x003a 023 023 000 Old_age Always - 77 190 Airflow_Temperature_Cel 0x0022 069 063 045 Old_age Always - 31 (Min/Max 30/31) 194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 13 0 0) 195 Hardware_ECC_Recovered 0x001a 053 032 000 Old_age Always - 209659794 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 3 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 49581102468911 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 724182207 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 399208342 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 33406 - # 2 Short offline Completed without error 00% 33405 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
RobJ Posted January 31, 2013 Share Posted January 31, 2013 I'm still on beta12 because it has been working for my system for a long time and I wanted to keep everything the same for my drive upgrade last weekend. Was planning to move to rc11 this weekend, but am glad I didn't change versions until I get this figured out. The sooner the better, I think. I believe that RC11 would be more reliable than beta12, many things have been fixed. the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something. As Joe said, it is creating a fair amount of network traffic, so when you have a chance, I would track that down. I did notice this value I the smart report, which a wiki page said could be a SATA cabling issue: 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 3 I'll recheck my cabling again today. I did not mention it because that drive has 33406 usage hours on it, so 3 is nothing to worry about, unless they are recent, and there is no indication of that. I imagine you have been using this drive for 3 to 4 years? The SMART report does not give us any indication as to when those 3 CRC errors occurred, so they very well could have occurred years ago. Never hurts to recheck the cabling though. I noticed my signature had old hardware info. I've updated it to be current. Rob - I have an AmiBios, the latest version available for my BoiStar board, and I think the SATA settings you're referring to are on the Southbridge page. The settings that generated the IDE mode you saw were: OnChip SATA Type = IDE -> AHCI SATA IDE Combined = enabled SATA-III Mode = Auto Changing from 'IDE -> AHCI' to 'AHCI' does not allow the system to boot, and I end up at a hardware summary page. Is this the proper page and settings? I checked the wiring and made minor changes, restarted and reran the SMART test - the UDMA value remains the same: I did notice in the syslog that you had a BioStar, so ignored the sig, but that didn't really matter, since I don't know what people see in their BIOS settings anyway. I just try to be general enough, when I advise. What you saw and did was correct. Once in a while, when a drive change is made, the BIOS will rearrange the starting boot drive order. Check to see if your UnRAID flash drive is still the first boot drive. That "SATA IDE Combined" option is new to me. I think it might be better to disable it, you don't want anything IDE related, except for true IDE drives. Quote Link to comment
JDGJr Posted January 31, 2013 Author Share Posted January 31, 2013 I'll get to rc11 as soon as I get my disk 1 back. Tom's notes say to only upgrade a stable system. Thanks for noting the 'SATA IDE Combined' setting. This got me to work on disabling everything IDE... First I disabled the PCI IDE Busmaster setting in the main setting page. Then I disabled 'SATA IDE Combined' on the SB page, and the hda/hdb drives were not presented in the BIOS list. Then I changed to SB 'OnChip SATA Type = AHCI' setting and the bios list showed all drives as boot options (oddly as IDE-blah-blah). Now, running unraid shows all drives as sd !!! First time I've seen that with this motherboard! Question: if I use the 'unassign disk1 ... reassign disk1' steps, do I need to run pre-clear? Or more exact, will the drive be cleared before the data is written, or is it just written? I really appreciate the detailed thought and info you've given! Quote Link to comment
RobJ Posted January 31, 2013 Share Posted January 31, 2013 Question: if I use the 'unassign disk1 ... reassign disk1' steps, do I need to run pre-clear? Or more exact, will the drive be cleared before the data is written, or is it just written? I really appreciate the detailed thought and info you've given! I admit I like a little appreciation now and then! Thank you. The rebuild will completely overwrite the drive with the correct data, so no clearing necessary. If the SMART report had showed anything suspicious, then it might have been a good idea to Preclear it, for the thorough testing it includes, but the report looks fine. Quote Link to comment
JDGJr Posted January 31, 2013 Author Share Posted January 31, 2013 WOW! that change made a huge difference - this rebuild is running about 10 times faster (90MB/sec) than the one last weekend. [Edit] ended up about 60MB/sec, but still much faster than when parity and disk 1 were IDE drives for the 'bad method' messages, it seems the plan of attack is to monitor the unmenu log and remove devices from the LAN until they stop, correct? Interestingly, I'm not seeing the entries today. Will have to start turning on devices. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.