December 27, 200916 yr hi... this morning my parity disk had some errors...so i did a restart and started a parity check.... after some time it stops with errors again... so, is the drive damaged??
December 27, 200916 yr The errors seem to be on ATA7 / SDF, which is your Cache disk. Dec 27 19:02:56 media-server kernel: scsi 7:0:0:0: Direct-Access ATA SAMSUNG HD154UI 1AG0 PQ: 0 ANSI: 5 Dec 27 19:02:56 media-server kernel: sd 7:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB) Dec 27 19:02:56 media-server emhttp: pci-0000:00:1f.2-scsi-0:0:0:0 host7 (sdf) ata-SAMSUNG_HD154UI_S1XWJ9AS700382 Dec 27 19:02:57 media-server kernel: md: import disk0: [8,80] (sdf) SAMSUNG HD154UI S1XWJ9AS700382 offset: 63 size: 1465138552 Dec 27 19:02:57 media-server kernel: md: disk0 replaced <<...>> Dec 27 20:20:35 media-server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 27 20:20:35 media-server kernel: ata7.00: cmd 35/00:d8:cf:37:7e/00:02:05:00:00/e0 tag 0 dma 372736 out Dec 27 20:20:35 media-server kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 27 20:20:35 media-server kernel: ata7.00: status: { DRDY } Dec 27 20:20:40 media-server kernel: ata7: link is slow to respond, please be patient (ready=0) Dec 27 20:20:45 media-server kernel: ata7: device not ready (errno=-16), forcing hardreset Dec 27 20:20:45 media-server kernel: ata7: soft resetting link Dec 27 20:20:51 media-server kernel: ata7: link is slow to respond, please be patient (ready=0) Dec 27 20:20:55 media-server kernel: ata7: SRST failed (errno=-16) Dec 27 20:20:55 media-server kernel: ata7: soft resetting link Dec 27 20:21:01 media-server kernel: ata7: link is slow to respond, please be patient (ready=0) Dec 27 20:21:06 media-server kernel: ata7: SRST failed (errno=-16) Dec 27 20:21:06 media-server kernel: ata7: soft resetting link Dec 27 20:21:11 media-server kernel: ata7: link is slow to respond, please be patient (ready=0) Dec 27 20:21:41 media-server kernel: ata7: SRST failed (errno=-16) Dec 27 20:21:41 media-server kernel: ata7: soft resetting link Dec 27 20:21:46 media-server kernel: ata7: SRST failed (errno=-16) Dec 27 20:21:46 media-server kernel: ata7: reset failed, giving up Dec 27 20:21:46 media-server kernel: ata7.00: disabled Dec 27 20:21:46 media-server kernel: ata7.00: device reported invalid CHS sector 0 Dec 27 20:21:46 media-server kernel: ata7: EH complete Dec 27 20:21:46 media-server kernel: sd 7:0:0:0: [sdf] Unhandled error code Dec 27 20:21:46 media-server kernel: sd 7:0:0:0: [sdf] Result: hostbyte=0x04 driverbyte=0x00 Dec 27 20:21:46 media-server kernel: end_request: I/O error, dev sdf, sector 92157903 Dec 27 20:21:46 media-server kernel: sd 7:0:0:0: [sdf] Unhandled error code Dec 27 20:21:46 media-server kernel: sd 7:0:0:0: [sdf] Result: hostbyte=0x04 driverbyte=0x00 Dec 27 20:21:46 media-server kernel: end_request: I/O error, dev sdf, sector 92158631 Dec 27 20:21:46 media-server kernel: sd 7:0:0:0: [sdf] Unhandled error code Dec 27 20:21:46 media-server kernel: sd 7:0:0:0: [sdf] Result: hostbyte=0x04 driverbyte=0x00 Dec 27 20:21:46 media-server kernel: end_request: I/O error, dev sdf, sector 92159183 Dec 27 20:21:46 media-server kernel: md: disk0 write error Dec 27 20:21:46 media-server kernel: handle_stripe write error: 92157840/0, count: 1 <<...>> Dec 27 20:39:01 media-server kernel: mdcmd (621): spindown 0 Dec 27 20:39:01 media-server kernel: md: disk0: ATA_OP_STANDBYNOW1 ioctl error: -5
December 27, 200916 yr Author i think it is the parity cache device: pci-0000:02:00.0-ide-0:0 ide2 (hde) ata-IC35L120AVV207-1_VNVD07G4CNG7AL parity device: pci-0000:00:1f.2-scsi-0:0:0:0 host7 (sdf) ata-SAMSUNG_HD154UI_S1XWJ9AS700382 do i need a new drive, or is this a software problem?
December 28, 200916 yr i think it is the parity cache device: pci-0000:02:00.0-ide-0:0 ide2 (hde) ata-IC35L120AVV207-1_VNVD07G4CNG7AL parity device: pci-0000:00:1f.2-scsi-0:0:0:0 host7 (sdf) ata-SAMSUNG_HD154UI_S1XWJ9AS700382 do i need a new drive, or is this a software problem? The only way to answer your question is for you to run a "smart" report on the drive and ask it to report on its health. The errors you are seeing could be caused by a loose cable to the drive, (either data OR power) or a problem internal to the drive. let us know the output of the following command: smartctl -a -d ata /dev/sdf If you are unable to get a report from the drive, stop the array, power down, and check the cabling for loose connections. Then power back up and try once more to run the smartctl command. If the drive is still un-responsive, try a different cable. If still un-responsive, it is time for an RMA. Joe L.
December 28, 200916 yr Oops, you're right. That is indeed your parity disk. I'm not sure why I thought it was cache disk. As to why or what the errors mean, I don't know. Dec 27 19:04:45 media-server emhttp: shcmd (39): mount -t reiserfs -o noacl,nouser_xattr,noatime,nodiratime /dev/hde1 /mnt/cache >/dev/null 2>&1
December 28, 200916 yr Author i had to reboot the system because the smart report said that it could not find a drive... ... after the reboot the parity appear as new....and smart reports the following: root@media-server:~# smartctl -a -d ata /dev/sdf smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD154UI Serial Number: S1XWJ9AS700382 Firmware Version: 1AG01118 User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Mon Dec 28 09:36:22 2009 GMT-1 ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (18553) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 32) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 068 068 011 Pre-fail Always - 10330 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 683 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1301 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 60 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 087 075 000 Old_age Always - 13 (Lifetime Min/Max 13/17) 194 Temperature_Celsius 0x0022 086 074 000 Old_age Always - 14 (Lifetime Min/Max 13/18) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 20832 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 099 099 000 Old_age Always - 491 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. should i start the system and do a parity check?
December 28, 200916 yr i had to reboot the system because the smart report said that it could not find a drive... ... after the reboot the parity appear as new....and smart reports the following: root@media-server:~# smartctl -a -d ata /dev/sdf smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD154UI Serial Number: S1XWJ9AS700382 Firmware Version: 1AG01118 User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Mon Dec 28 09:36:22 2009 GMT-1 ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (18553) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 32) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 068 068 011 Pre-fail Always - 10330 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 683 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1301 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 60 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 087 075 000 Old_age Always - 13 (Lifetime Min/Max 13/17) 194 Temperature_Celsius 0x0022 086 074 000 Old_age Always - 14 (Lifetime Min/Max 13/18) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 20832 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 099 099 000 Old_age Always - 491 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. should i start the system and do a parity check? Since the drive seems to be fine I'd suspect cabling to it as the cause of its outage. I'd power down, then re-seat the cables, then power back up. If it still responds with a smartctl report, then you are a prime candidate for the trust my parity disk procedure as described in the wiki http://lime-technology.com/wiki/index.php?title=Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily Let the resulting parity check complete. It may find errors if you wrote to the array prior to shutting it down (and shutting it down writes to the housekeeping area of each disk, so you will have some small number of errors near the start of the parity check regardless of what you do) It must complete and update the parity disk, so do not stop it prior to its completion. Once it has completed a "check" press the button once more, the second "check" should find no errors (and hopefully your parity disk will still be on-line at its completion) Joe L.
January 3, 201016 yr Author ok. problem solved... the psu was the problem... after i replaced it, unraid runs perfect again. thx anyway
Archived
This topic is now archived and is closed to further replies.