marcusone Posted September 11, 2015 Share Posted September 11, 2015 I have errors showing on 2 drives... what is the best way to proceed? Replace 1, rebuild, then replace the other? Or replace both (via clone tool or other, and then rebuild)? attached is screen shot. thanks. Thank you! edit: snippit from syslog (can post entire thing if you really want?): Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249936 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249944 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249944 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249952 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249952 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249960 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249960 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249968 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249968 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249976 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249976 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249984 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249984 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249992 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249992 Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637250000 Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637250000 Quote Link to comment
trurl Posted September 11, 2015 Share Posted September 11, 2015 Post complete syslog and more complete screenshot. Quote Link to comment
marcusone Posted September 11, 2015 Author Share Posted September 11, 2015 Here is a larger screen shot: v 5.0.5 is what i'm running. Not sure what a full syslog will do, as its 21kbytes in size and mostly filled with the error I posted, and some private information (files, etc, I'd rather not share, is there some way to clean file names out quickly?). I would really apprciate some advice on the best way to replace the 2 failing drives (they are 4.5years old WD Green drives, so not warranty, and need replacing, so would like advice so I can decide on the best method). Thanks! Quote Link to comment
marcusone Posted September 11, 2015 Author Share Posted September 11, 2015 I think i got the person info out of my log, but it won't let me post the zip file that is still 524KB in size (not sure why, keep getting connection timeout) Either way, its full of those errors, here are a few others maybe of interest? Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS shfs/user: shfs_open: open: Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS last message repeated 5 times Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS last message repeated 2 times Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS last message repeated 5 times Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS shfs/user: shfs_open: open: Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS last message repeated 3 times Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS last message repeated 2 times Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 20:02:55 RCNAS last message repeated 7 times Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD] Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error Quote Link to comment
marcusone Posted September 11, 2015 Author Share Posted September 11, 2015 Smart reports on the two drives: sdq (disk7) === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF) Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA3407269 LU WWN Device Id: 5 0014ee 600c0d3c7 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Sep 11 10:34:58 2015 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (35760) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 345) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2486 3 Spin_Up_Time 0x0027 242 165 021 Pre-fail Always - 2900 4 Start_Stop_Count 0x0032 093 093 000 Old_age Always - 7087 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 047 047 000 Old_age Always - 39316 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 88 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 48 193 Load_Cycle_Count 0x0032 176 176 000 Old_age Always - 73221 194 Temperature_Celsius 0x0022 120 111 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 172 000 Old_age Offline - 84 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 25084 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. disk9 (sdr in screen shot, but now sdm for some reason, perhaps because I stopped the array and assignments can sometimes change on my IBM controller?) === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF) Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA3269017 LU WWN Device Id: 5 0014ee 6ab6b73eb Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Sep 11 10:36:39 2015 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (36600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 353) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 164 021 Pre-fail Always - 1908 4 Start_Stop_Count 0x0032 093 093 000 Old_age Always - 7844 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 047 047 000 Old_age Always - 39309 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 90 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 52 193 Load_Cycle_Count 0x0032 167 167 000 Old_age Always - 101679 194 Temperature_Celsius 0x0022 120 115 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
RobJ Posted September 11, 2015 Share Posted September 11, 2015 edit: snippit from syslog (can post entire thing if you really want?): We ALWAYS want! You are welcome to clean out anything personal or private (so long as you keep the file a text file), but we HAVE to have the entire syslog to see the beginning setup and the very first errors that occurred. The errors that occur later are almost never interesting, as they are often just consequences of the original problem. It's the errors associated with the cause of the problem that we need to see. If you prefer, you can chop off the last 3/4ths of the file, all the redundant errors. Then zip the syslog and attach it (it compresses to almost a tenth), or post a public link to the zip. Disk 9 looks great, no issues. Disk 7 has a few bad sectors (Current_Pending_Sector count is 2), needs to be tested and replaced. I strongly recommend avoiding ANY writing to the array. To rebuild Disk 7, you will probably have to trust parity and the current Disk 9. Then you can rebuild onto a new 2TB drive newly assigned to Disk 7. Once the current Disk 7 is out, you can Preclear it a couple of times, make sure no more current pending sectors, and hopefully reuse it. Quote Link to comment
marcusone Posted September 11, 2015 Author Share Posted September 11, 2015 Thanks RobJ ... here is the top portion of my syslog showing when the errors started (the rest of the log was 99% of these errors repeating themselves). So hopefully this helps and you can guide me better!! APPRECIATE IT! syslog-2015-09-11_sm.zip Quote Link to comment
RobJ Posted September 11, 2015 Share Posted September 11, 2015 Syslog is fine for days, with Disk 7 at sdm (sd 3:0:1:0) and Disk 9 at sdp (sd 3:0:4:0). Then it appears you hot-plug a drive in at Sep 10 17:09:20, to slot sd 3:0:5:0, about 10 minutes after Disk 7 spun down. It's assigned the drive symbol sdq. It looks like this - Sep 10 16:59:05 RCNAS kernel: mdcmd (2813): spindown 7 Sep 10 17:09:20 RCNAS kernel: sd 3:0:1:0: [sdm] Synchronizing SCSI cache Sep 10 17:09:20 RCNAS kernel: sd 3:0:1:0: [sdm] Sep 10 17:09:20 RCNAS kernel: Result: hostbyte=0x01 driverbyte=0x00 Sep 10 17:09:20 RCNAS kernel: mpt2sas1: removing handle(0x000a), sas_addr(0x4433221101000000) Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: Direct-Access ATA WDC WD20EARS-00M AB51 PQ: 0 ANSI: 6 Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: SATA: handle(0x000a), sas_addr(0x4433221101000000), phy(1), device_name(0x0000000000000000) Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: SATA: enclosure_logical_id(0x500605b00372dfc0), slot(2) Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1) Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: Attached scsi generic sg12 type 0 Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Write Protect is off Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Mode Sense: 7f 00 00 08 Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 10 17:09:27 RCNAS kernel: sdq: sdq1 Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Attached SCSI disk The messages with sdm look completely innocent, but after this the next mention of Disk 7 involves read errors, so it (sdm) wasn't responding. Almost the same thing happens again in almost 2 hours, this time involving Disk 9 and a second new drive that appears to be hotplugged in. It's plugged into sd 3:0:6:0 and assigned sdr. Sep 10 18:50:51 RCNAS kernel: mdcmd (2823): spindown 9 Sep 10 18:50:52 RCNAS kernel: mdcmd (2824): spindown 10 Sep 10 18:50:52 RCNAS kernel: mdcmd (2825): spindown 11 Sep 10 19:12:54 RCNAS kernel: sd 3:0:4:0: [sdp] Synchronizing SCSI cache Sep 10 19:12:54 RCNAS kernel: sd 3:0:4:0: [sdp] Sep 10 19:12:54 RCNAS kernel: Result: hostbyte=0x01 driverbyte=0x00 Sep 10 19:12:54 RCNAS kernel: mpt2sas1: removing handle(0x000d), sas_addr(0x4433221102000000) Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: Direct-Access ATA WDC WD20EARS-00M AB51 PQ: 0 ANSI: 6 Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: SATA: handle(0x000d), sas_addr(0x4433221102000000), phy(2), device_name(0x0000000000000000) Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: SATA: enclosure_logical_id(0x500605b00372dfc0), slot(1) Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1) Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: Attached scsi generic sg15 type 0 Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Write Protect is off Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Mode Sense: 7f 00 00 08 Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 10 19:13:02 RCNAS kernel: sdr: sdr1 Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Attached SCSI disk Sep 10 19:39:05 RCNAS kernel: md: disk9 read error, sector=2162691968 Sep 10 19:39:10 RCNAS shfs/user: shfs_readdir: Sep 10 19:39:10 RCNAS shfs/user: shfs_readdir: readdir_r: Sep 10 19:39:10 RCNAS kernel: md: disk7 read error, sector=2162691968 Sep 10 19:39:10 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [20369 188461 0x0 SD] Sep 10 19:39:10 RCNAS kernel: REISERFS (device md9): Remounting filesystem read-only 26 minutes afterward, a read of Disk 9 is attempted and failed, plus Reiser file system corruption is detected, so Disk 9 is remounted read-only, which is going to fail even more I/O. It takes awhile but both drives report a lot of read errors and Disk 9 reports a lot of file system errors, then Disk 7 also is corrupted, and remounted read-only. My comments so far are strictly based on the syslog. Now when I look at your second screen pic, I'm amazed! Disk 7 shows as sdq(!) and Disk 9 as sdr! So it seems like what *looked* like a hotplug event was either a serious bug in the driver or it thought it was disconnected then was quickly reconnected and it set it up as a new drive, which would be fatal for unRAID. The message "removing handle" must be indicating it was dropping the drive. If you stop the array, then unRAID does a new inventory and recognizes the drive by their serials, so picked up their drive device symbol changes, without even noticing they had changed! You commented "sdr in screen shot, but now sdm for some reason". I suspect that Disk 9 had another pseudo 'hotplug event', and it was re-assigned a drive device symbol of sdm, because sdm was now available, no longer in use (used to be Disk 7). I have no idea what happened. The SAS error handler is singularly uninformative, did not say anything about drives being disconnected, didn't explain anywhere what was wrong. Perhaps the drives are loose, or vibrated loose? What's really bazaar is the drives are reported to be moved to completely different physical slots (sd 3:0:1:0 to sd 3:0:5:0, and sd 3:0:4:0 to sd 3:0:6:0). Is there any chance that someone pulled the drives out and pushed them back in, into different trays? What I *can* say is that it's not the fault of the drives. They appear to be fine. Disk 7 does have pending sectors, but they weren't involved in the problems above. Something else I can't help you with - Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: Found VMCI PCI device at 0x11080, irq 16 Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: Using capabilities 0xc Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: irq 74 for MSI/MSI-X Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: irq 75 for MSI/MSI-X Jul 21 22:10:59 RCNAS kernel: Guest personality initialized and is active Jul 21 22:10:59 RCNAS kernel: VMCI host device registered (name=vmci, major=10, minor=59) Jul 21 22:10:59 RCNAS kernel: Initialized host personality Jul 21 22:10:59 RCNAS vmsvc[1494]: [ warning] [GLib-GObject] invalid (NULL) pointer instance Jul 21 22:10:59 RCNAS vmsvc[1494]: [critical] [GLib-GObject] g_signal_emit_by_name: assertion `G_TYPE_CHECK_INSTANCE (instance)' failed It's VMWare related, which I have no experience with, but anything reporting a 'warning' then 'critical' I pay attention to! It seems to involve the VMCI device, again something I don't know anything about. I suspect something here is broken. Quote Link to comment
marcusone Posted September 11, 2015 Author Share Posted September 11, 2015 I have dual M1015 IBM (LSI) controllers flashed to IT mode passed through to unRaid. I didn't hot-swap or change anything on this machine for a long time... it only powered down in July due to a storm (I was even home so it was properly shutdown before battery backup failed). Sounds like either an issue with VMWare pass-through, or linux driver. I've rebooted the machine and started the rebuild process for the one disk; finger crossed that the issue is solved from a full reboot (the hardware completely shutdown to do the swap of drive to be safe) has cleared the driver/raid pass-through issue. Thank you again for an awesome analysis of my syslog! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.