advice - 2 drives with errors


Recommended Posts

I have errors showing on 2 drives... what is the best way to proceed? Replace 1, rebuild, then replace the other?

Or replace both (via clone tool or other, and then rebuild)?

 

attached is screen shot. thanks.

 

Thank you!

 

edit: snippit from syslog (can post entire thing if you really want?):

 

Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249936
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249944
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249944
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249952
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249952
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249960
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249960
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249968
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249968
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249976
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249976
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249984
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249984
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637249992
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637249992
Sep 10 23:34:47 RCNAS kernel: md: disk7 read error, sector=1637250000
Sep 10 23:34:47 RCNAS kernel: md: disk9 read error, sector=1637250000

unraid_2_drive_errors_sept2015.PNG.b5c23330d17d5308731ef5e9b2e475dd.PNG

Link to comment

Here is a larger screen shot: v 5.0.5 is what i'm running.

 

Not sure what a full syslog will do, as its 21kbytes in size and mostly filled with the error I posted, and some private information (files, etc, I'd rather not share, is there some way to clean file names out quickly?).

 

I would really apprciate some advice on the best way to replace the 2 failing drives (they are 4.5years old WD Green drives, so not warranty, and need replacing, so would like advice so I can decide on the best method).

 

Thanks!

unraid_2_drive_errors_sept2015V2.PNG.4ffa68d552830ac099e1f819b073d403.PNG

Link to comment

I think i got the person info out of my log, but it won't let me post the zip file that is still 524KB in size (not sure why, keep getting connection timeout) :(

Either way, its full of those errors, here are a few others maybe of interest?

 

Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS shfs/user: shfs_open: open: 
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS last message repeated 5 times
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS last message repeated 2 times
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS last message repeated 5 times
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS shfs/user: shfs_open: open: 
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS last message repeated 3 times
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS last message repeated 2 times
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS shfs/user: shfs_readdir: readdir_r: 
Sep 10 20:02:55 RCNAS last message repeated 7 times
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [3193 3194 0x0 SD]
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error
Sep 10 20:02:55 RCNAS kernel: REISERFS error (device md7): zam-7001 reiserfs_find_entry: io error

Link to comment

Smart reports on the two drives:

 

sdq (disk7)

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA3407269
LU WWN Device Id: 5 0014ee 600c0d3c7
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Fri Sep 11 10:34:58 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (35760) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 345) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2486
  3 Spin_Up_Time            0x0027   242   165   021    Pre-fail  Always       -       2900
  4 Start_Stop_Count        0x0032   093   093   000    Old_age   Always       -       7087
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   047   047   000    Old_age   Always       -       39316
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       88
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   176   176   000    Old_age   Always       -       73221
194 Temperature_Celsius     0x0022   120   111   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   172   000    Old_age   Offline      -       84

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25084         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

disk9 (sdr in screen shot, but now sdm for some reason, perhaps because I stopped the array and assignments can sometimes change on my IBM controller?)

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA3269017
LU WWN Device Id: 5 0014ee 6ab6b73eb
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Fri Sep 11 10:36:39 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (36600) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 353) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   164   021    Pre-fail  Always       -       1908
  4 Start_Stop_Count        0x0032   093   093   000    Old_age   Always       -       7844
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   047   047   000    Old_age   Always       -       39309
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       90
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       52
193 Load_Cycle_Count        0x0032   167   167   000    Old_age   Always       -       101679
194 Temperature_Celsius     0x0022   120   115   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

edit: snippit from syslog (can post entire thing if you really want?):

We ALWAYS want!  You are welcome to clean out anything personal or private (so long as you keep the file a text file), but we HAVE to have the entire syslog to see the beginning setup and the very first errors that occurred.  The errors that occur later are almost never interesting, as they are often just consequences of the original problem.  It's the errors associated with the cause of the problem that we need to see.  If you prefer, you can chop off the last 3/4ths of the file, all the redundant errors.  Then zip the syslog and attach it (it compresses to almost a tenth), or post a public link to the zip.

 

Disk 9 looks great, no issues.  Disk 7 has a few bad sectors (Current_Pending_Sector count is 2), needs to be tested and replaced.

 

I strongly recommend avoiding ANY writing to the array.  To rebuild Disk 7, you will probably have to trust parity and the current Disk 9.  Then you can rebuild onto a new 2TB drive newly assigned to Disk 7.  Once the current Disk 7 is out, you can Preclear it a couple of times, make sure no more current pending sectors, and hopefully reuse it.

Link to comment

Syslog is fine for days, with Disk 7 at sdm (sd 3:0:1:0) and Disk 9 at sdp (sd 3:0:4:0).

 

Then it appears you hot-plug a drive in at Sep 10 17:09:20, to slot sd 3:0:5:0, about 10 minutes after Disk 7 spun down.  It's assigned the drive symbol sdq.  It looks like this -

Sep 10 16:59:05 RCNAS kernel: mdcmd (2813): spindown 7

Sep 10 17:09:20 RCNAS kernel: sd 3:0:1:0: [sdm] Synchronizing SCSI cache

Sep 10 17:09:20 RCNAS kernel: sd 3:0:1:0: [sdm] 

Sep 10 17:09:20 RCNAS kernel: Result: hostbyte=0x01 driverbyte=0x00

Sep 10 17:09:20 RCNAS kernel: mpt2sas1: removing handle(0x000a), sas_addr(0x4433221101000000)

Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: Direct-Access    ATA      WDC WD20EARS-00M AB51 PQ: 0 ANSI: 6

Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: SATA: handle(0x000a), sas_addr(0x4433221101000000), phy(1), device_name(0x0000000000000000)

Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: SATA: enclosure_logical_id(0x500605b00372dfc0), slot(2)

Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Sep 10 17:09:27 RCNAS kernel: scsi 3:0:5:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)

Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: Attached scsi generic sg12 type 0

Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Write Protect is off

Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Mode Sense: 7f 00 00 08

Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Sep 10 17:09:27 RCNAS kernel:  sdq: sdq1

Sep 10 17:09:27 RCNAS kernel: sd 3:0:5:0: [sdq] Attached SCSI disk

The messages with sdm look completely innocent, but after this the next mention of Disk 7 involves read errors, so it (sdm) wasn't responding.

 

Almost the same thing happens again in almost 2 hours, this time involving Disk 9 and a second new drive that appears to be hotplugged in.  It's plugged into sd 3:0:6:0 and assigned sdr.

Sep 10 18:50:51 RCNAS kernel: mdcmd (2823): spindown 9

Sep 10 18:50:52 RCNAS kernel: mdcmd (2824): spindown 10

Sep 10 18:50:52 RCNAS kernel: mdcmd (2825): spindown 11

Sep 10 19:12:54 RCNAS kernel: sd 3:0:4:0: [sdp] Synchronizing SCSI cache

Sep 10 19:12:54 RCNAS kernel: sd 3:0:4:0: [sdp] 

Sep 10 19:12:54 RCNAS kernel: Result: hostbyte=0x01 driverbyte=0x00

Sep 10 19:12:54 RCNAS kernel: mpt2sas1: removing handle(0x000d), sas_addr(0x4433221102000000)

Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: Direct-Access    ATA      WDC WD20EARS-00M AB51 PQ: 0 ANSI: 6

Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: SATA: handle(0x000d), sas_addr(0x4433221102000000), phy(2), device_name(0x0000000000000000)

Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: SATA: enclosure_logical_id(0x500605b00372dfc0), slot(1)

Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)

Sep 10 19:13:02 RCNAS kernel: scsi 3:0:6:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)

Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: Attached scsi generic sg15 type 0

Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Write Protect is off

Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Mode Sense: 7f 00 00 08

Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Sep 10 19:13:02 RCNAS kernel:  sdr: sdr1

Sep 10 19:13:02 RCNAS kernel: sd 3:0:6:0: [sdr] Attached SCSI disk

Sep 10 19:39:05 RCNAS kernel: md: disk9 read error, sector=2162691968

Sep 10 19:39:10 RCNAS shfs/user: shfs_readdir:

Sep 10 19:39:10 RCNAS shfs/user: shfs_readdir: readdir_r:

Sep 10 19:39:10 RCNAS kernel: md: disk7 read error, sector=2162691968

Sep 10 19:39:10 RCNAS kernel: REISERFS error (device md9): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [20369 188461 0x0 SD]

Sep 10 19:39:10 RCNAS kernel: REISERFS (device md9): Remounting filesystem read-only

26 minutes afterward, a read of Disk 9 is attempted and failed, plus Reiser file system corruption is detected, so Disk 9 is remounted read-only, which is going to fail even more I/O.

 

It takes awhile but both drives report a lot of read errors and Disk 9 reports a lot of file system errors, then Disk 7 also is corrupted, and remounted read-only.

 

My comments so far are strictly based on the syslog.  Now when I look at your second screen pic, I'm amazed!  Disk 7 shows as sdq(!) and Disk 9 as sdr!  So it seems like what *looked* like a hotplug event was either a serious bug in the driver or it thought it was disconnected then was quickly reconnected and it set it up as a new drive, which would be fatal for unRAID.  The message "removing handle" must be indicating it was dropping the drive.

 

If you stop the array, then unRAID does a new inventory and recognizes the drive by their serials, so picked up their drive device symbol changes, without even noticing they had changed!  You commented "sdr in screen shot, but now sdm for some reason".  I suspect that Disk 9 had another pseudo 'hotplug event', and it was re-assigned a drive device symbol of sdm, because sdm was now available, no longer in use (used to be Disk 7).

 

I have no idea what happened.  The SAS error handler is singularly uninformative, did not say anything about drives being disconnected, didn't explain anywhere what was wrong.  Perhaps the drives are loose, or vibrated loose?  What's really bazaar is the drives are reported to be moved to completely different physical slots (sd 3:0:1:0 to sd 3:0:5:0, and sd 3:0:4:0 to sd 3:0:6:0).  Is there any chance that someone pulled the drives out and pushed them back in, into different trays?

 

What I *can* say is that it's not the fault of the drives.  They appear to be fine.  Disk 7 does have pending sectors, but they weren't involved in the problems above.

 

Something else I can't help you with -

Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: Found VMCI PCI device at 0x11080, irq 16

Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: Using capabilities 0xc

Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: irq 74 for MSI/MSI-X

Jul 21 22:10:59 RCNAS kernel: vmw_vmci 0000:00:07.7: irq 75 for MSI/MSI-X

Jul 21 22:10:59 RCNAS kernel: Guest personality initialized and is active

Jul 21 22:10:59 RCNAS kernel: VMCI host device registered (name=vmci, major=10, minor=59)

Jul 21 22:10:59 RCNAS kernel: Initialized host personality

Jul 21 22:10:59 RCNAS vmsvc[1494]: [ warning] [GLib-GObject] invalid (NULL) pointer instance

Jul 21 22:10:59 RCNAS vmsvc[1494]: [critical] [GLib-GObject] g_signal_emit_by_name: assertion `G_TYPE_CHECK_INSTANCE (instance)' failed

It's VMWare related, which I have no experience with, but anything reporting a 'warning' then 'critical' I pay attention to!  It seems to involve the VMCI device, again something I don't know anything about.  I suspect something here is broken.

Link to comment

I have dual M1015 IBM (LSI) controllers flashed to IT mode passed through to unRaid.

 

I didn't hot-swap or change anything on this machine for a long time... it only powered down in July due to a storm (I was even home so it was properly shutdown before battery backup failed).

 

Sounds like either an issue with VMWare pass-through, or linux driver.  I've rebooted the machine and started the rebuild process for the one disk; finger crossed that the issue is solved from a full reboot (the hardware completely shutdown to do the swap of drive to be safe) has cleared the driver/raid pass-through issue.

 

Thank you again for an awesome analysis of my syslog!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.