Hi all,
Been running a 16 (15 + parity) array for almost a year without any issues.
Just had my first issue and hoping for some guidance so I don't lose any data (or minimize data loss).
I'm running version 5.0.4 on a Supermicro X9SCM-F-O on ESXi.
There are two M1015s both flashed to P15.
I was watching a movie when streaming froze and I checked dmesg on unraid to find:
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
sd 2:0:2:0: attempting task abort! scmd(db84c0c0)
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
sd 2:0:2:0: attempting task abort! scmd(db84c0c0)
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
sd 2:0:2:0: attempting task abort! scmd(db84c0c0)
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
Cleanly shut down, rebooted and a parity check kicked off. It grinds along happily at ~90MB/sec until it hit ~77% then those same errors started throwing on the same drive. It then slows down to XXKB/sec and the simplefeatures UI on 80 and the stock UI on 8080 both become _very_ unresponsive.
I had run a successful parity check 100 days prior with no issues and have also checked all cabling thoroughly.
The drives normally run ~30C. They are now up at 38-40C, because it is 35C outside here and they are grinding through parity.
I stopped the check, rebooted, and parity checked again. Again the same process. I will let the parity check continue to plug along at 50KB/sec and hope it eventually speeds up, but the errors are continuing.
I have attached a syslog.
I tried to run smartctl to attach smart info, but it hung and can't be killed with CTRL+C. Running ps seems to indicate the system already tried to run smart unsuccessfully:
I have attached an older smart output for the sdl drive.
root@Tower:~# ps ax | grep smart
2193 ? S 0:00 sh -c smartctl -d ata -A /dev/sdl| grep -i temperature
2194 ? D 0:00 smartctl -d ata -A /dev/sdl
8328 pts/1 D+ 0:00 smartctl -a -A /dev/sdl
8848 ? D 0:00 /usr/sbin/smartctl -n standby -A /dev/sdl
8849 ? Z 0:00 [smartctl] <defunct>
8850 ? Z 0:00 [smartctl] <defunct>
8851 ? Z 0:00 [smartctl] <defunct>
9390 pts/2 S+ 0:00 grep smart
At this point I assume the drive is done, and I just would like guidance on best ways to proceed. I have a identical brand new (in shrink wrap) 3TB WD Green ready to move forward.
A couple drives have thrown ~100-200 errors during the failed parity rebuild processes, so I'm thinking I might have multiple drives dying at the same time.
Any suggestions on the best way to proceed to minimize data loss would be greatly appreciated.
Thanks so much in advance and happy holidays to all.
syslog_and_smart.zip