simse Posted April 26, 2019 Share Posted April 26, 2019 Hello everyone, I have a SAS controller card flashed to IT mode, I am not 100% sure which card it is, but it was recommended on the Unraid Wiki, and believe it's an IBM ServeRaid. It's been working great for about two years, but a week ago, the 2 drives that are connected to it are throwing read errors. They are very few, and the array seems to be running just fine. I would like to know, if anyone could suggest what might be wrong, or things I should check. I've already ordered a new card and a new SAS to SATA splitter cable. Could it be a software issue? Could it be a power issue? Could it be a firmware issue? Could it be a temperature issue? I hope the fantastic community here at Unraid, can help me out. Cheers! Quote Link to comment
JorgeB Posted April 27, 2019 Share Posted April 27, 2019 Please post the diagnostics after some errors, there might be some clues there: Tools -> Diagnostics Quote Link to comment
simse Posted April 27, 2019 Author Share Posted April 27, 2019 Thank you for getting back to me. Here's the diagnostics. The drives with read errors are Disk 1 (sdc) and Disk 8 (sdb). proton-diagnostics-20190427-1128.zip Quote Link to comment
JorgeB Posted April 27, 2019 Share Posted April 27, 2019 Both cases look like an actual disk problem, run an extended SMART test on both. Quote Link to comment
simse Posted April 27, 2019 Author Share Posted April 27, 2019 Thank you for getting back to me. Both SMART tests passed and both disks also bassed a 24 hour badblocks run on initial install. I'm still 99% certain the problem is the SAS card or cable and NOT the drives. Quote Link to comment
JorgeB Posted April 28, 2019 Share Posted April 28, 2019 The problem was the disks, you can see the UNC @ LBA errors on the SMART reports, though they might be intermittent errors, if the long tests passed keep an eye on them. Quote Link to comment
simse Posted April 28, 2019 Author Share Posted April 28, 2019 I HIGHLY doubt it's the hard drives failing. One is older than the other, and they both report the same type of UNC error at the EXACT same time. This would point to an error with SAS controller or something else. Quote Link to comment
JorgeB Posted April 28, 2019 Share Posted April 28, 2019 1 hour ago, simse said: both report the same type of UNC error at the EXACT same time Don't know where you are seeing this, errors were at different times and very different sectors, hence very unlikely it's a controller problem, also UNC @ LBA errors are media errors, not communication errors. Apr 23 01:02:23 Proton kernel: sd 9:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Apr 23 01:02:23 Proton kernel: sd 9:0:0:0: [sdb] tag#0 Sense Key : 0x3 [current] [descriptor] Apr 23 01:02:23 Proton kernel: sd 9:0:0:0: [sdb] tag#0 ASC=0x11 ASCQ=0x0 Apr 23 01:02:23 Proton kernel: sd 9:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 01 80 2c 79 68 00 00 00 18 00 00 Apr 23 01:02:23 Proton kernel: print_req_error: critical medium error, dev sdb, sector 6445365608 Apr 23 01:02:23 Proton kernel: md: disk8 read error, sector=6445365544 ... Apr 23 02:03:44 Proton kernel: sd 9:0:1:0: [sdc] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Apr 23 02:03:44 Proton kernel: sd 9:0:1:0: [sdc] tag#1 Sense Key : 0x3 [current] Apr 23 02:03:44 Proton kernel: sd 9:0:1:0: [sdc] tag#1 ASC=0x11 ASCQ=0x0 Apr 23 02:03:44 Proton kernel: sd 9:0:1:0: [sdc] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 02 31 4e 90 00 00 04 00 00 00 Apr 23 02:03:44 Proton kernel: print_req_error: critical medium error, dev sdc, sector 36785808 Apr 23 02:03:44 Proton kernel: md: disk1 read error, sector=36785744 When there's a problem with an LSI controller usually it's logged like this: Sep 21 09:12:44 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! And after this there can also be simultaneous errors in multiple disks at the same sector. Quote Link to comment
simse Posted April 28, 2019 Author Share Posted April 28, 2019 Brilliant point. But then, I'm curious; I've had 8 drives connect to main SATA controller for 5 years, and not a single drive has failed. IF these two WD Red drives connected to LSI controller are indeed failing, that would mean a 100% failure rate (I've had 3 Seagate drives completely fail as well) when connected to the LSI controller. Am I the unluckiest server owner or is there something else? That's all I'm trying to find out. Quote Link to comment
JorgeB Posted April 28, 2019 Share Posted April 28, 2019 Like mentioned those errors can be intermittent, if they passed the SMART tes they are OK for now, just keep an eye on the disks. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.