Georg Posted December 2, 2019 Share Posted December 2, 2019 Hi all, I have some weird disk errors that occur when I for instance start to copy files with the windows explorer onto a network share or otherwise use the system. It is not exactly logical to me what triggers the error. They happen usually on multiple disks at the exact same time and it does not seem to affect the system to hard but usually resulting in a copy error. After a restart they are obviously deleted but reoccur again. I am also not sure whether it is is connected to reaching a high water level mark. In those cases I get the following windows copy error: Do you know how to avoid this kind of error? Each time a high water level is reached I have to at least once stop and restart the array to be able to start the copying process again. Shouldn't it normally switch automatically to another disc? A third thing which may could play a role in this is that I sometimes see this and are then subsequently not able to view the log file: Those errors have happend several times now, so I have multiple diagnostic files: atlas-diagnostics-20191128-2305.zip atlas-diagnostics-20191125-2006.zip Thanks in advance! Regards, Georg Quote Link to comment
JorgeB Posted December 3, 2019 Share Posted December 3, 2019 It's affecting multiple disks on different controllers: Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 Nov 27 23:42:02 Atlas kernel: print_req_error: critical target error, dev sdj, sector 64 Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 Nov 27 23:42:02 Atlas kernel: print_req_error: critical target error, dev sdk, sector 64 Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 Nov 27 23:42:02 Atlas kernel: print_req_error: critical target error, dev sde, sector 64 Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 This would suggest for example a power issue, do you have another PSU you can use to test? Quote Link to comment
JorgeB Posted December 3, 2019 Share Posted December 3, 2019 Also a good idea to upgrade the firmware on both LSI controllers, since they are very old. Quote Link to comment
Georg Posted December 3, 2019 Author Share Posted December 3, 2019 First of all thank you for your answer! 13 hours ago, johnnie.black said: This would suggest for example a power issue, do you have another PSU you can use to test? Well not easy but I could use the one from my PC. I think I will first try to update the SAS 9300 16i as it should be done anyway. Do you coincidentally have some tips for that? How likely is it that I kill the card when having no experience with it at all?^^ (I would try to follow that thing https://www.broadcom.com/support/knowledgebase/1211161501344/flashing-firmware-and-bios-on-lsi-sas-hbas) I will probably try it on the weekend. When switching the PSU, would you suggest a specific test or should I just try, use it and see what happens? The one installed is a Corsair HX850i and te one currently in my PC is a Corsair RM750x. Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 LSI firmware update is very easy, like doing a bios update. After switching PSU just use the server normally and look and the log for similar errors as above, you can also run a non correct parity check, it can probably cause the errors to appear sooner. Quote Link to comment
Georg Posted March 28, 2020 Author Share Posted March 28, 2020 On 12/4/2019 at 8:13 AM, johnnie.black said: LSI firmware update is very easy, like doing a bios update. Sorry for the long time... I have successfully updated the HBA card, thinking at first that this was the solution but it wasn't... I also tried another PSU, switching drives around, changing cables. By doing so, I think, I could narrow down the error to one SFF8643 port of the SAS 9300 i16. In that plug it is always the first bay/hdd of the four which has the error. I also got the impression that it occurs more often when writing to that disk than when reading from it but this might be wrong. Do you think there is a way to further trouble shoot that error? Quote Link to comment
JorgeB Posted March 29, 2020 Share Posted March 29, 2020 If it's always the same port (and just that one) I would just avoid using it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.