December 2, 20196 yr Hi all, I have some weird disk errors that occur when I for instance start to copy files with the windows explorer onto a network share or otherwise use the system. It is not exactly logical to me what triggers the error. They happen usually on multiple disks at the exact same time and it does not seem to affect the system to hard but usually resulting in a copy error. After a restart they are obviously deleted but reoccur again. I am also not sure whether it is is connected to reaching a high water level mark. In those cases I get the following windows copy error: Do you know how to avoid this kind of error? Each time a high water level is reached I have to at least once stop and restart the array to be able to start the copying process again. Shouldn't it normally switch automatically to another disc? A third thing which may could play a role in this is that I sometimes see this and are then subsequently not able to view the log file: Those errors have happend several times now, so I have multiple diagnostic files: atlas-diagnostics-20191128-2305.zip atlas-diagnostics-20191125-2006.zip Thanks in advance! Regards, Georg
December 3, 20196 yr Community Expert It's affecting multiple disks on different controllers: Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 13:0:2:0: [sdj] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 Nov 27 23:42:02 Atlas kernel: print_req_error: critical target error, dev sdj, sector 64 Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 13:0:3:0: [sdk] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 Nov 27 23:42:02 Atlas kernel: print_req_error: critical target error, dev sdk, sector 64 Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 12:0:2:0: [sde] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 Nov 27 23:42:02 Atlas kernel: print_req_error: critical target error, dev sde, sector 64 Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 Sense Key : 0x5 [current] Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 ASC=0x20 ASCQ=0x0 Nov 27 23:42:02 Atlas kernel: sd 12:0:1:0: [sdd] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 40 00 00 04 00 00 00 This would suggest for example a power issue, do you have another PSU you can use to test?
December 3, 20196 yr Community Expert Also a good idea to upgrade the firmware on both LSI controllers, since they are very old.
December 3, 20196 yr Author First of all thank you for your answer! 13 hours ago, johnnie.black said: This would suggest for example a power issue, do you have another PSU you can use to test? Well not easy but I could use the one from my PC. I think I will first try to update the SAS 9300 16i as it should be done anyway. Do you coincidentally have some tips for that? How likely is it that I kill the card when having no experience with it at all?^^ (I would try to follow that thing https://www.broadcom.com/support/knowledgebase/1211161501344/flashing-firmware-and-bios-on-lsi-sas-hbas) I will probably try it on the weekend. When switching the PSU, would you suggest a specific test or should I just try, use it and see what happens? The one installed is a Corsair HX850i and te one currently in my PC is a Corsair RM750x.
December 4, 20196 yr Community Expert LSI firmware update is very easy, like doing a bios update. After switching PSU just use the server normally and look and the log for similar errors as above, you can also run a non correct parity check, it can probably cause the errors to appear sooner.
March 28, 20206 yr Author On 12/4/2019 at 8:13 AM, johnnie.black said: LSI firmware update is very easy, like doing a bios update. Sorry for the long time... I have successfully updated the HBA card, thinking at first that this was the solution but it wasn't... I also tried another PSU, switching drives around, changing cables. By doing so, I think, I could narrow down the error to one SFF8643 port of the SAS 9300 i16. In that plug it is always the first bay/hdd of the four which has the error. I also got the impression that it occurs more often when writing to that disk than when reading from it but this might be wrong. Do you think there is a way to further trouble shoot that error?
March 29, 20206 yr Community Expert If it's always the same port (and just that one) I would just avoid using it.
Archived
This topic is now archived and is closed to further replies.