mgranger Posted December 3, 2019 Share Posted December 3, 2019 Last night I moved all my hard drives over to a new server I built. I finally upgraded to a chassis that has 16 bays and connects to the backplane with 2 SAS cables. I bought the chassis off ebay used but the cables were new. I have been reading that most of these errors are fixed with new cables but this a brand new cable so I hope it is not that and it would seem like it would be more than just disk 5 with errors. I will post my diagnostics but some advice would be much appreciated. I have already rebuilt disk 5 using the same drive and same spot in the chassis. I am hoping it is not a bad spot in the chassis. I noticed the drive showed up in Unassigned Devices while being spun down in the array and then shortly after I got the read error. finalizer-diagnostics-20191203-1855.zip Quote Link to comment
JorgeB Posted December 3, 2019 Share Posted December 3, 2019 There are communication issues with multiple disks: Dec 3 11:52:37 Finalizer kernel: sd 9:0:1:0: attempting task abort! scmd(00000000375afb15) Dec 3 11:52:37 Finalizer kernel: sd 9:0:1:0: [sdd] tag#6528 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00 Dec 3 11:52:37 Finalizer kernel: scsi target9:0:1: handle(0x000b), sas_address(0x5003048001d979ad), phy(13) Dec 3 11:52:37 Finalizer kernel: scsi target9:0:1: enclosure logical id(0x5003048001d979bf), slot(1) Dec 3 11:52:37 Finalizer kernel: sd 9:0:1:0: device_block, handle(0x000b) Dec 3 11:52:38 Finalizer kernel: sd 9:0:1:0: task abort: SUCCESS scmd(00000000375afb15) Dec 3 11:52:39 Finalizer kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x000b) Dec 3 11:52:46 Finalizer kernel: sd 9:0:7:0: attempting task abort! scmd(00000000ff7cd007) Dec 3 11:52:46 Finalizer kernel: sd 9:0:7:0: [sdj] tag#6568 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00 Dec 3 11:52:46 Finalizer kernel: scsi target9:0:7: handle(0x0011), sas_address(0x5003048001d979b3), phy(19) Dec 3 11:52:46 Finalizer kernel: scsi target9:0:7: enclosure logical id(0x5003048001d979bf), slot(7) Dec 3 11:52:46 Finalizer kernel: sd 9:0:7:0: device_block, handle(0x0011) Dec 3 11:52:48 Finalizer kernel: sd 9:0:7:0: task abort: SUCCESS scmd(00000000ff7cd007) Dec 3 11:52:48 Finalizer kernel: sd 9:0:7:0: device_unblock and setting to running, handle(0x0011) Dec 3 11:59:29 Finalizer kernel: mdcmd (50): spindown 5 Dec 3 12:00:08 Finalizer kernel: sd 9:0:13:0: attempting task abort! scmd(00000000454c3926) Dec 3 12:00:08 Finalizer kernel: sd 9:0:13:0: [sdo] tag#6783 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00 Dec 3 12:00:08 Finalizer kernel: scsi target9:0:13: handle(0x000f), sas_address(0x5003048001d979b1), phy(17) Dec 3 12:00:08 Finalizer kernel: scsi target9:0:13: enclosure logical id(0x5003048001d979bf), slot(5) Dec 3 12:00:08 Finalizer kernel: sd 9:0:13:0: device_block, handle(0x000f) Dec 3 12:00:10 Finalizer kernel: sd 9:0:13:0: device_unblock and setting to running, handle(0x000f) Dec 3 12:00:11 Finalizer kernel: sd 9:0:13:0: [sdo] Synchronizing SCSI cache Dec 3 12:00:11 Finalizer kernel: sd 9:0:13:0: [sdo] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 So unlikely to be a bad slot, most likely it can be the cables, power issue, backplane expander or the HBA, unfortunately not easy to say which one is the culprit without swapping things around to rule them out. Quote Link to comment
mgranger Posted December 3, 2019 Author Share Posted December 3, 2019 2 minutes ago, johnnie.black said: There are communication issues with multiple disks: Dec 3 11:52:37 Finalizer kernel: sd 9:0:1:0: attempting task abort! scmd(00000000375afb15) Dec 3 11:52:37 Finalizer kernel: sd 9:0:1:0: [sdd] tag#6528 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00 Dec 3 11:52:37 Finalizer kernel: scsi target9:0:1: handle(0x000b), sas_address(0x5003048001d979ad), phy(13) Dec 3 11:52:37 Finalizer kernel: scsi target9:0:1: enclosure logical id(0x5003048001d979bf), slot(1) Dec 3 11:52:37 Finalizer kernel: sd 9:0:1:0: device_block, handle(0x000b) Dec 3 11:52:38 Finalizer kernel: sd 9:0:1:0: task abort: SUCCESS scmd(00000000375afb15) Dec 3 11:52:39 Finalizer kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x000b) Dec 3 11:52:46 Finalizer kernel: sd 9:0:7:0: attempting task abort! scmd(00000000ff7cd007) Dec 3 11:52:46 Finalizer kernel: sd 9:0:7:0: [sdj] tag#6568 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00 Dec 3 11:52:46 Finalizer kernel: scsi target9:0:7: handle(0x0011), sas_address(0x5003048001d979b3), phy(19) Dec 3 11:52:46 Finalizer kernel: scsi target9:0:7: enclosure logical id(0x5003048001d979bf), slot(7) Dec 3 11:52:46 Finalizer kernel: sd 9:0:7:0: device_block, handle(0x0011) Dec 3 11:52:48 Finalizer kernel: sd 9:0:7:0: task abort: SUCCESS scmd(00000000ff7cd007) Dec 3 11:52:48 Finalizer kernel: sd 9:0:7:0: device_unblock and setting to running, handle(0x0011) Dec 3 11:59:29 Finalizer kernel: mdcmd (50): spindown 5 Dec 3 12:00:08 Finalizer kernel: sd 9:0:13:0: attempting task abort! scmd(00000000454c3926) Dec 3 12:00:08 Finalizer kernel: sd 9:0:13:0: [sdo] tag#6783 CDB: opcode=0x85 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00 Dec 3 12:00:08 Finalizer kernel: scsi target9:0:13: handle(0x000f), sas_address(0x5003048001d979b1), phy(17) Dec 3 12:00:08 Finalizer kernel: scsi target9:0:13: enclosure logical id(0x5003048001d979bf), slot(5) Dec 3 12:00:08 Finalizer kernel: sd 9:0:13:0: device_block, handle(0x000f) Dec 3 12:00:10 Finalizer kernel: sd 9:0:13:0: device_unblock and setting to running, handle(0x000f) Dec 3 12:00:11 Finalizer kernel: sd 9:0:13:0: [sdo] Synchronizing SCSI cache Dec 3 12:00:11 Finalizer kernel: sd 9:0:13:0: [sdo] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 So unlikely to be a bad slot, most likely it can be the cables, power issue, backplane expander or the HBA, unfortunately not easy to say which one is the culprit without swapping things around to rule them out. Any advice on the best way to troubleshoot this? I don't really have any extra of anything to start troubleshooting with a new component. Wouldn't other disks get read errors if it was a cable or HBA issue? Quote Link to comment
mgranger Posted December 4, 2019 Author Share Posted December 4, 2019 Well maybe this answers something. I was doing a rebuild on Disk 5 (new disk but same slot) and I got more errors on the parity and Disk 4 and Disk 3. This makes me think it has something more to do with a cable, backplane, or HBA. I have 2 power supplies so I hope it is not the but I guess it is possible. I am going to let it rebuild Disk 5 then probably shut it down until I can get new cables unless there is no risk of corrupting the data? Before I tried rebuilding the drive I tried to check all the connections to make sure they were tightly connected. Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 5 hours ago, mgranger said: I got more errors on the parity and Disk 4 and Disk 3. UDMA CRC errors are usually a bad cable, much less likely but can also be controller or backplane. 5 hours ago, mgranger said: unless there is no risk of corrupting the data? Not with UDMA CRC errors, but if more disks have read errors it can, just post the diags after the rebuild is done if you're not sure. Quote Link to comment
mgranger Posted December 4, 2019 Author Share Posted December 4, 2019 Ok the rebuild just finished. Here is the diags. I think it is ok but just want to check. I am guessing i will get read errors shortly but I guess I will just run the server until that happens. My new cables are supposed to come on Friday and I will be replacing the existing ones then. Will just have to troubleshoot from there. finalizer-diagnostics-20191204-1736.zip Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 9 minutes ago, mgranger said: I think it is ok but just want to check. It was successful, no read errors on other disks during the rebuild. Quote Link to comment
mgranger Posted December 4, 2019 Author Share Posted December 4, 2019 So the drive i took out of Drive 5 was a 4TB HGST drive. I have it in the unassigned devices now but it is refusing to mount. Is there a reason that this is happening? I figured I would be able to mount it using UD. I think I can reformat it now that the other disk was rebuilt but just trying to troubleshoot why this would happen. Quote Link to comment
JorgeB Posted December 4, 2019 Share Posted December 4, 2019 If it's XFS it won't mount because the a duplicate UUID, you can see why it's not mounting on the syslog or post diags. If it's the UUID issue you can change it for the unassigned device with: xfs_admin -U generate /dev/sdX1 replace X with correct letter. Quote Link to comment
mgranger Posted December 4, 2019 Author Share Posted December 4, 2019 I typed this in and this is the warning that came up. Should I delete the log? Quote Link to comment
JorgeB Posted December 5, 2019 Share Posted December 5, 2019 See if the disk mounts before clearing the log, you can do that after stopping the array and then try to mount with UD, if it still doesn't mount the duplicate UUID isn't the problem (or the only problem), so run xfs_repair first. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.