May 17, 20197 yr I recently upgraded from 6.3.3 (which I was running stable for over a year) to 6.5.3. I had one drive unmountable upon upgrading, so I ran xfs repair. This didn't resolve the issue so I rebuilt with a new drive. About a day later now I have 4 unmountable drives, and a parity drive offline. On top of that 6 of my 14 drives now show smart errors. I have 12 WD Red/White label 8 TB drives, and 2 Seagate Ironwolf Pro 8TB drives (parity). It seems highly unlikely to me that I actually have 1/2 my drives dying all at once, and I have tried to remedy the issue with a change of cables, and HBA so I don't think it's hardware related. I also did not back up my 6,3,3 usb drive before updating, which I know I should have. I don't have an option to downgrade back to 6.3.3 either (it just says downgrade to 6.5.3, which I'm already on). I have included the 3 log files and smart data from diagnostics. Any help you guys could offer would really help me out. For now all my data is inaccessible, and I don't know what to do about this. Edited May 17, 20197 yr by TBSCamCity
May 17, 20197 yr If you 0nly recently upgraded, what is the rational for going with 6.5.3 release which is nearly a year old? The current stable release is 6.7. BTW: If you just provide the diagnostics zip file it will give all the information you provided as a single file that is much easier to work with. Edited May 17, 20197 yr by itimpi
May 17, 20197 yr Author I upgraded to 6.5.3 because that was marked as the stable release on the GUI. Here is the full zip. main-diagnostics-20190516-2239.zip
May 17, 20197 yr The problem is that very few people will be running 6.5.3 so support could be difficult. You should ideally upgrade to 6.6.7 (the previous Stable release) or 6.7 (the current recently released Stable release) as whatever problem you are encountering on 6.5.3 may well have been fixed in one of the release made since then.
May 17, 20197 yr Disk6 is failing, other than that there a few CRC errors on other disks, which usually indicate a cable problem, though SMART for disk2 is missing, these errors on two disks could also suggest a cable/connection issue: May 16 22:39:48 Main kernel: sd 1:0:2:0: attempting task abort! scmd(ffff882fffaea548) May 16 22:39:48 Main kernel: sd 1:0:2:0: [sdd] tag#2 CDB: opcode=0x8a 8a 00 00 00 00 00 00 0e f8 40 00 00 04 00 00 00 May 16 22:39:48 Main kernel: scsi target1:0:2: handle(0x000c), sas_address(0x50030480005b0f48), phy(8) May 16 22:39:48 Main kernel: scsi target1:0:2: enclosure_logical_id(0x50030480005b0f7f), slot(4) May 16 22:39:51 Main kernel: sd 1:0:2:0: task abort: SUCCESS scmd(ffff882fffaea548) May 16 22:39:51 Main kernel: sd 1:0:7:0: attempting task abort! scmd(ffff882fffaea948) May 16 22:39:51 Main kernel: sd 1:0:7:0: [sdi] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 00 0e fc 40 00 00 04 00 00 00 May 16 22:39:51 Main kernel: scsi target1:0:7: handle(0x0011), sas_address(0x50030480005b0f4f), phy(15) May 16 22:39:51 Main kernel: scsi target1:0:7: enclosure_logical_id(0x50030480005b0f7f), slot(11) May 16 22:39:51 Main kernel: sd 1:0:7:0: device_block, handle(0x0011) May 16 22:39:51 Main kernel: sd 1:0:7:0: task abort: SUCCESS scmd(ffff882fffaea948) May 16 22:39:51 Main kernel: sd 1:0:7:0: device_unblock and setting to running, handle(0x0011) There are a few disks needing a filesystem check, ideally done after resolving these connection issues.
May 17, 20197 yr Author Thanks, my first thought was connection issue of some sort. I have tried multiple cables and HBA's with no success. I just upgraded to 6.7 and now every disk is unmountable and shows SMART errors. Perhaps my backplane is the issue (it's a Supermicro 846 chassis).
May 17, 20197 yr 13 minutes ago, TBSCamCity said: Perhaps my backplane is the issue (it's a Supermicro 846 chassis) Could be, could also be the HBA, PSU, etc,
May 17, 20197 yr Author 9 minutes ago, johnnie.black said: Could be, could also be the HBA, PSU, etc, Yeah I have spare HBA's, PSU's, and cables from other Supermicro servers I have so I ruled all those out. I don't have a spare backplane for the 4U chassis to test though. I also figured it was very odd timing for one of these to die right when I upgraded to 6.5.3, but I guess it could just be coincidence! I will try to find the culprit this weekend and check back if I can rule out the hardware.
Archived
This topic is now archived and is closed to further replies.