adminmat Posted February 4, 2020 Share Posted February 4, 2020 (edited) [UPDATE] All drives now show up in unRAID as Unassigned Devices. Three 8TB WD white drives. They are showing as "missing" under Array Devices and can not be added back? Pleas let me know if you have a solution [ORIGINAL POST] Last night I received an email alert that my array disks had errors. I have 3 HDDs in my array. One had many (8202) and one had 4. Parity showing 16,412. I quickly realized that the power was off to my DAS enclosure. My DAS enclosure is powered by an ATX power supply. I'm not sure why or how it turned off but that's for another discussion. I have not yet shut down / rebooted my unRAID server. I have two cache drives that are still running. I have since triggered the ATX power supply to come back on. (I have a ground jumper wire permanently attached... I just pulled it out and plugged it back and the DAS powered up again. I'll need a better solution obviously. It has been running great for a few months with no downtime) So the DAS is on but the disks show no activity and they can not be spun up by clicking "SPIN UP" in the unRAID GUI. Should I now reboot the system? What happens next? Will the array repair itself? Why would the disks show errors if the PSU had shut them down suddenly? Is there a chance that the power to them could have been flickering or low? I only have SATA power coming from the ATX PSU to the back of some iStar boxes. The disks and iStar JBOD enclosure should have no control over the PSU. Here is the diagnostics file. Note that I captured this Diagnostics after powering the DAS back on and trying to click "SPIN UP" a few times. tower-diagnostics-20200204-0359.zip Edited February 6, 2020 by adminmat Quote Link to comment
JorgeB Posted February 4, 2020 Share Posted February 4, 2020 52 minutes ago, adminmat said: Should I now reboot the system? Yes. 52 minutes ago, adminmat said: What happens next? All should be back to normal, luckily Unraid failed to write back since multiple disks got disconnected at the same time, so no disk was disabled. Quote Link to comment
adminmat Posted February 4, 2020 Author Share Posted February 4, 2020 13 minutes ago, johnnie.black said: "Unraid failed to write back since multiple disks got disconnected at the same time, so no disk was disabled." Thanks JohnnieB. Can you elaborate on this point? Did you get that info from the log file? What do you mean by write back? Quote Link to comment
adminmat Posted February 5, 2020 Author Share Posted February 5, 2020 So I rebooted but my disks are not showing up. Any ideas what's going on? Is this a hardware issue? I have an external LSI SAS card that's been controlling the disks in a DAS. How do I check the status of the LSI card? tower-diagnostics-20200204-2022.zip Quote Link to comment
JorgeB Posted February 5, 2020 Share Posted February 5, 2020 11 hours ago, adminmat said: Can you elaborate on this point? Did you get that info from the log file? What do you mean by write back? Every time there's a read error Unraid first tries to recalculated that sector using parity and the other drives and then write it back, it this case it couldn't, so it didn't disable the disk, it's in the syslog. 6 hours ago, adminmat said: Is this a hardware issue? Looks like it, problems initializing all disks, check all connections/power: Feb 4 20:12:23 Tower kernel: sd 5:0:0:0: rejecting I/O to offline device Feb 4 20:12:23 Tower kernel: ldm_validate_partition_table(): Disk read failed. Feb 4 20:12:23 Tower kernel: sd 5:0:0:0: rejecting I/O to offline device Feb 4 20:12:23 Tower kernel: sdb: unable to read partition table Feb 4 20:12:23 Tower kernel: sd 5:0:0:0: [sdb] Attached SCSI disk Feb 4 20:12:23 Tower kernel: mpt2sas_cm0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630) Feb 4 20:12:23 Tower kernel: sd 5:0:1:0: Power-on or device reset occurred Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: Power-on or device reset occurred Feb 4 20:12:23 Tower kernel: mpt2sas_cm0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630) ### [PREVIOUS LINE REPEATED 1 TIMES] ### Feb 4 20:12:23 Tower kernel: sd 5:0:1:0: [sdc] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Feb 4 20:12:23 Tower kernel: sd 5:0:1:0: [sdc] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 Feb 4 20:12:23 Tower kernel: print_req_error: I/O error, dev sdc, sector 0 Feb 4 20:12:23 Tower kernel: Buffer I/O error on dev sdc, logical block 0, async page read Feb 4 20:12:23 Tower kernel: sd 5:0:1:0: rejecting I/O to offline device Feb 4 20:12:23 Tower kernel: ldm_validate_partition_table(): Disk read failed. Feb 4 20:12:23 Tower kernel: sd 5:0:1:0: rejecting I/O to offline device Feb 4 20:12:23 Tower kernel: sdc: unable to read partition table Feb 4 20:12:23 Tower kernel: sd 5:0:1:0: [sdc] Attached SCSI disk Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: Power-on or device reset occurred Feb 4 20:12:23 Tower kernel: mpt2sas_cm0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630) Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: [sdd] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 Feb 4 20:12:23 Tower kernel: print_req_error: I/O error, dev sdd, sector 0 Feb 4 20:12:23 Tower kernel: Buffer I/O error on dev sdd, logical block 0, async page read Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: rejecting I/O to offline device Feb 4 20:12:23 Tower kernel: ldm_validate_partition_table(): Disk read failed. Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: rejecting I/O to offline device Feb 4 20:12:23 Tower kernel: sdd: unable to read partition table Feb 4 20:12:23 Tower kernel: sd 5:0:2:0: [sdd] Attached SCSI disk Quote Link to comment
adminmat Posted February 5, 2020 Author Share Posted February 5, 2020 So I rebooted the server. I even moved the disks to another JOBD enclosure and they are not showing up in unRAID. I have an LSI SAS adapter in my server that runs to the JOBD enclosure. The power is spinning the disks up but not getting any signal. Is there a way I'd be able to see if the LSI controller of the disks are connected? I couldn't find an option in my BIOS or IPMI UI. I have an X10 Supermicro board. Can I look at the SAS adapter in the unRAID terminal? Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 7 hours ago, adminmat said: Is there a way I'd be able to see if the LSI controller of the disks are connected? LSI has a BIOS flashed, if that's not appearing during boot you likely have "Option ROM" or similar disabled in the board bios for that slot. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 (edited) So in the unRAID terminal I can issue a "lsscsi" command and my 3 data disk show up. They are still connected through an LSI SAS controller. But unRAID GUI still says they are missing. So I'm assuming at this point since the unRAID OS can see the LSI SAS card and the disks that this is an unRAID OS issue. Is there a command that will re-connect my drives to the unRAID OS? Edited February 6, 2020 by adminmat Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 The disk are connected, the problem is that they are not being initialized correctly and then Unraid has issues identifying them: Feb 4 20:12:51 Tower emhttpd: device /dev/sdd problem getting id Feb 4 20:12:51 Tower emhttpd: device /dev/sdb problem getting id Feb 4 20:12:51 Tower emhttpd: device /dev/sdc problem getting id What kind of enclosure are you using, does it have an expander or controller or is it SAS direct connect? Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 The easiest way to troubleshoot this would be to connect one of the disks directly to the HBA on the server, if it works the enclosure it's likely as enclosure problem, if it still doesn't connect to the onboard SATA controller, if it works the HBA/cables is likely the problem. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 (edited) 54 minutes ago, johnnie.black said: The disk are connected, the problem is that they are not being initialized correctly and then Unraid has issues identifying them: Feb 4 20:12:51 Tower emhttpd: device /dev/sdd problem getting id Feb 4 20:12:51 Tower emhttpd: device /dev/sdb problem getting id Feb 4 20:12:51 Tower emhttpd: device /dev/sdc problem getting id What kind of enclosure are you using, does it have an expander or controller or is it SAS direct connect? SAS direct connect via an 8087 adapter and breakout cables: LSI SAS HBA in my server MB > to SFF-8088 cable > SFF-8088 to 8087 adapter like this > to SFF-8087 to SATA breakout cable > to back of iStar hot swap enlosure This is powered by a Corsair SFF PSU via ATX > SATA power cable > Back of iStar enclosure This has been working well for a while. I'm not using an expander. Here is the the server and enclosure cabinet side by side: Here is the 8088 cable going from the HBA in server to the cabinet (you can also see the PSU) : I mounted this SFF-8088 to 8087 adapter in the cabinet: Here is inside of cabinet showing 8087 break-out cables: The SATA cables plug into the back of the iStar boxes: I tried moving the disks to another, new iStar enclosure, powered it up, connected the SAS cable and same result. Edited February 6, 2020 by adminmat Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 49 minutes ago, johnnie.black said: The easiest way to troubleshoot this would be to connect one of the disks directly to the HBA on the server, if it works the enclosure it's likely as enclosure problem, if it still doesn't connect to the onboard SATA controller, if it works the HBA/cables is likely the problem. I have tried a 2nd, new iStar enclosure with the same results. Could the Corsair power supply be undervolting, causing an issue? I can connect direcly to the HBA via external 8088 cable to the 8088 to 8087 adapter then directly to the drive. Using the PSU in the server. My HBA has no 8087 ports on it. And I can just connect the disk to the motherboard's SATA port. Also using the server's PSU. Quote Link to comment
civic95man Posted February 6, 2020 Share Posted February 6, 2020 Might be exceeding the maximum Sata length. Probably should use an expander on the enclosure end. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 5 minutes ago, civic95man said: Might be exceeding the maximum Sata length. Probably should use an expander on the enclosure end. I was concerned about this at first. But the SAS cable maximum length is 10 meters and SATA is 1 meter. My SAS cable is 1 meter and the SATA cable is 0.5 meter. I still think this is a unRAID OS issue. I'll put another OS on the server tonight and see if I can read/write to some other disks in the enclosure. This setup has been work perfectly for ~ 4 months with no issues and just now be an issue? I've loaded 12 TB on to these disks with plenty of reads and had not one error. I doubt you go from zero errors to "Missing" disks Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 Might be exceeding the maximum Sata length. Yep, max SATA cable length is 1m total, though if it was working before kind of strange now having issue with all disks, it could still be a cable gone bad. But the SAS cable maximum length is 10 meters and SATA is 1 meter. Yes, but only between two SAS devices, e.g. between an HBA and a SAS expander, then 1 meter to SATA disks. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 2 minutes ago, johnnie.black said: Yep, max SATA cable length is 1m total, though if it was working before kind of strange now having issue with all disks, it could still be a cable gone bad. I just replaced the SFF-8088 with a spare and same result. It's not the 8087s because I've tried 2 enclosures and with 2 sets of breakout cables. 4 minutes ago, johnnie.black said: Yes, but only between two SAS devices, e.g. between an HBA and a SAS expander, then 1 meter to SATA disks. So it's either PSU voltage issues, the OS or the cable resistance increased for some reason. Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 It can't be the OS. 30 minutes ago, adminmat said: I can connect direcly to the HBA via external 8088 cable to the 8088 to 8087 adapter then directly to the drive. Using the PSU in the server. My HBA has no 8087 ports on it. And I can just connect the disk to the motherboard's SATA port. Also using the server's PSU. Those would be the next things to try. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 (edited) "It can't be the OS." I don't see how either. The NVME cache drives are showing up available in the GUI. So connecting one of the 8TB HDDs directly to the Supermicro motherboard SATA port, taping off the 3.3v pin on the WD white drive, powering with server's PSU, still shows "missing" Does it matter that these disks are encrypted? In terminal I'm getting: So the same as before. What's next? I could try to install Ubuntu on bare metal and see if I can un-encrypt the disks and read/write? Could all the HDDs be killed by a bad PSU? Edit: Added Diagnostics tower-diagnostics-20200206-1349.zip Edited February 6, 2020 by adminmat Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 There's a hardware problem somewhere, even SMART can't be read correctly: ATA_READ_LOG_EXT (addr=0x03:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious) Read SMART Extended Comprehensive Error Log failed Read SMART Error Log failed: scsi error medium or hardware error (serious) ATA_READ_LOG_EXT (addr=0x07:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious) Read SMART Extended Self-test Log failed Read SMART Self-test Log failed: scsi error medium or hardware error (serious) Read SMART Selective Self-test Log failed: scsi error medium or hardware error (serious) Are you sure the tape is working correctly? Do you have a molex to STA adapter? If yes use that. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 I'll swap for the molex now. I'm sure this doesn't matter but the disk does show in the BIOS now. It didn't before when it was in the HBA (unexpectedly) Quote Link to comment
JorgeB Posted February 6, 2020 Share Posted February 6, 2020 Just now, adminmat said: It didn't before when it was in the HBA (unexpectedly) It wouldn't show up on the motherboard bios, only on the HBA bios. Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 (edited) Didn't have a molex adapter so i removed the pin. tower-diagnostics-20200206-1430.zip Good news? The drive now shows in Unassigned Drives! Looks like SMART is still failing? Edited February 6, 2020 by adminmat Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 I can see the drive (Disk 1) in "unassigned Devices" now. But I can not select the drive up in the Array Devices using the drop-down menu. When I click the drop-down menu arrow under Disk 1 the only option is "no device" Surely I have another option other than formatting my drives. I have a significant amount of data on Disk 1. Disk 2 is empty AFAIK. And the 3rd disk is the Parity. Should I attempt to connect the other 2 disks? is it possibly waiting on those to be connected? Or could I cause more harm by connecting them? Is there a plugin to help with this? Quote Link to comment
adminmat Posted February 6, 2020 Author Share Posted February 6, 2020 (edited) I've since removed the LSI SAS HBA from the server to see if that helped. Still have Drive 1 connected via MOBO SATA. It's still showing as an Unassigned Device although it has the same name as before which is still listed up in Array Devices. Any ideas, anyone? Edited February 6, 2020 by adminmat Quote Link to comment
itimpi Posted February 6, 2020 Share Posted February 6, 2020 2 minutes ago, adminmat said: I've since removed the LSI SAS HBA from the server to see if that helped. Still have Drive 1 connected via MOBO SATA. It's still showing as an Unassigned Device although it has the same name as before which is still listed up in Array Devices. Any ideas, anyone? There will be some difference if it is still in Unassigned Devices. If the names match then the reported size may be fractionally different which will cause Unraid to treat it as a different device. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.