Jump to content

All Data Drives Missing: Please Help


Recommended Posts

[UPDATE]  All drives now show up in unRAID as Unassigned Devices. Three 8TB WD white drives. They are showing as "missing" under Array Devices and can not be added back? Pleas let me know if you have a solution

 

[ORIGINAL POST] Last night I received an email alert that my array disks had errors. I have 3 HDDs in my array. One had many (8202) and one had 4. Parity showing 16,412. I quickly realized that the power was off to my DAS enclosure. My DAS enclosure is powered by an ATX power supply. I'm not sure why or how it turned off but that's for another discussion. I have not yet shut down / rebooted my unRAID server. I have two cache drives that are still running. 

 

I have since triggered the ATX power supply to come back on. (I have a ground jumper wire permanently attached... I just pulled it out and plugged it back and the DAS powered up again. I'll need a better solution obviously. It has been running great for a few months with no downtime)

 

So the DAS is on but the disks show no activity and they can not be spun up by clicking "SPIN UP" in the unRAID GUI.

 

Should I now reboot the system?

What happens next?

Will the array repair itself?

Why would the disks show errors if the PSU had shut them down suddenly? 

Is there a chance that the power to them could have been flickering or low?

I only have SATA power coming from the ATX PSU to the back of some iStar boxes. The disks and iStar JBOD enclosure should have no control over the PSU.

 

Here is the diagnostics file. Note that I captured this Diagnostics after powering the DAS back on and trying to click "SPIN UP" a few times. 

 

 

tower-diagnostics-20200204-0359.zip

Edited by adminmat
Link to comment
11 hours ago, adminmat said:

Can you elaborate on this point? Did you get that info from the log file? What do you mean by write back? 

Every time there's a read error Unraid first tries to recalculated that sector using parity and the other drives and then write it back, it this case it couldn't, so it didn't disable the disk, it's in the syslog.

 

6 hours ago, adminmat said:

Is this a hardware issue?

Looks like it, problems initializing all disks, check all connections/power:

 

Feb  4 20:12:23 Tower kernel: sd 5:0:0:0: rejecting I/O to offline device
Feb  4 20:12:23 Tower kernel: ldm_validate_partition_table(): Disk read failed.
Feb  4 20:12:23 Tower kernel: sd 5:0:0:0: rejecting I/O to offline device
Feb  4 20:12:23 Tower kernel: sdb: unable to read partition table
Feb  4 20:12:23 Tower kernel: sd 5:0:0:0: [sdb] Attached SCSI disk
Feb  4 20:12:23 Tower kernel: mpt2sas_cm0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630)
Feb  4 20:12:23 Tower kernel: sd 5:0:1:0: Power-on or device reset occurred
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: Power-on or device reset occurred
Feb  4 20:12:23 Tower kernel: mpt2sas_cm0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630)
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Feb  4 20:12:23 Tower kernel: sd 5:0:1:0: [sdc] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Feb  4 20:12:23 Tower kernel: sd 5:0:1:0: [sdc] tag#1 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
Feb  4 20:12:23 Tower kernel: print_req_error: I/O error, dev sdc, sector 0
Feb  4 20:12:23 Tower kernel: Buffer I/O error on dev sdc, logical block 0, async page read
Feb  4 20:12:23 Tower kernel: sd 5:0:1:0: rejecting I/O to offline device
Feb  4 20:12:23 Tower kernel: ldm_validate_partition_table(): Disk read failed.
Feb  4 20:12:23 Tower kernel: sd 5:0:1:0: rejecting I/O to offline device
Feb  4 20:12:23 Tower kernel: sdc: unable to read partition table
Feb  4 20:12:23 Tower kernel: sd 5:0:1:0: [sdc] Attached SCSI disk
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: Power-on or device reset occurred
Feb  4 20:12:23 Tower kernel: mpt2sas_cm0: log_info(0x31110630): originator(PL), code(0x11), sub_code(0x0630)
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: [sdd] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
Feb  4 20:12:23 Tower kernel: print_req_error: I/O error, dev sdd, sector 0
Feb  4 20:12:23 Tower kernel: Buffer I/O error on dev sdd, logical block 0, async page read
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: rejecting I/O to offline device
Feb  4 20:12:23 Tower kernel: ldm_validate_partition_table(): Disk read failed.
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: rejecting I/O to offline device
Feb  4 20:12:23 Tower kernel: sdd: unable to read partition table
Feb  4 20:12:23 Tower kernel: sd 5:0:2:0: [sdd] Attached SCSI disk

 

Link to comment

So I rebooted the server. I even moved the disks to another JOBD enclosure and they are not showing up in unRAID. 

 

I have an LSI SAS adapter in my server that runs to the JOBD enclosure. The power is spinning the disks up but not getting any signal. 

 

Is there a way I'd be able to see if the LSI controller of the disks are connected? I couldn't find an option in my BIOS or IPMI UI. I have an X10 Supermicro board. Can I look at the SAS adapter in the unRAID terminal? 

Link to comment

So in the unRAID terminal I can issue a "lsscsi" command and my 3 data disk show up. They are still connected through an LSI SAS controller. But unRAID GUI still says they are missing. 

 

So I'm assuming at this point since the unRAID OS can see the LSI SAS card and the disks that this is an unRAID OS issue. 

 

Is there a command that will re-connect my drives to the unRAID OS?

 

LSI CLI commands DISKS showing.png

Edited by adminmat
Link to comment

The disk are connected, the problem is that they are not being initialized correctly and then Unraid has issues identifying them:

 

Feb  4 20:12:51 Tower emhttpd: device /dev/sdd problem getting id
Feb  4 20:12:51 Tower emhttpd: device /dev/sdb problem getting id
Feb  4 20:12:51 Tower emhttpd: device /dev/sdc problem getting id

What kind of enclosure are you using, does it have an expander or controller or is it SAS direct connect?

Link to comment
54 minutes ago, johnnie.black said:

The disk are connected, the problem is that they are not being initialized correctly and then Unraid has issues identifying them:

 


Feb  4 20:12:51 Tower emhttpd: device /dev/sdd problem getting id
Feb  4 20:12:51 Tower emhttpd: device /dev/sdb problem getting id
Feb  4 20:12:51 Tower emhttpd: device /dev/sdc problem getting id

What kind of enclosure are you using, does it have an expander or controller or is it SAS direct connect?

SAS direct connect via an 8087 adapter and breakout cables:

 

LSI SAS HBA in my server MB  >  to SFF-8088 cable  >  SFF-8088 to 8087 adapter like this  >  to SFF-8087 to SATA breakout cable  >  to back of iStar hot swap enlosure

 

This is powered by a Corsair SFF PSU via ATX  >  SATA power cable  >  Back of iStar enclosure

 

This has been working well for a while. I'm not using an expander. 

 

 

Here is the the server and enclosure cabinet side by side:

1r.thumb.png.754ec4ec88d997ece85df3f565be5c1c.png

 

Here is the 8088 cable going from the HBA in server to the cabinet (you can also see the PSU) :

2r.thumb.png.6af9269fde8348906221d2c02c7c48d1.png

 

I mounted this SFF-8088 to 8087 adapter in the cabinet: 

IMG_20190927_135202.thumb.jpg.2db4fe9b02b819a7a1c94822228d74c3.jpg

 

Here is inside of cabinet showing 8087 break-out cables:

3.thumb.jpg.5cbd72bf0bcae61abdbb65ffe4d18bac.jpg

 

The SATA cables plug into the back of the iStar boxes:

4.thumb.jpg.935e3c4c52a9cc052b4b02fbd6b2575b.jpg

 

I tried moving the disks to another, new iStar enclosure, powered it up, connected the SAS cable and same result.

1.png

Edited by adminmat
Link to comment
49 minutes ago, johnnie.black said:

The easiest way to troubleshoot this would be to connect one of the disks directly to the HBA on the server, if it works the enclosure it's likely as enclosure problem, if it still doesn't connect to the onboard SATA controller, if it works the HBA/cables is likely the problem.

 

 

 

I have tried a 2nd, new iStar enclosure with the same results. Could the Corsair power supply be undervolting, causing an issue?  I can connect direcly to the HBA via external 8088 cable to the 8088 to 8087 adapter then directly to the drive. Using the PSU in the server. My HBA has no 8087 ports on it.

 

And I can just connect the disk to the motherboard's SATA port. Also using the server's PSU. 

 

 

 

 

Link to comment
5 minutes ago, civic95man said:

Might be exceeding the maximum Sata length.  Probably should use an expander on the enclosure end.

 

I was concerned about this at first. But the SAS cable maximum length is 10 meters and SATA is 1 meter. My SAS cable is 1 meter and the SATA cable is 0.5 meter. I still think this is a unRAID OS issue. I'll put another OS on the server tonight and see if I can read/write to some other disks in the enclosure. 

 

This setup has been work perfectly for ~ 4 months with no issues and just now be an issue? I've loaded 12 TB on to these disks with plenty of reads and had not one error. 

 

I doubt you go from zero errors to "Missing" disks

Link to comment
Might be exceeding the maximum Sata length.

Yep, max SATA cable length is 1m total, though if it was working before kind of strange now having issue with all disks, it could still be a cable gone bad.

 

But the SAS cable maximum length is 10 meters and SATA is 1 meter.

Yes, but only between two SAS devices, e.g. between an HBA and a SAS expander, then 1 meter to SATA disks.

 

 

Link to comment
2 minutes ago, johnnie.black said:

Yep, max SATA cable length is 1m total, though if it was working before kind of strange now having issue with all disks, it could still be a cable gone bad.

I just replaced the SFF-8088 with a spare and same result. It's not the 8087s because I've tried 2 enclosures and with 2 sets of breakout cables. 

 

4 minutes ago, johnnie.black said:

Yes, but only between two SAS devices, e.g. between an HBA and a SAS expander, then 1 meter to SATA disks.

 

So it's either PSU voltage issues, the OS or the cable resistance increased for some reason. 

Link to comment

It can't be the OS.

 

30 minutes ago, adminmat said:

I can connect direcly to the HBA via external 8088 cable to the 8088 to 8087 adapter then directly to the drive. Using the PSU in the server. My HBA has no 8087 ports on it.

 

And I can just connect the disk to the motherboard's SATA port. Also using the server's PSU. 

Those would be the next things to try.

Link to comment

"It can't be the OS."

I don't see how either. The NVME cache drives are showing up available in the GUI. 

 

So connecting one of the 8TB HDDs directly to the Supermicro motherboard SATA port, taping off the 3.3v pin on the WD white drive, powering with server's PSU, still shows "missing"

 

Does it matter that these disks are encrypted? 

 

In terminal I'm getting:

 

98203087_mbsataport.PNG.560cfeb0bbc71a9a7f41a62d879ec5a8.PNG

 

So the same as before. 

 

What's next? I could try to install Ubuntu on bare metal and see if I can un-encrypt the disks and read/write?

 

Could all the HDDs be killed by a bad PSU?

 

Edit: Added Diagnostics 

tower-diagnostics-20200206-1349.zip

Edited by adminmat
Link to comment

There's a hardware problem somewhere, even SMART can't be read correctly:

ATA_READ_LOG_EXT (addr=0x03:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read SMART Extended Comprehensive Error Log failed

Read SMART Error Log failed: scsi error medium or hardware error (serious)

ATA_READ_LOG_EXT (addr=0x07:0x00, page=0, n=1) failed: scsi error medium or hardware error (serious)
Read SMART Extended Self-test Log failed

Read SMART Self-test Log failed: scsi error medium or hardware error (serious)

Read SMART Selective Self-test Log failed: scsi error medium or hardware error (serious)

Are you sure the tape is working correctly? Do you have a molex to STA adapter? If yes use that.

Link to comment

I can see the drive (Disk 1) in "unassigned Devices" now. But I can not select the drive up in the Array Devices using the drop-down menu. When I click the drop-down menu arrow under Disk 1 the only option is "no device"

 

Surely I have another option other than formatting my drives. I have a significant amount of data on Disk 1. Disk 2 is empty AFAIK. And the 3rd disk is the Parity. 


Should I attempt to connect the other 2 disks? is it possibly waiting on those to be connected? Or could I cause more harm by connecting them?

 

Is there a plugin to help with this? 

 

Link to comment

I've since removed the LSI SAS HBA from the server to see if that helped. Still have Drive 1 connected via MOBO SATA. It's still showing as an Unassigned Device although it has the same name as before which is still listed up in Array Devices. 

 

Any ideas, anyone? 

Edited by adminmat
Link to comment
2 minutes ago, adminmat said:

I've since removed the LSI SAS HBA from the server to see if that helped. Still have Drive 1 connected via MOBO SATA. It's still showing as an Unassigned Device although it has the same name as before which is still listed up in Array Devices. 

 

Any ideas, anyone? 

There will be some difference if it is still in Unassigned Devices.    If the names match then the reported size may be fractionally different which will cause Unraid to treat it as a different device.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...