[Plugin] Spin Down SAS Drives


doron

Recommended Posts

13 minutes ago, dansonamission said:

The drive is a ST4000NM0023 Seagate, revision GS0D and the card HP H220 LSI 9205-8i 9207 with P20 firmware.

This manual for this drive (in fact, the entire Constellation ES.3 series) seems to indicate (sec 6.1) that an explicit NOTIFY needs to be sent to the device to recover from the spindown mode we're sending the device into (Standby_Z). 

This is in contrast with other devices tested (e.g. WD/HGST) that automatically spin up when sent to this state.

 

Has anyone seen positive results with this drive and this plugin?

 

We might end up having to enumerate the drive types where this works well vs. those that fail, and build a white list in the plugin. Nasty 😞

 

Can I get brief messages here from anyone who's using (or tried using) the plugin, reporting success/failure? Just a one liner with:

<HDD Model> <Success/Failure> (<optional comment>)

would be great. Example;

HUH721212AL4200  Success

PM would also work if you don't want to post. Thanks!

Link to comment

My drives are as follows, not tested with the plugin but work with manual commands.

 

HGST     HUS724030ALS640  A1C4    Success
HITACHI  HMRSK2000GBAS07K 3P02 Success
 

Will get a seagate drive and do some testing and feedback

Edited by SimonF
Additional Info
Link to comment
39 minutes ago, SuperDan said:

HITACHI   HUC106060CSS600 A430 Failure

Read errors when spun down and trying to wake back up.

They do spin backup up but have had random (3 times) drives with red x.

Array is 20 drives of the above type.

 

Thanks. So basically you're saying the result in your case is not consistent? Some (most?) of the time they do spin back up but at times they get the read errors?

 

Could you paste syslog lines from the time of such error that red-x-ed a drive?

Link to comment
1 hour ago, doron said:

Thanks. So basically you're saying the result in your case is not consistent? Some (most?) of the time they do spin back up but at times they get the read errors?

 

That is correct.

 

I had to re enable the plug to get the log entries.

This time 2 drives red x'd on me.

The only entry I saw related to those drives are:

 

Oct  7 09:32:46 unNAS kernel: blk_update_request: I/O error, dev sdz, sector 64 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct  7 09:32:46 unNAS kernel: md: disk17 read error, sector=0
Oct  7 09:32:51 unNAS kernel: blk_update_request: I/O error, dev sdz, sector 64 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Oct  7 09:32:51 unNAS kernel: md: disk17 write error, sector=0

Oct  7 09:41:37 unNAS kernel: blk_update_request: I/O error, dev sdu, sector 586549720 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct  7 09:41:37 unNAS kernel: md: disk15 read error, sector=586549656
Oct  7 09:41:37 unNAS kernel: blk_update_request: I/O error, dev sdu, sector 586549720 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct  7 09:41:37 unNAS kernel: md: disk15 write error, sector=586549656
Oct  7 09:41:37 unNAS kernel: blk_update_request: I/O error, dev sdu, sector 64 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Oct  7 09:41:37 unNAS kernel: md: disk15 write error, sector=0

 

Link to comment
1 hour ago, doron said:

Thanks. So basically you're saying the result in your case is not consistent? Some (most?) of the time they do spin back up but at times they get the read errors?

 

Could you paste syslog lines from the time of such error that red-x-ed a drive?

Actually I found more log entries for one of the drives that went red x:

 

Oct  7 09:31:30 unNAS SAS Assist v0.6[36184]: spinning down slot 17, device /dev/sdz (/dev/sg26)
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 Sense Key : 0x2 [current] [descriptor] 
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 ASC=0x4 ASCQ=0x11 
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 CDB: opcode=0x28 28 00 00 00 00 40 00 00 08 00
Oct  7 09:32:46 unNAS kernel: blk_update_request: I/O error, dev sdz, sector 64 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Oct  7 09:32:46 unNAS kernel: md: disk17 read error, sector=0
Oct  7 09:32:46 unNAS kernel: sd 2:0:8:0: Power-on or device reset occurred
Oct  7 09:32:51 unNAS kernel: sd 2:0:23:0: [sdz] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Oct  7 09:32:51 unNAS kernel: sd 2:0:23:0: [sdz] tag#8 Sense Key : 0x2 [current] [descriptor] 
Oct  7 09:32:51 unNAS kernel: sd 2:0:23:0: [sdz] tag#8 ASC=0x4 ASCQ=0x11 
Oct  7 09:32:51 unNAS kernel: sd 2:0:23:0: [sdz] tag#8 CDB: opcode=0x2a 2a 00 00 00 00 40 00 00 08 00
Oct  7 09:32:51 unNAS kernel: blk_update_request: I/O error, dev sdz, sector 64 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Oct  7 09:32:51 unNAS kernel: md: disk17 write error, sector=0

 

Link to comment

Thanks very much for this.

57 minutes ago, SuperDan said:

Oct  7 09:31:30 unNAS SAS Assist v0.6[36184]: spinning down slot 17, device /dev/sdz (/dev/sg26)
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 Sense Key : 0x2 [current] [descriptor] 
Oct  7 09:32:46 unNAS kernel: sd 2:0:23:0: [sdz] tag#24 ASC=0x4 ASCQ=0x11 

Aha. That's the infamous/dreaded 02-04-11, same as we've seen on the Seagate. Will have to exclude this series as well 😞

 

57 minutes ago, SuperDan said:

Oct  7 09:32:46 unNAS kernel: sd 2:0:8:0: Power-on or device reset occurred

A different drive? Was something else happening at the same time?

 

Link to comment
1 hour ago, doron said:

Thanks very much for this.

Aha. That's the infamous/dreaded 02-04-11, same as we've seen on the Seagate. Will have to exclude this series as well 😞

 

A different drive? Was something else happening at the same time?

 

Maybe a differnet drive(s) since my cache drives are SATA SSD's.

 

May as well add these drive as well since the are HITACHI HUC106060CSS600 drives rebranded to Netapp drives but still suffer the above problem,

NETAPP  X422_TAL13600A10

NETAPP  X422_HCOBD600A10

 

 

Edited by SuperDan
  • Thanks 1
Link to comment
1 minute ago, SuperDan said:

Something I just noticed, the other user having this problem is using an HP controller

HP H220 LSI 9205-8i 9207

And I am using a DELL H310 LSI MegaRAID SAS 2008

 

Maybe proprietary firmware on the HBA may have something to do with it?

Just throwing it out there.

 

Hmm. Certainly a possibility, yes. 

Currently it's second on my list of potential causes, only because the OEM manual of the Seagate described the drive's behavior in the Standby_Z state (this is what we're using to spin the drive down), and it explicitly stated that the drive will require an init op to spin up again. This is in contrast with the HGST/WD SAS drives, which (explicitly) specify that in the same state the drive is ready to spin back up on next I/O.

The manual for your Hitachi is silent about this - does not say either - but the 04-11 sense makes me think it does it like the Seagate.

 

You'd think that standards would be standards, and the drive behavior following a standard SAS/SCSI operation would be standard... go figure.

 

It could still be the controller, the above is just my current thinking.

Link to comment
30 minutes ago, doron said:

Hmm. Certainly a possibility, yes. 

Currently it's second on my list of potential causes, only because the OEM manual of the Seagate described the drive's behavior in the Standby_Z state (this is what we're using to spin the drive down), and it explicitly stated that the drive will require an init op to spin up again. This is in contrast with the HGST/WD SAS drives, which (explicitly) specify that in the same state the drive is ready to spin back up on next I/O.

The manual for your Hitachi is silent about this - does not say either - but the 04-11 sense makes me think it does it like the Seagate.

 

You'd think that standards would be standards, and the drive behavior following a standard SAS/SCSI operation would be standard... go figure.

 

It could still be the controller, the above is just my current thinking.

Hmm, this has given me the motivation I needed to replace the DELL controller with one that is flashed with the LSI firmware. Been meaning to do it for a while now.

I just ordered it and when I install it Ill report back if anything changed.

  

  • Like 1
Link to comment

[1000:0064] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)

[10:0:0:0] disk ATA ST3320310CS SC14 /dev/sdl 320GB

[10:0:1:0] disk SEAGATE ST4000NM0023 XMGJ /dev/sdm 4.00TB

 

testing with SEAGATE drive works as expected so far, but I have only added as a standalone second pool drive. Spins up when accessed.

 

will test on the other controller below. in my system over the weekend and may add into pool as the parity.

 

[1000:0086] 02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

[3:0:0:0] disk HGST HUS724030ALS640 A1C4 /dev/sdd 3.00TB

[3:0:1:0] disk HITACHI HMRSK2000GBAS07K 3P02 /dev/sde 2.00TB

[3:0:2:0] disk HGST HUS724030ALS640 A1C4 /dev/sdf 3.00TB

[3:0:3:0] disk HGST HUS724030ALS640 A1C4 /dev/sdh 3.00TB

[3:0:4:0] disk HGST HUS724030ALS640 A1C4 /dev/sdj 3.00TB

[3:0:5:0] disk HITACHI HMRSK2000GBAS07K 3P02 /dev/sdk 2.00TB

 

Both controllers are in IT Mode on P20.

Link to comment
2 hours ago, SimonF said:

[10:0:1:0] disk SEAGATE ST4000NM0023 XMGJ /dev/sdm 4.00TB

 

testing with SEAGATE drive works as expected so far, but I have only added as a standalone second pool drive. Spins up when accessed.

That's very interesting. 

Could it be that it just takes very long to spin back up? I'm trying to find a plausible explanation as to why the same drive works as expected in your case but failed in the previous poster's case.

I guess the next test needs to be in the array.

 

Or, it might, after all, be the controller. Or the firmware version, although I'd really doubt that.

 

I was just about to push out a new version of the plugin with some changes and improvements, and with a mechanism to filter out certain HDDs by name/vendor. I'll hold off until we have some more insight.

 

Edited by doron
Link to comment
26 minutes ago, Spazhead said:

FYI not working for my 7 Seagate ST4000NM0023 on Dell H200 IT mode

Thanks for reporting.

Can you elaborate on "not working" in this case - are you seeing i/o errors? Did you get a red x? etc.

Can you share the drive f/w version?

We're seeing different reports wrt this drive so trying to dig further in (f/w, controller etc.).

 

Link to comment

Just a follow up.

I replaced the DELL H310 LSI MegaRAID SAS 2008 controller with a DELL H710 LSI 9207-8i P20 flashed with LSI IT mode firmware version P20 (20.00.07.00).

After 18 hours of running and doing some other tests the drives spin up/down as they should and no errors or red x drives.

So, at least in my case replacing the controller fixed it.

 

 

Edited by SuperDan
Link to comment
50 minutes ago, SuperDan said:

controller fixed it

Thanks for the update please feedback if any errors happen in the future. Are all the drives you are using HITACHI

 

Testing with SEAGATE as parity drive performs as expected with Manual spin down via command on 9201-16e will test Serial Attached SCSI controller" "Broadcom / LSI" "SAS2308 PCI-Express Fusion-MPT SAS-2" -r05 "Super Micro Computer Inc" "Onboard SAS2308 PCI-Express Fusion-MPT SAS-2 tomorrow

Edited by SimonF
Additional Info
Link to comment
51 minutes ago, SuperDan said:

Just a follow up.

I replaced the DELL H310 LSI MegaRAID SAS 2008 controller with a DELL H710 LSI 9207-8i P20 flashed with LSI IT mode firmware version P20 (20.00.07.00).

After 18 hours of running and doing some other tests the drives spin up/down as they should and no errors or red x drives.

So, at least in my case replacing the controller fixed it.

Thanks - great stuff!

Other reports also begin to confirm that this is probably controller dependent.

 

Do you happen to have the PCI IDs of these two?

For the running one, you can obtain it via "lspci -vnnmm".

For the one you removed - well, if it is connected to another machine...

Link to comment
6 minutes ago, doron said:

Thanks - great stuff!

Other reports also begin to confirm that this is probably controller dependent.

 

Do you happen to have the PCI IDs of these two?

For the running one, you can obtain it via "lspci -vnnmm".

For the one you removed - well, if it is connected to another machine...

Sorry, I did not record the ID for the one I removed.

Unfortunately these are the Mini Mono cards that require a proprietary PCI storage slot on Dell servers and I do not have another server I can plug the old one into.

 

output of lspci -vnnmm

 

Slot:   03:00.0
Class:  Serial Attached SCSI controller [0107]
Vendor: Broadcom / LSI [1000]
Device: SAS2308 PCI-Express Fusion-MPT SAS-2 [0087]
SVendor:        Dell [1028]
SDevice:        SAS2308 PCI-Express Fusion-MPT SAS-2 [1f38]
Rev:    05
NUMANode:       0
IOMMUGroup:     18

  • Thanks 1
Link to comment
9 hours ago, doron said:

Thanks for reporting.

Can you elaborate on "not working" in this case - are you seeing i/o errors? Did you get a red x? etc.

Can you share the drive f/w version?

We're seeing different reports wrt this drive so trying to dig further in (f/w, controller etc.).

 

 

i guess it just doesn't sleep

 

log error

Oct 10 04:29:19 TOWER kernel: md: do_drive_cmd: disk13: ATA_OP e0 ioctl error: -5

Oct 10 04:29:19 TOWER emhttpd: error: mdcmd, 2723: Input/output error (5): write

 

drives still work

Screen Shot 2020-10-10 at 2.59.43 PM.png

Link to comment
2 minutes ago, SuperDan said:

Sorry, I did not record the ID for the one I removed.

Unfortunately these are the Mini Mono cards that require a proprietary PCI storage slot on Dell servers and I do not have another server I can plug the old one into.

Yes, I know these Dell Minis all too well. No worries - you are providing really helpful data.

I'm going to assume the old one is 1000:0073 / 1028:1f51 . If anyone can corroborate or correct (Dell H310 Mini Mono) - please do.

Link to comment

SEAGATE Disk works as expected on both controllers

 

Class:  Serial Attached SCSI controller [0107]
Vendor: Broadcom / LSI [1000]
Device: SAS2308 PCI-Express Fusion-MPT SAS-2 [0086]
SVendor:        Super Micro Computer Inc [15d9]
SDevice:        Onboard SAS2308 PCI-Express Fusion-MPT SAS-2 [0691]
Rev:    05

 

and

 

Class:  Serial Attached SCSI controller [0107]
Vendor: Broadcom / LSI [1000]
Device: SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] [0064]
SVendor:        Broadcom / LSI [1000]
SDevice:        9201-16e 6Gb/s SAS/SATA PCIe x8 External HBA [30d0]
Rev:    02

Link to comment

Thanks for making this a plugin.  Hope I can help with some diagnostics.

 

HUS724030ALS640  Fail with *Temp and drive not spun down.  Still operational, seemingly no effect.  Getting read errors after a reboot.

 

image.thumb.png.f8c1d2dca4416530290696be38f50e5d.png

 

Oct 11 20:38:24 Beast kernel: mdcmd (47): spindown 7
Oct 11 20:38:24 Beast kernel: md: do_drive_cmd: disk7: ATA_OP e0 ioctl error: -5
Oct 11 20:38:24 Beast emhttpd: error: mdcmd, 2723: Input/output error (5): write
Oct 11 20:38:24 Beast SAS Assist v0.6[27532]: spinning down slot 7, device /dev/sde (/dev/sg4)
Slot:   06:00.0
Class:  Serial Attached SCSI controller [0107]
Vendor: Broadcom / LSI [1000]
Device: SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] [0064]
SVendor:        Broadcom / LSI [1000]
SDevice:        SAS 9201-16i [30c0]
Rev:    02
NUMANode:       0

 

Edited by jenga201
Link to comment

Thanks for taking the time.

13 hours ago, jenga201 said:

Thanks for making this a plugin.  Hope I can help with some diagnostics.

 

HUS724030ALS640  Fail with *Temp and drive not spun down.  Still operational, seemingly no effect. 

Thanks. How did you determine that the drive is not spun down?

 

13 hours ago, jenga201 said:

 

Getting read errors after a reboot.

 

image.thumb.png.f8c1d2dca4416530290696be38f50e5d.png

 


Oct 11 20:38:24 Beast kernel: mdcmd (47): spindown 7
Oct 11 20:38:24 Beast kernel: md: do_drive_cmd: disk7: ATA_OP e0 ioctl error: -5
Oct 11 20:38:24 Beast emhttpd: error: mdcmd, 2723: Input/output error (5): write
Oct 11 20:38:24 Beast SAS Assist v0.6[27532]: spinning down slot 7, device /dev/sde (/dev/sg4)

Slot:   06:00.0
Class:  Serial Attached SCSI controller [0107]
Vendor: Broadcom / LSI [1000]
Device: SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] [0064]
SVendor:        Broadcom / LSI [1000]
SDevice:        SAS 9201-16i [30c0]
Rev:    02
NUMANode:       0

 

These two messages (OP e0 error and mdcmd error (5)) are not really read errors. They are the a the drive/controller's response to the ATA ops sent by Unraid, and are not an indication of a problem. I'm assuming you have been receiving them prior to installing the plugin (and keep receiving them if you remove the plugin) - please correct me if I'm wrong.

 

The next version of the plugin (not pushed out yet) includes a feature that just filters out these syslog messages.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.