SATA Errors on Spinup

tdallen · April 28, 2015

I've been working on a problem with my server. It has been unstable under unRAID 6, with problems like dropping disks and showing errors on the AOC-SAS2LP. My most recent attempt to fix the problem was to replace the power supply. I think things are working better but I'm still seeing an issue in the logs.

Here is the scenario -

- The server is idle.

- The parity drive is a 6TB WD Red and is spun down.

- The target data drive also happens to be a 6TB WD for this operation, and it is also spun down.

- I invoke Manual Post-Processing in Sickbeard, which is going to immediately move files from cache to the data drive. The cache drive is spun up (thanks Plex).

Here's what I see in the logs:

Apr 27 18:42:43 Tower in.telnetd[2566]: connect from 192.168.1.23 (192.168.1.23)
Apr 27 18:42:45 Tower login[2567]: ROOT LOGIN  on '/dev/pts/0' from 'Berkeley'
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [1] tag[1], task [ffff8800ba5b4400]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 000007FE,  slot [1].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [2] tag[2], task [ffff8800ba5b4200]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 000007FC,  slot [2].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [3] tag[3], task [ffff8800ba5b4b00]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 000007F8,  slot [3].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [4] tag[4], task [ffff8800ba5b4100]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 000007F0,  slot [4].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [5] tag[5], task [ffff8800ba5b4900]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 000007E0,  slot [5].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [6] tag[6], task [ffff8800ba5b4600]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 000007C0,  slot [6].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [7] tag[7], task [ffff8800bb8f7000]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000780,  slot [7].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [8] tag[8], task [ffff8800bb8f7500]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000700,  slot [8].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [9] tag[9], task [ffff8800bb8f7700]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000600,  slot [9].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1866:Release slot [a] tag[a], task [ffff8800bb8f7900]:
Apr 27 19:31:31 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000400,  slot [a].
Apr 27 19:31:31 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Apr 27 19:31:31 Tower kernel: sas: Enter sas_scsi_recover_host busy: 10 failed: 10
Apr 27 19:31:31 Tower kernel: sas: ata10: end_device-0:1: cmd error handler
Apr 27 19:31:31 Tower kernel: sas: ata9: end_device-0:0: dev error handler
Apr 27 19:31:31 Tower kernel: sas: ata10: end_device-0:1: dev error handler
Apr 27 19:31:31 Tower kernel: sas: ata11: end_device-0:2: dev error handler
Apr 27 19:31:31 Tower kernel: sas: ata12: end_device-0:3: dev error handler
Apr 27 19:31:31 Tower kernel: sas: ata13: end_device-0:4: dev error handler
Apr 27 19:31:31 Tower kernel: sas: ata14: end_device-0:5: dev error handler
Apr 27 19:31:32 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x7fe0 SErr 0x0 action 0x6
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:58:e4/04:00:11:02:00/40 tag 5 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:5c:e4/04:00:11:02:00/40 tag 6 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:60:e4/04:00:11:02:00/40 tag 7 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:64:e4/04:00:11:02:00/40 tag 8 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:68:e4/04:00:11:02:00/40 tag 9 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:6c:e4/04:00:11:02:00/40 tag 10 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:70:e4/04:00:11:02:00/40 tag 11 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 41/10:00:38:70:e4/00:00:11:02:00/40 Emask 0x481 (invalid argument) <F>
Apr 27 19:31:32 Tower kernel: ata10.00: status: { DRDY ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { IDNF }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:74:e4/04:00:11:02:00/40 tag 12 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:78:e4/04:00:11:02:00/40 tag 13 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Apr 27 19:31:32 Tower kernel: ata10.00: cmd 61/00:00:38:7c:e4/04:00:11:02:00/40 tag 14 ncq 524288 out
Apr 27 19:31:32 Tower kernel:         res 01/04:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 27 19:31:32 Tower kernel: ata10.00: status: { ERR }
Apr 27 19:31:32 Tower kernel: ata10.00: error: { ABRT }
Apr 27 19:31:32 Tower kernel: ata10: hard resetting link
Apr 27 19:31:32 Tower kernel: ata10.00: configured for UDMA/133
Apr 27 19:31:32 Tower kernel: ata10: EH complete
Apr 27 19:31:32 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1

The full syslog is attached with problems starting on line 1307.

In the past this type of error would typically be accompanied by a dropped disk which required a new config to bring back into the array. It also usually resulted in parity errors which would need to be corrected in the next parity check.

This time (new power supply installed) it appears that a) I still have an error, b) I didn't drop a disk this time, and c) I'm not seeing any parity errors on the check I'm running right now.

Any thoughts? Is this a significant error or something the server is recovering from? Do any of you see anything similar when spinning up 6TB WD Red or disks on an AOC-SAS2LP for immediate access via program like Sickbeard or Mover? It's hard to declare victory with the new power supply when I'm still seeing this in the logs...

Thanks

syslog.zip

dgaschk · April 28, 2015

Look for BIOS and firmware updates.

tdallen · April 29, 2015

Hi - both are up to date.

remotevisitor · April 29, 2015

Both me and my brother have had problems with 6tb WD red disks connected to AOC-SASLP-MV8 with very similar symptoms. We had no such problems until we replaced some existing disks with the 6tbs.

Our work around was to set the 6Tb disks to never spin down .... Not an ideal solution but better than having to reboot, do a file system check and rebuilding parity ever few weeks.

However both of us have this last week tried setting the disks back to the defaul spin down to see if the latest beta has cured the problem for us .... Time will tell if this is the case.

tdallen · April 29, 2015

Thanks. Is there a way to specify that only certain drives remain spinning? I know where to set them all to spin down (or not), but haven't found where to do it for individual drives...

itimpi · April 29, 2015

Thanks. Is there a way to specify that only certain drives remain spinning? I know where to set them all to spin down (or not), but haven't found where to do it for individual drives...

click on a drive in the Main tab and you can set various parameters (including Spin Down delay) for that particular drive.

dgaschk · April 29, 2015

This may require a HDD or HBA firmware or BIOS update.

tdallen · May 1, 2015

@remotevisitor and @itimpi - Thanks, I've set the two 6TB drives not to spin down. Am I being paranoid? The system appears to have recovered in the example above by resetting the link, but those errors don't look healthy to me and in the past they resulted in parity errors.

@dgaschk - While I agree that these errors look like they could be solved with firmware type update, I'm not having any luck. The motherboard BIOS is up to date with the latest 2104. The AOC-SAS2LP is running the latest 4.0.0.1812 BIOS. There isn't much information available on updates to NASWare 3.0 on the WD Red 6TB drives. Both are WD60EFRX 68MYMN1 drives, though - in theory a minor update over the original 68MYMN0 drives and again the latest available.

So I guess the question is - are the errors above scary enough to take an action like preventing the 6TB drives from spinning down? I don't want to needlessly put wear on the drives, but I also don't want to compromise the integrity of the array. I'd also rather not build out a whole new server, though I have considered it as I've struggled to stabilize unRAID 6 using the Supermicro HBA and disks on this motherboard...

SATA Errors on Spinup

Recommended Posts

tdallen

Link to comment

dgaschk

Link to comment

tdallen

Link to comment

remotevisitor

Link to comment

tdallen

Link to comment

itimpi

Link to comment

dgaschk

Link to comment

tdallen

Link to comment

Archived