[4.4.2] file copy failures, exception frozen/timeouts

January 8, 200917 yr

Within the last week, I was getting exception frozen/timeout errors when transferring files to my server. It had previously been rock solid. From reading previous posts, I suspected a cable issue, so last night I replaced all SATA cables with new ones. I also upgraded to 4.4.2, although that shouldn't have anything to do with this issue. Same problem again today, I get an error when copying files stating the file can't be found (on the source drive). Also, after the error occurs, unRAID is inaccessible via http (just sits there waiting for reply from server). I'm going to perform full SMART tests on all drives, but any advice would be greatly appreciated. I've attached an older log (prior to cable replacement), dated 2009-01-05. Here's an excerpt from my syslog (finally able to attach it! ):

Jan  7 19:10:24 NAS kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Jan  7 19:10:24 NAS kernel: ata3.00: cmd 60/00:00:77:b9:42/04:00:4f:00:00/40 tag 0 ncq 524288 in
Jan  7 19:10:24 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  7 19:10:24 NAS kernel: ata3.00: status: { DRDY }
Jan  7 19:10:24 NAS kernel: ata3: soft resetting link
Jan  7 19:10:24 NAS kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  7 19:10:29 NAS kernel: ata3.00: qc timeout (cmd 0x27)
Jan  7 19:10:29 NAS kernel: ata3.00: failed to read native max address (err_mask=0x4)
Jan  7 19:10:29 NAS kernel: ata3.00: HPA support seems broken, skipping HPA handling
Jan  7 19:10:29 NAS kernel: ata3.00: revalidation failed (errno=-5)
Jan  7 19:10:30 NAS kernel: ata3: soft resetting link
Jan  7 19:10:30 NAS kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  7 19:10:30 NAS kernel: ata3.00: configured for UDMA/133
Jan  7 19:10:30 NAS kernel: ata3: EH complete
Jan  7 19:10:30 NAS kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sectors (750156 MB)
Jan  7 19:10:30 NAS kernel: sd 3:0:0:0: [sda] Write Protect is off
Jan  7 19:10:30 NAS kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jan  7 19:10:30 NAS kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan  7 19:11:30 NAS kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Jan  7 19:11:30 NAS kernel: ata3: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Jan  7 19:11:30 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Jan  7 19:11:30 NAS kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Jan  7 19:11:30 NAS kernel: ata3: tag : dhfis dmafis sdbfis sacitve
Jan  7 19:11:30 NAS kernel: ata3: tag 0x0: 0 0 0 1  
Jan  7 19:11:30 NAS kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Jan  7 19:11:30 NAS kernel: ata3.00: cmd 60/00:00:77:b9:42/04:00:4f:00:00/40 tag 0 ncq 524288 in
Jan  7 19:11:30 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  7 19:11:30 NAS kernel: ata3.00: status: { DRDY }
Jan  7 19:11:31 NAS kernel: ata3: soft resetting link
Jan  7 19:11:31 NAS kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  7 19:11:31 NAS kernel: ata3.00: configured for UDMA/133
Jan  7 19:11:31 NAS kernel: ata3: EH complete
Jan  7 19:11:31 NAS kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sectors (750156 MB)
Jan  7 19:11:31 NAS kernel: sd 3:0:0:0: [sda] Write Protect is off
Jan  7 19:11:31 NAS kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jan  7 19:11:31 NAS kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan  7 19:12:31 NAS kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Jan  7 19:12:31 NAS kernel: ata3: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Jan  7 19:12:31 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Jan  7 19:12:31 NAS kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Jan  7 19:12:31 NAS kernel: ata3: tag : dhfis dmafis sdbfis sacitve
Jan  7 19:12:31 NAS kernel: ata3: tag 0x0: 0 0 0 1  
Jan  7 19:12:31 NAS kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Jan  7 19:12:31 NAS kernel: ata3.00: cmd 60/00:00:77:b9:42/04:00:4f:00:00/40 tag 0 ncq 524288 in
Jan  7 19:12:31 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  7 19:12:31 NAS kernel: ata3.00: status: { DRDY }
Jan  7 19:12:32 NAS kernel: ata3: soft resetting link
Jan  7 19:12:32 NAS kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  7 19:12:32 NAS kernel: ata3.00: configured for UDMA/133
Jan  7 19:12:32 NAS kernel: ata3: EH complete
Jan  7 19:12:32 NAS kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sectors (750156 MB)
Jan  7 19:12:32 NAS kernel: sd 3:0:0:0: [sda] Write Protect is off
Jan  7 19:12:32 NAS kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jan  7 19:12:32 NAS kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan  7 19:13:32 NAS kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Jan  7 19:13:32 NAS kernel: ata3: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Jan  7 19:13:32 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Jan  7 19:13:32 NAS kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Jan  7 19:13:32 NAS kernel: ata3: tag : dhfis dmafis sdbfis sacitve
Jan  7 19:13:32 NAS kernel: ata3: tag 0x0: 0 0 0 1  
Jan  7 19:13:32 NAS kernel: ata3.00: NCQ disabled due to excessive errors
Jan  7 19:13:32 NAS kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Jan  7 19:13:32 NAS kernel: ata3.00: cmd 60/00:00:77:b9:42/04:00:4f:00:00/40 tag 0 ncq 524288 in
Jan  7 19:13:32 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  7 19:13:32 NAS kernel: ata3.00: status: { DRDY }
Jan  7 19:13:33 NAS kernel: ata3: soft resetting link
Jan  7 19:13:33 NAS kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jan  7 19:13:33 NAS kernel: ata3.00: configured for UDMA/133
Jan  7 19:13:33 NAS kernel: ata3: EH complete
Jan  7 19:13:33 NAS kernel: sd 3:0:0:0: [sda] 1465149168 512-byte hardware sectors (750156 MB)
Jan  7 19:13:33 NAS kernel: sd 3:0:0:0: [sda] Write Protect is off
Jan  7 19:13:33 NAS kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jan  7 19:13:33 NAS kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

January 8, 200917 yr

What motherboard?

January 8, 200917 yr

This looks like an nForce board like mine (an Epox MF570 based on nForce 570 chipset). I found that the swncq=0 boot option greatly alleviated the problem, although it did not completely eliminate these errors. I have been able to run successfully since adding it.

I added swncq=0 to the Boot Codes wiki page, had forgotten to add it before.

January 8, 200917 yr

Author

What motherboard?

EVGA 122-CK-NF68-AR LGA 775 NVIDIA nForce 680i SLI ATX

January 8, 200917 yr

Author

This looks like an nForce board like mine (an Epox MF570 based on nForce 570 chipset). I found that the swncq=0 boot option greatly alleviated the problem, although it did not completely eliminate these errors. I have been able to run successfully since adding it.

I added swncq=0 to the Boot Codes wiki page, had forgotten to add it before.

I will try that now, thanks for your help. I see on that page it says "only if needed, for boards based on the nForce 5 or higher chipsets, and only for recent kernel versions included with unRAID v4.4-beta2 or later". I realized the timing of this problem coincides with my upgrade to 4.4 (betas), so I've got my fingers crossed that this is it...

January 8, 200917 yr

Author

This looks like an nForce board like mine (an Epox MF570 based on nForce 570 chipset). I found that the swncq=0 boot option greatly alleviated the problem, although it did not completely eliminate these errors. I have been able to run successfully since adding it.

I added swncq=0 to the Boot Codes wiki page, had forgotten to add it before.

I will try that now, thanks for your help. I see on that page it says "only if needed, for boards based on the nForce 5 or higher chipsets, and only for recent kernel versions included with unRAID v4.4-beta2 or later". I realized the timing of this problem coincides with my upgrade to 4.4 (betas), so I've got my fingers crossed that this is it...

So far, so good! I've copied 30 gigs so far, no problem. Thanks again.

January 9, 200917 yr

Maybe somewhat related to what the thread-starter is experiencing:

Starting with unRAID 4.4beta2 I experienced that 1 out of a total of 7 harddrives consistently is having problems, whereas things were fine with unRAID 4.3 final.

I have mentioned this previously in another thread:

http://lime-technology.com/forum/index.php?topic=1771.msg22207#msg22207

I am still getting the same error with both unRAID 4.4final, as well as with unRAID 4.4.2

I have also tried adding the swncq=0 to the boot options in the syslinux.cfg without any effect.

Quick summary:

On my Expox 9NPA+ Ultra (nForce4 Ultra) motherboard with an AMD X2 3800+ CPU the following happens:

During boot (and continuing forever ) I get alot of these messages:

Jan 9 22:51:47 Tower kernel: ata4: COMRESET failed (errno=-16)

Jan 9 22:51:47 Tower kernel: ata4: hard resetting link

Jan 9 22:51:53 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

Jan 9 22:51:54 Tower ntpd[1267]: synchronized to 80.239.2.130, stratum 2

Jan 9 21:51:53 Tower ntpd[1267]: time reset -3600.558942 s

Jan 9 21:52:21 Tower kernel: ata4: COMRESET failed (errno=-16)

Jan 9 21:52:21 Tower kernel: ata4: limiting SATA link speed to 1.5 Gbps

Jan 9 21:52:21 Tower kernel: ata4: hard resetting link

Jan 9 21:52:26 Tower kernel: ata4: COMRESET failed (errno=-16)

Jan 9 21:52:26 Tower kernel: ata4: reset failed, giving up

Jan 9 21:52:26 Tower kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x1950000 action 0xe frozen t4

Jan 9 21:52:26 Tower kernel: ata4: SError: { PHYRdyChg CommWake Dispar LinkSeq TrStaTrns }

Jan 9 21:52:26 Tower kernel: ata4: hard resetting link

Jan 9 21:52:32 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

Jan 9 21:52:36 Tower kernel: ata4: COMRESET failed (errno=-16)

Jan 9 21:52:36 Tower kernel: ata4: hard resetting link

Jan 9 21:52:42 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

Jan 9 21:52:45 Tower emhttp: shcmd (13): /usr/sbin/hdparm -y /dev/sde >/dev/null

Jan 9 21:52:46 Tower kernel: ata4: COMRESET failed (errno=-16)

Jan 9 21:52:46 Tower kernel: ata4: hard resetting link

Jan 9 21:52:52 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

UPDATE:

There is BTW an interesting discussion about this Linux kernel bug for instance here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/256637

January 10, 200917 yr

From the link to that bug discussion, it looks like you have company, if that's any comfort! Perhaps y'all can start your own support group, the Nforce Comreset Sufferers Support Group!

It doesn't look like there is anything we can do to help you. You can wait for them to resolve this, and stay with an older version of unRAID for now. Or get another motherboard. I'm sorry.

The swncq=0 boot option only helps when you see SWNCQ within the exception handler error messages in the syslog.

January 11, 200917 yr

Will the Nvidia support be resolved any time soon?

If not, how do we add drivers to unraid to support it better?

I think I will duck now, as I am sure this has been answered somewhere....

Seeing how I made the mistake of buying an asus p5nt WS which has nvidia Dual Gige nics and SATA controllers...

Which board should I have bought to be 100% supported using an intel cpu (775 socket) and DDR2 Memory, and 1 pci-x, many pci-e, 1 pci

Maybe I will swap MB.

January 11, 200917 yr

Will the Nvidia support be resolved any time soon?

If not, how do we add drivers to unraid to support it better?

I think I will duck now, as I am sure this has been answered somewhere....

Seeing how I made the mistake of buying an asus p5nt WS which has nvidia Dual Gige nics and SATA controllers...

Which board should I have bought to be 100% supported using an intel cpu (775 socket) and DDR2 Memory, and 1 pci-x, many pci-e, 1 pci

Maybe I will swap MB.

Although it is true that earlier versions of the nForce chipsets had some serious issues, and the later versions (such as your 680i) have needed some workarounds, I don't believe you have a basis for your pessimism. I have never heard of any issues with the nVidia NIC's or SATA controllers. Your specific Asus board does have some bad reviews, especially of the PCI-X slot, may be problematic. What problems have you yourself had with this board?

Elsewhere I believe, there is a recommendation for a board with both PCI Express and PCI-X slots.

January 11, 200917 yr

Sorry if I sound pessimestic, I do not mean to sound that way.

When it comes to linux I am a newbie, if this were AIX I woul dbe running around trailblazing!

I love what you guys are doing with unRaid!!

I have come accross a few things I am trying to resolve.

One it seems I can overload the unraid system when copying files from two computers at once and when the cache drive is having files moved off.

Freeze ups when the cache drive is full.

I can live with it if I take more time to move things (1 server at a time), then it does not seem to happen. I will capture the syslog next time it happens.

I just had a second drive go red on me, ran a reiserfsck , no corruption, so when I am done transfering files from my software raid 5 windows box, I plan on replacing the cable or port connection, and if that does not work, the drive (with one of the 750GB drives from my windows server raid 5).

I did try running "root@Tower:/usr/src/linux/drivers/md# smartctl -a -d ata /dev/sda" but I get the following error: "smartctl: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory"

When a drive goes red, does unraid stop writing to it completely?

For the Nic side of things I see many

"Jan 9 16:09:06 Tower kernel: eth0: too many iterations (6) in nv_nic_irq." on many days

Again, this seem to occur less if I only copy from one location.

I was going to test a full slax install to see how it performed vs the pre-built distro of unraid. Need time.

Is this patch related for nvidia sata and NICS?

http://www.linuxquestions.org/questions/slackware-14/nvidia-driver-wont-compile-693153/

Thank you for the awesome support! I am slowly turning the tide on my friends who are starting to test unraid for themselves... I am sure you will have more customers shortly.

February 14, 200917 yr

Author

After over a month of no issues, I had multiple drives bomb out on me last night during an attempted periodic cache-to-data move. I had to power down the server ungracefully and reboot. Parity check completed with no errors. I have the swncq=0 boot code specified. I notice each of three drives that had 1000+ errors now shows udma_crc_error_count=1 under the myMain Smart View. I don't see how the cables could be bad.... but is that what this is telling me? I'm at a loss for how to avoid this from happening. Here's part of the 2.5MB syslog (messages like these repeat, along with thousands of lines of handle_stripe read errors):

Feb 13 03:40:52 NAS kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 13 03:40:52 NAS kernel: ata6.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0
Feb 13 03:40:52 NAS kernel:          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Feb 13 03:40:52 NAS kernel: ata6.00: status: { DRDY }
Feb 13 03:40:52 NAS kernel: ata6: soft resetting link
Feb 13 03:40:52 NAS kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 13 03:40:57 NAS kernel: ata6.00: qc timeout (cmd 0x27)
Feb 13 03:40:57 NAS kernel: ata6.00: failed to read native max address (err_mask=0x4)
Feb 13 03:40:57 NAS kernel: ata6.00: HPA support seems broken, skipping HPA handling
Feb 13 03:40:57 NAS kernel: ata6.00: revalidation failed (errno=-5)
Feb 13 03:40:57 NAS kernel: ata6: soft resetting link
Feb 13 03:40:58 NAS kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 13 03:40:58 NAS kernel: ata6.00: configured for UDMA/133
Feb 13 03:40:58 NAS kernel: ata6: EH complete
Feb 13 03:40:58 NAS kernel: sd 6:0:0:0: [sdf] 1953523055 512-byte hardware sectors (1000204 MB)
Feb 13 03:40:58 NAS kernel: sd 6:0:0:0: [sdf] Write Protect is off
Feb 13 03:40:58 NAS kernel: sd 6:0:0:0: [sdf] Mode Sense: 00 3a 00 00
Feb 13 03:40:58 NAS kernel: sd 6:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 13 03:41:19 NAS kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Feb 13 03:41:19 NAS kernel: ata5: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Feb 13 03:41:19 NAS kernel:   dhfis 0x1 dmafis 0x0 sdbfis 0x0
Feb 13 03:41:19 NAS kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Feb 13 03:41:19 NAS kernel: ata5: tag : dhfis dmafis sdbfis sacitve
Feb 13 03:41:19 NAS kernel: ata5: tag 0x0: 1 0 0 1  
Feb 13 03:41:19 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 13 03:41:19 NAS kernel: ata5.00: cmd 61/00:00:3f:40:18/04:00:2c:00:00/40 tag 0 ncq 524288 out
Feb 13 03:41:19 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 03:41:19 NAS kernel: ata5.00: status: { DRDY }
Feb 13 03:41:20 NAS kernel: ata5: soft resetting link
Feb 13 03:41:25 NAS kernel: ata5: link is slow to respond, please be patient (ready=0)
Feb 13 03:41:30 NAS kernel: ata5: SRST failed (errno=-16)
Feb 13 03:41:30 NAS kernel: ata5: soft resetting link
Feb 13 03:41:35 NAS kernel: ata5: link is slow to respond, please be patient (ready=0)
Feb 13 03:41:40 NAS kernel: ata5: SRST failed (errno=-16)
Feb 13 03:41:40 NAS kernel: ata5: soft resetting link
Feb 13 03:41:40 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 13 03:41:45 NAS kernel: ata5.00: qc timeout (cmd 0x27)
Feb 13 03:41:45 NAS kernel: ata5.00: failed to read native max address (err_mask=0x4)
Feb 13 03:41:45 NAS kernel: ata5.00: HPA support seems broken, skipping HPA handling
Feb 13 03:41:45 NAS kernel: ata5.00: revalidation failed (errno=-5)
Feb 13 03:41:46 NAS kernel: ata5: soft resetting link
Feb 13 03:41:46 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 13 03:41:46 NAS kernel: ata5.00: configured for UDMA/133
Feb 13 03:41:46 NAS kernel: ata5: EH complete
Feb 13 03:42:46 NAS kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Feb 13 03:42:46 NAS kernel: ata5: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Feb 13 03:42:46 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Feb 13 03:42:46 NAS kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Feb 13 03:42:46 NAS kernel: ata5: tag : dhfis dmafis sdbfis sacitve
Feb 13 03:42:46 NAS kernel: ata5: tag 0x0: 0 0 0 1  
Feb 13 03:42:46 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 13 03:42:46 NAS kernel: ata5.00: cmd 61/00:00:3f:40:18/04:00:2c:00:00/40 tag 0 ncq 524288 out
Feb 13 03:42:46 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 03:42:46 NAS kernel: ata5.00: status: { DRDY }
Feb 13 03:42:46 NAS kernel: ata5: soft resetting link
Feb 13 03:42:52 NAS kernel: ata5: link is slow to respond, please be patient (ready=0)
Feb 13 03:42:57 NAS kernel: ata5: SRST failed (errno=-16)
Feb 13 03:42:57 NAS kernel: ata5: soft resetting link
Feb 13 03:43:02 NAS kernel: ata5: link is slow to respond, please be patient (ready=0)
Feb 13 03:43:07 NAS kernel: ata5: SRST failed (errno=-16)
Feb 13 03:43:07 NAS kernel: ata5: soft resetting link

February 14, 200917 yr

smino: I deeply apologize, I had forgotten about your post, found it about 50 or 60 open Firefox tabs back. I still hope to respond to several of your questions, as I have had similar issues.

Moshpitius: It's hard to be definitive in a case like this, even with all of the evidence. Although I cannot say for sure, it really looks to me as if you had a serious crash or major electrical spike that caused a serious situation, one that it was unable to recover from. Can you verify that the UDMA errors occurred at about the same time as the beginning of the errors? Check their SMART report, in the most recent 5 error log section, and see if these appear to have occurred very recently, the hour of occurrence is the same number of hours back from the current power-on hours. As you say, it is very unlikely that the cables all went bad at the same time, especially if at the same time as the start of all other errors. Other small possibility is something got way too hot, but did not shut down the computer.

After a major problem as you have had, any number of errors could occur, so the size of the syslog and the quantity and type of errors since, only indicate that something serious happened earlier, unrecoverable at the time. Powering off completely should resolve that, and a parity check should confirm everything is OK.

Was there possibly any external indication of an electrical fault, such as lights dimming or close lightning strike, at the time of start of failure? I don't have any further ideas.

By the way, I have discovered since the earlier discussion of SWNCQ, that I no longer need swncq=0 or noapic for my nForce570 board. See this post. I would be very interested in your comments, and any testing you may attempt.

February 14, 200917 yr

Author

Thank you very much RobJ. I will take a look at the SMART report to confirm the timing. I don't know if there was a major electrical spike, but regrettably my NAS is not running on a UPS (buying one this weekend finally). What's interesting is I notice more exception timeout errors in the most recent syslog (since I powered the server back on). The forum isn't letting me upload attachments, but I copied the error messages for ata5 below... received a similar message for ata4. I ran reiserfsck on a few drives and they were okay. I will inspect the cables now for the heck of it.

Thanks for the update on swncq=0. I will do some testing and post my results.

Feb 14 01:48:26 NAS emhttp: shcmd (26): /usr/sbin/hdparm -y /dev/sdc >/dev/null
Feb 14 01:48:27 NAS emhttp: shcmd (27): /usr/sbin/hdparm -y /dev/sdf >/dev/null
Feb 14 02:13:28 NAS emhttp: shcmd (28): /usr/sbin/hdparm -y /dev/sde >/dev/null
Feb 14 02:30:29 NAS emhttp: shcmd (29): /usr/sbin/hdparm -y /dev/sdd >/dev/null
Feb 14 02:30:29 NAS emhttp: shcmd (30): /usr/sbin/hdparm -y /dev/sdb >/dev/null
Feb 14 02:51:30 NAS emhttp: shcmd (31): /usr/sbin/hdparm -y /dev/sdc >/dev/null
Feb 14 02:51:31 NAS emhttp: shcmd (32): /usr/sbin/hdparm -y /dev/sdf >/dev/null
Feb 14 03:17:32 NAS emhttp: shcmd (33): /usr/sbin/hdparm -y /dev/sde >/dev/null
Feb 14 03:32:33 NAS emhttp: shcmd (34): /usr/sbin/hdparm -y /dev/sdd >/dev/null
Feb 14 03:32:33 NAS emhttp: shcmd (35): /usr/sbin/hdparm -y /dev/sdb >/dev/null
Feb 14 03:52:34 NAS emhttp: shcmd (36): /usr/sbin/hdparm -y /dev/sdc >/dev/null
Feb 14 03:52:35 NAS emhttp: shcmd (37): /usr/sbin/hdparm -y /dev/sdf >/dev/null
Feb 14 04:23:36 NAS emhttp: shcmd (38): /usr/sbin/hdparm -y /dev/sde >/dev/null
Feb 14 04:33:37 NAS emhttp: shcmd (39): /usr/sbin/hdparm -y /dev/sdd >/dev/null
Feb 14 04:33:37 NAS emhttp: shcmd (40): /usr/sbin/hdparm -y /dev/sdb >/dev/null
Feb 14 04:53:38 NAS emhttp: shcmd (41): /usr/sbin/hdparm -y /dev/sdc >/dev/null
Feb 14 04:53:39 NAS emhttp: shcmd (42): /usr/sbin/hdparm -y /dev/sdf >/dev/null
Feb 14 05:28:40 NAS emhttp: shcmd (43): /usr/sbin/hdparm -y /dev/sde >/dev/null
Feb 14 05:29:18 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 14 05:29:18 NAS kernel: ata5.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0
Feb 14 05:29:18 NAS kernel:          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Feb 14 05:29:18 NAS kernel: ata5.00: status: { DRDY }
Feb 14 05:29:21 NAS kernel: ata5: soft resetting link
Feb 14 05:29:21 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 05:29:21 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 05:29:21 NAS kernel: ata5: EH complete
Feb 14 05:29:21 NAS kernel: sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 14 05:29:21 NAS kernel: sd 5:0:0:0: [sde] Write Protect is off
Feb 14 05:29:21 NAS kernel: sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
Feb 14 05:29:21 NAS kernel: sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 14 05:39:41 NAS emhttp: shcmd (44): /usr/sbin/hdparm -y /dev/sdd >/dev/null
Feb 14 05:39:41 NAS emhttp: shcmd (45): /usr/sbin/hdparm -y /dev/sdb >/dev/null
Feb 14 05:54:42 NAS emhttp: shcmd (46): /usr/sbin/hdparm -y /dev/sdc >/dev/null
Feb 14 05:54:42 NAS emhttp: shcmd (47): /usr/sbin/hdparm -y /dev/sdf >/dev/null
Feb 14 06:29:43 NAS emhttp: shcmd (48): /usr/sbin/hdparm -y /dev/sde >/dev/null
Feb 14 06:30:12 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 14 06:30:12 NAS kernel: ata5.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0
Feb 14 06:30:12 NAS kernel:          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Feb 14 06:30:12 NAS kernel: ata5.00: status: { DRDY }
Feb 14 06:30:16 NAS kernel: ata5: soft resetting link
Feb 14 06:30:16 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 06:30:16 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 06:30:16 NAS kernel: ata5: EH complete
Feb 14 06:30:16 NAS kernel: sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 14 06:30:16 NAS kernel: sd 5:0:0:0: [sde] Write Protect is off
Feb 14 06:30:16 NAS kernel: sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
Feb 14 06:30:16 NAS kernel: sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

February 14, 200917 yr

Author

RobJ: Checked on the cables, they appear to be fine. I noticed the following in my syslog for ata4 and ata6:

Feb 14 16:17:17 NAS kernel: ata4.00: HPA detected: current 2930277168, native 18446744072344861488
Feb 14 16:17:17 NAS kernel: ata6.00: HPA detected: current 1953523055, native 1953525168

I was reading a thread about resetting HPA (http://lime-technology.com/forum/index.php?topic=3166.0). Could this be at all related?

I also noticed the following at the end of my current syslog:

Feb 14 17:15:45 NAS kernel: eth0: too many iterations (6) in nv_nic_irq.

Feb 14 17:16:16 NAS last message repeated 35 times

I think I've read about this before, need to find those threads and review to see if this is a problem...

February 15, 200917 yr

Feb 14 16:17:17 NAS kernel: ata4.00: HPA detected: current 2930277168, native 18446744072344861488

This one is not an HPA, but a harmless bug, either in the disk driver, or the drive firmware, and is typical of Seagate 1.5TB drives. It appears as if one of those 'agents' is thinking 32 bit numbers, and the other is thinking 64 bit numbers. For sure, the logging agent is trying to display a much larger number (stored in more bytes) than was provided it, which leads to very large garbage numbers. I'm going to guess that the Seagate is sending a 4 byte number, but the driver is thinking it received an 8 byte number, so it displays the 4 it receives plus the random 4 immediately following. They need to get their protocols back in sync. I'm sure this will disappear in a future kernel release.

Feb 14 16:17:17 NAS kernel: ata6.00: HPA detected: current 1953523055, native 1953525168

This one *is* an HPA, a typical one of 2113 blocks, about 2MB. Since the drive is in production, no longer empty, I would recommend leaving it as it is. This is at the very far end of the drive, so unless you are in the habit of filling your drives up to the very last kilobyte, this is completely inconsequential. Plus, you can confuse the Reiser file system, possibly unRAID itself, and possibly other things if you change the size of the drive. Since most of us leave a small bit of breathing room at the end of our drives, typically 1 to 20 gigabytes, losing 2MB (2 thousandths of a gigabyte) is nothing, not worth the possible problems. The one time it *is* a problem, is when you don't realize your parity drive has an HPA, then you try to add another drive of the same size. The new drive will look like it is 2MB larger than the parity drive, and refuse to be added.

Feb 14 17:15:45 NAS kernel: eth0: too many iterations (6) in nv_nic_irq.

This appears in my syslogs too, and in smino's. I have only found it in syslogs of boards with nForce chipsets, using the forcedeth network driver. I have NEVER seen any harm come of it, so I've learned to ignore it. From what I have seen, in mine as well as others, it appears only to happen during periods of heavy network traffic. My own hypothesis is, that the forcedeth driver was not designed for heavy traffic, that it cannot handle more than 6 requests at a time, and prints a warning when it gets more work than it can handle. Again, I have never seen any harm from it. I would like to see it go away of course, so perhaps a future release will improve it. Since I rarely tax the network, at least to my unRAID server, I am OK with the current state. But if I needed more simultaneous streams to my server, I think I would disable the onboard NIC, and add one of the high performance Intel NIC's, on a PCI Express card.

February 15, 200917 yr

Author

RobJ- thanks again for the detailed response, very valuable info for me. I happened to just copy an 8GB file to the server just fine, but a few minutes later the web interface was unresponsive. Slightly different error messages:

Feb 14 23:49:24 NAS kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 14 23:49:24 NAS kernel: ata6.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0
Feb 14 23:49:24 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 23:49:24 NAS kernel: ata6.00: status: { DRDY }
Feb 14 23:49:24 NAS kernel: ata6: soft resetting link
Feb 14 23:49:24 NAS kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:49:29 NAS kernel: ata6.00: qc timeout (cmd 0x27)
Feb 14 23:49:29 NAS kernel: ata6.00: failed to read native max address (err_mask=0x4)
Feb 14 23:49:29 NAS kernel: ata6.00: HPA support seems broken, skipping HPA handling
Feb 14 23:49:29 NAS kernel: ata6.00: revalidation failed (errno=-5)
Feb 14 23:49:30 NAS kernel: ata6: soft resetting link
Feb 14 23:49:30 NAS kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:49:30 NAS kernel: ata6.00: configured for UDMA/133
Feb 14 23:49:30 NAS kernel: ata6: EH complete
Feb 14 23:49:30 NAS kernel: sd 6:0:0:0: [sdf] 1953523055 512-byte hardware sectors (1000204 MB)
Feb 14 23:49:30 NAS kernel: sd 6:0:0:0: [sdf] Write Protect is off
Feb 14 23:49:30 NAS kernel: sd 6:0:0:0: [sdf] Mode Sense: 00 3a 00 00
Feb 14 23:49:30 NAS kernel: sd 6:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 14 23:50:17 NAS kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Feb 14 23:50:17 NAS kernel: ata5: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Feb 14 23:50:17 NAS kernel:   dhfis 0x1 dmafis 0x0 sdbfis 0x0
Feb 14 23:50:17 NAS kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Feb 14 23:50:17 NAS kernel: ata5: tag : dhfis dmafis sdbfis sacitve
Feb 14 23:50:17 NAS kernel: ata5: tag 0x0: 1 0 0 1  
Feb 14 23:50:17 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 14 23:50:17 NAS kernel: ata5.00: cmd 60/00:00:7f:f1:97/04:00:25:00:00/40 tag 0 ncq 524288 in
Feb 14 23:50:17 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 23:50:17 NAS kernel: ata5.00: status: { DRDY }
Feb 14 23:50:17 NAS kernel: ata5: soft resetting link
Feb 14 23:50:18 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:50:23 NAS kernel: ata5.00: qc timeout (cmd 0x27)
Feb 14 23:50:23 NAS kernel: ata5.00: failed to read native max address (err_mask=0x4)
Feb 14 23:50:23 NAS kernel: ata5.00: HPA support seems broken, skipping HPA handling
Feb 14 23:50:23 NAS kernel: ata5.00: revalidation failed (errno=-5)
Feb 14 23:50:23 NAS kernel: ata5: soft resetting link
Feb 14 23:50:23 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:50:23 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 23:50:23 NAS kernel: ata5: EH complete
Feb 14 23:51:23 NAS kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Feb 14 23:51:23 NAS kernel: ata5: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Feb 14 23:51:23 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Feb 14 23:51:23 NAS kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Feb 14 23:51:23 NAS kernel: ata5: tag : dhfis dmafis sdbfis sacitve
Feb 14 23:51:23 NAS kernel: ata5: tag 0x0: 0 0 0 1  
Feb 14 23:51:23 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 14 23:51:23 NAS kernel: ata5.00: cmd 60/00:00:7f:f1:97/04:00:25:00:00/40 tag 0 ncq 524288 in
Feb 14 23:51:23 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 23:51:23 NAS kernel: ata5.00: status: { DRDY }
Feb 14 23:51:24 NAS kernel: ata5: soft resetting link
Feb 14 23:51:24 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:51:24 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 23:51:24 NAS kernel: ata5: EH complete
Feb 14 23:51:24 NAS kernel: sd 5:0:0:0: [sde] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 14 23:52:24 NAS kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Feb 14 23:52:24 NAS kernel: ata5: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Feb 14 23:52:24 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Feb 14 23:52:24 NAS kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Feb 14 23:52:24 NAS kernel: ata5: tag : dhfis dmafis sdbfis sacitve
Feb 14 23:52:24 NAS kernel: ata5: tag 0x0: 0 0 0 1  
Feb 14 23:52:24 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 14 23:52:24 NAS kernel: ata5.00: cmd 60/00:00:7f:f1:97/04:00:25:00:00/40 tag 0 ncq 524288 in
Feb 14 23:52:24 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 23:52:24 NAS kernel: ata5.00: status: { DRDY }
Feb 14 23:52:25 NAS kernel: ata5: soft resetting link
Feb 14 23:52:25 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:52:25 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 23:52:25 NAS kernel: ata5: EH complete
Feb 14 23:52:25 NAS kernel: sd 5:0:0:0: [sde] Write Protect is off
Feb 14 23:52:25 NAS kernel: sd 5:0:0:0: [sde] Mode Sense: 00 3a 00 00
Feb 14 23:53:25 NAS kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Feb 14 23:53:25 NAS kernel: ata5: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Feb 14 23:53:25 NAS kernel:   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Feb 14 23:53:25 NAS kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Feb 14 23:53:25 NAS kernel: ata5: tag : dhfis dmafis sdbfis sacitve
Feb 14 23:53:25 NAS kernel: ata5: tag 0x0: 0 0 0 1  
Feb 14 23:53:25 NAS kernel: ata5.00: NCQ disabled due to excessive errors
Feb 14 23:53:25 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Feb 14 23:53:25 NAS kernel: ata5.00: cmd 60/00:00:7f:f1:97/04:00:25:00:00/40 tag 0 ncq 524288 in
Feb 14 23:53:25 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 23:53:25 NAS kernel: ata5.00: status: { DRDY }
Feb 14 23:53:26 NAS kernel: ata5: soft resetting link
Feb 14 23:53:26 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:53:26 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 23:53:26 NAS kernel: ata5: EH complete
Feb 14 23:54:26 NAS kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb 14 23:54:26 NAS kernel: ata5.00: cmd 25/00:00:7f:f1:97/00:04:25:00:00/e0 tag 0 dma 524288 in
Feb 14 23:54:26 NAS kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 14 23:54:26 NAS kernel: ata5.00: status: { DRDY }
Feb 14 23:54:26 NAS kernel: ata5: soft resetting link
Feb 14 23:54:27 NAS kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 14 23:54:27 NAS kernel: ata5.00: configured for UDMA/133
Feb 14 23:54:27 NAS kernel: ata5: EH complete
Feb 14 23:54:27 NAS kernel: sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I think it's time I start adding boot codes (or removing swncq=0).

May 7, 200917 yr

Hi all, I'm having a very similar problem to this, but starting from slightly different initial conditions. Here is the background to my situation...

Recently, I decided that it was time for another round of upgrades to my unRAID server. I had been running on an Asus P4P800S motherboard (Intel 848P chipset) with 10 SATA drives (2 on the mobo controller and 8 on one of the PCI promise 8x controllers) mounted in 2 of the IcyDock 3->5 SATA backplanes, and 2 PATA drives running off the mobo PATA contoller.

The plan was to eliminate the PATA drives, put in another promise TX8 and another of the IcyDock 3->5 backplanes. The PATA drives would be replaced with shiny new 1.5 TB drives, one of which would become my new parity drive and the other simply a data upgrade.

At first, things seemed to go pretty well, but it was not long before I started having trouble with the new parity drive. Eventually things got so bad that the drive bricked and I needed to RMA it. With the addition of the new controller card, I also started to see occasional failures like those described in this thread. I asked for some advice in the 4.3 forum (since I was running 4.3 beta6 at the time) and followed the suggestions I received. My mobo bios is up-to-date; I upgraded to unRAID 4.5 beta4, I checked the SMART status of all my disks, ran self tests, and so on. In general, things seemed to be more-or-less OK; even though I was running without parity.

When the new parity drive showed up, I racked it up and started to rebuild. I couldn't finish a parity rebuild because one of my drives would always show the errors that are the focus of this thread. At first, it seemed to be associated with a particular 1TB drive, so I ended up getting yet another 1.5TB drive and moving the data from the 1TB drive over to the 1.5TB drive. Later on, however, I started to get the same kind of errors with different drives.

SMART checks generally show that the drives think they are healthy, but sometimes are failing simple commands like IDENTIFY and ENABLE_SMART. When the array is offline and I mount individual /dev/sdX1 partitions, I don't seem to have any trouble. Its only when everything is running full-bore during a parity rebuild that I really see the trouble.

Suspecting that it might be a cable or SATA port problem, I have juggled the physical connections of the drives around looking for a pattern, but have not discovered one. I have one data point which seems to indicate that bypassing the SATA backplane made things better for one particular drive (I finally succeeded in rebuilding parity doing this), but many people use these backplanes without a problem, so I am not convinced yet.

While I have had problems with my power supply in the past, I don't think there is any issue right now. I'm using a 650W Corsair TX series PS with a single massive 12v rail. Looking at power consumption from the outside using a kill-a-watt, the system is pulling about 215W total with all the drives spinning; so I don't think there is a problem with an unstable PS.

So; here is what I am wondering.

1) Has anyone on this thread made any progress with their link reset issues? If so, what fixed it for you?

2) Have people had trouble with the IcyDock backplanes? I really like their convenience, plus they look super spiffy, but I am wondering if I would be better off ditching them.

3) I'm putting a ton of load on my PCI bus when I have all 13 of my drives up and running. Both of my promise controllers seem to be sitting on the same PCI bus, so the transfer rates during rebuild are terrible (~9-10MB/s) and its because of the PCI channel limit. The P4P800S its pretty ancient these days as well. I have been considering switching to a newer mobo (one with at least 8 SATA ports already on it) and getting one of the new 8 port PCIe 4x SAS controllers which just came out. This should (in theory) eliminate the PCI bottleneck, and might eliminate my link-reset issues, assuming that the trouble is somehow related to my mobo and controller combination. Still, I would rather not spend another $250, unless I had some confidence that things were going to get better.

Any thoughts or suggestions? If I were going to replace the mobo, does anyone have a suggestion for a motherboard which works well with the new SAS controller? An example chunk of the syslog is provided below. Let me know if there is more data I can collect.

-john

May  5 21:21:19 mrpink kernel: ata17.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  5 21:21:19 mrpink kernel: ata17.00: cmd c8/00:08:f7:b8:08/00:00:00:00:00/e0 tag 0 dma 4096 in
May  5 21:21:19 mrpink kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May  5 21:21:19 mrpink kernel: ata17.00: status: { DRDY }
May  5 21:21:24 mrpink kernel: ata17: link is slow to respond, please be patient (ready=0)
May  5 21:21:29 mrpink kernel: ata17: device not ready (errno=-16), forcing hardreset
May  5 21:21:29 mrpink kernel: ata17: soft resetting link
May  5 21:21:34 mrpink kernel: ata17: link is slow to respond, please be patient (ready=0)
May  5 21:21:39 mrpink kernel: ata17: SRST failed (errno=-16)
May  5 21:21:39 mrpink kernel: ata17: soft resetting link
May  5 21:21:44 mrpink kernel: ata17: link is slow to respond, please be patient (ready=0)
May  5 21:21:49 mrpink kernel: ata17: SRST failed (errno=-16)
May  5 21:21:49 mrpink kernel: ata17: soft resetting link
May  5 21:21:54 mrpink kernel: ata17: link is slow to respond, please be patient (ready=0)
May  5 21:22:25 mrpink kernel: ata17: SRST failed (errno=-16)
May  5 21:22:25 mrpink kernel: ata17: soft resetting link
May  5 21:22:30 mrpink emhttp: shcmd (54): /usr/sbin/hdparm -y /dev/sdn >/dev/null
May  5 21:22:30 mrpink emhttp: shcmd: shcmd (54): exit status: 5
May  5 21:22:30 mrpink kernel: ata17: SRST failed (errno=-16)
May  5 21:22:30 mrpink kernel: ata17: reset failed, giving up
May  5 21:22:30 mrpink kernel: ata17.00: disabled
May  5 21:22:30 mrpink kernel: ata17: EH complete
May  5 21:22:30 mrpink kernel: sd 17:0:0:0: [sdn] Result: hostbyte=0x04 driverbyte=0x00
May  5 21:22:30 mrpink kernel: end_request: I/O error, dev sdn, sector 571639
May  5 21:22:30 mrpink kernel: md: disk9 read error

May 7, 200917 yr

There seem to be a number of different ways that drives fail or produce errors, and many of them use some of the same terms, such as soft and hard reset, frozen, and timeout. Unfortunately, they are often quite different, and other error terms need to be examined to indicate the true problem. In your case, the syslog excerpt does not actually have an obvious error, perhaps it was earlier in the syslog? What is apparent here, is that the drive has stopped responding, and after a number of attempts to reset and recover, it is finally disabled, which is fatal. All subsequent errors, such as the read error, can be ignored, because the drive is considered disabled. You cannot conclude ANYTHING here though, about the drive. It could be a problem with the drive, or the cabling, or the connectors along the path, or the controller, ...

I don't have anything else to go on, so no other comments. Perhaps a copy of the full syslog will reveal more. Just my opinion here, you were running fine before, with drives, controllers, busses, and motherboard all from the same generation. Now you are adding faster drives from the next generation, and trying to get them to wait their turn on the older generations busses, which may be leading to some of the instability. I do think you may probably be better off with a current motherboard, of good quality, able to keep up with your new drives. I don't like speculating though, and this is pure speculation.

SATA cables and backplanes are also an iffy thing. The fact that some combinations have worked better would seem to indicate that better cabling *might* help.

[4.4.2] file copy failures, exception frozen/timeouts

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)