Jump to content

Bad SATA cable / power problem or Faulty drive?


niven

Recommended Posts

Hello!

 

I ran a parity check last night. It didnt detect or correct any errors. But to make sure there weren't any errors I checked the SYSlog. I'm not an expert reading the SYSlog, but I usually "search find" for errors or failures.

 

When i searched error this is what i found:

 

Aug 13 01:21:17 Mediatower kernel: mdcmd (45): check CORRECT

Aug 13 01:21:17 Mediatower kernel: md: recovery thread woken up ...

Aug 13 01:21:17 Mediatower kernel: md: recovery thread checking parity...

Aug 13 01:21:17 Mediatower kernel: md: using 1536k window, over a total of 2930266532 blocks.

Aug 13 01:26:01 Mediatower kernel: ata6.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6 frozen

Aug 13 01:26:01 Mediatower kernel: ata6.00: irq_stat 0x08000000, interface fatal error

Aug 13 01:26:01 Mediatower kernel: ata6: SError: { UnrecovData HostInt 10B8B BadCRC }

Aug 13 01:26:01 Mediatower kernel: ata6.00: failed command: READ DMA

Aug 13 01:26:01 Mediatower kernel: ata6.00: cmd c8/00:e8:a0:4b:ce/00:00:00:00:00/e4 tag 20 dma 118784 in

Aug 13 01:26:01 Mediatower kernel:        res 50/00:00:9f:4b:ce/00:00:04:00:00/e0 Emask 0x50 (ATA bus error)

Aug 13 01:26:01 Mediatower kernel: ata6.00: status: { DRDY }

Aug 13 01:26:01 Mediatower kernel: ata6: hard resetting link

Aug 13 01:26:01 Mediatower kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Aug 13 01:26:01 Mediatower kernel: ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded

Aug 13 01:26:01 Mediatower kernel: ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Aug 13 01:26:01 Mediatower kernel: ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out

Aug 13 01:26:01 Mediatower kernel: ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded

Aug 13 01:26:01 Mediatower kernel: ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Aug 13 01:26:01 Mediatower kernel: ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out

Aug 13 01:26:01 Mediatower kernel: ata6.00: configured for UDMA/133

Aug 13 01:26:01 Mediatower kernel: ata6: EH complete

Aug 13 01:44:13 Mediatower kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen

Aug 13 01:44:13 Mediatower kernel: ata6.00: irq_stat 0x08000000, interface fatal error

Aug 13 01:44:13 Mediatower kernel: ata6: SError: { UnrecovData 10B8B BadCRC }

Aug 13 01:44:13 Mediatower kernel: ata6.00: failed command: READ DMA EXT

Aug 13 01:44:13 Mediatower kernel: ata6.00: cmd 25/00:28:58:4c:7d/00:00:17:00:00/e0 tag 24 dma 20480 in

Aug 13 01:44:13 Mediatower kernel:        res 50/00:00:57:4c:7d/00:00:17:00:00/e0 Emask 0x10 (ATA bus error)

Aug 13 01:44:13 Mediatower kernel: ata6.00: status: { DRDY }

Aug 13 01:44:13 Mediatower kernel: ata6: hard resetting link

 

I did a search on the error ata6: SError: { UnrecovData 10B8B BadCRC } and i found good information on this page: https://lime-technology.com/wiki/index.php/The_Analysis_of_Drive_Issues

 

The presence of BadCRC is a pretty good indicator of a poor quality SATA cable. However, if a better cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).

 

My system specs are:

 

Motherboard:  ASRock 1155 B75 PRO3-M M-ATX (90-MXGLW0-A0UAYZ)

CPU:  INTEL 3rd Generation Core i3-3220 LGA1155 Boxed (BX80637I33220)

PSU:  CORSAIR HX 650W PSU ATX 12V V2.31, 80 Plus Gold, Modular, 2x 6+2-pin PCIe, 8x SATA, 4x Molex (CP-9020030-EU)

Card: ADAPTEC SATA RAID AAR1430SA PCI KIT (AAR1430SA-ROHS-KIT)

RAM  Corsair Dominator DHX DDR3 1600MHz 8GB with CL9.

HDD's: WD Red 3TB NAS Harddrive x 6

SSD HyperX Fury SSD 120GB 2,5"

Backplanes: ICY BOX IB-555SSK 5Bay Backplane x 2

 

The backplane uses MOLEX connectors so I had to buy and use a molex splitter

 

I think from what I have red on the analysis page, that it's related to a bad cable or a power problem with the sata cables. But i don't know how I should find the faulty sata cable or what the source of the errors are.

 

I will attach the whole SYSlog so its easier can identify the problem. ( The text files size was too large, so I uploaded the ZIP instead.)

 

Thanks for the help and information tips!

mediatower-syslog-20150813-0959.zip

Link to comment

From the syslog -

Aug 12 19:54:01 Mediatower kernel: ata6.00: ATA-9: WDC WD30EFRX-68AX9N0,      WD-WCC1T1055807, 80.00A80, max UDMA/133

...

Aug 12 19:54:01 Mediatower kernel: scsi 6:0:0:0: Direct-Access    ATA      WDC WD30EFRX-68A 0A80 PQ: 0 ANSI: 5

...

Aug 12 19:54:01 Mediatower kernel: sd 6:0:0:0: [sdc] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)

...

Aug 12 19:54:04 Mediatower emhttp: WDC_WD30EFRX-68AX9N0_WD-WCC1T1055807 (sdc) 2930266584

...

Aug 12 19:54:04 Mediatower kernel: md: import disk1: [8,32] (sdc) WDC_WD30EFRX-68AX9N0_WD-WCC1T1055807 size: 2930266532

 

As you said, I'd replace the SATA cable first.

Link to comment

Thanks for the reply  RobJ!

 

As you said, I'd replace the SATA cable first.

 

Yes, i will definitely check if this solves the problem first.

 

How can I determine which SATA cable that needs to be replaced? Do I check the model number of the drive? ; WDC WD30EFRX-68A and WDC_WD30EFRX-68AX9N0_WD-WCC1T1055807? or is it SDC that needs to be replaced?

 

When I have replaced the correct SATA cable, how can I check if the problem is fixed? Can i run a test for this?

 

really appreciate the help and tips, thanks!

Link to comment

I guess I was not clear, all of those lines are referring to the same drive, as it is known by different subsystems.

 

  Disk 1 == sdc == ata6.00 == sd 6:0:0:0 == scsi 6:0:0:0 == WDC WD30EFRX-68AX9N0 with serial = WD-WCC1T1055807

 

There is no test.  If the problem was a bad cable, replacing it with a good one should stop the errors from recurring.

 

One thing I recommend is to turn on the SMART Notifications (in Advanced View), then turn off 'Command timeout' and add 199 to 'Custom attribute'.  Attribute 199 is UDMA_CRC_Error_Count, which if constantly increasing is a good indicator of a bad SATA cable.

Link to comment
I guess I was not clear, all of those lines are referring to the same drive, as it is known by different subsystems.

 

  Disk 1 == sdc == ata6.00 == sd 6:0:0:0 == scsi 6:0:0:0 == WDC WD30EFRX-68AX9N0 with serial = WD-WCC1T1055807

 

Thanks! Just me who is'nt used to understand / read the log correctly.

 

There is no test.  If the problem was a bad cable, replacing it with a good one should stop the errors from recurring.

Ok i understand.

 

One thing I recommend is to turn on the SMART Notifications (in Advanced View), then turn off 'Command timeout' and add 199 to 'Custom attribute'.  Attribute 199 is UDMA_CRC_Error_Count, which if constantly increasing is a good indicator of a bad SATA cable.

 

Ok, thanks for the tips and information! I will report back when I have replaced the SATA cable with a new one.

Link to comment

So I replaced the SATA cables last night and there seems to be no errror messages now. I guess I will have to just keep an eye for the future.

 

I noticed this in the syslog:

 

Aug 14 22:14:57 Mediatower emhttp: shcmd (54): mkdir -p /mnt/disk1

Aug 14 22:14:57 Mediatower emhttp: shcmd (55): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md1 /mnt/disk1 |& logger

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md1): found reiserfs format "3.6" with standard journal

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md1): using ordered data mode

Aug 14 22:14:57 Mediatower kernel: reiserfs: using flush barriers

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md1): journal params: device md1, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md1): checking transaction log (md1)

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md1): Using r5 hash to sort names

Aug 14 22:14:57 Mediatower emhttp: shcmd (56): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md1 /mnt/disk1 |& logger

Aug 14 22:14:57 Mediatower emhttp: reiserfs: resizing /mnt/disk1

Aug 14 22:14:57 Mediatower emhttp: shcmd (57): mkdir -p /mnt/disk2

Aug 14 22:14:57 Mediatower emhttp: shcmd (58): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md2 /mnt/disk2 |& logger

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md2): found reiserfs format "3.6" with standard journal

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md2): using ordered data mode

Aug 14 22:14:57 Mediatower kernel: reiserfs: using flush barriers

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md2): journal params: device md2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md2): checking transaction log (md2)

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md2): Using r5 hash to sort names

Aug 14 22:14:57 Mediatower emhttp: shcmd (59): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md2 /mnt/disk2 |& logger

Aug 14 22:14:57 Mediatower emhttp: reiserfs: resizing /mnt/disk2

Aug 14 22:14:57 Mediatower emhttp: shcmd (60): mkdir -p /mnt/disk3

Aug 14 22:14:57 Mediatower emhttp: shcmd (61): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md3 /mnt/disk3 |& logger

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md3): found reiserfs format "3.6" with standard journal

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md3): using ordered data mode

Aug 14 22:14:57 Mediatower kernel: reiserfs: using flush barriers

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md3): journal params: device md3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Aug 14 22:14:57 Mediatower kernel: REISERFS (device md3): checking transaction log (md3)

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md3): Using r5 hash to sort names

Aug 14 22:14:58 Mediatower emhttp: shcmd (62): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md3 /mnt/disk3 |& logger

Aug 14 22:14:58 Mediatower emhttp: reiserfs: resizing /mnt/disk3

Aug 14 22:14:58 Mediatower emhttp: shcmd (63): mkdir -p /mnt/disk4

Aug 14 22:14:58 Mediatower emhttp: shcmd (64): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md4 /mnt/disk4 |& logger

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md4): found reiserfs format "3.6" with standard journal

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md4): using ordered data mode

Aug 14 22:14:58 Mediatower kernel: reiserfs: using flush barriers

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md4): journal params: device md4, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md4): checking transaction log (md4)

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md4): Using r5 hash to sort names

Aug 14 22:14:58 Mediatower emhttp: shcmd (65): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md4 /mnt/disk4 |& logger

Aug 14 22:14:58 Mediatower emhttp: reiserfs: resizing /mnt/disk4

Aug 14 22:14:58 Mediatower emhttp: shcmd (66): mkdir -p /mnt/disk5

Aug 14 22:14:58 Mediatower emhttp: shcmd (67): set -o pipefail ; mount -t reiserfs -o noatime,nodiratime /dev/md5 /mnt/disk5 |& logger

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md5): found reiserfs format "3.6" with standard journal

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md5): using ordered data mode

Aug 14 22:14:58 Mediatower kernel: reiserfs: using flush barriers

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md5): journal params: device md5, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md5): checking transaction log (md5)

Aug 14 22:14:58 Mediatower kernel: REISERFS (device md5): Using r5 hash to sort names

Aug 14 22:14:58 Mediatower emhttp: shcmd (68): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/md5 /mnt/disk5 |& logger

Aug 14 22:14:58 Mediatower emhttp: reiserfs: resizing /mnt/disk5

Aug 14 22:14:58 Mediatower emhttp: shcmd (69): mkdir -p /mnt/cache

Aug 14 22:14:58 Mediatower emhttp: shcmd (70): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/sdh1 /mnt/cache |& logger

Aug 14 22:14:58 Mediatower kernel: BTRFS info (device sdh1): disk space caching is enabled

 

What is the message about pipefail? what does it mean didn't find anything on the forums.

Is this something i should be concerned about?/ need to fix?

 

I'll attach the new syslog if anyone would like to take a look if there is any problems i should be worry about.

 

Thanks - niven

mediatower-syslog-20150815-0117.zip

Link to comment

So I replaced the SATA cables last night and there seems to be no errror messages now. I guess I will have to just keep an eye for the future.

...

Thanks - niven

 

This is correct. Many users mistakenly believe that once a problem is fixed, it will immediately be obvious and the damage done by the problem will be undone. So if a drive is kicked from the array due to a cabling problem, and the cabling problem is fixed, they think that the disk will suddenly be back in the array and all is well. Or if the UDMA_CRC_Error_Count is high due to a cabling issue, that fixing the cabling issue will result in that attribute going back to zero. Neither of these is true. Once a drive is kicked from the array, the user has to rebuild the disk, or perform some more advanced techniques to get it back in the array. If the UDMA_CRC_Error_Count increments, it will never decrement, and you'd be wanting to monitor to make sure it does not increment further. The only be able to tell is by remembering the highest value when the cabling problem existing, and using that for comparison (myMain gives an easy way to do that).

 

(Note that the only attribute that you would be looking to go back to zero once it increments is the Current_Pending_Sector. Once a pending sector is reallocated or put back in service by the drive, its value will (should) decrement. So you should always be looking for a zero for Current_Pending_Sector.)

Link to comment
set -o pipefail is normal.

 

Ok, thanks for the confirmation trurl!

 

This is correct. Many users mistakenly believe that once a problem is fixed, it will immediately be obvious and the damage done by the problem will be undone. So if a drive is kicked from the array due to a cabling problem, and the cabling problem is fixed, they think that the disk will suddenly be back in the array and all is well. Or if the UDMA_CRC_Error_Count is high due to a cabling issue, that fixing the cabling issue will result in that attribute going back to zero. Neither of these is true. Once a drive is kicked from the array, the user has to rebuild the disk, or perform some more advanced techniques to get it back in the array. If the UDMA_CRC_Error_Count increments, it will never decrement, and you'd be wanting to monitor to make sure it does not increment further. The only be able to tell is by remembering the highest value when the cabling problem existing, and using that for comparison (myMain gives an easy way to do that).

 

(Note that the only attribute that you would be looking to go back to zero once it increments is the Current_Pending_Sector. Once a pending sector is reallocated or put back in service by the drive, its value will (should) decrement. So you should always be looking for a zero for Current_Pending_Sector.)

 

Thank you so much for the information! I'm going to use this in the future. Also I need to install unMENU and myMAIN havn't done that yet. Is unMENU and myMAIN compatible with unRAID 6.1?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...