Hard disk error messages in syslog. Parity errors detected when checked.


Joe L.

Recommended Posts

This was initially posted in the 4.5beta12 release thread, but it had nothing to do with the new release, so it is continuing here:

 

I'm having some problems and I don't know if they are related to beta12 or whether they are related to a few new harddisks I added in during the last couple days. Basically adding in the new harddisks seems to have worked and everything seems to be fine. But today I started a parity check and it showed 4 errors right when it started. So I've checked out the syslog and found a few things that worry me. Here's the full syslog:

 

http://madshi.net/syslog.txt

 

Dec  2 13:09:01 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  2 13:09:01 Tower kernel: ata10.00: ATA-7: SAMSUNG HD154UI, 1AG01113, max UDMA7
Dec  2 13:09:01 Tower kernel: ata10.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 0/32)
Dec  2 13:09:01 Tower kernel: ata10.00: configured for UDMA/133
Dec  2 13:09:01 Tower kernel: ata10: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xf t4
Dec  2 13:09:01 Tower kernel: ata10: hotplug_status 0x2
Dec  2 13:09:01 Tower kernel: ata10: hard resetting link
Dec  2 13:09:01 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  2 13:09:01 Tower kernel: ata10.00: configured for UDMA/133
Dec  2 13:09:01 Tower kernel: ata10: EH complete

 

Is that "exeption" supposed to be there? Same thing for ata11 and ata12. And then I have 6 exceptions like these, all for ata8:

 

Dec  2 16:24:07 Tower kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x780100 action 0x6
Dec  2 16:24:07 Tower kernel: ata8.00: irq_stat 0x08000000
Dec  2 16:24:07 Tower kernel: ata8: SError: { UnrecovData 10B8B Dispar BadCRC Handshk }
Dec  2 16:24:07 Tower kernel: ata8.00: cmd 25/00:f8:0f:98:b8/00:03:22:00:00/e0 tag 0 dma 520192 in
Dec  2 16:24:07 Tower kernel:          res 50/00:00:0e:98:b8/00:00:22:00:00/e0 Emask 0x10 (ATA bus error)
Dec  2 16:24:07 Tower kernel: ata8.00: status: { DRDY }
Dec  2 16:24:07 Tower kernel: ata8: hard resetting link
Dec  2 16:24:07 Tower kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  2 16:24:07 Tower kernel: ata8.00: configured for UDMA/133
Dec  2 16:24:07 Tower kernel: ata8: EH complete

 

Are these ok? Or is it a sign of something bad? How can I find out what disk ata8 is exactly?

 

I'd run SMART reports on the drives themselves

 

How do I do that? Do I have to remove the drivers from unRAID to do that? Or is there a way to run the SMART report directly in unRAID somehow? Thanks!

 

I'll order some replacement cables...

A good first step.

P.S: How can I find out which drive is "ata8"?

http://ata.wiki.kernel.org/index.php/Libata_error_messages

Port 8 on disk controller 0 (If I understand the wiki correctly)

 

As far as the error itself.  The wiki says:

These bits are set by the SATA host interface in response to error conditions on the SATA link. Unless a drive hotplug or unplug operation occurred, it is generally not normal to see any of these bits set. If they are, it usually points strongly toward a hardware problem (often a bad SATA cable or a bad or inadequate power supply).

 

Since you recently added disks, it could be either.

 

Joe L.

Link to comment

Thanks for your help, Joe!

 

http://ata.wiki.kernel.org/index.php/Libata_error_messages

Port 8 on disk controller 0 (If I understand the wiki correctly)

 

Hmmmm... I have several controllers, but none of them has 8 ports. I wish the syslog would list the harddisk serial number for each controller/port.

 

As far as the error itself.  The wiki says:

These bits are set by the SATA host interface in response to error conditions on the SATA link. Unless a drive hotplug or unplug operation occurred, it is generally not normal to see any of these bits set. If they are, it usually points strongly toward a hardware problem (often a bad SATA cable or a bad or inadequate power supply).

 

Since you recently added disks, it could be either.

 

I've replaced Western Digital 1TB disks (spec says 7.4 Watts) with Samsung 1.5TB disks (spec says 6.3 Watts). So power consumption should have gone down.

 

It might be a cable. The big question is how to find out which cable. Of course I could simply replace all cables. But if I knew which disk was ata8, I could try replacing only that cable right now. I have a few cables lying around, but not enough for all disks.

 

Can you give me a hint on how to do that SMART report you mentioned earlier? Can I do that directly within unRAID (e.g. via telnet) or do I have to remove the drives from unRAID and connect them to my PC?

 

Thanks!

Link to comment

P.S: I think I was able to identify ata8: These "ata" numbers seem to match the "host" numbers in the new beta12 "spinup group" edit field. At least I still have 2 harddisks from different manufacturers in my unRAID array. And the "ata" syslog messages for these match the "host" numbers for these two harddisks. If this identification method is correct, ata8 is an "old" harddisk, one which has been there for weeks or even months. That would indicate that the error is not caused by the new harddisks, but that the problem might really be caused by a bad cable. Maybe the cable isn't even bad, but just got loose when I replaced the harddisks during the past few days...

 

Got 2 new BadCRC exceptions in the syslog now during the still running parity check, but the parity error count is still sitting at 4. Strange. At least the 2 new exceptions are again from ata8. So I guess it's really only that one cable that is problematic.

Link to comment

Can you give me a hint on how to do that SMART report you mentioned earlier? Can I do that directly within unRAID (e.g. via telnet) or do I have to remove the drives from unRAID and connect them to my PC?

Thanks!

Yes, you can do it from the telnet prompt.

 

Type

smartctl -a -d ata /dev/XXX

where XXX = sda, sdb, sdc, sdd,etc... for SATA disks or disks on controller emulating SCSI disks or

          XXX = hda, hdb, hdc, etc  for older PATA disks.  (you don't have any PATA disks, but for future reference)

 

The device designation is shown in "(parens)" here in your syslog:

Dec  2 13:09:01 Tower emhttp: Device inventory:

Dec  2 13:09:01 Tower emhttp: pci-0000:00:1f.2-scsi-0:0:0:0 host1 (sda) ata-SAMSUNG_HD501LJ_S0MUJ1MP604127

Dec  2 13:09:01 Tower emhttp: pci-0000:00:1f.2-scsi-1:0:0:0 host2 (sdb) ata-SAMSUNG_HD154UI_S1XWJ1KSA12289

Dec  2 13:09:01 Tower emhttp: pci-0000:00:1f.2-scsi-2:0:0:0 host3 (sdc) ata-SAMSUNG_HD154UI_S1Y6J1KS724526

Dec  2 13:09:01 Tower emhttp: pci-0000:00:1f.2-scsi-3:0:0:0 host4 (sdd) ata-SAMSUNG_HD154UI_S1XWJ1KSA12291

Dec  2 13:09:01 Tower emhttp: pci-0000:00:1f.2-scsi-4:0:0:0 host5 (sde) ata-SAMSUNG_HD154UI_S1XWJ1KSA12150

Dec  2 13:09:01 Tower emhttp: pci-0000:00:1f.2-scsi-5:0:0:0 host6 (sdg) ata-WDC_WD1000FYPS-01ZKB0_WD-WCASJ0654104

Dec  2 13:09:01 Tower emhttp: pci-0000:02:00.0-scsi-1:0:0:0 host8 (sdh) ata-SAMSUNG_HD154UI_S1XJJDWS517390

Dec  2 13:09:01 Tower emhttp: pci-0000:05:00.0-scsi-0:0:0:0 host9 (sdi) ata-SAMSUNG_HD154UI_S1XJJDWS517397

Dec  2 13:09:01 Tower emhttp: pci-0000:05:00.0-scsi-1:0:0:0 host10 (sdj) ata-SAMSUNG_HD154UI_S1XJJDWS517392

Dec  2 13:09:01 Tower emhttp: pci-0000:05:00.0-scsi-2:0:0:0 host11 (sdk) ata-SAMSUNG_HD154UI_S1XJJDWS517393

Dec  2 13:09:01 Tower emhttp: pci-0000:05:00.0-scsi-3:0:0:0 host12 (sdl) ata-SAMSUNG_HD103UJ_S13PJ1NQ713356

 

Earlier in your syslog was this set of lines...

Dec  2 13:09:01 Tower kernel: ata8.00: ATA-7: SAMSUNG HD154UI, 1AG01113, max UDMA7

Dec  2 13:09:01 Tower kernel: ata8.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 31/32)

Dec  2 13:09:01 Tower kernel: ata8.00: configured for UDMA/133

So I know it is one of your SAMSUNG drives.

These lines seem to me that the drive is emulating scsi controller 8 and points to "/dev/sdh" as the disk.

Dec  2 13:09:01 Tower kernel: scsi 8:0:0:0: Direct-Access    ATA      SAMSUNG HD154UI  1AG0 PQ: 0 ANSI: 5

Dec  2 13:09:01 Tower kernel: sd 8:0:0:0: [sdh] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)

Dec  2 13:09:01 Tower kernel: sd 8:0:0:0: [sdh] Write Protect is off

Dec  2 13:09:01 Tower kernel: sd 8:0:0:0: [sdh] Mode Sense: 00 3a 00 00

Dec  2 13:09:01 Tower kernel: sd 8:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 

My bet is the errors are on /dev/sdh.  Try its cable first.  But remember, it could as easily be noise on the power supply line to the disk.  A lower power "green" drive does not mean it is less likely to be voltage sensitive...  You might move the disk to a different power supply connector if the SATA cable swap has no effect. 

 

What make/model power supply are you using?

 

Joe L.

Link to comment

P.S: I think I was able to identify ata8: These "ata" numbers seem to match the "host" numbers in the new beta12 "spinup group" edit field. At least I still have 2 harddisks from different manufacturers in my unRAID array. And the "ata" syslog messages for these match the "host" numbers for these two harddisks. If this identification method is correct, ata8 is an "old" harddisk, one which has been there for weeks or even months. That would indicate that the error is not caused by the new harddisks, but that the problem might really be caused by a bad cable. Maybe the cable isn't even bad, but just got loose when I replaced the harddisks during the past few days...

 

Got 2 new BadCRC exceptions in the syslog now during the still running parity check, but the parity error count is still sitting at 4. Strange. At least the 2 new exceptions are again from ata8. So I guess it's really only that one cable that is problematic.

It could be something other than the cable... It is as likely to be the power supply or its cabling.  do you have  power supply "splitters" in line with the drives?

 

Joe L.

Link to comment

Thanks again, Joe.

 

Yes, sdh is the same drive I also ended up suspecting. I'll try the SMART stuff and cable swapping as soon as the current parity check is through. FWIW, the syslog got another few exceptions from drive ata8. But the parity error count is still at 4. It seems that the kernel can often recover from these exceptions without producing real read errors. Anyway, I'm happy that the exceptions are reproducable by doing a parity check. This way I have a reliable test for whether the problem is really gone or not.

 

I do use power supply splitters. The power supply didn't have enough power connectors for all 11 drives. If the SATA cable replacement doesn't help, I'll try playing around with the power cables.

 

I've just checked. My power supply is a Corsair HX620W. It has 620W. That should really be much more than plenty!

Link to comment

The powersupply is good for 21 drives - IF (!) you're using green drives (tested myself and can confirm by own experience). I had older drives in another box, requiring twice the current and a combination with bad connections on the splitterplugs gave me bad errors with that combination.

 

Link to comment

The powersupply is good for 21 drives - IF (!) you're using green drives (tested myself and can confirm by own experience). I had older drives in another box, requiring twice the current and a combination with bad connections on the splitterplugs gave me bad errors with that combination.

 

This review  http://www.hardwaresecrets.com/article/371/1

showed they found internally the power supply was really a two-rail supply, with one of the rails for the CPU/MB.  The two rails were independently monitored for current draw, BUT they were both connected to the same internal source supply of 12 volts.

 

With that in mind, and since the supply uses modular connectors, and he has said he uses splitters, odds are high all the drives are on one rail, and odds are higher than usual that the voltage drops across multiple splitters/cables/connectors could result in a disk being more sensitive to voltage fluctuations, especially if it is at the end of a chain of connectors/splitters.   

 

If the replacement of the SATA data cable does not fix the issue, it is still a good idea to see if it can be cabled differently to eliminate an intermediate power supply connector or two.

 

Joe L.

Link to comment

The powersupply is good for 21 drives - IF (!) you're using green drives (tested myself and can confirm by own experience). I had older drives in another box, requiring twice the current and a combination with bad connections on the splitterplugs gave me bad errors with that combination.

 

Interesting. 8 out of 11 disks are now Samsung EcoGreen F2 1500GB. Plan to move everything to EcoGreen F2 in the next couple of days/weeks.

 

This review  http://www.hardwaresecrets.com/article/371/1

showed they found internally the power supply was really a two-rail supply, with one of the rails for the CPU/MB.  The two rails were independently monitored for current draw, BUT they were both connected to the same internal source supply of 12 volts.

 

With that in mind, and since the supply uses modular connectors, and he has said he uses splitters, odds are high all the drives are on one rail, and odds are higher than usual that the voltage drops across multiple splitters/cables/connectors could result in a disk being more sensitive to voltage fluctuations, especially if it is at the end of a chain of connectors/splitters. 

 

If the replacement of the SATA data cable does not fix the issue, it is still a good idea to see if it can be cabled differently to eliminate an intermediate power supply connector or two.

 

I will do definitely do that, if the SATA cable replacement doesn't help. However, I'm having a hard time believing that the power supply could be causing trouble, since according to most reviews (including the one you linked to) it was tested very good. And I'm only driving it with maybe 30-35% of its capacity. Well, anyway, as I said, I'll give it a try. But I'm quite hopeful that the SATA cable replacement will fix the issue.

 

Thanks again for your help, guys!

Link to comment

I will do definitely do that, if the SATA cable replacement doesn't help. However, I'm having a hard time believing that the power supply could be causing trouble, since according to most reviews (including the one you linked to) it was tested very good. And I'm only driving it with maybe 30-35% of its capacity. Well, anyway, as I said, I'll give it a try. But I'm quite hopeful that the SATA cable replacement will fix the issue.

 

Thanks again for your help, guys!

It is a good supply... I have no doubt, but all it takes is one poor quality connection and the voltage that is very well regulated at the supply will not be within spec at the end of a series of splitters.    If possible, use multiple connections to the connectors on the side of the supply.  Have a few connections as possible between any given drive and the power supply.

 

Good luck... I'm sure you'll find the cause.

 

Joe L.

Link to comment

hard disk power supply splitters are of the devil.

 

after a loot of trouble with poor power connections in my bigtower with 6 hds,

(hard-disks stopping to respond during reads/writes, or trouble booting the server)

I took out all the power splitters, and cut off all the 4-pin male-connectors,

soldered the 4 wires on each splitter together, and connected those to the pc power-supply using

a quad screw terminal block.never had any of those issues since.

 

 

 

Link to comment

FWIW, ran another parity check yesterday with the cables still as they were. I've got 17 of those CRC complaints, all about ata8, but still the parity check ended with 0 errors. It seems to me that these CRC complaints are properly worked around by the controller or OS somehow. That makes me wonder where those 4 parity errors came from which made me analyse the syslog in the first place.

 

Anyway, will be playing with the cables now...

Link to comment

Several times, I wanted to comment on that, but it did not seem important enough, and you were being ably helped.

 

I have come to think of the BadCRC error flag as equivalent to BadCable, as in "See BadCRC, think BadCable.  But the experts do say that it could also be because of an issue with the power supplied to the drive.  I do NOT think it can cause a parity error or data corruption, because any issue with the data, such as an incorrect CRC, causes the data to be re-sent until it IS correct.

 

In my opinion, the 4 parity errors, especially because you said they were at the very beginning of the drive where the 'housekeeping' bytes of the file system are, are probably because of a server crash or power outage sometime between this parity check and the previous parity check.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.