Bad drive? or Bad slot??


Recommended Posts

Some history...

 

I had a 8T drive fail (red ball) recently in slot will call it A.  I had a spare 8T unassigned device so I swapped it in and rebuilt fine.

I ran pre-clear on the "failed" drive in slot A and it failed in the zeroing phase..  Ok maybe it was a bad drive.

So I bought another 8T drive to keep as a spare.  I shucked it and threw it into slot B.  Slot B had an old 4T drive that I took out of service for some reason.

I ran pre-clear on the new 8T drive in slot B and it ran fine.  No issues other than a couple UDMA CRC errors.  It was only like 4 or 5..  So I scratched my head and continued.  I want to remove 3 2T disks so I put my "spare" in slot B in the array.  I took all the stuff on one of the 2T disks and put it on the 8T drive in slot B.

That ran fine an came up maybe with one or two more CRC errors.  I zeroed out the 2T drive and emptied the 2nd 2T drive on the 8T in slot B.  That went fine...  but a couple more CRC errors.  So I'm thinking maybe slot B has some issues with the cable?

 

The first 2T was ready to be removed from the system So I removed it and created a new config and trusted parity.  That went fine.  I started a parity check and that was working fine so I stopped it.  I shutdown the array and was going to try the new disk in a different slot.  I moved the new 8T into slot A and removed the "failed" 8T

I started the system and copied more files from the 2nd 2T disk that I wanted to empty.  After that was done I noticed that the new disk in Slot A red balled!!!  It had a bunch of errors (like 6k)  Shoot!  WTF!  So now I put the new 8T back in Slot B.  And it's currently being re-built!  It's moving along at 7% with no disk errors.

It's up to 14 CRC errors.  but I think that's what it started with at the start of the re-build.

 

So do I have a bad drive or bad slot(s).  The slots are in 2 different 5 bay norco drive cages.  

Trying to figure out how to proceed..  Look to the experts here..

 

 

 

Link to comment
20 minutes ago, jbuszkie said:

Yeah..  I know the CRC are connection issues.. usually..  But these cables haven't been touched in years... 

If there are errors there's a problem, and it's not the drive.

 

21 minutes ago, jbuszkie said:

I'm worried about the drive red-balling in slot A...

Is it the drive or the slot...

Without any diags posted we can only guess.

Link to comment
3 minutes ago, jbuszkie said:

They don't contain any historical data about failed disks, right?

They should be saved after the disk gets disabled, and before rebooting, without the syslog and based on the SMART report disk looks OK, CRC errors like mentioned aren't a disk problem, so likely it's the slot/cable.

Link to comment

Crap!  Those are two different Norco drive cages. One with the dropped disk and one with the CRC errors. And it's not like I bumped the machine or anything..  I guess it's time to open her up and check for loose cables to the cages!  So the drive might be good???  hmm..

Anyone seen years of vibrations knocking a sata cable loose on a drive cage?

 

I *think* both of those slots in question are connected to the MB not the LSI card...

Has anyone seen norco slots go bad?

 

 

Thanks @johnnie.black

Link to comment
Quote

At least that one is, it's using one of the two Asmedia ports.

Ok..  I give up..  How were you able to tell that ata7 and ata8 where the Asmedia ports from the syslog?  Because I know my motherboard..  I could figure it out..  But I can't find anywhere which ports are mapped to which controller in the syslog...  There is no mention of asmedia in the syslog.

Link to comment

With the full diagnostics it would be easy to see it's an Asmedia controller, with just the syslog it doesn't show Asmedia, it just shows a two port controller loading after the first 6 Intel ports, but using the motherboard model I could see it has a 2 port Asmedia controller.

  • Thanks 1
Link to comment
36 minutes ago, jbuszkie said:

Ok..  I give up..  How were you able to tell that ata7 and ata8 where the Asmedia ports from the syslog?  Because I know my motherboard..  I could figure it out..  But I can't find anywhere which ports are mapped to which controller in the syslog...  There is no mention of asmedia in the syslog.

How about this:

[from your 800k syslog] lines 759-763

Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: SSS flag set, parallel bus scan disabled
Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc sxs 
Mar 11 07:50:54 Tower kernel: scsi host7: ahci
Mar 11 07:50:54 Tower kernel: scsi host8: ahci

Then searching for 0000:04:00 leads to: [line 373]

Mar 11 07:50:54 Tower kernel: pci 0000:04:00.0: [1b21:0612] type 00 class 0x010601

And [1b21:0612] is the Vendor ID (Asmedia) : Device ID (ASM1062) pair for that controller.

 

"A rose by any other name ... is still a rose."

 

  • Thanks 1
Link to comment
  • 1 year later...

Ugh...  I hate to bring up an old thread...  But I'm having issues again.  It looks like it's the same slot as above. 

 

I just rebooted my server and upon restart drive 8 became red balled!

Nov 12 09:46:52 Tower kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
Nov 12 09:46:52 Tower kernel: ata8.00: irq_stat 0x08000000, interface fatal error
Nov 12 09:46:52 Tower kernel: ata8: SError: { Handshk }
Nov 12 09:46:52 Tower kernel: ata8.00: failed command: WRITE DMA EXT
Nov 12 09:46:52 Tower kernel: ata8.00: cmd 35/00:08:30:14:01/00:01:00:02:00/e0 tag 19 dma 135168 out
Nov 12 09:46:52 Tower kernel:         res 50/00:00:37:15:01/00:00:00:02:00/e0 Emask 0x10 (ATA bus error)
Nov 12 09:46:52 Tower kernel: ata8.00: status: { DRDY }
Nov 12 09:46:52 Tower kernel: ata8: hard resetting link
Nov 12 09:47:02 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:02 Tower kernel: ata8: hard resetting link
Nov 12 09:47:12 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:12 Tower kernel: ata8: hard resetting link
Nov 12 09:47:47 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:47 Tower kernel: ata8: limiting SATA link speed to 1.5 Gbps
Nov 12 09:47:47 Tower kernel: ata8: hard resetting link
Nov 12 09:47:52 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:52 Tower kernel: ata8: reset failed, giving up
Nov 12 09:47:52 Tower kernel: ata8.00: disabled
Nov 12 09:47:52 Tower kernel: ata8: EH complete
Nov 12 09:47:52 Tower kernel: sd 8:0:0:0: [sdg] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=60s
Nov 12 09:47:52 Tower kernel: sd 8:0:0:0: [sdg] tag#20 CDB: opcode=0x8a 8a 00 00 00 00 02 00 01 14 30 00 00 01 08 00 00
Nov 12 09:47:52 Tower kernel: blk_update_request: I/O error, dev sdg, sector 8590005296 op 0x1:(WRITE) flags 0x800 phys_seg 33 prio class 0
Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005232
Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005240
Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005248

 

Now I really believe that the disk is still good.  How do I get unraid to try to rebuild onto that disk?  As in how do I get it to believe that the disk is not redballed and try to rebuild to that disk?  

 

The other drive from the above post that had "errors" in this slot has been behaving fine in a different slot for over a year now. So I really think it's an issue with that slot.

 

grr...  I can't remember if I replaced that ata8 cable or not last time.

 

Thanks,

 

Jim

 

 

 

Link to comment
2 minutes ago, jbuszkie said:

As in how do I get it to believe that the disk is not redballed and try to rebuild to that disk?  

Not quite clear what you want to do, there are two options:

1) rebuild on top of the old disk, this is usually the recommended option unless the emulated disk is not mounting.

2) do a new config to re-enable the disk but you'll need to re-sync parity.

 

As for the error, the disk dropped offline, this is usually a power/connection problem.

Link to comment
2 minutes ago, JorgeB said:

Not quite clear what you want to do, there are two options:

1) rebuild on top of the old disk, this is usually the recommended option unless the emulated disk is not mounting.

2) do a new config to re-enable the disk but you'll need to re-sync parity.

 

 

1)  I want to rebuild on top of the old disk

 

How do I do that?  Unraid has it redballed

Link to comment

This is what I did from Squid's post I found

On 11/21/2020 at 12:00 AM, Squid said:

Anytime a disk is redballed (as yours is), you must rebuild the contents of the drive.  You don't need to clear it again.

 

Stop the array, unassign the disk.  Start the array, stop the array, re-assign the disk and restart the array.  A rebuild will happen.

 

 

It seems to be rebuilding.  I'm getting more memory tomorrow so I'll  try to replace that sata cable tomorrow or switch the cable to my last free slot and mark that slot as bad! 😞

 

Link to comment
4 minutes ago, trurl said:

"manual" link at lower right of the webUI takes you to the current version of the documentation. Also linked at top and bottom of forum.

The manual never used to be useful for stuff like this.  I've always relied on the knowledge here!

It seems like the manual has improved!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.