Bad drive? or Bad slot??

jbuszkie · March 11, 2020

Some history...

I had a 8T drive fail (red ball) recently in slot will call it A. I had a spare 8T unassigned device so I swapped it in and rebuilt fine.

I ran pre-clear on the "failed" drive in slot A and it failed in the zeroing phase.. Ok maybe it was a bad drive.

So I bought another 8T drive to keep as a spare. I shucked it and threw it into slot B. Slot B had an old 4T drive that I took out of service for some reason.

I ran pre-clear on the new 8T drive in slot B and it ran fine. No issues other than a couple UDMA CRC errors. It was only like 4 or 5.. So I scratched my head and continued. I want to remove 3 2T disks so I put my "spare" in slot B in the array. I took all the stuff on one of the 2T disks and put it on the 8T drive in slot B.

That ran fine an came up maybe with one or two more CRC errors. I zeroed out the 2T drive and emptied the 2nd 2T drive on the 8T in slot B. That went fine... but a couple more CRC errors. So I'm thinking maybe slot B has some issues with the cable?

The first 2T was ready to be removed from the system So I removed it and created a new config and trusted parity. That went fine. I started a parity check and that was working fine so I stopped it. I shutdown the array and was going to try the new disk in a different slot. I moved the new 8T into slot A and removed the "failed" 8T

I started the system and copied more files from the 2nd 2T disk that I wanted to empty. After that was done I noticed that the new disk in Slot A red balled!!! It had a bunch of errors (like 6k) Shoot! WTF! So now I put the new 8T back in Slot B. And it's currently being re-built! It's moving along at 7% with no disk errors.

It's up to 14 CRC errors. but I think that's what it started with at the start of the re-build.

So do I have a bad drive or bad slot(s). The slots are in 2 different 5 bay norco drive cages.

Trying to figure out how to proceed.. Look to the experts here..

JorgeB · March 11, 2020

CRC errors are a connection problem, 9 times out of 10 a bad SATA cable, but could also be the backplane, even the controller, though much less likely.

jbuszkie · March 11, 2020

Yeah.. I know the CRC are connection issues.. usually.. But these cables haven't been touched in years... I'm worried about the drive red-balling in slot A...

Is it the drive or the slot...

JorgeB · March 11, 2020

20 minutes ago, jbuszkie said:

Yeah.. I know the CRC are connection issues.. usually.. But these cables haven't been touched in years...

If there are errors there's a problem, and it's not the drive.

21 minutes ago, jbuszkie said:

I'm worried about the drive red-balling in slot A...

Is it the drive or the slot...

Without any diags posted we can only guess.

jbuszkie · March 11, 2020

I'll give the diags.. But I'm not sure what help it will be. They don't contain any historical data about failed disks, right? You would be able to see the CRC errors in the SMART report.. What else is useful in there?

tower-diagnostics-20200311-1114.zip

Edited March 11, 2020 by jbuszkie

JorgeB · March 11, 2020

3 minutes ago, jbuszkie said:

They don't contain any historical data about failed disks, right?

They should be saved after the disk gets disabled, and before rebooting, without the syslog and based on the SMART report disk looks OK, CRC errors like mentioned aren't a disk problem, so likely it's the slot/cable.

jbuszkie · March 11, 2020

Yeah this after the reboots during the rebuild So the disabled disk info is not there...

... I actually found the syslogs..

syslog-20200311-074811.txt syslog-20200311-082322.txt

jbuszkie · March 11, 2020

Look at the 800K one.. The errors start around 8:16ish

JorgeB · March 11, 2020

Yes, looks like a connection/power issue, disk dropped offline out of the blue, like if the power or SATA cable was pulled.

jbuszkie · March 11, 2020

Crap! Those are two different Norco drive cages. One with the dropped disk and one with the CRC errors. And it's not like I bumped the machine or anything.. I guess it's time to open her up and check for loose cables to the cages! So the drive might be good??? hmm..

Anyone seen years of vibrations knocking a sata cable loose on a drive cage?

I *think* both of those slots in question are connected to the MB not the LSI card...

Has anyone seen norco slots go bad?

Thanks @johnnie.black

JorgeB · March 11, 2020

32 minutes ago, jbuszkie said:

So the drive might be good???

Most likely.

37 minutes ago, jbuszkie said:

I *think* both of those slots in question are connected to the MB

At least that one is, it's using one of the two Asmedia ports.

jbuszkie · March 11, 2020

Quote

At least that one is, it's using one of the two Asmedia ports.

I keep forgetting how to piece everything together in the syslog! I haven't had to look at the syslog in a long time LOL. I have to re-learn each time 😄

jbuszkie · March 11, 2020

Quote

At least that one is, it's using one of the two Asmedia ports.

Ok.. I give up.. How were you able to tell that ata7 and ata8 where the Asmedia ports from the syslog? Because I know my motherboard.. I could figure it out.. But I can't find anywhere which ports are mapped to which controller in the syslog... There is no mention of asmedia in the syslog.

JorgeB · March 11, 2020

With the full diagnostics it would be easy to see it's an Asmedia controller, with just the syslog it doesn't show Asmedia, it just shows a two port controller loading after the first 6 Intel ports, but using the motherboard model I could see it has a 2 port Asmedia controller.

UhClem · March 11, 2020

36 minutes ago, jbuszkie said:

Ok.. I give up.. How were you able to tell that ata7 and ata8 where the Asmedia ports from the syslog? Because I know my motherboard.. I could figure it out.. But I can't find anywhere which ports are mapped to which controller in the syslog... There is no mention of asmedia in the syslog.

How about this:

[from your 800k syslog] lines 759-763

Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: SSS flag set, parallel bus scan disabled
Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc sxs 
Mar 11 07:50:54 Tower kernel: scsi host7: ahci
Mar 11 07:50:54 Tower kernel: scsi host8: ahci

Then searching for 0000:04:00 leads to: [line 373]

Mar 11 07:50:54 Tower kernel: pci 0000:04:00.0: [1b21:0612] type 00 class 0x010601

And [1b21:0612] is the Vendor ID (Asmedia) : Device ID (ASM1062) pair for that controller.

"A rose by any other name ... is still a rose."

jbuszkie · March 11, 2020

Nice! I missed the 0000:04:00.0 being able to be mapped back to a PCI device!

From the LSPCI I also could have seen that 04:00.0 mapped to the ASMedia device!

Thanks,

Jim

jbuszkie · November 12, 2021

Ugh... I hate to bring up an old thread... But I'm having issues again. It looks like it's the same slot as above.

I just rebooted my server and upon restart drive 8 became red balled!

Nov 12 09:46:52 Tower kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
Nov 12 09:46:52 Tower kernel: ata8.00: irq_stat 0x08000000, interface fatal error
Nov 12 09:46:52 Tower kernel: ata8: SError: { Handshk }
Nov 12 09:46:52 Tower kernel: ata8.00: failed command: WRITE DMA EXT
Nov 12 09:46:52 Tower kernel: ata8.00: cmd 35/00:08:30:14:01/00:01:00:02:00/e0 tag 19 dma 135168 out
Nov 12 09:46:52 Tower kernel:         res 50/00:00:37:15:01/00:00:00:02:00/e0 Emask 0x10 (ATA bus error)
Nov 12 09:46:52 Tower kernel: ata8.00: status: { DRDY }
Nov 12 09:46:52 Tower kernel: ata8: hard resetting link
Nov 12 09:47:02 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:02 Tower kernel: ata8: hard resetting link
Nov 12 09:47:12 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:12 Tower kernel: ata8: hard resetting link
Nov 12 09:47:47 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:47 Tower kernel: ata8: limiting SATA link speed to 1.5 Gbps
Nov 12 09:47:47 Tower kernel: ata8: hard resetting link
Nov 12 09:47:52 Tower kernel: ata8: softreset failed (1st FIS failed)
Nov 12 09:47:52 Tower kernel: ata8: reset failed, giving up
Nov 12 09:47:52 Tower kernel: ata8.00: disabled
Nov 12 09:47:52 Tower kernel: ata8: EH complete
Nov 12 09:47:52 Tower kernel: sd 8:0:0:0: [sdg] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=60s
Nov 12 09:47:52 Tower kernel: sd 8:0:0:0: [sdg] tag#20 CDB: opcode=0x8a 8a 00 00 00 00 02 00 01 14 30 00 00 01 08 00 00
Nov 12 09:47:52 Tower kernel: blk_update_request: I/O error, dev sdg, sector 8590005296 op 0x1:(WRITE) flags 0x800 phys_seg 33 prio class 0
Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005232
Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005240
Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005248

Now I really believe that the disk is still good. How do I get unraid to try to rebuild onto that disk? As in how do I get it to believe that the disk is not redballed and try to rebuild to that disk?

The other drive from the above post that had "errors" in this slot has been behaving fine in a different slot for over a year now. So I really think it's an issue with that slot.

grr... I can't remember if I replaced that ata8 cable or not last time.

Thanks,

Jim

JorgeB · November 12, 2021

2 minutes ago, jbuszkie said:

As in how do I get it to believe that the disk is not redballed and try to rebuild to that disk?

Not quite clear what you want to do, there are two options:

1) rebuild on top of the old disk, this is usually the recommended option unless the emulated disk is not mounting.

2) do a new config to re-enable the disk but you'll need to re-sync parity.

As for the error, the disk dropped offline, this is usually a power/connection problem.

jbuszkie · November 12, 2021

2 minutes ago, JorgeB said:

Not quite clear what you want to do, there are two options:

1) rebuild on top of the old disk, this is usually the recommended option unless the emulated disk is not mounting.

2) do a new config to re-enable the disk but you'll need to re-sync parity.

1) I want to rebuild on top of the old disk

How do I do that? Unraid has it redballed

jbuszkie · November 12, 2021

This is what I did from Squid's post I found

On 11/21/2020 at 12:00 AM, Squid said:

Anytime a disk is redballed (as yours is), you must rebuild the contents of the drive. You don't need to clear it again.

Stop the array, unassign the disk. Start the array, stop the array, re-assign the disk and restart the array. A rebuild will happen.

It seems to be rebuilding. I'm getting more memory tomorrow so I'll try to replace that sata cable tomorrow or switch the cable to my last free slot and mark that slot as bad! 😞

trurl · November 12, 2021

2 hours ago, jbuszkie said:

1) I want to rebuild on top of the old disk

How do I do that? Unraid has it redballed

"manual" link at lower right of the webUI takes you to the current version of the documentation. Also linked at top and bottom of forum.

jbuszkie · November 12, 2021

4 minutes ago, trurl said:

"manual" link at lower right of the webUI takes you to the current version of the documentation. Also linked at top and bottom of forum.

The manual never used to be useful for stuff like this. I've always relied on the knowledge here!

It seems like the manual has improved!

Bad drive? or Bad slot??

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation