jbuszkie Posted March 11, 2020 Share Posted March 11, 2020 Some history... I had a 8T drive fail (red ball) recently in slot will call it A. I had a spare 8T unassigned device so I swapped it in and rebuilt fine. I ran pre-clear on the "failed" drive in slot A and it failed in the zeroing phase.. Ok maybe it was a bad drive. So I bought another 8T drive to keep as a spare. I shucked it and threw it into slot B. Slot B had an old 4T drive that I took out of service for some reason. I ran pre-clear on the new 8T drive in slot B and it ran fine. No issues other than a couple UDMA CRC errors. It was only like 4 or 5.. So I scratched my head and continued. I want to remove 3 2T disks so I put my "spare" in slot B in the array. I took all the stuff on one of the 2T disks and put it on the 8T drive in slot B. That ran fine an came up maybe with one or two more CRC errors. I zeroed out the 2T drive and emptied the 2nd 2T drive on the 8T in slot B. That went fine... but a couple more CRC errors. So I'm thinking maybe slot B has some issues with the cable? The first 2T was ready to be removed from the system So I removed it and created a new config and trusted parity. That went fine. I started a parity check and that was working fine so I stopped it. I shutdown the array and was going to try the new disk in a different slot. I moved the new 8T into slot A and removed the "failed" 8T I started the system and copied more files from the 2nd 2T disk that I wanted to empty. After that was done I noticed that the new disk in Slot A red balled!!! It had a bunch of errors (like 6k) Shoot! WTF! So now I put the new 8T back in Slot B. And it's currently being re-built! It's moving along at 7% with no disk errors. It's up to 14 CRC errors. but I think that's what it started with at the start of the re-build. So do I have a bad drive or bad slot(s). The slots are in 2 different 5 bay norco drive cages. Trying to figure out how to proceed.. Look to the experts here.. Quote Link to comment
JorgeB Posted March 11, 2020 Share Posted March 11, 2020 CRC errors are a connection problem, 9 times out of 10 a bad SATA cable, but could also be the backplane, even the controller, though much less likely. Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Yeah.. I know the CRC are connection issues.. usually.. But these cables haven't been touched in years... I'm worried about the drive red-balling in slot A... Is it the drive or the slot... Quote Link to comment
JorgeB Posted March 11, 2020 Share Posted March 11, 2020 20 minutes ago, jbuszkie said: Yeah.. I know the CRC are connection issues.. usually.. But these cables haven't been touched in years... If there are errors there's a problem, and it's not the drive. 21 minutes ago, jbuszkie said: I'm worried about the drive red-balling in slot A... Is it the drive or the slot... Without any diags posted we can only guess. Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 (edited) I'll give the diags.. But I'm not sure what help it will be. They don't contain any historical data about failed disks, right? You would be able to see the CRC errors in the SMART report.. What else is useful in there? tower-diagnostics-20200311-1114.zip Edited March 11, 2020 by jbuszkie Quote Link to comment
JorgeB Posted March 11, 2020 Share Posted March 11, 2020 3 minutes ago, jbuszkie said: They don't contain any historical data about failed disks, right? They should be saved after the disk gets disabled, and before rebooting, without the syslog and based on the SMART report disk looks OK, CRC errors like mentioned aren't a disk problem, so likely it's the slot/cable. Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Yeah this after the reboots during the rebuild So the disabled disk info is not there... ... I actually found the syslogs.. syslog-20200311-074811.txt syslog-20200311-082322.txt Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Look at the 800K one.. The errors start around 8:16ish Quote Link to comment
JorgeB Posted March 11, 2020 Share Posted March 11, 2020 Yes, looks like a connection/power issue, disk dropped offline out of the blue, like if the power or SATA cable was pulled. 1 Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Crap! Those are two different Norco drive cages. One with the dropped disk and one with the CRC errors. And it's not like I bumped the machine or anything.. I guess it's time to open her up and check for loose cables to the cages! So the drive might be good??? hmm.. Anyone seen years of vibrations knocking a sata cable loose on a drive cage? I *think* both of those slots in question are connected to the MB not the LSI card... Has anyone seen norco slots go bad? Thanks @johnnie.black Quote Link to comment
JorgeB Posted March 11, 2020 Share Posted March 11, 2020 32 minutes ago, jbuszkie said: So the drive might be good??? Most likely. 37 minutes ago, jbuszkie said: I *think* both of those slots in question are connected to the MB At least that one is, it's using one of the two Asmedia ports. Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Quote At least that one is, it's using one of the two Asmedia ports. I keep forgetting how to piece everything together in the syslog! I haven't had to look at the syslog in a long time LOL. I have to re-learn each time 😄 Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Quote At least that one is, it's using one of the two Asmedia ports. Ok.. I give up.. How were you able to tell that ata7 and ata8 where the Asmedia ports from the syslog? Because I know my motherboard.. I could figure it out.. But I can't find anywhere which ports are mapped to which controller in the syslog... There is no mention of asmedia in the syslog. Quote Link to comment
JorgeB Posted March 11, 2020 Share Posted March 11, 2020 With the full diagnostics it would be easy to see it's an Asmedia controller, with just the syslog it doesn't show Asmedia, it just shows a two port controller loading after the first 6 Intel ports, but using the motherboard model I could see it has a 2 port Asmedia controller. 1 Quote Link to comment
UhClem Posted March 11, 2020 Share Posted March 11, 2020 36 minutes ago, jbuszkie said: Ok.. I give up.. How were you able to tell that ata7 and ata8 where the Asmedia ports from the syslog? Because I know my motherboard.. I could figure it out.. But I can't find anywhere which ports are mapped to which controller in the syslog... There is no mention of asmedia in the syslog. How about this: [from your 800k syslog] lines 759-763 Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: SSS flag set, parallel bus scan disabled Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode Mar 11 07:50:54 Tower kernel: ahci 0000:04:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc sxs Mar 11 07:50:54 Tower kernel: scsi host7: ahci Mar 11 07:50:54 Tower kernel: scsi host8: ahci Then searching for 0000:04:00 leads to: [line 373] Mar 11 07:50:54 Tower kernel: pci 0000:04:00.0: [1b21:0612] type 00 class 0x010601 And [1b21:0612] is the Vendor ID (Asmedia) : Device ID (ASM1062) pair for that controller. "A rose by any other name ... is still a rose." 1 Quote Link to comment
jbuszkie Posted March 11, 2020 Author Share Posted March 11, 2020 Nice! I missed the 0000:04:00.0 being able to be mapped back to a PCI device! From the LSPCI I also could have seen that 04:00.0 mapped to the ASMedia device! Thanks, Jim Quote Link to comment
jbuszkie Posted November 12, 2021 Author Share Posted November 12, 2021 Ugh... I hate to bring up an old thread... But I'm having issues again. It looks like it's the same slot as above. I just rebooted my server and upon restart drive 8 became red balled! Nov 12 09:46:52 Tower kernel: ata8.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen Nov 12 09:46:52 Tower kernel: ata8.00: irq_stat 0x08000000, interface fatal error Nov 12 09:46:52 Tower kernel: ata8: SError: { Handshk } Nov 12 09:46:52 Tower kernel: ata8.00: failed command: WRITE DMA EXT Nov 12 09:46:52 Tower kernel: ata8.00: cmd 35/00:08:30:14:01/00:01:00:02:00/e0 tag 19 dma 135168 out Nov 12 09:46:52 Tower kernel: res 50/00:00:37:15:01/00:00:00:02:00/e0 Emask 0x10 (ATA bus error) Nov 12 09:46:52 Tower kernel: ata8.00: status: { DRDY } Nov 12 09:46:52 Tower kernel: ata8: hard resetting link Nov 12 09:47:02 Tower kernel: ata8: softreset failed (1st FIS failed) Nov 12 09:47:02 Tower kernel: ata8: hard resetting link Nov 12 09:47:12 Tower kernel: ata8: softreset failed (1st FIS failed) Nov 12 09:47:12 Tower kernel: ata8: hard resetting link Nov 12 09:47:47 Tower kernel: ata8: softreset failed (1st FIS failed) Nov 12 09:47:47 Tower kernel: ata8: limiting SATA link speed to 1.5 Gbps Nov 12 09:47:47 Tower kernel: ata8: hard resetting link Nov 12 09:47:52 Tower kernel: ata8: softreset failed (1st FIS failed) Nov 12 09:47:52 Tower kernel: ata8: reset failed, giving up Nov 12 09:47:52 Tower kernel: ata8.00: disabled Nov 12 09:47:52 Tower kernel: ata8: EH complete Nov 12 09:47:52 Tower kernel: sd 8:0:0:0: [sdg] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=60s Nov 12 09:47:52 Tower kernel: sd 8:0:0:0: [sdg] tag#20 CDB: opcode=0x8a 8a 00 00 00 00 02 00 01 14 30 00 00 01 08 00 00 Nov 12 09:47:52 Tower kernel: blk_update_request: I/O error, dev sdg, sector 8590005296 op 0x1:(WRITE) flags 0x800 phys_seg 33 prio class 0 Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005232 Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005240 Nov 12 09:47:52 Tower kernel: md: disk8 write error, sector=8590005248 Now I really believe that the disk is still good. How do I get unraid to try to rebuild onto that disk? As in how do I get it to believe that the disk is not redballed and try to rebuild to that disk? The other drive from the above post that had "errors" in this slot has been behaving fine in a different slot for over a year now. So I really think it's an issue with that slot. grr... I can't remember if I replaced that ata8 cable or not last time. Thanks, Jim Quote Link to comment
JorgeB Posted November 12, 2021 Share Posted November 12, 2021 2 minutes ago, jbuszkie said: As in how do I get it to believe that the disk is not redballed and try to rebuild to that disk? Not quite clear what you want to do, there are two options: 1) rebuild on top of the old disk, this is usually the recommended option unless the emulated disk is not mounting. 2) do a new config to re-enable the disk but you'll need to re-sync parity. As for the error, the disk dropped offline, this is usually a power/connection problem. Quote Link to comment
jbuszkie Posted November 12, 2021 Author Share Posted November 12, 2021 2 minutes ago, JorgeB said: Not quite clear what you want to do, there are two options: 1) rebuild on top of the old disk, this is usually the recommended option unless the emulated disk is not mounting. 2) do a new config to re-enable the disk but you'll need to re-sync parity. 1) I want to rebuild on top of the old disk How do I do that? Unraid has it redballed Quote Link to comment
jbuszkie Posted November 12, 2021 Author Share Posted November 12, 2021 This is what I did from Squid's post I found On 11/21/2020 at 12:00 AM, Squid said: Anytime a disk is redballed (as yours is), you must rebuild the contents of the drive. You don't need to clear it again. Stop the array, unassign the disk. Start the array, stop the array, re-assign the disk and restart the array. A rebuild will happen. It seems to be rebuilding. I'm getting more memory tomorrow so I'll try to replace that sata cable tomorrow or switch the cable to my last free slot and mark that slot as bad! 😞 Quote Link to comment
trurl Posted November 12, 2021 Share Posted November 12, 2021 2 hours ago, jbuszkie said: 1) I want to rebuild on top of the old disk How do I do that? Unraid has it redballed "manual" link at lower right of the webUI takes you to the current version of the documentation. Also linked at top and bottom of forum. Quote Link to comment
jbuszkie Posted November 12, 2021 Author Share Posted November 12, 2021 4 minutes ago, trurl said: "manual" link at lower right of the webUI takes you to the current version of the documentation. Also linked at top and bottom of forum. The manual never used to be useful for stuff like this. I've always relied on the knowledge here! It seems like the manual has improved! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.