December 18, 200916 yr About a week ago my Parity drive turned up Red with 128 write errors. I pulled the drive and did a full Smart test in Windows and the drive tested fine. I also tested the most heavily used data disk and it seemed to have quite a few bad sectors, so I replaced it. I also installed un-Notify to head off future disasters. Today un-Notify starts sending me disabled disk errors every thirty minutes, but it also says the disk has PASSED. The Parity disk shows the same 128 write errors it had before, probably a coincidence. Relevant Syslog: Dec 17 09:09:34 TerraByteMe kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 Dec 17 09:09:34 TerraByteMe kernel: ata3.00: BMDMA stat 0x5 Dec 17 09:09:34 TerraByteMe kernel: ata3: SError: { Handshk } Dec 17 09:09:34 TerraByteMe kernel: ata3.00: cmd 35/00:00:3f:40:18/00:04:0c:00:00/e0 tag 0 dma 524288 out Dec 17 09:09:34 TerraByteMe kernel: res 51/84:d1:3f:40:18/84:00:00:00:00/e0 Emask 0x10 (ATA bus error) Dec 17 09:09:34 TerraByteMe kernel: ata3.00: status: { DRDY ERR } Dec 17 09:09:34 TerraByteMe kernel: ata3.00: error: { ICRC ABRT } Dec 17 09:09:34 TerraByteMe kernel: ata3: hard resetting link Dec 17 09:09:35 TerraByteMe kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 17 09:09:35 TerraByteMe kernel: ata3.00: configured for UDMA/133 Dec 17 09:09:35 TerraByteMe kernel: ata3: EH complete Bad cable? Bad controller? Bad RAM? Any suggestions appreciated.
December 18, 200916 yr I would rather see the entire syslog, to avoid making bad assumptions. All I can say from this single error is that the ICRC error flag most commonly indicates a bad cable, and less likely, it could be a power issue. This IS an interface issue, not a physical drive issue, although I don't like to say much with only one error to go by. This error did not cause any further issues. A reset was issued, and the speeds continued at full speed, 3.0 Gbps and UDMA/133. There must be a number of other error messages therefore, if the drive was disabled, and they will culminate in something like: ata3.00: disabled. Every error after this message, that refers to ata3.00, can be ignored.
December 18, 200916 yr I forgot to mention that you can use the Trust My Array procedure to quickly restore the array, if the drive is disabled but not actually at fault, disabled because of interface issues. You would want to fix the issue first, such as replace the cable.
December 18, 200916 yr Author I would rather see the entire syslog, to avoid making bad assumptions. All I can say from this single error is that the ICRC error flag most commonly indicates a bad cable, and less likely, it could be a power issue. This IS an interface issue, not a physical drive issue, although I don't like to say much with only one error to go by. This error did not cause any further issues. A reset was issued, and the speeds continued at full speed, 3.0 Gbps and UDMA/133. There must be a number of other error messages therefore, if the drive was disabled, and they will culminate in something like: ata3.00: disabled. Every error after this message, that refers to ata3.00, can be ignored. I don't see that in the Syslog at all. Weird thing is that when I stopped the array, UnRaid said that I had replaced the Parity disk (See log), which I did not. Did see an IRQ7 error at startup, not sure what that's about.
December 18, 200916 yr Did see an IRQ7 error at startup, not sure what that's about. kernel: spurious 8259A interrupt: IRQ7. This is practically an identifying characteristic of nForce boards. I have never seen any spurious IRQ's of any kind on other boards, and I believe I have always found this specific spurious IRQ7 on every nForce-based motherboard. It has been harmless in almost all cases, and if it only occurs during the boot, then *is* harmless. But I had a serious problem with it myself, because it occurred again later, and the kernel disabled the interrupt. That was fatal for me, because it had assigned one disk controller (with 2 drives attached), the USB port my flash drive was using, and the network chipset, all on IRQ7! So instantly 2 drives were down, my flash drive was inaccessible, and the networking was down. Not sure how it could get worse! I was able to reserve IRQ 7 in the BIOS settings, which effectively isolated it from ever being used again. I don't see that in the Syslog at all. Weird thing is that when I stopped the array, UnRaid said that I had replaced the Parity disk You are right, in that the kernel NEVER disabled the drive, and continued on as if the drive was fine. But unRAID had requested a write operation, and after close to a minute of waiting, gave up and called it a write error, which causes unRAID to mark a drive as Disabled (red ball) in its table. The drive was still there though, and when unRAID checked and found it, made the poor assumption that this was a replacement for the drive it had just marked as disabled. The Trust My Array procedure is perfect for this case. This is one more case of the communication 'disconnect' between the upper level modules of unRAID and the lower level drivers and exception handlers. In another recent case, the opposite of this one, unRAID requested I/O from a drive, but the kernel lost contact with the drive completely (temporarily). The driver and exception handler worked hard to reset the drive and recover it, but finally marked it Disabled, and that is somewhat more fatal (more fatal?) than the unRAID module disabling a drive. Later the drive was recovered (actually several times), but it was recovered as a *different* drive, with a different Device ID, and unRAID never knew about what had happened below it. unRAID kept is as a good green-balled drive, but recorded read errors for it. In your case, the drive was reset over and over, successfully each time, so the kernel continued to assume it was fine, even though reads and writes were still not successful. It slowed communications down, on both the SATA link (from 3.0 to 1.5Gbps) and the DMA channel (from UDMA/133 to UDMA/33), but in both cases that is as low as it can go. I have seen this happen several times before, where the drive and driver are trapped in a loop, where communications indicates a problem (HandShk & ICRC), but every reset indicates that the drive is still there and that the communications channels are fine, and the DMA channel is working. Once it has slowed it down to slowest allowable speed, then there is nothing further the current exception handler can do. There really needs to be a way for the exception handler to not only test that communications are working, but they are also *productive*. I also wish the exception handler was better designed to inform you of exactly who or what is reporting the original issue, so you could better tell if it is a problem with the drive, the cabling, the controller, etc. I personally think it is probably related to the cable or connectors, and not the power to the drive, because every error is so consistent, but it could also be a problem with the controller, perhaps a bug, due to some weird combination of circumstances. I don't think we can be more definitive than that, sorry. You can try replacing the cable, and you can try connecting it to a different port, perhaps on a different controller.
December 23, 200916 yr Author Changing the cable seems to have done the trick, whether it was loose or bad I don't know. I'll just toss it. I've moved a few TB around and the problem has not come back.
Archived
This topic is now archived and is closed to further replies.