October 14, 200817 yr I seem to always get from 30-200 sync errors everytime I run a full parity check. No error counts by any drive above. Attached is my syslog.txt where I stopped it once I noticed sync errors started to get logged. This is on a full system without cache (16 drives, 15 Data + Parity) Motherboard: Abit AB9 Pro BIOS 22 Adaptic 4 port sata in 16x slot 2 Roswell 2 port sata cards. Unraid: 4.3.beta6 Example item in the log under ata17 and ata18: Oct 14 16:28:55 Tower kernel: ata17: SError: { UnrecovData 10B8B Dispar BadCRC Handshk } Oct 14 16:28:55 Tower kernel: ata17.00: cmd 60/f8:00:c7:82:02/02:00:00:00:00/40 tag 0 ncq 389120 in Oct 14 16:28:55 Tower kernel: res 50/00:f8:c7:82:02/00:02:00:00:00/40 Emask 0x10 (ATA bus error) Oct 14 16:28:55 Tower kernel: ata17.00: status: { DRDY } Oct 14 16:28:55 Tower kernel: ata17: hard resetting link Oct 14 16:28:55 Tower kernel: ata17: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 14 16:28:55 Tower kernel: ata17.00: configured for UDMA/133 Oct 14 16:28:55 Tower kernel: ata17: EH complete 1. Is this a known issue under 2 of the ports under a certain type of controller for this MB ? 2. 4.3.3 fix this? (Wanted to wait for 4.4 non beta) 3. How to find out what drives are associated with ata17 or ata18 ? Thanks for any help.
October 14, 200817 yr ata17: (sdp) ata-HDS725050KLA360_KRVN0******KBC ata18: (sdq) ata-SAMSUNG_HD103UJ_S13PJ******647 Your errors look like communication problems between the drives and your system. Most of the errors are exception Emask, BadCRC, ATA bus error sequences, and the rest are exception Emask, device error sequences, both of which, I believe, are communications related. The number one suspect would be bad or loose cables, next would be (I think) the controller they are attached to. In this case, I believe that may be the onboard JMB SATA controller. I don't think there is anything wrong with the drives. The sync errors recorded in your syslog occurred in the same second as communications issues, both immediately before and after, within the same second. I would have to assume then, the very high possibility of corrupted data returned by one of the reads, which caused a parity mismatch. This unfortunately may have caused unRAID to 'correct' the parity drive, but wrongly. So another parity check will have to be performed, to reverse the bad corrections. But you should fix the cable or other issues first. These issues are unrelated to which version of unRAID you are using. You have a number of Hitachi 500GB drives, like the one listed above. They are mostly linking at 3.0 Gbps, but some of them are linking at 1.5 Gbps, and they appear to be identical drives, connected to SATA II controllers, with others linking at 3.0 Gbps. I have never had a Hitachi drive, so I don't know if there are any configuration jumpers available on them, but you might want to check that. On the other hand, the Hitachi listed above (sdp) was setup initially at 3.0 Gbps, and reset to the same, but a later reset resulted in 1.5 Gbps, either an unsuccessful negotiation of the faster speed during the communications issues, or evidence of inconsistent negotiations with those Hitachi 500GB's. You've got bragging rights with your XOR processing speed, fastest I personally have seen: "xor using function: pIII_sse (10090.000 MB/sec)". I was particularly surprised because you are using a "CPU: Intel® Core2 Duo CPU E6750 @ 2.66GHz stepping 0b" clocked at "2720.125 MHz[/color]" in single core mode.
October 15, 200817 yr This is PRECISELY WHY we need a parity verify option WITHOUT writing to any drives. If there is a drive failure with the array in this condition, any recovery effort is useless. We need to be able to check parity and look for errors without changing anything. I lost hundreds gigabytes of data because of a drive failure while my parity was incorrectly being written. In my case, I had a drive fail to spin up in time, it showed failed, I did the "correct" thing and replaced the drive. I rebuilt the drive, and everything looked mostly ok, but parity checks showed thousands of errors. Thinking something was hinky, I tested the original drive, and it was ok. I then killed ANY chance of getting my data back by putting the original drive back in the array, and letting it rebuild, instead of hitting restore. Parity checks continued to show thousands of errors, and the rebuilt drive was corrupt. If I could have checked parity without changing it, I would have run more tests to find the actual problem before trusting the array to rebuild my drive.
October 15, 200817 yr You make a very good case. The problem I have with it though, is that I can't think of a case where there was any suspicion of trouble like that, prior to running the parity check. So in most cases, it's the parity check that first turned up a problem, but then it's too late. Plus, a situation like you mentioned is very rare. Perhaps there could be an adjustment to the way the parity check works, to delay fixes until the end, and ask for permission then. But many or most users would want their parity to be corrected immediately. The ultimate solution would be to check the syslog after detecting a parity error, and look for certain types of errors, and react accordingly.
October 15, 200817 yr Author Your errors look like communication problems between the drives and your system. Most of the errors are exception Emask, BadCRC, ATA bus error sequences, and the rest are exception Emask, device error sequences, both of which, I believe, are communications related. The number one suspect would be bad or loose cables, next would be (I think) the controller they are attached to. In this case, I believe that may be the onboard JMB SATA controller. I don't think there is anything wrong with the drives. Both drives you identified are on my JMicron sata controller on the MB (SATA8 and SATA9) I thought issues with this controller was fixed with 4.1+ ? The sync errors recorded in your syslog occurred in the same second as communications issues, both immediately before and after, within the same second. I would have to assume then, the very high possibility of corrupted data returned by one of the reads, which caused a parity mismatch. This unfortunately may have caused unRAID to 'correct' the parity drive, but wrongly. So another parity check will have to be performed, to reverse the bad corrections. But you should fix the cable or other issues first. I might have to get some more SATA ports installed on the PCI bus to get a clean parity build. At this point in time I might already have some bad bits on two other drives that I have rebuilt from HD upgrades. Anyway to know which files might be corropted now? These issues are unrelated to which version of unRAID you are using. Nice to know! You have a number of Hitachi 500GB drives, like the one listed above. They are mostly linking at 3.0 Gbps, but some of them are linking at 1.5 Gbps, and they appear to be identical drives, connected to SATA II controllers, with others linking at 3.0 Gbps. Some are older SATA I drives. You've got bragging rights with your XOR processing speed, fastest I personally have seen: "xor using function: pIII_sse (10090.000 MB/sec)". I was particularly surprised because you are using a "CPU: Intel® Core2 Duo CPU E6750 @ 2.66GHz stepping 0b" clocked at "2720.125 MHz[/color]" in single core mode. Cool! How does the pIII_sse value relate to the GUI rebuild rate? It's usually around 50MB/sec ? All drives are using MB SATA ports or SATA in the graphics slot or pciex1 slots. No PCI stuff. From jonathanm This is PRECISELY WHY we need a parity verify option WITHOUT writing to any drives. I TOTALLY agree!
Archived
This topic is now archived and is closed to further replies.