September 23, 200916 yr Hello! Strange issue I am having at the moment : I troubleshooted a faulty cable lately which gave me about 700 errors in the parity check, typical errors were : Sep 20 03:33:38 Alpha kernel: sd 6:0:0:0: [sdg] 976773168 512-byte hardware sectors (500108 MB) Sep 20 03:33:38 Alpha kernel: sd 6:0:0:0: [sdg] Write Protect is off Sep 20 03:33:38 Alpha kernel: sd 6:0:0:0: [sdg] Mode Sense: 00 3a 00 00 Sep 20 03:33:38 Alpha kernel: sd 6:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Sep 20 03:35:31 Alpha kernel: ata6.00: exception Emask 0x12 SAct 0x0 SErr 0x4850400 action 0xe frozen Sep 20 03:35:31 Alpha kernel: ata6: SError: { Proto PHYRdyChg CommWake LinkSeq DevExch } Sep 20 03:35:31 Alpha kernel: ata6.00: cmd 25/00:00:07:40:39/00:04:12:00:00/e0 tag 0 dma 524288 in Sep 20 03:35:31 Alpha kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x16 (ATA bus error) But now I still have 62 errors showing in the parity check results but nothing appears in the syslog related to these errors. Here is the last part of the syslog right after the launch of the check : Sep 22 15:33:48 Alpha kernel: md: recovery thread checking parity... Sep 22 15:33:48 Alpha kernel: md: using 1152k window, over a total of 976762552 blocks. Sep 22 15:40:56 Alpha ntpd[1204]: time reset -0.424898 s Sep 22 15:42:24 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 16:00:38 Alpha ntpd[1204]: time reset -0.370562 s Sep 22 16:03:18 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 16:17:17 Alpha ntpd[1204]: time reset -0.427218 s Sep 22 16:17:59 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 16:33:52 Alpha ntpd[1204]: time reset -0.389664 s Sep 22 16:34:30 Alpha in.telnetd[1461]: connect from 192.168.123.12 (192.168.123.12) Sep 22 16:34:33 Alpha login[1462]: ROOT LOGIN on `pts/0' from `192.168.123.12' Sep 22 16:37:27 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 16:53:38 Alpha ntpd[1204]: time reset -0.428178 s Sep 22 16:54:12 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 17:18:38 Alpha ntpd[1204]: time reset -0.407110 s Sep 22 17:19:23 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 17:34:11 Alpha ntpd[1204]: time reset -0.528294 s Sep 22 17:36:57 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 17:52:10 Alpha ntpd[1204]: time reset -0.405057 s Sep 22 17:52:25 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 18:08:28 Alpha ntpd[1204]: time reset -0.384115 s Sep 22 18:09:20 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 18:14:45 Alpha emhttp: shcmd (19): /usr/sbin/hdparm -y /dev/sdh >/dev/null Sep 22 18:14:45 Alpha emhttp: shcmd (20): /usr/sbin/hdparm -y /dev/sdb >/dev/null Sep 22 18:25:14 Alpha ntpd[1204]: time reset -0.374570 s Sep 22 18:26:21 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 18:41:18 Alpha ntpd[1204]: no servers reachable Sep 22 18:42:22 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 18:45:38 Alpha ntpd[1204]: time reset -0.421187 s Sep 22 18:49:13 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 19:04:20 Alpha ntpd[1204]: time reset -0.411492 s Sep 22 19:05:05 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 19:12:45 Alpha emhttp: shcmd (21): /usr/sbin/hdparm -y /dev/sdg >/dev/null Sep 22 19:12:45 Alpha emhttp: shcmd (22): /usr/sbin/hdparm -y /dev/sda >/dev/null Sep 22 19:21:12 Alpha ntpd[1204]: time reset -0.397888 s Sep 22 19:21:47 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 19:41:02 Alpha ntpd[1204]: time reset -0.407795 s Sep 22 19:43:21 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 Sep 22 19:55:00 Alpha kernel: md: sync done. time=15671sec rate=62329K/sec Sep 22 19:55:00 Alpha kernel: md: recovery thread sync completion status: 0 Sep 22 19:57:21 Alpha ntpd[1204]: time reset -0.387149 s Sep 22 19:57:41 Alpha ntpd[1204]: synchronized to 207.150.167.80, stratum 2 I am also adding the full syslog in case you can find something I missed. Do you have any idea where these errors could come from and how to solve them?
September 24, 200916 yr Author Bump! I just finished a new check, 68 errors, still nothing showing in syslog. Does anyone know a linux command to get some more juice from the syslog?
September 24, 200916 yr Your syslog looks fine, and I'm happy you solved the cabling problem, that seems to be an increasingly common source of 'drive' errors. I'd like to respectfully address a possible misconception though. As far as I know, drive exceptions/errors are not directly related to parity errors, although it is possible for drive errors to result in an inability to read/write a drive or a sector thereof, resulting in an inability to calculate parity. In particular, interface errors related to the communication of data (eg. bad cabling) should NEVER (as far as I know!) result in parity errors. The data path from the media surface of the drive to the original requester is *supposed* to be error-checked and mostly error-corrected the entire way. If data handling functions along the path are working correctly (no firmware or driver crashes), every chunk of data is error-checked at every step of the path. At many steps, if data reception discovers corruption, then operations are transparently repeated until data is correct. If the data cannot be corrected, then errors are returned (and possibly logged) along the path, but the bad data is NEVER returned (except perhaps in special debug circumstances). Ultimately, the data request is either repeated until correct or cancelled, but the bad data is not returned for use. A parity calculation therefore is always performed on good data, or it is not performed at all. If it cannot be performed, then there will be obvious errors logged, and I'm not positive on this, but yes there may be circumstances where a parity error is recorded for this, but I don't think it should be. It should really be called a 'failure to calculate parity' error. Now for the problems you are having: in most cases in the past when repeated parity checks keep returning batches of parity errors, often with differing counts, the number one cause was bad memory. I strongly recommend running a memory test on your system. I suspect you will find problems on the first pass, but if not, leave it running all night. Memory tests should be able to run and run and run, with absolutely zero errors returned. If you see even one, then you need to start testing the individual memory cards, attempting to isolate which produce errors. Sometimes, it is just easier to replace them all, with a matched set. It is also possible that your memory settings are not correct. The next possible source of issues, in my opinion, is heat. Check for overheating of motherboard chipsets, especially the bridge chipsets.
September 24, 200916 yr Another source of issue is certtain chipsets and/or controllers which corrupt the data wghile in transit to the drive. Last one we've seen was a controller issue. It was a constant problem until the drives were moved off that controller.
September 25, 200916 yr Author Hey guys! Thank you for your advises, the first memory check (2 passes) did not indicate any error, I'll let it run for a whole night and display the results tomorrow. If I made a mistake in the memory settings (I do not think I tweaked those though) would that show errors in Memtest? As for the controller, WeeboTech, do you think a good test would be to swap 2 hard drives from 2 different controllers (one would be the suspected one of course) reassign them correctly and test again? For the temperature suspect I do not have any idea on how to check that except reboot and check in the BIOS. Would that be a good way to check it or is there a linux command which can gather this kind of info from a working machine? Again, thank you very much for your replies
September 25, 200916 yr Thank you for your advises, the first memory check (2 passes) did not indicate any error, I'll let it run for a whole night and display the results tomorrow. If I made a mistake in the memory settings (I do not think I tweaked those though) would that show errors in Memtest? Memory that is configured too aggressively for performance or does not match the motherboard's requirements can look just like bad memory, and fail the memory test. Sometimes, you need to use slower settings than the default for a specific memory card to satisfy a particular motherboard. Once you can pass a very long memory test without a single error, then you should be fine. As for the controller, WeeboTech, do you think a good test would be to swap 2 hard drives from 2 different controllers (one would be the suspected one of course) reassign them correctly and test again? I shouldn't speak for him, but I think what he would say is that it would be best to move ALL drives off a suspect controller, and then test. If by eliminating the use of a specific controller stops all errors, then you can assume you have found the guilty party. For the temperature suspect I do not have any idea on how to check that except reboot and check in the BIOS. Would that be a good way to check it or is there a linux command which can gather this kind of info from a working machine? From UnMENU - System Info - CPU Info, you can see your CPU temp(s) and possibly more (would be nice to see those temps more prominently on MyMain screens, auto-updated too!). The BIOS setup screen can also be helpful too. But the temps that are probably of most interest here are the temps of the bridge chipsets, and the easiest way to get a 'feel' for them is *feel* them. What I do is open the case, make sure the machine is running, touch metal parts of the case to ensure you are grounded, then let the back of a finger come close and barely touch the center of the largest chips on the board, being very careful not to touch any metal leads. You should feel either cool, warm, or hot, but NOT too hot to touch. Also when you first open the case, check for non-spinning fans, anywhere. That's almost always a Very Bad Thing.
September 26, 200916 yr Author Hi RobJ! After 25 passes the memory reports green, no errors at all. The chipset heat sinks are warm as in not hot just a bit above human temperature, I can definitively let my finger on it with no discomfort. So this leaves me with the controllers, which is not that bad as I had my eyes on a SATA2 one (4ports on eBay for about 40€) in order to replace the SATA ones I currently have in the server. The question that remains would be : which controller to replace? I have 4 : - Motherboard SATA1 (2 connections) - Motherboard SATA2 (1 connection >> parity drive) - Adaptec SATA1 on PCI (2 ports) - SLi (eBay) SATA2 on PCI-Express (2 ports) As I have no way of knowing which port/drive is faulty I guess I only have to hope it is one of the SATA1 controllers. Thanks again for your help and advises
Archived
This topic is now archived and is closed to further replies.