August 18, 201015 yr I'm currently trying to rebuild data onto a new hard drive. Twice now it has started out fine, but eventually resulted in "kernel panic" messages saying it was out of memory with no killable processes. Both times I couldn't access the syslog. Started rebuild again and have been paying closer attention to it. The rebuild started automatically and was originally transfering at around 17,000 Kb, but has now slowed to 1,700 Kb. At this rate it will take about ten days to finish. I opened the syslog and see that there is definitely an issue: Aug 4 21:35:38 Tower emhttp: shcmd (11): killall -HUP smbd Aug 4 21:35:38 Tower emhttp: shcmd (12): /etc/rc.d/rc.nfsd restart >/dev/null Aug 4 21:48:00 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:48:00 Tower kernel: hda: DMA timeout retry Aug 4 21:48:00 Tower kernel: hda: timeout waiting for DMA Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:01 Tower kernel: ide0: reset: success Aug 4 21:48:29 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:48:29 Tower kernel: hda: DMA timeout retry Aug 4 21:48:29 Tower kernel: hda: timeout waiting for DMA Aug 4 21:48:57 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:48:57 Tower kernel: hda: DMA timeout retry Aug 4 21:48:57 Tower kernel: hda: timeout waiting for DMA Aug 4 21:49:23 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:49:23 Tower kernel: hda: DMA timeout retry Aug 4 21:49:23 Tower kernel: hda: timeout waiting for DMA Aug 4 21:50:00 Tower kernel: hda: lost interrupt Aug 4 21:51:00 Tower last message repeated 2 times Aug 4 21:54:41 Tower last message repeated 2 times Aug 4 22:06:57 Tower in.telnetd[1350]: connect from 192.168.1.2 (192.168.1.2) Aug 4 22:07:02 Tower login[1351]: ROOT LOGIN on `pts/0' from `192.168.1.2' Any ideas? Full syslog attached. Thanks in advance. syslog.txt
August 18, 201015 yr You best back up a bit and fill us in on what you are doing, and why. From your syslog it appears as if you are trying to replace disk2. (/dev/sdc) From your syslog, you have hardware problems where the OS keeps resetting their disk controllers in an attempt to talk to the disks. Those messages mostly seem to involve /dev/hda (disk4) Aug 4 21:48:00 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:48:00 Tower kernel: hda: DMA timeout retry Aug 4 21:48:00 Tower kernel: hda: timeout waiting for DMA Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:00 Tower kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } Aug 4 21:48:00 Tower kernel: hda: task_in_intr: error=0x04 { DriveStatusError } Aug 4 21:48:00 Tower kernel: ide: failed opcode was: unknown Aug 4 21:48:01 Tower kernel: ide0: reset: success Aug 4 21:48:29 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:48:29 Tower kernel: hda: DMA timeout retry Aug 4 21:48:29 Tower kernel: hda: timeout waiting for DMA Aug 4 21:48:57 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:48:57 Tower kernel: hda: DMA timeout retry Aug 4 21:48:57 Tower kernel: hda: timeout waiting for DMA Aug 4 21:49:23 Tower kernel: hda: dma_timer_expiry: dma status == 0x20 Aug 4 21:49:23 Tower kernel: hda: DMA timeout retry Aug 4 21:49:23 Tower kernel: hda: timeout waiting for DMA Aug 4 21:50:00 Tower kernel: hda: lost interrupt When enough of these messages are written to the syslog, you'll use up all available RAM and crash, exactly as you have experienced. Basically, disk 4 is not working. It might be bad, it might be a bad cable, it could be a bad power connection or splitter, or even a marginal power supply. When initially booting, almost all your disks report hardware problems: Aug 4 21:35:28 Tower kernel: ata1: softreset failed (device not ready) Aug 4 21:35:28 Tower kernel: ata1: failed due to HW bug, retry pmp=0 Aug 4 21:35:28 Tower kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 4 21:35:28 Tower kernel: ata1.00: HPA detected: current 2930277168, native 18446744072344861488 Aug 4 21:35:28 Tower kernel: ata1.00: ATA-7: SAMSUNG HD154UI, 1AG01118, max UDMA7 Aug 4 21:35:28 Tower kernel: ata1.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32) Aug 4 21:35:28 Tower kernel: ata1.00: configured for UDMA/133 Aug 4 21:35:28 Tower kernel: ata2: softreset failed (device not ready) Aug 4 21:35:28 Tower kernel: ata2: failed due to HW bug, retry pmp=0 Aug 4 21:35:28 Tower kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 4 21:35:28 Tower kernel: ata2.00: ATA-8: WDC WD10EADS-65L5B1, 01.01A01, max UDMA/133 Aug 4 21:35:28 Tower kernel: ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32) Aug 4 21:35:28 Tower kernel: ata2.00: configured for UDMA/133 Aug 4 21:35:28 Tower kernel: ata3: softreset failed (device not ready) Aug 4 21:35:28 Tower kernel: ata3: failed due to HW bug, retry pmp=0 Aug 4 21:35:28 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 4 21:35:28 Tower kernel: ata3.00: HPA detected: current 2930277168, native 18446744072344861488 Aug 4 21:35:28 Tower kernel: ata3.00: ATA-7: SAMSUNG HD154UI, 1AG01118, max UDMA7 Aug 4 21:35:28 Tower kernel: ata3.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32) Aug 4 21:35:28 Tower kernel: ata3.00: configured for UDMA/133 Aug 4 21:35:28 Tower kernel: ata4: softreset failed (device not ready) Aug 4 21:35:28 Tower kernel: ata4: failed due to HW bug, retry pmp=0 Aug 4 21:35:28 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 4 21:35:28 Tower kernel: ata4.00: ATA-8: WDC WD10EADS-00L5B1, 01.01A01, max UDMA/133 Aug 4 21:35:28 Tower kernel: ata4.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32) Aug 4 21:35:28 Tower kernel: ata4.00: configured for UDMA/133 now, did disk2 fail? Or are you just trying to upgrade it? Your big problem seems to be disk4. Can you get SMART report on disk4 smartctl -d ata -a /dev/hda
August 19, 201015 yr Author I was replacing disk2 since it looked like it had failed, but now I'm wondering if the problem was actually from something else. The web console displayed a red ball next to disk2, so I removed it from the tower and placed it in an external device I had. I tried to check the contents through windows using some software I had downloaded (I think it was called DiskInternals), and it couldn't locate the drive, so I figured it had died. Maybe I was too hasty in diagnosing the problem on my own. I'm definitely in over my head at this point. One thing that has me suspicious is that my power supply had very few SATA connectors, and I ended up buying a bunch of cheap IDE to SATA adaptors. I had an extra one lying around so I switched the one that connected to disk4. I figure if I'm lucky, that will solve the problem. I ran the smartctl command as you suggested, but I'm not sure what I should be looking for. I'm uploading that and a new syslog. So far things appear to be going well, and data is transfering at about 58,000 KB/sec, which is much faster than before. smart.txt syslog-2010-08-18.txt
August 20, 201015 yr Author It appears that the IDE to SATA adaptor may have been the problem. When I woke up this morning the terminal showed "id c1 respawning too fast", same thing with ID's c2, c3, c4. I couldn't log in, so I had to do a hard power down. When I restarted, I found that the server completed the rebuild over the night. It began a parity check this morning, which is now completed. It says parity is valid and all drives are operational, although the parity check found 26988143 errors. Is that something I should be concerned about?
August 20, 201015 yr It appears that the IDE to SATA adaptor may have been the problem. When I woke up this morning the terminal showed "id c1 respawning too fast", same thing with ID's c2, c3, c4. I couldn't log in, so I had to do a hard power down. When I restarted, I found that the server completed the rebuild over the night. It began a parity check this morning, which is now completed. It says parity is valid and all drives are operational, although the parity check found 26988143 errors. Is that something I should be concerned about? You should do another parity check. a few million errors indicates a huge issue. Expect NO parity errors on this next "check" If you do have errors, you have a hardware issue of some kind.
August 20, 201015 yr Author I ran another parity check and this one resulted in 0 errors. I'm not convinced I'm totally out of the woods, and plan to keep a close eye on the system. Thank you very much for your help.
September 5, 201015 yr Author Having trouble again, and it appears disk4 has failed. I've tried a number of different IDE to SATA adaptors and it constantly displays a red ball. Before going out and buying a replacement drive, I figured I would upload the syslog and smart capture. Any thoughts? syslog-2010-09-05.txt smart2.txt
September 5, 201015 yr The smart report looks quite normal. I don't think the drive has died at all. It will stay as "red" until the array thinks it is replaced. (It will not automatically go back online if you fixed a bad connection or replaced a bas SATA-IDE adapter) To get it back onlline and to have it re-construct the contents onto itself perform these steps: Stop the array Un-assign disk4. Start the array with disk4 un-assigned (This will cause the array to forget the model/serial number of the "failed" disk4 drive) Stop the array once more Re-assign disk4 Start the array once more. This time it will begin the process of re-constructing disk4. (remember, it was taken off-line when a write to it failed, so it needs to be re-constructed) Do NOT press the button labeled as "restore" as it is actually a "Initialize Configuration and Immediately Invalidate Parity based on any Prior Configuration" button. It would cause you to lose the data on disk4 it it was really defective. Let's hope it was just the SATA/IDE adapter. Joe L.
Archived
This topic is now archived and is closed to further replies.