Mailman74 Posted April 2, 2012 Share Posted April 2, 2012 I have been having problems the last 2 days with my server not replying. I have to manual reboot after trying to reboot via shell in a box. It runs ok for a few hours after reboot but then becomes unresponsive. I thought it was network related but the other 4 pc's on network are fine. Wondering if it could be my pci NIC card. I am attaching the syslog from the last start up. I do not see any syslogs on my flash in the 2-3 weeks. syslog-2012-04-01.txt Link to comment
Joe L. Posted April 2, 2012 Share Posted April 2, 2012 I have been having problems the last 2 days with my server not replying. I have to manual reboot after trying to reboot via shell in a box. It runs ok for a few hours after reboot but then becomes unresponsive. I thought it was network related but the other 4 pc's on network are fine. Wondering if it could be my pci NIC card. I am attaching the syslog from the last start up. I do not see any syslogs on my flash in the 2-3 weeks. syslogs are not saves on the flash drive automatically. They are in /var/log/ This drive: Apr 1 19:53:44 Tower kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Apr 1 19:53:44 Tower kernel: ata1.00: ATA-8: Hitachi HDS5C3020ALA632, ML6OA180, max UDMA/133 Apr 1 19:53:44 Tower kernel: ata1.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32) Apr 1 19:53:44 Tower kernel: ata1.00: configured for UDMA/100 Apr 1 19:53:44 Tower kernel: scsi 0:0:0:0: Direct-Access ATA Hitachi HDS5C302 ML6O PQ: 0 ANSI: 5 Apr 1 19:53:44 Tower kernel: sd 0:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Apr 1 19:53:44 Tower kernel: sd 0:0:0:0: [sdb] Write Protect is off Apr 1 19:53:44 Tower kernel: sd 0:0:0:0: [sdb] Mode Sense: 00 3a 00 00 Apr 1 19:53:44 Tower kernel: sd 0:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FU Is having lots of CRC errors (checksum errors in comminucating with it over the SATA link) It is your parity drive Apr 1 19:56:31 Tower kernel: md: import disk0: [8,16] (sdb) Hitachi_HDS5C3020ALA632_ML0220F30B4EJD size: 195351455 Apr 1 19:57:25 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x6 Apr 1 19:57:25 Tower kernel: ata1.00: irq_stat 0x00020002, device error via D2H FIS Apr 1 19:57:25 Tower kernel: ata1: SError: { 10B8B BadCRC } Apr 1 19:57:25 Tower kernel: ata1.00: failed command: READ DMA EXT Apr 1 19:57:25 Tower kernel: ata1.00: cmd 25/00:70:f0:16:37/00:01:00:00:00/e0 tag 0 dma 188416 in Apr 1 19:57:25 Tower kernel: res 51/84:41:1f:17:37/00:01:00:00:00/00 Emask 0x10 (ATA bus error) Apr 1 19:57:25 Tower kernel: ata1.00: status: { DRDY ERR } Apr 1 19:57:25 Tower kernel: ata1.00: error: { ICRC ABRT } Apr 1 19:57:25 Tower kernel: ata1: hard resetting link Apr 1 19:57:28 Tower kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Apr 1 19:57:28 Tower kernel: ata1.00: configured for UDMA/100 Apr 1 19:57:28 Tower kernel: ata1: EH complete Apr 1 19:57:32 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x6 Apr 1 19:57:32 Tower kernel: ata1.00: irq_stat 0x00020002, device error via D2H FIS Apr 1 19:57:32 Tower kernel: ata1: SError: { 10B8B BadCRC } Apr 1 19:57:32 Tower kernel: ata1.00: failed command: READ DMA EXT Apr 1 19:57:32 Tower kernel: ata1.00: cmd 25/00:f0:48:f3:3e/00:03:00:00:00/e0 tag 0 dma 516096 in Apr 1 19:57:32 Tower kernel: res 51/84:a1:97:f4:3e/00:02:00:00:00/00 Emask 0x10 (ATA bus error) Apr 1 19:57:32 Tower kernel: ata1.00: status: { DRDY ERR } Apr 1 19:57:32 Tower kernel: ata1.00: error: { ICRC ABRT } Apr 1 19:57:32 Tower kernel: ata1: hard resetting link Apr 1 19:57:34 Tower kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Apr 1 19:57:34 Tower kernel: ata1.00: configured for UDMA/100 Apr 1 19:57:34 Tower kernel: ata1: EH complete Apr 1 19:57:38 Tower last message repeated 34 times Apr 1 19:57:38 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x6 Apr 1 19:57:38 Tower kernel: ata1.00: irq_stat 0x00020002, device error via D2H FIS Apr 1 19:57:38 Tower kernel: ata1: SError: { 10B8B BadCRC } Apr 1 19:57:38 Tower kernel: ata1.00: failed command: READ DMA EXT Apr 1 19:57:38 Tower kernel: ata1.00: cmd 25/00:00:48:ef:46/00:04:00:00:00/e0 tag 0 dma 524288 in Apr 1 19:57:38 Tower kernel: res 51/84:e1:67:ef:46/00:03:00:00:00/00 Emask 0x10 (ATA bus error) Apr 1 19:57:38 Tower kernel: ata1.00: status: { DRDY ERR } Apr 1 19:57:38 Tower kernel: ata1.00: error: { ICRC ABRT } Apr 1 19:57:38 Tower kernel: ata1: hard resetting link Apr 1 19:57:41 Tower kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Apr 1 19:57:41 Tower kernel: ata1.00: configured for UDMA/100 Apr 1 19:57:41 Tower kernel: ata1: EH complete Apr 1 19:57:44 Tower kernel: ata1: limiting SATA link speed to 1.5 Gbps Apr 1 19:57:44 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x6 Apr 1 19:57:44 Tower kernel: ata1.00: irq_stat 0x00020002, device error via D2H FIS Apr 1 19:57:44 Tower kernel: ata1: SError: { 10B8B BadCRC } Apr 1 19:57:44 Tower kernel: ata1.00: failed command: READ DMA EXT Apr 1 19:57:44 Tower kernel: ata1.00: cmd 25/00:00:48:7b:4d/00:04:00:00:00/e0 tag 0 dma 524288 in Apr 1 19:57:44 Tower kernel: res 51/84:b1:97:7c:4d/00:02:00:00:00/00 Emask 0x10 (ATA bus error) Apr 1 19:57:44 Tower kernel: ata1.00: status: { DRDY ERR } Apr 1 19:57:44 Tower kernel: ata1.00: error: { ICRC ABRT } Apr 1 19:57:44 Tower kernel: ata1: hard resetting link Apr 1 19:57:46 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 10) Apr 1 19:57:46 Tower kernel: ata1.00: configured for UDMA/100 Apr 1 19:57:46 Tower kernel: ata1: EH complete Usually, this is bad cabling, or cabling picking up noise from wires it is bundled with, or, a bad power supply or one unable to supply the drives clean power. Link to comment
Mailman74 Posted April 2, 2012 Author Share Posted April 2, 2012 Is having lots of CRC errors (checksum errors in comminucating with it over the SATA link) It is your parity drive Apr 1 19:56:31 Tower kernel: md: import disk0: [8,16] (sdb) Hitachi_HDS5C3020ALA632_ML0220F30B4EJD size: 195351455 Usually, this is bad cabling, or cabling picking up noise from wires it is bundled with, or, a bad power supply or one unable to supply the drives clean power. Thanks a lot for helping me out. OK I had an extra sata cable and replaced it for the old one, what should I do now to see if it fixed the problem? I really do not know how to check the psu. You have any instructions on how to rule it a faulty psu? I also attached the new syslog after a new sata cable and reboot. syslog-2012-04-01_1.txt Link to comment
Mailman74 Posted April 2, 2012 Author Share Posted April 2, 2012 If it is the psu would this be a good choice? http://www.microcenter.com/single_product_results.phtml?product_id=0376717 Link to comment
dgaschk Posted April 3, 2012 Share Posted April 3, 2012 That one should work. There is a good selection here: http://lime-technology.com/forum/index.php?topic=12219.0 Link to comment
Mailman74 Posted April 4, 2012 Author Share Posted April 4, 2012 OK I checked cables and they are fine. I bought and installed a new psu and I am having the same problems. The server will run for a few hours and then become unresponsive. I try rebooting through shell in a box and the server does not reboot. I have to manually shut it down and restart the server then it will work again. It will run for a few hours and then the same problem. I do have a PIC NIC card installed because the mobo nic card was not working. Link to comment
dgaschk Posted April 4, 2012 Share Posted April 4, 2012 Post SMART report for the parity drive. Link to comment
Mailman74 Posted April 4, 2012 Author Share Posted April 4, 2012 Post SMART report for the parity drive. Also the parity drive is hooked up to a SATA2 Serial ATA II PCI-Express RAID Controller Card (Silicon Image SIL3132). I think I tried moving the cable to mobo already and it did not solve the problem smart.txt Link to comment
greg631 Posted April 4, 2012 Share Posted April 4, 2012 OK I checked cables and they are fine. I bought and installed a new psu and I am having the same problems. The server will run for a few hours and then become unresponsive. I try rebooting through shell in a box and the server does not reboot. I have to manually shut it down and restart the server then it will work again. It will run for a few hours and then the same problem. I do have a PIC NIC card installed because the mobo nic card was not working. Sounds like my experience too. Link to comment
Mailman74 Posted April 4, 2012 Author Share Posted April 4, 2012 OK last night I moved the parity drive cable from the SATA2 Serial ATA II PCI-Express RAID Controller Card and plugged it into the mobo. The server has been up and running all night and I am not getting multiple emails about resync rebuilding. We were not using the server though but I am hoping this fixed the problem. Could it be the RAID card is bad or maybe too slow? Link to comment
dgaschk Posted April 4, 2012 Share Posted April 4, 2012 The syslog indicates a bad SATA cable: http://lime-technology.com/wiki/index.php/The_Analysis_of_Drive_Issues#Drive_Interface_Issues Link to comment
Mailman74 Posted April 5, 2012 Author Share Posted April 5, 2012 The syslog indicates a bad SATA cable: http://lime-technology.com/wiki/index.php/The_Analysis_of_Drive_Issues#Drive_Interface_Issues I have no idea how to read and understand the syslog. I did move the same cable from the RAID Sata card to the mobo and the server has been running fine for about 18 hours now. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.