September 7, 200619 yr Alright, I am going to do my best to make this as detailed as possible and hopefully someone can advise how I should proceed. Tuesday I received a new 400GB HD from Outpost. I was loading to it to be /disk10 in the array. I was having very slow throughput and it was hanging the array. Because I just had a similar problem while adding /disk9, I figured it might be another bad mobile rack or bad ide cable. I connected the 400gb hard drive directly to the second bus on the second promise card with a working IDE cable. I was still having the same problems; telneting to the array took a few minutes, loading the web-based management page took a few minutes. Once I was to the management utility I could see that it was clearing but moving very slow. So, I figured maybe this promise card is bad and decided that I would contact Tom and see if I could get a replacement or a loaner card to confirm that it is the card giving me problems. I brought my array backup, checked parity and everything was great. Here is the log from the error: Tower login: root Linux 2.4.31. root@Tower:~# tail -f /var/log/syslog Sep 5 20:04:44 Tower kernel: md: import hdl ST3200826A 5ND274LX offset: 63 size: 195360952 Sep 5 20:04:44 Tower kernel: md: import hde HDS722525VLAT80 VN693ECFV03RAD offset: 63 size: 244198552 Sep 5 20:04:48 Tower kernel: md: import hdf ST3300631A 5NF1K71E offset: 63 size: 293036152 Sep 5 20:04:48 Tower kernel: md: import hdg ST3400620A 3QG053Y7 offset: 63 size: 390711352 Sep 5 20:04:48 Tower kernel: md10: new disk Sep 5 20:04:48 Tower kernel: md: blkdev_get error: -6 Sep 5 20:05:28 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 5 20:05:28 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 5 20:05:28 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 5 20:05:28 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 5 20:59:07 Tower smbd[1007]: [2006/09/05 20:59:07, 0] lib/util_sock.c:get_peer_addr(1000) Sep 5 20:59:07 Tower smbd[1007]: getpeername failed. Error was Transport endpoint is not connected Sep 5 20:59:07 Tower smbd[1007]: [2006/09/05 20:59:07, 0] lib/util_sock.c:write_socket_data(430) Sep 5 20:59:07 Tower smbd[1007]: write_socket_data: write failure. Error = Connection reset by peer Sep 5 20:59:07 Tower smbd[1007]: [2006/09/05 20:59:07, 0] lib/util_sock.c:write_socket(455) Sep 5 20:59:07 Tower smbd[1007]: write_socket: Error writing 4 bytes to socket 22: ERRNO = Connection reset by peer Sep 5 20:59:07 Tower smbd[1007]: [2006/09/05 20:59:07, 0] lib/util_sock.c:send_smb(647) Sep 5 20:59:07 Tower smbd[1007]: Error writing 4 bytes to client. -1. (Connection reset by peer) Sep 5 22:03:36 Tower nmbd[764]: [2006/09/05 22:03:35, 0] nmbd/nmbd_namequery.c:query_name_response(101) Sep 5 22:03:36 Tower nmbd[764]: query_name_response: Multiple (2) responses received for a query on subnet 192.168.0.12 for name MSHOME<1d>. Sep 5 22:03:37 Tower nmbd[764]: This response was from IP 192.168.0.10, reporting an IP address of 192.168.0.10. Sep 6 00:39:30 Tower smbd[1080]: [2006/09/06 00:39:30, 0] lib/util_sock.c:get_peer_addr(1000) Sep 6 00:39:30 Tower smbd[1080]: getpeername failed. Error was Transport endpoint is not connected Sep 6 00:39:30 Tower smbd[1080]: [2006/09/06 00:39:30, 0] lib/util_sock.c:get_peer_addr(1000) Sep 6 00:39:30 Tower smbd[1080]: getpeername failed. Error was Transport endpoint is not connected Sep 6 00:39:30 Tower smbd[1080]: [2006/09/06 00:39:30, 0] lib/util_sock.c:write_socket_data(430) Sep 6 00:39:30 Tower smbd[1080]: write_socket_data: write failure. Error = Connection reset by peer Sep 6 00:39:30 Tower smbd[1080]: [2006/09/06 00:39:30, 0] lib/util_sock.c:write_socket(455) Sep 6 00:39:30 Tower smbd[1080]: write_socket: Error writing 4 bytes to socket 22: ERRNO = Connection reset by peer Sep 6 00:39:30 Tower smbd[1080]: [2006/09/06 00:39:30, 0] lib/util_sock.c:send_smb(647) Sep 6 00:39:30 Tower smbd[1080]: Error writing 4 bytes to client. -1. (Connection reset by peer) Sep 6 01:12:04 Tower smbd[1105]: [2006/09/06 01:11:52, 0] lib/util_sock.c:get_peer_addr(1000) Sep 6 01:12:04 Tower smbd[1105]: getpeername failed. Error was Transport endpoint is not connected Sep 6 01:12:04 Tower smbd[1105]: [2006/09/06 01:12:04, 0] lib/util_sock.c:get_peer_addr(1000) Sep 6 01:12:04 Tower smbd[1105]: getpeername failed. Error was Transport endpoint is not connected Sep 6 01:12:04 Tower smbd[1105]: [2006/09/06 01:12:04, 0] lib/util_sock.c:write_socket_data(430) Sep 6 01:12:04 Tower smbd[1105]: write_socket_data: write failure. Error = Connection reset by peer Sep 6 01:12:04 Tower smbd[1105]: [2006/09/06 01:12:04, 0] lib/util_sock.c:write_socket(455) Sep 6 01:12:04 Tower smbd[1105]: write_socket: Error writing 4 bytes to socket 22: ERRNO = Connection reset by peer Sep 6 01:12:04 Tower smbd[1105]: [2006/09/06 01:12:04, 0] lib/util_sock.c:send_smb(647) Last night I decided that I really wanted to get the new 400GB HD into the system so I decided that I would replace /disk4 because it was an older 80GB drive. I stopped the array, powered off, put in the 400 and brought the array back online. Things started going great and it was rebuilding the data with no problem and I also noticed that all the orange/amber LED's were pinged while rebuilding. I checked the management utility and things were moving at about 18Kbps and it was going to take about 2 hours to complete. I went down and watched some TV with the wife, went on a walk with the dogs, etc. I came back expecting the rebuild to be about half done.... Nope, first thing I noticed when I sat down at my desk is that the amber LED's are no longer pinged, but flashing! Except /disk4 and /disk5 are pinged (these are both on the same IDE bus which would be promise card1 ide1) So, I refresh the management utility and I see that things have slowed way down. I was hitting 2-3Kbps and the time now changed to 24+hours. I decided to let it run through the night to see what would happen. When I woke up it was still at this terrible pace and I stopped the data rebuild. The data on the drive is luckily backed up and I decided to copy the data over to my PC this morning. So, it is currently running unprotected and copying over the data to my PC hard drive because I don't want to have to load 80GB. So, the question is, where the hell do I go from here? Once I copy all of the data over to my PC.... what do I do next? I have a bottle neck somewhere in the system, could this possibly be another bad promise card? Here is the log from yesterday and this morning: Sep 6 19:54:00 Tower smbd[1486]: [2006/09/06 19:54:00, 0] lib/util_sock.c:send_smb(647) Sep 6 19:54:00 Tower smbd[1486]: Error writing 4 bytes to client. -1. (Connection reset by peer) Sep 6 20:58:00 Tower smbd[1511]: [2006/09/06 20:58:00, 0] lib/util_sock.c:get_peer_addr(1000) Sep 6 20:58:00 Tower smbd[1511]: getpeername failed. Error was Transport endpoint is not connected Sep 6 20:58:00 Tower smbd[1511]: [2006/09/06 20:58:00, 0] lib/util_sock.c:write_socket_data(430) Sep 6 20:58:00 Tower smbd[1511]: write_socket_data: write failure. Error = Connection reset by peer Sep 6 20:58:00 Tower smbd[1511]: [2006/09/06 20:58:00, 0] lib/util_sock.c:write_socket(455) Sep 6 20:58:00 Tower smbd[1511]: write_socket: Error writing 4 bytes to socket 5: ERRNO = Connection reset by peer Sep 6 20:58:00 Tower smbd[1511]: [2006/09/06 20:58:00, 0] lib/util_sock.c:send_smb(647) Sep 6 20:58:00 Tower smbd[1511]: Error writing 4 bytes to client. -1. (Connection reset by peer) Sep 6 22:34:02 Tower smbd[1537]: [2006/09/06 22:34:02, 0] lib/util_sock.c:read_socket_data(384) Sep 6 22:34:02 Tower smbd[1537]: read_socket_data: recv failure for 4. Error= Connection reset by peer Sep 7 02:11:31 Tower kernel: hdi: dma_timer_expiry: dma status == 0x20 Sep 7 02:11:31 Tower kernel: hdi: timeout waiting for DMA Sep 7 02:11:31 Tower kernel: PDC202XX: Primary channel reset. Sep 7 02:11:31 Tower kernel: hdi: timeout waiting for DMA Sep 7 02:11:31 Tower kernel: hdi: (__ide_dma_test_irq) called while not waiting Sep 7 02:11:31 Tower kernel: hdi: status error: status=0x58 { DriveReady SeekComplete DataRequest } Sep 7 02:11:31 Tower kernel: Sep 7 02:11:31 Tower kernel: hdi: drive not ready for command Sep 7 02:11:31 Tower kernel: hdi: status timeout: status=0xd0 { Busy } Sep 7 02:11:31 Tower kernel: Sep 7 02:11:31 Tower kernel: PDC202XX: Primary channel reset. Sep 7 02:11:31 Tower kernel: hdi: drive not ready for command Sep 7 02:11:32 Tower kernel: ide4: reset: success Sep 7 03:43:19 Tower kernel: get_token: status
September 7, 200619 yr teamhood, You do have the 400GB drive select jumper set to CS (Cable Select)? Just a thought. Regards, TCIII
September 7, 200619 yr Author TCIII: First off, thank you very much for repsonding I am 99% sure that I checked to make sure that it was set to cable select. I am going to double check when I get home to make sure that it is. So, I was thinking about this today and was hoping that someone would jump on and let me know if this is OK. If I was to Initialize the array, it would wipe out the data on the drive that is missing correct? I am a bit weary of pressing that reset array option... I just want to make sure that it won't wipe out all my data? Hopefully Tom or someone could jump on here and let me know what I should do. I really wonder what the heck is going on!!
September 7, 200619 yr teamhood, A reset of the the array configuration will not erase any data. Yes, you DO NOT WANT to initialize the array as that will wipe all of your data. Did you get your Promise HD controller cards from Tom or else where? If you did not buy them from Tom, did you make sure that they had the latest firmware? You can go to the Promise website and download the latest firmware for the Promise controllers if you do not have the latest version installed on the controllers. Regards, TCIII
September 7, 200619 yr Author TCIII: Yes, I bought my cards from Tom. I purchased the starter kit.... so I am sure Tom is working on updating everything here for the new release and I don't expect to hear from him for a few days about the issue. Thank you, Mike
September 8, 200619 yr Your problems seem to have a common area, timeouts with the new disk drives. Are you using the newer style 80 conductor cables to wire the new disks to the controllers? Use of the older 40 wire cables would cause all kinds of weird errors at the transfer rates used by these drives. Also... can you confirm that the BLUE connectors on your cables are plugged into your controllers and/or motherboard and that the other end (black) connects to your disk drives. Having these backwards would also cause weird errors. Also, if you have cables with three connectors, make sure you either have either: A disk drive on BOTH of the connectors, or If only one drive on the cable, make sure it is on the black connector on the END of the cable. It is not correct to have a drive on the middle (Grey) connector and nothing on the end connector. (An empty drive tray is still nothing) This is all described on this page in more detail than I can put in this message. http://www.pcguide.com/ref/hdd/if/ide/confCable80-c.html I have a feeling that once Linux detects disk errors it will slow down the disk transfer mode to where it thinks everything is responding correctly. This might be PIO mode instead of UDMA mode. That would cause your performance to drop big time. Joe L.
September 8, 200619 yr Author Joe, Thank you so much for responding to the thread!!! I checked to make sure that all my IDE cables were 80pin and yes, all of them a blue/grey/black and are properly conntected with the blue to the MOBO/Promise cards. I don't have any IDE cables are not attached to a hard drive. I restarted the array after verifying all of the above information and it seems that I am still have some strange errors; here is the log. It seems that hdi is giving me issues, but honestly, which drive is i?!?! Sep 7 17:45:44 Tower kernel: Sep 7 17:45:44 Tower kernel: hdi: drive not ready for command Sep 7 17:45:44 Tower kernel: hdi: status timeout: status=0xd0 { Busy } Sep 7 17:45:44 Tower kernel: Sep 7 17:45:44 Tower kernel: hdj: set_drive_speed_status: status=0xd0 { Busy } Sep 7 17:45:44 Tower kernel: blk: queue c033c84c, I/O limit 4095Mb (mask 0xffffffff) Sep 7 17:45:44 Tower kernel: PDC202XX: Primary channel reset. Sep 7 17:45:44 Tower kernel: hdi: drive not ready for command Sep 7 17:45:44 Tower kernel: ide4: reset: success Sep 7 17:45:44 Tower kernel: blk: queue c033c710, I/O limit 4095Mb (mask 0xffffffff) Sep 7 17:46:04 Tower kernel: hdi: dma_timer_expiry: dma status == 0x60 Sep 7 17:46:04 Tower kernel: hdi: timeout waiting for DMA Sep 7 17:46:04 Tower kernel: PDC202XX: Primary channel reset. Sep 7 17:46:04 Tower kernel: hdi: timeout waiting for DMA Sep 7 17:46:04 Tower kernel: hdi: (__ide_dma_test_irq) called while not waiting Sep 7 17:46:04 Tower kernel: hdi: status error: status=0x58 { DriveReady SeekComplete DataRequest } Sep 7 17:46:04 Tower kernel: Sep 7 17:46:04 Tower kernel: hdi: drive not ready for command Sep 7 17:46:04 Tower kernel: hdi: status timeout: status=0xd0 { Busy } Sep 7 17:46:04 Tower kernel: Sep 7 17:46:04 Tower kernel: PDC202XX: Primary channel reset. Sep 7 17:46:04 Tower kernel: hdi: drive not ready for command Sep 7 17:46:05 Tower kernel: ide4: reset: success Sep 7 17:46:25 Tower kernel: hdi: dma_timer_expiry: dma status == 0x20 Sep 7 17:46:25 Tower kernel: hdi: timeout waiting for DMA Sep 7 17:46:25 Tower kernel: PDC202XX: Primary channel reset. Sep 7 17:46:25 Tower kernel: hdi: timeout waiting for DMA Sep 7 17:46:25 Tower kernel: hdi: (__ide_dma_test_irq) called while not waiting Sep 7 17:46:25 Tower kernel: hdi: status error: status=0x58 { DriveReady SeekComplete DataRequest } Sep 7 17:46:25 Tower kernel: Sep 7 17:46:25 Tower kernel: hdi: drive not ready for command Sep 7 17:46:25 Tower kernel: hdi: status timeout: status=0xd0 { Busy } Sep 7 17:46:25 Tower kernel: Sep 7 17:46:25 Tower kernel: PDC202XX: Primary channel reset. Sep 7 17:46:25 Tower kernel: hdi: drive not ready for command Sep 7 17:46:25 Tower kernel: ide4: reset: success Sep 7 17:47:22 Tower kernel: reiserfs: checking transaction log (device md(9,6)) ... Sep 7 17:47:22 Tower kernel: for (md(9,6)) Sep 7 17:47:23 Tower kernel: reiserfs: checking transaction log (device md(9,1)) ... Sep 7 17:47:23 Tower kernel: for (md(9,1)) Sep 7 17:47:23 Tower kernel: reiserfs: checking transaction log (device md(9,) ... Sep 7 17:47:23 Tower kernel: for (md(9,) Sep 7 17:47:37 Tower kernel: reiserfs: checking transaction log (device md(9,5)) ... Sep 7 17:47:37 Tower kernel: for (md(9,5)) Sep 7 17:47:37 Tower kernel: md(9,1):Using r5 hash to sort names Sep 7 17:47:37 Tower kernel: md(9,:Using r5 hash to sort names Sep 7 17:47:37 Tower kernel: md(9,:can't shrink filesystem on-line Sep 7 17:47:38 Tower kernel: md(9,6):Using r5 hash to sort names Sep 7 17:47:38 Tower kernel: md(9,1):can't shrink filesystem on-line Sep 7 17:47:39 Tower kernel: md(9,6):can't shrink filesystem on-line Sep 7 17:47:44 Tower kernel: md(9,5):Using r5 hash to sort names Sep 7 17:47:44 Tower kernel: md(9,5):can't shrink filesystem on-line Sep 7 17:47:44 Tower kernel: reiserfs: checking transaction log (device md(9,9)) ... Sep 7 17:47:44 Tower kernel: for (md(9,9)) Sep 7 17:47:46 Tower kernel: md(9,9):Using r5 hash to sort names Sep 7 17:47:46 Tower kernel: md(9,9):can't shrink filesystem on-line Sep 7 17:48:11 Tower kernel: reiserfs: checking transaction log (device md(9,2)) ... Sep 7 17:48:11 Tower kernel: for (md(9,2)) Sep 7 17:48:12 Tower kernel: reiserfs: checking transaction log (device md(9,7)) ... Sep 7 17:48:12 Tower kernel: for (md(9,7)) Sep 7 17:48:12 Tower kernel: reiserfs: checking transaction log (device md(9,4)) ... Sep 7 17:48:12 Tower kernel: for (md(9,4)) Sep 7 17:48:13 Tower kernel: md(9,7):Using r5 hash to sort names Sep 7 17:48:13 Tower kernel: md(9,7):can't shrink filesystem on-line Sep 7 17:48:14 Tower kernel: md(9,2):Using r5 hash to sort names Sep 7 17:48:14 Tower kernel: md(9,2):can't shrink filesystem on-line Sep 7 17:48:16 Tower kernel: reiserfs: checking transaction log (device md(9,3)) ... Sep 7 17:48:16 Tower kernel: for (md(9,3)) Sep 7 17:48:16 Tower kernel: md(9,3):Using r5 hash to sort names Sep 7 17:48:16 Tower kernel: md(9,4):Using r5 hash to sort names Sep 7 17:48:16 Tower kernel: md(9,3):can't shrink filesystem on-line Sep 7 17:48:16 Tower kernel: md(9,4):can't shrink filesystem on-line Sep 7 17:48:16 Tower kernel: get_token: status Sep 7 17:48:17 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:48:17 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:48:17 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:48:17 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:48:17 Tower nmbd[764]: [2006/09/07 17:48:17, 0] nmbd/nmbd.c:terminate(56) Sep 7 17:48:17 Tower nmbd[764]: Got SIGTERM: going down... Sep 7 17:48:39 Tower kernel: get_token: status Sep 7 17:48:39 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:48:39 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:48:39 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:48:39 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:21 Tower kernel: get_token: status Sep 7 17:50:21 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:21 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:21 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:21 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:32 Tower kernel: get_token: status Sep 7 17:50:32 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:32 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:32 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:32 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:43 Tower kernel: get_token: status Sep 7 17:50:44 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:44 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:44 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:44 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:47 Tower kernel: get_token: status Sep 7 17:50:47 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:47 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:47 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:47 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:53 Tower kernel: get_token: status Sep 7 17:50:53 Tower kernel: hdk: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:53 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 17:50:53 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error } Sep 7 17:50:53 Tower kernel: hde: drive_cmd: error=0x04 { DriveStatusError }
September 8, 200619 yr According to this excerpt from the "go" script, that disk is on the first promise controller (closest to the CPU on the recommended intel motherboard): # Following assumes a 12-disk system (2 promise ide controllers)^M # Disks 0..3 (motherboard Pri/Sec) hdparm -c1d1a0m8A1W1u1 /dev/hda hdparm -c1d1a0m8A1W1u1 /dev/hdb hdparm -c1d1a0m8A1W1u1 /dev/hdc hdparm -c1d1a0m8A1W1u1 /dev/hdd # Disks 4..7 (pci slot closest to CPU) hdparm -c1d1a0m8A1W1u1 /dev/hdi hdparm -c1d1a0m8A1W1u1 /dev/hdj hdparm -c1d1a0m8A1W1u1 /dev/hdk hdparm -c1d1a0m8A1W1u1 /dev/hdl # Disks 8..11 (pci slot further from CPU) hdparm -c1d1a0m8A1W1u1 /dev/hde hdparm -c1d1a0m8A1W1u1 /dev/hdf hdparm -c1d1a0m8A1W1u1 /dev/hdg hdparm -c1d1a0m8A1W1u1 /dev/hdh Is that disk different than any of the others? or on a longer drive cable? I first would try a different cable on that disk, then, try moving it to a different controller, or different power supply if you have two in the case, or could just try moving it to a different slot (if you have removable trays) you might also try the folowing command (after logging in using telnet) and post the output : hdparm -i /dev/hdi Part of the output of the hdparm -i command will be the drive model number and serial nunmber. They will help you identify the physical drive. You might also be able to alter the hdparm command for that drive in the "go" script to use a different (slower) udma mode and then use the drive without the errors but Tom might have to advise you what options you could try. Joe L.
September 8, 200619 yr Author Joe Thank you so much for sharing all your ideas and willingness to help. To be honest, I held off on getting UnRaid because I saw what you all went through when Tom first went MIA. I still wasn't worried because I noticed that a few member here and over at AVS seemed willing to help everyone make it through. So, I really appreciate all of the assistance... I will not stop :-* your ass I just changed out IDE cables on the first Promise card, IDE slot one which connects /disk4 (new drive that I stuck in to upgrade from my old 80gb) and in terms, this is technically disk5 in the array (if you count parity as one...) I also put this on a different power supply.... still not doing well here, the re-build is even slower than before... 218 KB/s Here is the result of the command that you gave me to run: Tower login: root Linux 2.4.31. root@Tower:~# hdparm -i /dev/hdi /dev/hdi: Model=ST3400620A, FwRev=3.AAD, SerialNo=3QG053Y7 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=16384kB, MaxMultSect=16, MultSect=8 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: * signifies the current active mode I am going to try a few different cables and connecting that hard drive directly to the promise card (removing it from the mobile rack units) Thank you again! -Mike
September 8, 200619 yr Author Okay, here is the next set of logs once I removed both drives from the mobile docks that are on promise 1 (disk4 and disk5) and connected them directly to the promise card. The system still sees all of drives, but holy crap it is taking a lot longer for the disks to mount and the rebuild is going very, VERY slow. Started Stop will take the disks off-line. Data-Rebuild in progress. Cancel will stop the Data-Rebuild operation. WARNING: canceling Data-Rebuild will leave the array unprotected! Total size: 390,711,352 KB Current position: 53,628 (0.0%) Estimated speed: 983 KB/sec Estimated finish: 6609.3 minutes Sep 7 20:19:59 Tower kernel: Sep 7 20:19:59 Tower kernel: hdi: drive not ready for command Sep 7 20:19:59 Tower kernel: hdi: status timeout: status=0xd0 { Busy } Sep 7 20:19:59 Tower kernel: Sep 7 20:19:59 Tower kernel: hdj: set_drive_speed_status: status=0xd0 { Busy } Sep 7 20:19:59 Tower kernel: blk: queue c033c84c, I/O limit 4095Mb (mask 0xffff ffff) Sep 7 20:19:59 Tower kernel: PDC202XX: Primary channel reset. Sep 7 20:19:59 Tower kernel: hdi: drive not ready for command Sep 7 20:20:00 Tower kernel: ide4: reset: success Sep 7 20:20:00 Tower kernel: blk: queue c033c710, I/O limit 4095Mb (mask 0xffff ffff) Sep 7 20:20:22 Tower kernel: hdi: dma_timer_expiry: dma status == 0x60 Sep 7 20:20:22 Tower kernel: hdi: timeout waiting for DMA Sep 7 20:20:22 Tower kernel: PDC202XX: Primary channel reset. Sep 7 20:20:22 Tower kernel: hdi: timeout waiting for DMA Sep 7 20:20:22 Tower kernel: hdi: (__ide_dma_test_irq) called while not waiting Sep 7 20:20:22 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple te Error } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR C } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple te Error } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR C } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple te Error } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR C } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple te Error } Sep 7 20:20:22 Tower kernel: hdj: dma_intr: error=0x84 { DriveStatusError BadCR C } Sep 7 20:20:22 Tower kernel: blk: queue c033c84c, I/O limit 4095Mb (mask 0xffff ffff) Sep 7 20:20:22 Tower kernel: PDC202XX: Primary channel reset. Sep 7 20:20:23 Tower kernel: ide4: reset: success Sep 7 20:20:25 Tower kernel: blk: queue c033c710, I/O limit 4095Mb (mask 0xffff ffff) Sep 7 20:20:46 Tower kernel: hdi: dma_timer_expiry: dma status == 0x60 Sep 7 20:20:46 Tower kernel: hdi: timeout waiting for DMA Sep 7 20:20:46 Tower kernel: PDC202XX: Primary channel reset. Sep 7 20:20:46 Tower kernel: hdi: timeout waiting for DMA Sep 7 20:20:46 Tower kernel: hdi: (__ide_dma_test_irq) called while not waiting Sep 7 20:20:46 Tower kernel: hdj: dma_intr: status=0x51 { DriveReady SeekComple Sep 7 20:21:02 Tower kernel: for (md(9,) Sep 7 20:21:02 Tower kernel: md(9,:Using r5 hash to sort names Sep 7 20:21:02 Tower kernel: md(9,1):Using r5 hash to sort names Sep 7 20:21:03 Tower kernel: md(9,:can't shrink filesystem on-line Sep 7 20:21:04 Tower kernel: md(9,1):can't shrink filesystem on-line Sep 7 20:21:25 Tower kernel: reiserfs: checking transaction log (device md(9,3) ) ... Sep 7 20:21:25 Tower kernel: for (md(9,3)) Sep 7 20:21:27 Tower kernel: md(9,3):Using r5 hash to sort names Sep 7 20:21:27 Tower kernel: reiserfs: checking transaction log (device md(9,2) ) ... Sep 7 20:21:27 Tower kernel: for (md(9,2)) Sep 7 20:21:27 Tower kernel: md(9,3):can't shrink filesystem on-line Sep 7 20:21:38 Tower kernel: reiserfs: checking transaction log (device md(9,9) ) ... Sep 7 20:21:38 Tower kernel: for (md(9,9)) Sep 7 20:21:39 Tower kernel: md(9,9):Using r5 hash to sort names Sep 7 20:21:39 Tower kernel: md(9,9):can't shrink filesystem on-line Sep 7 20:21:46 Tower kernel: md(9,2):Using r5 hash to sort names Sep 7 20:21:47 Tower kernel: md(9,2):can't shrink filesystem on-line Sep 7 20:21:49 Tower kernel: reiserfs: checking transaction log (device md(9,7) ) ... Sep 7 20:21:49 Tower kernel: for (md(9,7)) Sep 7 20:21:57 Tower kernel: md(9,7):Using r5 hash to sort names Sep 7 20:21:57 Tower kernel: md(9,7):can't shrink filesystem on-line Sep 7 20:22:03 Tower kernel: reiserfs: checking transaction log (device md(9,6) ) ... Sep 7 20:22:03 Tower kernel: for (md(9,6)) Sep 7 20:22:09 Tower kernel: reiserfs: checking transaction log (device md(9,4) ) ... Sep 7 20:22:09 Tower kernel: for (md(9,4)) Sep 7 20:22:12 Tower kernel: md(9,6):Using r5 hash to sort names Sep 7 20:22:13 Tower kernel: md(9,6):can't shrink filesystem on-line Sep 7 20:22:15 Tower kernel: reiserfs: checking transaction log (device md(9,5) ) ... Sep 7 20:22:15 Tower kernel: for (md(9,5)) Sep 7 20:22:15 Tower kernel: md(9,5):Using r5 hash to sort names Sep 7 20:22:15 Tower kernel: md(9,5):can't shrink filesystem on-line Sep 7 20:22:15 Tower kernel: md(9,4):Using r5 hash to sort names Sep 7 20:22:15 Tower kernel: md(9,4):can't shrink filesystem on-line Sep 7 20:22:15 Tower kernel: get_token: status Sep 7 20:22:16 Tower kernel: hdk: drive_cmd: error=0x04 { DriveStatusError } Sep 7 20:22:16 Tower nmbd[764]: [2006/09/07 20:22:16, 0] nmbd/nmbd.c:terminate( 56) Sep 7 20:22:16 Tower nmbd[764]: Got SIGTERM: going down... Sep 7 20:22:21 Tower kernel: spurious 8259A interrupt: IRQ15. Sep 7 20:22:32 Tower kernel: get_token: status Sep 7 20:22:32 Tower kernel: hde: drive_cmd: status=0x51 { DriveReady SeekCompl
September 9, 200619 yr Author Tom, IF you get a chance you think you could drop some ideas what is happening with my array? I actually let the rebuild run for the past two days and it is just about finished but it still doesn't solve the under lying issue here!
September 10, 200619 yr In my Parity Sync problem thread you said: Glad that your issue worked out... I am still having my issues and I don't know what the hell is causing the issues.... I wish it was a bad hard drive, but I seriously have no clue whatis wrong my array!! Hopefully Tom will jump on this weekend and give me some ideas!!! Are you sure it's not a bad hard drive? Have you tried formatting and using the drive in another system? I wasn't sure from the preceding posts how you had verified the hard drive was OK. The weird thing is that in WinXP if I get weird problems with disk access/read/write the FIRST thing I worry about is a bad hard disk. Ironically, since we can get so much info on the problem under UnRaid (well, Linux), we look for all sorts of problems other than the hard disk.
September 10, 200619 yr Author Well, I guess I assumed it isn't a bad hard drive because it was brand new out of the box... but so get this..... I have disk4 and disk5 connected directly to the promise card. I let it run the data rebuild (it took about 48hours and the speed it was at) So, I decided to just stop the array and reboot it. I also put the hard drive back in the mobile rack units. I brought the array back online and decided to run a parity check to see what speed I was at. I don't understand this at all... but it was flying... 20Kbps... so I have no idea what went wrong when replacing a hard drive with a brand new one.... but I don't know all seems fine now. By the end of the parity check I was hitting 40Kbps.... ??? So... I don't know problem currently solved. I am goin to throw a 120 into the array and an 80gb to add disk10 and disk11 which will completely fill the array. I will let you know all know what the hell happens!
September 10, 200619 yr Author Problem is not solved....as I stated I was able to run parity with no problem. I am currently copying ~50GB of data from /disk 4 to /disk7 and I am getting very poor performance. I have been tailing the log and I am still seeing an error that I have posted above. Remember the prior OS when upgrading a hard drive with a brand new unformatted hard drive we were seeing some very odd things? Well, I wonder if this is something similar?? Here is the log... Tom.. what the hell is going on? The array doesn't seem to be meeting expectations! I am sure I am not alone when I say that the data I am storing on my array is very important to me. Yes, I have it backed up, but I want to be able to have it securely accessible... I don't know but there is definitely a glitch that I am seeing here. IT would be great if you jumped on here and gave us some advice?? Currently I might be the only one seeing it.... which hopefully that is the case, but I would like some support from you about what heck these errors are? Note: as I stated in my previous post, I was going to install 2 new hard drives into the array. I have decided against that, since my original problem still seems to be present. Here is the log: root@Tower:~# tail -f /var/log/syslog Sep 9 20:59:37 Tower smbd[1817]: getpeername failed. Error was Transport endpoint is not connected Sep 9 20:59:37 Tower smbd[1817]: [2006/09/09 20:59:37, 0] lib/util_sock.c:write_socket_data(430) Sep 9 20:59:37 Tower smbd[1817]: write_socket_data: write failure. Error = Connection reset by peer Sep 9 20:59:37 Tower smbd[1817]: [2006/09/09 20:59:37, 0] lib/util_sock.c:write_socket(455) Sep 9 20:59:37 Tower smbd[1817]: write_socket: Error writing 4 bytes to socket 22: ERRNO = Connection reset by peer Sep 9 20:59:37 Tower smbd[1817]: [2006/09/09 20:59:37, 0] lib/util_sock.c:send_smb(647) Sep 9 20:59:37 Tower smbd[1817]: Error writing 4 bytes to client. -1. (Connection reset by peer) Sep 9 21:31:37 Tower smbd[1818]: [2006/09/09 21:31:37, 0] lib/util_sock.c:read_socket_data(384) Sep 9 21:31:37 Tower smbd[1818]: read_socket_data: recv failure for 4. Error= Connection reset by peer Sep 9 22:14:51 Tower kernel: get_token: status
September 10, 200619 yr Author So I decided to try a different test. Copy from /disk4 to my PC then from PC to /disk7. Still SHIT speed. BTW: I am running Gigabit ethernet with cat5e cables. In the past I was able to copy from PC > array and from array>array with very, very fast speeds.... now it is all shit
September 11, 200619 yr Author Tom, Glad to see that you rolled out the new OS. I was wondering if you could advise what I should do? Should I upgrade and see if maybe the new OS will solve my issues? If not, what the heck should I do?!?! I currently have my server down because I have no idea what the heck is going on with my array? I would really, really, appreciate some support! I know you have been busy and now you are going to have a lot more questions regarding the new OS, but I honestly don't know what I should do!!!
September 11, 200619 yr teamhood, Let's try and solve this. First, linux tends to be very "heroic" in trying to recover from disk errors. It will retry several times, drop into PIO mode and retry some more. This accounts for your slowdowns. There is very little risk of losing any of your data. The unRAID s/w is very careful about reformatting, etc. Having said that, there are ways to clobber data: don't put a data drive in the parity drive slot! One way to troubleshoot bad drives or h/w is this... IMPORTANT: the procedure below to save/restore array config using the "dd" command is ONLY valid for 2.xxxxxx releases - do NOT do this on release 3.0 or later! In the 3.0 release, the "super.dat" file already exists in the flash "config" directory. Saving/restoring array config in 3.0 or later is simply a matter of saving/restoring the file. So... First, Stop your array, telnet to the server, and capture your current array configuration: dd if=/dev/sda2 of=/boot/super.dat bs=4096 count=1 This will create a file called "super.dat" in the root of your Flash. Now power down and remove all your "good" drives and set them aside. Install your new drive(s) and make a "mini array". If you only have one drive, install it as disk1. Reboot the array, reset the array config, and run it to see if any error occur. If you suspect a particular bad slot, put your new drives in there and repeat. When you're ready to restore your array, install all your good drives back in the system. Power up, telnet into the system and type: dd of=/dev/sda2 if=/boot/super.dat bs=4096 count=1 Notice we swapped around the "of" and the "if" in the command. Now reboot and your array config should be restored. Let me know if this kind of procedure will work for you. The general principle is this: always start with something working, and then change just one thing at a time.
September 12, 200619 yr Author Tom, Thank you so much for getting back. Now, when I remove all the hard drives.. you want me to keep parity in the array? The drive in question here is /disk4. I replaced my old 80GB /disk4 with a 400GB drive and it took almost 2 days to rebuild the data... so what I am going to do is follow your steps, but I just want to make sure that you want me to leave the parity in place!! THANK YOU! -Mike
September 12, 200619 yr If you leave parity in place then it will likely get re-written with different parity than what you need for your working array. I'm just trying to be safe so that if you should happen to have a *legiitimate* failure of one of your other hard drives, you will be able to rebuild it. To test your /disk4 hard drive separately, it's only necessary to have it be the only one installed in the system. Just install it in any of your disk slots except parity. You'll find that when you Start a new array without a parity drive, the system in effect "disables" the parity drive and goes ahead and brings the array on-line. You should be able to format drive (if it's unformatted) and read/write it. It will also appear as a share. I'm working on a "troubleshooting" page for the website & hopefully will have that up in a day or two.
September 12, 200619 yr Author Tom, Thank you for the prompt reply. I am going to do as follows and will report my findings in the thread. Thank you, Mike
September 12, 200619 yr Author Alright, here is what happened. I checked out drive 4, copied over 400 megs in a few seconds no issues from drive. I decided to load each hard drive and just copy a chunk of data to each drive then copy that data back to my PC. Every drive worked perfectly... until /disk9. IT is the slow drive here!!
September 12, 200619 yr Author Here is the log while trying to copy something over to /disk9 from my PC Sep 11 19:40:56 Tower kernel: hdf: dma_timer_expiry: dma status == 0x60 Sep 11 19:40:56 Tower kernel: hdf: timeout waiting for DMA Sep 11 19:40:56 Tower kernel: PDC202XX: Primary channel reset. Sep 11 19:40:56 Tower kernel: hdf: timeout waiting for DMA Sep 11 19:40:56 Tower kernel: hdf: (__ide_dma_test_irq) called while not waiting Sep 11 19:40:56 Tower kernel: hdf: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:40:56 Tower kernel: Sep 11 19:40:56 Tower kernel: hdf: drive not ready for command Sep 11 19:40:56 Tower kernel: hdf: status timeout: status=0xd1 { Busy } Sep 11 19:40:56 Tower kernel: Sep 11 19:40:56 Tower kernel: PDC202XX: Primary channel reset. Sep 11 19:40:56 Tower kernel: hdf: drive not ready for command Sep 11 19:40:56 Tower kernel: ide2: reset: success Sep 11 19:41:02 Tower kernel: hdf: dma_intr: status=0x51 { DriveReady SeekComple te Error } Sep 11 19:41:02 Tower kernel: hdf: dma_intr: error=0x84 { DriveStatusError BadCR C } Sep 11 19:41:24 Tower kernel: hdf: dma_timer_expiry: dma status == 0x40 Sep 11 19:41:24 Tower kernel: hdf: timeout waiting for DMA Sep 11 19:41:24 Tower kernel: PDC202XX: Primary channel reset. Sep 11 19:41:24 Tower kernel: hdf: timeout waiting for DMA Sep 11 19:41:24 Tower kernel: hdf: (__ide_dma_test_irq) called while not waiting Sep 11 19:41:24 Tower kernel: hdf: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:41:24 Tower kernel: Sep 11 19:41:24 Tower kernel: hdf: drive not ready for command Sep 11 19:41:24 Tower kernel: hdf: status timeout: status=0xd1 { Busy } Sep 11 19:41:24 Tower kernel: Sep 11 19:41:24 Tower kernel: PDC202XX: Primary channel reset. Sep 11 19:41:24 Tower kernel: hdf: drive not ready for command Sep 11 19:41:24 Tower kernel: ide2: reset: success Sep 11 19:41:44 Tower kernel: hdf: dma_timer_expiry: dma status == 0x40 Sep 11 19:41:44 Tower kernel: hdf: timeout waiting for DMA Sep 11 19:41:44 Tower kernel: PDC202XX: Primary channel reset. Sep 11 19:41:44 Tower kernel: hdf: timeout waiting for DMA Sep 11 19:41:44 Tower kernel: hdf: (__ide_dma_test_irq) called while not waiting Sep 11 19:41:44 Tower kernel: hdf: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:41:44 Tower kernel: Sep 11 19:41:44 Tower kernel: hdf: drive not ready for command Sep 11 19:41:44 Tower kernel: hdf: status timeout: status=0xd1 { Busy } Sep 11 19:41:44 Tower kernel: Sep 11 19:41:44 Tower kernel: PDC202XX: Primary channel reset. Sep 11 19:41:44 Tower kernel: hdf: drive not ready for command Sep 11 19:41:44 Tower kernel: ide2: reset: success
September 12, 200619 yr Author hrmmm, might have spoke to soon I just tried copying something to /disk 7 and check it Sep 11 19:49:31 Tower kernel: hdl: dma_timer_expiry: dma status == 0x60 Sep 11 19:49:31 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:49:31 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:49:31 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:49:31 Tower kernel: hdl: (__ide_dma_test_irq) called while not waiting Sep 11 19:49:31 Tower kernel: hdl: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:49:31 Tower kernel: Sep 11 19:49:31 Tower kernel: hdl: drive not ready for command Sep 11 19:49:31 Tower kernel: hdl: status timeout: status=0xd1 { Busy } Sep 11 19:49:31 Tower kernel: Sep 11 19:49:31 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:49:31 Tower kernel: hdl: drive not ready for command Sep 11 19:49:31 Tower kernel: ide5: reset: success Sep 11 19:49:31 Tower kernel: blk: queue c033cca0, I/O limit 4095Mb (mask 0xffff ffff) Sep 11 19:49:55 Tower kernel: hdl: dma_timer_expiry: dma status == 0x40 Sep 11 19:49:55 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:49:55 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:49:55 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:49:55 Tower kernel: hdl: (__ide_dma_test_irq) called while not waiting Sep 11 19:49:55 Tower kernel: hdl: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:49:55 Tower kernel: Sep 11 19:49:55 Tower kernel: hdl: drive not ready for command Sep 11 19:49:55 Tower kernel: hdl: status timeout: status=0xd1 { Busy } Sep 11 19:49:55 Tower kernel: Sep 11 19:49:55 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:49:55 Tower kernel: hdl: drive not ready for command Sep 11 19:49:55 Tower kernel: ide5: reset: success Sep 11 19:50:15 Tower kernel: hdl: dma_timer_expiry: dma status == 0x40 Sep 11 19:50:15 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:50:15 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:50:15 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:50:15 Tower kernel: hdl: (__ide_dma_test_irq) called while not waiting Sep 11 19:50:15 Tower kernel: hdl: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:50:15 Tower kernel: Sep 11 19:50:15 Tower kernel: hdl: drive not ready for command Sep 11 19:50:15 Tower kernel: hdl: status timeout: status=0xd1 { Busy } Sep 11 19:50:15 Tower kernel: Sep 11 19:50:15 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:50:15 Tower kernel: hdl: drive not ready for command Sep 11 19:50:16 Tower kernel: ide5: reset: success Sep 11 19:50:36 Tower kernel: hdl: dma_timer_expiry: dma status == 0x40 Sep 11 19:50:36 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:50:36 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:50:36 Tower kernel: hdl: timeout waiting for DMA Sep 11 19:50:36 Tower kernel: hdl: (__ide_dma_test_irq) called while not waiting Sep 11 19:50:36 Tower kernel: hdl: status error: status=0x58 { DriveReady SeekCo mplete DataRequest } Sep 11 19:50:36 Tower kernel: Sep 11 19:50:36 Tower kernel: hdl: drive not ready for command Sep 11 19:50:36 Tower kernel: hdl: status timeout: status=0xd1 { Busy } Sep 11 19:50:36 Tower kernel: Sep 11 19:50:36 Tower kernel: PDC202XX: Secondary channel reset. Sep 11 19:50:36 Tower kernel: hdl: drive not ready for command Sep 11 19:50:36 Tower kernel: ide5: reset: success
September 12, 200619 yr so... /disk6,7,8,9 are all slow boats interesting... are those 4 disks all on the same controller card? (not sure how you did your cabling.) Or, are they all the same brand/model? I have one old 40Gig drive in my array that is veeeeerrrrrrryyyyyyy slow. As in your array, it will take forever to check parity until it gets past the first 40Gig... then it flies... I have not bothered to do anything with it as it is empty and is only there from my experiments when we tracked down the bug in replacing/upgrading drives. Joe L.
Archived
This topic is now archived and is closed to further replies.