March 25, 200818 yr So, I've had my unRAID server up and running for a year or more without any problems (ie: never done anything in terms of upgrading or replacing hard drives). Well, I took advantage of some recent 500GB IDE drives and planned on replacing some old drives. So, I did everything that is expected and now when I reboot the server the server is crawling at 2,000kb/sec which will take like 68 hours to rebuild!! I did the typical trouble shooting which consisted of the following: 1. Took new hard drive out of icy dock and plugged it directly into promise card Result: Still the same. The rebuild starts at 15,000kb but quickly dies down to 2,000kb 2. Attached new cable and made sure it was snug Result: Still the same. The rebuild starts at 15k and quickly dies down to 2k 3. Changed which power supply the new hard drive was attached. Result: Same as above, no change in speed. So, what the hell could be causing this slow down? I never had any problem in the past when it came to parity rebuilds. I replaced my old 120GB drive with a 500GB and there was not much data on the old 120GB (about 8GB of data on the drive). The log doesn't show anything out of the ordinary at all. Everything is the same as usual... could the Promise card be the slow down? Thank you for any ideas!!! I hate upgrading when the server has functioned perfectly for over a year...
March 25, 200818 yr Author Well, something interesting pops up in the system log: Mar 24 05:58:25 Tower kernel: hdj: dma_timer_expiry: dma status == 0x40 Mar 24 05:58:25 Tower kernel: hdj: timeout waiting for DMA Mar 24 05:58:25 Tower kernel: PDC202XX: Primary channel reset. Mar 24 05:58:25 Tower kernel: hdj: timeout waiting for DMA Mar 24 05:58:25 Tower kernel: hdj: (__ide_dma_test_irq) called while not waiting Mar 24 05:58:25 Tower kernel: hdj: status error: status=0x58 { DriveReady SeekComplete DataRequest } Mar 24 05:58:25 Tower kernel: Mar 24 05:58:25 Tower kernel: hdj: drive not ready for command Mar 24 05:58:25 Tower kernel: hdj: status timeout: status=0xd0 { Busy } Mar 24 05:58:25 Tower kernel: Mar 24 05:58:25 Tower kernel: PDC202XX: Primary channel reset. Mar 24 05:58:25 Tower kernel: hdj: drive not ready for command Mar 24 05:58:25 Tower kernel: ide4: reset: success Mar 24 06:12:32 Tower smbd[1270]: [2008/03/24 06:12:32, 0] lib/util_sock.c:get_peer_addr(1000) Mar 24 06:12:32 Tower smbd[1270]: getpeername failed. Error was Transport endpoint is not connected Mar 24 06:12:32 Tower smbd[1270]: [2008/03/24 06:12:32, 0] lib/util_sock.c:get_peer_addr(1000) Mar 24 06:12:32 Tower smbd[1270]: getpeername failed. Error was Transport endpoint is not connected Mar 24 06:12:32 Tower smbd[1270]: [2008/03/24 06:12:32, 0] lib/util_sock.c:write_socket_data(430) Mar 24 06:12:32 Tower smbd[1270]: write_socket_data: write failure. Error = Connection reset by peer Mar 24 06:12:32 Tower smbd[1270]: [2008/03/24 06:12:32, 0] lib/util_sock.c:write_socket(455) Mar 24 06:12:32 Tower smbd[1270]: write_socket: Error writing 4 bytes to socket 21: ERRNO = Connection reset by peer Mar 24 06:12:32 Tower smbd[1270]: [2008/03/24 06:12:32, 0] lib/util_sock.c:send_smb(647) Mar 24 06:12:32 Tower smbd[1270]: Error writing 4 bytes to client. -1. (Connection reset by peer) And yes, hard drive j is the new hard drive that was installed....
March 25, 200818 yr Author Well, there is definitely something slowing down the new drive, check these two speed tests, the new drive being the first: /dev/hdj: Timing cached reads: 2524 MB in 2.00 seconds = 1262.00 MB/sec Timing buffered disk reads: 8 MB in 3.58 seconds = 2.23 MB/sec root@Tower:~# hdparm -tT /dev/hda /dev/hda: Timing cached reads: 2356 MB in 2.00 seconds = 1178.00 MB/sec Timing buffered disk reads: 204 MB in 3.01 seconds = 67.77 MB/sec
March 25, 200818 yr Since the problem seems to me to be either the hard drive, the cable, or the Promise card, I would pull the drive and install it in another computer and test it. If it proves to be fine, then try another cable, and if no improvement, then you are left with a defective card. Perhaps it is just a bad IDE connector. Whatever the problem, it is probably dropping you into a PIO mode, although it is not evident above. In the syslog, search for 'limiting speed to' or 'configured for PIO' messages.
March 25, 200818 yr Author Rob, First, I must say thank you for looking at my post! I attached my most recent log from startup just a few minutes ago, I've been searching through the forums and the wiki in hopes of finding different commands to test the drives. I will pull the drive and hook it up to a different PC.
March 25, 200818 yr Author root@Tower:~# hdparm -i /dev/hdj /dev/hdj: Model=MAXTOR STM3500630A, FwRev=3.AAE, SerialNo=9QG1DELK Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=16384kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: * signifies the current active mode
March 25, 200818 yr Author So we can rule out a few more things. 1. PCI Slot, I plugged the card into an open slot and the drive exhibited the same speed 2. Promise card, I had an extra card sitting around and same speed in both PCI slots. 3. Cable, I used two different cables 4. Hard drive, I put it into my PC and formated and ran a drive test on it, no problem So, I'm starting to wonder if the drive that is on the same cable could be a problem? If you look at the log I posted a few posts up you see that hdi had some sort of errors associated with it... it that possible that the second drive on the same cable could cause all of this? Thank you to anyone who has input!
March 25, 200818 yr If hdi is the drive on the same channel/cable, then yes, it could certainly be responsible for the problems. I would like to have been able to see the entire syslog, as I went by your comment that 'The log doesn't show anything out of the ordinary at all', and only errors on hdj were visible to me. In the second excerpt from your syslog, the primary channel is being reset over and over again, and that is time-consuming, would affect both drives on the cable, and slow both way down. The duplication in the user shares are not important, just a nuisance to be cleaned up sometime. The SMART I/O errors from 3 different drives are a minor problem. The Samba communications errors at 6:12 make me want to recommend upgrading to a much more recent unRAID version. There are very many bug fixes in the newer Linux kernels, and in the newer Samba version. But most significant are the errors from hdj and hdi, especially the BadCRC errors from hdi. If you upgrade to a 4.2 version, you can run the smartctl package, and test the SMART system on hdi. I suspect it may show some serious problems with that drive. If you remove the new drive, it would be interesting to see what the syslog shows, especially as to the hdi drive.
March 25, 200818 yr did you contact Tom? if the drive is IDE, certainly another disk on the same cable does not help remember in IDE mode, the bus can be only "read" or only "write" at the same time
March 25, 200818 yr Author If you upgrade to a 4.2 version, you can run the smartctl package, and test the SMART system on hdi. I suspect it may show some serious problems with that drive. If you remove the new drive, it would be interesting to see what the syslog shows, especially as to the hdi drive. Rob, I guess I can give the upgrade to 4.2 a go. Do I have to do anything out of the normal scope of upgrading because I have a disk that still needs the data rebuilt? I was thinking about upgrading last night, but I was worried because of the new drive not being fully rebuilt! Thank you, Mike
March 25, 200818 yr Author did you contact Tom? if the drive is IDE, certainly another disk on the same cable does not help remember in IDE mode, the bus can be only "read" or only "write" at the same time 1. I have not contacted Tom at this point. I hope that some of our users can give me help/ideas before contacting Tom. I want to let Tom keep coding for 4.3 and not take his attention away from that 2. Yes, both drives hdi and hdj are IDE drives. Thank you -Mike
March 25, 200818 yr I guess I can give the upgrade to 4.2 a go. Do I have to do anything out of the normal scope of upgrading because I have a disk that still needs the data rebuilt? I was thinking about upgrading last night, but I was worried because of the new drive not being fully rebuilt! That's a very good question, and I haven't used v3, so I would prefer someone like Joe or Tom to answer better. I *think* it would be safe to put the old drive back in, start up unRAID with your current flash, check the web management page and the syslog for a good startup, then shutdown and upgrade the flash (or replace it with a new upgraded version). Make sure you have a backup of the contents of your flash drive. Then if the upgrade was successful, and the array looks fine, try replacing the drive again. At some point, you might consider replacing the hdi drive first. If you have room elsewhere, I would try to save the data off hdi and hdj to somewhere else, so that no matter what happens, there will be no data loss. You would no longer have to worry about a failure to rebuild, and with all of the errors on multiple drives, that is a real concern.
March 25, 200818 yr Author I sent JoeL a PM and asked if he could peep this thread and give an input. I thought about putting my other drive back into the array, but the new drive did show up and I wasn't sure how all that worked out. I will copy all the data off those drives to my PC just to keep the data safe. Thank you again!
March 25, 200818 yr teamhood - CRC errors are almost always an issue with cables, or some other problem affecting the signaling on the cable. Are all your IDE drives jumpered for "Cable Select"? If so, you can try setting the drive on the end of the cable to Master and the one below it to Slave - I've seen IDE drives that don't honor the cable select jumper (but remember that you did this if you move the drives around later!).
March 26, 200818 yr Author Here is where I stand: 1. I tried to make hdi master and hdj slave - still the same performance problems 2. I replaced my new hard drive with the previous one (hdj) and I still have the performance issues 3. I am now copying all the data off of both drives to my PC just incase something might happen with either of these two drives. unRAID is trying to rebuild parity, but I stopped it for now while copying over about 200GB of data (at about 2.3MB/sec.... ouch) Once that is finished I need to try and figure out what the hell is going on with these two drives. I'm not sure if both of them are wanting to kick the bucket or not, but something is messing them up. Please note that both of these drives are cabled directly into a PCI Promise card and are not connected through a mobile rack unit (I removed them for the trays just to make sure those were not causing any problems) So, that is where I stand... still not sure what the hell plaguing my configuration, but hopefully someone will come up with something. I'm not going to do any upgrades or anything until this is solved....
March 26, 200818 yr According to the "mdcmd status" command output, the disk assigned to the first data slot in your array is INVALID. It is /dev/sdc1. is it one of your new disks? From what you've just wrote, I'm not sure if the configuration is still the same. Tom made a great point. Are you sure you don't accidentally have two drives both set to Master, to two both set to Slave?
March 26, 200818 yr Author Joe, The first disk that is not valid is the parity drive because I took out the 'new' hard drive and replaced it with the previous one. So, once I did that I had to restore the array and I have to build parity. Parity was moving very slowly because of whatever is going on with the two drives hdi and hdk. I will pull a system log once I copy off all the data from the two drives in question and post it. I am certain that the problem lies with one and/or both of these hard drives. (I guess I could be wrong, but you have to understand that this array has been up and running for almost two years without ANY problems. So, this leads me to believe that me replacing hdj with a larger drive has someone messed everything up) Technology can be your best friend and your worse enemy all in the matter of minutes
March 26, 200818 yr Once you get back to a stable system... To upgrade a disk you can do it two ways. 1st way. Stop Array, power down, replace old drive with new larger drive, power up, start array, it uses parity and other data drives to rebuild old contents to new drive. 2nd way. (and apparently the way you used) Copy data off of disk(s) to be replaced, Stop Array, power down, replace old drive(s) with new, power up, assign new drive(s) to old slot(s) in array, Press Restore to record the new configuration, clear and format new drives, and calculate parity. Since you replaced two drives at once, you must have used the second method??? I've always just done one disk at a time with the first method. Might I suggest method #1 for you too. At least it minimizes what to undo if you have further problems. Yes, it might take longer, since it has to deal with two disks, one at a time, but might be safer. Joe L.
March 26, 200818 yr Author Joe, I did not replace two disks at once. I originally replaced disk 10 (hdj) with a new 500GB hard drive. When I went to format/clear/rebuild data on that drive is when I noticed the speed problems that my server is displaying. So, after checking to make sure it wasn't the IDE cables, the mobile rack units, hard drive jumper settings, etc I have decided to remove my 'new' drive and put back in the original drive to see if maybe the new drive was causing these problems. So, the command that I posted is showing that parity isn't valid because I put back in my original drive and had to restore to allow the array to boot. I did not want to upgrade this way at all. I wanted to do it the easy way, but there something messing up my array and it has something to do with hdi and hdk. As stated, I am currently copying all of the data off those two drives to protect that data in case both of those drives die. I still have 15 hours remaining due to only seeing 2.3MB/sec transfer speed from the array. Once that is complete I will start doing some further testing, and posting a system log so you guys can give me suggestions.
March 28, 200818 yr Author Here is my system log from a few minutes ago. Looks like hdi and hdj which are on the same ide cable are not in DMA mode... ???
March 28, 200818 yr Mike, we still need to see a complete syslog. The pieces you have attached are only a small part of it. Instructions for copying it are near the bottom of: http://lime-technology.com/wiki/index.php?title=Viewing_the_System_Log. It is normal with a lot of disk errors for DMA use to be turned off for faulty devices.
March 30, 200818 yr Author FULL System log has been attached. I wasn't around all weekend so I decided to let the array run and re-build parity (which took a few days) so my array is now back in a good state, but one, two or both of those drives are causing me problems!! So, for your viewing pleasure, please see the attached system log. Rob and Joe I want to say thank you for your interest and input. I am very happy that we have built a good community and I wish I could add value, but I am not a guru like you guys. -Mike
March 30, 200818 yr That definitely took quite awhile. It sped up some when it finished hdj at the 120 gigs mark, then really took off after finishing hdi at the 300GB point. By my calculations, it took 29.5 hours to do the first 120GB, 10 more hours to do from 120GB to 200GB, 12 more hours to do from 200GB to 300GB, and 55 minutes to finish from 300GB to 400GB! You aren't using the fourth IDE channel IDE3, which is the Secondary channel on the first Promise card. If it is working fine, why not add an IDE cable to it, and hook up ONE of the mis-behaving drives to it, either hdi or hdj. That way, both will have a channel to themselves, and can't affect the other. You may or may not have to do one more Parity check or sync, but then you should be in a better position to replace one at a time. Thanks again for the syslog, it really helped. I wish there was a strongly emphasized message somewhere, at the first place any user would go to when having problems with their unRAID setup, to capture a complete syslog and attach it along with a summary of their hardware, and a complete description of their problems and any error messages. It certainly would help in troubleshooting.
March 30, 200818 yr Looking at your syslog, there seems to be multiple things wrong: o Multiple drives are reporting SMART errors. You might try enabling SMART on these drives using this: smartctl -s /dev/hdk o Also saw "Index Error" on /dev/sdk - I've never seen this before One thing you might try is swapping the slots of the failing drives with good drives. For example, if 'disk8' and 'disk9' are the ones causing problems, swap them with say, 'disk2' and 'disk3'. When you boot the server it will notice the drives have been swapped, but you can then Start the array and it will just record the new disk positions - parity will still be valid. This will tell you a great deal - if the problem "follows the disks" well you have a bad disk(s); if it stays on the controller, then you might have a bad controller, cable, etc. Lately we actually had a test server behave in a similar fashion - turned out in fact 2 drives were bad. What happened was that one drive was giving spurious problems all along (which is good from a testing standpoint), then another drive happened to fail. This was very baffling to figure out. Rule of thumb: if a problem/bug seems very baffling and does not respond as expected to test conditions, then you are probably dealing with 2 (or more) problems/bugs simultaneously. Also don't rule out that addition of a new drive might have sent your power supply over a cliff.
Archived
This topic is now archived and is closed to further replies.