September 2, 200718 yr I am having alot of problems getting my unRAID box stable. It has nine 500GB WD SATAII drives connected three Promise SAT150 TX4. The symptoms are that sporadically i receive countless syslogs (posted next) and then the box becomes completely unresponsove cant even ping it. I have everything that is unnecessary turned of in the BIOS, tried swapping PCI slots, rolling back unRAID versions, swapping NICs, turning of spin down and as many other things as i can think of. I am prepared to post as much information as anyone would like and basically try anything necessary to get to the route of this problem as it exceptionally annoying and its just a matter of time before i lose data. Nothing has been changed in unRAID except turning of apci via syslinux boot code. Syslog samples Sep 2 17:37:43 NASBOX emhttp[1294]: shcmd (12): /usr/sbin/hdparm -y /dev/sde >/dev/null Sep 2 17:37:44 NASBOX emhttp[1294]: shcmd (13): /usr/sbin/hdparm -y /dev/sdf >/dev/null Sep 2 17:37:44 NASBOX emhttp[1294]: shcmd (14): /usr/sbin/hdparm -y /dev/sdg >/dev/null Sep 2 17:37:54 NASBOX kernel: [18173.780487] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Sep 2 17:37:54 NASBOX kernel: [18173.780500] ata9.00: cmd e0/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Sep 2 17:37:54 NASBOX kernel: [18173.780502] res 40/00:00:00:40:00/0c:00:3a:00:00/00 Emask 0x4 (timeout) Sep 2 17:37:55 NASBOX kernel: [18174.110115] ata9: soft resetting port Sep 2 17:37:55 NASBOX kernel: [18174.110123] ata9: SATA link down (SStatus 614 SControl 300) Sep 2 17:37:55 NASBOX kernel: [18174.110131] ata9: failed to recover some devices, retrying in 5 secs Sep 2 17:47:12 NASBOX kernel: [ 234.527461] ata9.00: exception Emask 0x10 SAct 0x0 SErr 0x100 action 0x2 Sep 2 17:47:12 NASBOX kernel: [ 234.527469] ata9.00: (port_status 0x20200000) Sep 2 17:47:12 NASBOX kernel: [ 234.527478] ata9.00: cmd 25/00:f8:ef:c9:0b/00:01:00:00:00/e0 tag 0 cdb 0x0 data 258048 in Sep 2 17:47:12 NASBOX kernel: [ 234.527480] res 51/0c:97:50:cb:0b/0c:00:00:00:00/e0 Emask 0x10 (ATA bus error) Sep 2 17:47:12 NASBOX kernel: [ 234.850346] ata9: soft resetting port Sep 2 17:47:12 NASBOX kernel: [ 235.010193] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Sep 2 17:47:12 NASBOX kernel: [ 235.060666] ata9.00: configured for UDMA/33 Sep 2 17:47:12 NASBOX kernel: [ 235.060680] ata9: EH complete Sep 2 17:47:12 NASBOX kernel: [ 235.129956] sd 9:0:0:0: [sdg] 976773168 512-byte hardware sectors (500108 MB) Sep 2 17:47:12 NASBOX kernel: [ 235.178340] sd 9:0:0:0: [sdg] Write Protect is off Sep 2 17:47:12 NASBOX kernel: [ 235.178347] sd 9:0:0:0: [sdg] Mode Sense: 00 3a 00 00 Sep 2 17:47:12 NASBOX kernel: [ 235.255839] sd 9:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO
September 2, 200718 yr department of longshots: I seem to recall something about problems with more than two Promise controllers on one motherboard. Not a Unraid issue, but a Promise issue. /Rene
September 2, 200718 yr The posts I saw on this topic referred to it being solved by swapping PCI slots, but you did that. A couple of things: * You mention not having changed anything but APCI, does this mean it worked before? * Did you try turning APCI back to its original setting? * Have you tried getting a smaller number of drives working? Then add them until it is no longer stable. * Please post your complete config - be specific (mobo, age, cpu, drives, psu, memory, other devices, brands/models, how hooked up, ...) * Are there drives hooked up directly to the mobo? * Are your drives set so they work with SATA150? * Did you reseat the cards and cables (power and signal)? SATA cables can get come loose fairly easily. * Is your PSU up to the task? For nine drives I would expect at least a good 500W supply Bill
September 3, 200718 yr Author Thanks for the replies Hopefully i answer all the questions here: department of longshots: I seem to recall something about problems with more than two Promise controllers on one motherboard. Not a Unraid issue, but a Promise issue. /Rene If anyone can point me at any of these it would be appreciated. I will keep looking as well. The posts I saw on this topic referred to it being solved by swapping PCI slots, but you did that I have even tried swapping cards out with new ones as i have spares. Have i tried every mathematical combination of cards and slots, probably not, but im not sure thats completely practical... thats an awful lot of reboots. * You mention not having changed anything but APCI, does this mean it worked before? No sorry that was perhaps a red herring. All the APCI fix done was remove a recurring syslog entry complqaining about a second CPU (which didnt exist). Fix applied as per a post here and the error went away. * Did you try turning APCI back to its original setting? Yes and the problem persists. As the problem is sporadic it sometimes takes a while to come back so at first i thought the APCI fix had done it but as i mentioned previousoly it was a red herring * Have you tried getting a smaller number of drives working? Then add them until it is no longer stable. The system has grown so in essence i have done this. It wasnt until i added this latest drive (the first on the 3rd card) that the problem became quite so prevalent, although its always been there. * Are there drives hooked up directly to the mobo? No they all exist on the TX4 cards. In fact the mobo BIOS has all the SATA ports turned of as well as the secondary IDE. The primary IDE has a DVDR in it. * Are your drives set so they work with SATA150? Thats a new one on me what needs to be done to set a drive to SATA150? * Did you reseat the cards and cables (power and signal)? SATA cables can get come loose fairly easily. Obviously there are alot of cables etc but they are relatively neat as I believe they are all ok having checked them several times. * Is your PSU up to the task? For nine drives I would expect at least a good 500W supply Its a Hyper Type R 580W power supply so I assume yes. * Please post your complete config - be specific (mobo, age, cpu, drives, psu, memory, other devices, brands/models, how hooked up, ...) Hyper Type R 580W power Asrock K7S8XE+ AMD Athlon XP 2500+ 2 Gail 512MB PC3200 Memory Sticks 3 Promise SATA150 TX4 controllers in PCI slots 1,2 and 3 DVDR (Would need to identify the brand but brand new) Parity WDC WD5000AAKS-0 disk1 WDC WD5000YS disk2 WDC WD5000AAKS disk3 WDC WD5000YS disk4 WDC WD5000YS disk5 WDC WD5000AAKS disk6 WDC WD5000YS disk7 WDC WD5000AAKS disk8 WDC WD5000YS
September 3, 200718 yr Thanks for the replies Hopefully i answer all the questions here: * Is your PSU up to the task? For nine drives I would expect at least a good 500W supply Its a Hyper Type R 580W power supply so I assume yes. According to this your supply has two 12 volt rails. Often times, the motherboard is connected to one of the rails, Is it possible you have most/all your drives on the other rail, and you only have a fraction of its advertised capacity available for the hard-disks. One of your 12volt rails has a capacity of 240 watts. the other 216 watts.
September 3, 200718 yr Bill's idea about forced SATA150 operation made me think, so I searched other syslogs, and found another user with similar 'exception Emask' errors and one similarity in setup. Like you, they have SATA300 drives connected to SATA150 ports. So I would suggest checking that each of your drives has the SATA150 jumper installed. SATA300 is supposed to be backward-compatible, but most SATA300 drives come with the jumper, just because some first generation SATA ports were not compatible. If this works, give Bill the credit for the idea, and I'll suggest it to the other user too.
September 3, 200718 yr Author I will give both ideas a go. Obviously stripping out all the drives and checking them will take some time so if don't get back to the thread for a couple of days that will be why. Appreciated for the continued help.
September 5, 200718 yr Author A small update. Ive found a pattern that may be a red herring or not. Severla of the fatal lockups seem to happen after usr/sbin/hdparm -y /dev/sdg >/dev/null like: Sep 2 17:37:44 NASBOX emhttp[1294]: shcmd (14): /usr/sbin/hdparm -y /dev/sdg >/dev/null Sep 2 17:37:54 NASBOX kernel: [18173.780487] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Does that hint at anything?
September 7, 200718 yr Author An update. I have changed to 4.2-beta-2 and started again slowly testing every piece of equipment. As it stands i am back to all original hardware including the 3rd TX4 card and the system is rock solid. Zero (not a single) error ever. The last test will be to add in the 9th drive but i want to run with the current hardware for a bit to see if it glitches before this.
September 7, 200718 yr Under 4.2-beta, click "spin down" and see if it results in error messages in syslog.
September 7, 200718 yr Author With the current setup of 3.2-beta-2, 3 cards and 8 drives this created no errors at all. The real test will be when i add the 9th drive later today or perhaps tomorrow. As an aside "Spin Up" seems to do nothing.
September 7, 200718 yr ... As an aside "Spin Up" seems to do nothing. If the drives are already spun up, then Spin Up will do nothing.
September 7, 200718 yr Author lol yes i understand that. Let me be more specific. With some or all of the drives in a spun down state pressing the spin up button results in none of these drives spinning up. e.g. it does nothing. Also it creates zero syslog entries.
September 7, 200718 yr How about Spin Down? Does that work? What should happen is the green icons should start blinking and the temperatures should indicate '*'. If this works, then click Spin Up and look at the syslog - are you getting errors reported in there?
September 7, 200718 yr Author With all or some of the drives spun up (green solid icon) if i press the spin down button they all spin down (flashing icon and *). I also see entries in syslog like: Sep 7 08:43:51 NASUser emhttp[1278]: shcmd (18): /usr/sbin/hdparm -y /dev/sda >/dev/null However if i then press the spin up button absolutely nothing happens at all and zero syslog of any sort are created.
September 7, 200718 yr Since you're 9th drive is not hooked up, can you also yank the 3rd Promise card & see if you get the same result with Spin Up/Down?
September 7, 200718 yr Author I pulled the 3rd card. Exactly the same results... the spin up button does nothing. The spin down button works as expected. I also tried it is Firefox and IE 6 in case it was something simple but no change.
September 8, 200718 yr Release 4.2-beta3 uses a different method for spinning up drives. Please see if this works for your config.
September 8, 200718 yr Author OK slowly I am making some steps forward. I added the SATA1 jumper to the drive that was creating thousands of errors and added it to the array. Overnight it came back online with zero errors from this drive. There was one of the errors on another drive though which im not surprised about since they are all WD drives. I dont have any more jumpers as they dont come with any so i need to order more and add them to all 9 drives. I upgraded to 4.2-beta-3 and rebooted. unRAID came online perfectly with zero errors. Pressed the spin down button and i could hear the drives going to sleep. However then i lost telnet and the network connection completely. I changed my KVM to the unRAID box and started typing the command to look at the syslog. Half way through it my keyboard locked up completely and i couldnt type anymore. Also interestingly my KVM became locked so i couldn't change to another server either. Any ideas? Should i change from a realtec NIC to a Intel 1000 MT or the like? Update: I had to hard reboot the server. When it came up it went into parity checking mode and guess what... same dan errors which were gone on the new drive are back. HELP ! this is driving me mad
September 11, 200718 yr Author a polite bump. Am i at the point where i have to start changing device hardware to see if it the root of the problem? I dont want to start spending more cash unless i need to but will if it helps me and the cause.
September 11, 200718 yr How are the hard drives connected? That is, via direct cable connection, mobile rack, backplane, other? Carefully check all cabling. SATA-1 cables and connectors are notoriously bad - it is very easy to have a cable "rock" partially off. If you have to, secure cables with a spot of hot glue. The best improvement in SATA-2 is the locking connector.
September 11, 200718 yr Author Drives are connected direct to the cards. I've checked the cables several times but i will try replacing a couple and see. The drives are slid direct into 5.25" slots using 3.5" to 5.25" metal brackets and have plate fans.
September 11, 200718 yr OK slowly I am making some steps forward. I added the SATA1 jumper to the drive that was creating thousands of errors and added it to the array. Overnight it came back online with zero errors from this drive. There was one of the errors on another drive though which im not surprised about since they are all WD drives. I dont have any more jumpers as they dont come with any so i need to order more and add them to all 9 drives. I upgraded to 4.2-beta-3 and rebooted. unRAID came online perfectly with zero errors. Pressed the spin down button and i could hear the drives going to sleep. However then i lost telnet and the network connection completely. I changed my KVM to the unRAID box and started typing the command to look at the syslog. Half way through it my keyboard locked up completely and i couldnt type anymore. Also interestingly my KVM became locked so i couldn't change to another server either. Any ideas? Should i change from a realtec NIC to a Intel 1000 MT or the like? Update: I had to hard reboot the server. When it came up it went into parity checking mode and guess what... same dan errors which were gone on the new drive are back. HELP ! this is driving me mad What is the h/w config at this state, specifically, how many Promise controllers, how many drives?
September 11, 200718 yr Author Exactly the same as it is now... 3 cards and 9 drives The error is sporadic but when it comes it seems come with a vengeance. I can force the error by initialising a parity check apart from that it will run in normal operation all day every day now. I havent had a complete lock up since i went to the beta i dont believe.
September 11, 200718 yr I suggest you use only 2 Promise cards. Is there a reason why you're not using the on-board SATA? Perhaps connect your 9th drive to one of the motherboard SATA ports.
Archived
This topic is now archived and is closed to further replies.