March 13, 201115 yr Ok, so now I have a question about this "failed" disk. 1) I can see it from the disk share in \\tower\disk4 2) I can run a SMART test which returns no errors. 3) I can run a speed test which shows it at the same speeds as my other drives. 4) BIOS reports the drive as present and SMART capable and fine... Is this drive really failed, or has unraid goofed up? Thoughts? unRAID version is 4.7 I'll attached my syslogs (multiple logs, since I am not sure where/when the error occured). Thanks! Km logs.zip
March 13, 201115 yr Ok, so now I have a question about this "failed" disk. 1) I can see it from the disk share in \\tower\disk4 2) I can run a SMART test which returns no errors. 3) I can run a speed test which shows it at the same speeds as my other drives. 4) BIOS reports the drive as present and SMART capable and fine... Is this drive really failed, or has unraid goofed up? Thoughts? unRAID version is 4.7 I'll attached my syslogs (multiple logs, since I am not sure where/when the error occured). Thanks! Km For a drive to be marked failed by unRAID a write to it has to have failed. It must have had a write to the drive fail for it to be kicked from the array. If you want to try and reuse the drive you wil need to: 1. Stop the array 2. Unassign the drive 3. Start the array 4. Stop the array 5. Assign the drive back to the same slot 6. Start the array A data rebuild should start so let it go and see what happens.
March 13, 201115 yr Author For a drive to be marked failed by unRAID a write to it has to have failed. It must have had a write to the drive fail for it to be kicked from the array. If you want to try and reuse the drive you wil need to: 1. Stop the array 2. Unassign the drive 3. Start the array 4. Stop the array 5. Assign the drive back to the same slot 6. Start the array A data rebuild should start so let it go and see what happens. Is that safe? I'm assuming it'll attempt the rebuild but error out if it can't? The array is unusable during the rebuild, right? I'll rebuild tonight after movie night
March 13, 201115 yr For a drive to be marked failed by unRAID a write to it has to have failed. It must have had a write to the drive fail for it to be kicked from the array. If you want to try and reuse the drive you wil need to: 1. Stop the array 2. Unassign the drive 3. Start the array 4. Stop the array 5. Assign the drive back to the same slot 6. Start the array A data rebuild should start so let it go and see what happens. Is that safe? I'm assuming it'll attempt the rebuild but error out if it can't? The array is unusable during the rebuild, right? I'll rebuild tonight after movie night Yes the rebuild is safe. You can use the array while a rebuild is going on. writing to the array may slow the rebuild down but you will be fine. If you have movie night tonight and another drive fails... then you are SOL. You should be able to get the array back up and going but the process will be much more involved.
March 14, 201115 yr Author Yes the rebuild is safe. You can use the array while a rebuild is going on. writing to the array may slow the rebuild down but you will be fine. If you have movie night tonight and another drive fails... then you are SOL. You should be able to get the array back up and going but the process will be much more involved. Ok, data-rebuild is in progress, will report in 10 hours
March 14, 201115 yr Author Data rebuild is finished, the drive has a green ball and it says "Parity is valid". Should I assume it's good to go, or run another parity check? Km.
March 14, 201115 yr Data rebuild is finished, the drive has a green ball and it says "Parity is valid". Should I assume it's good to go, or run another parity check? Km. Yes, another "check" is in order. You've just written the drive, a "check" will let you know what was written is readable.
March 15, 201115 yr Author Yes, another "check" is in order. You've just written the drive, a "check" will let you know what was written is readable. Second parity check completed, parity is valid and all drives are green and good to go... VERY weird. I just don't understand, maybe it was a glitch in a single write attempt? No other tests gave warnings or errors on that drive... Oh well, as long as it's back up and running, I'm a happy camper
March 15, 201115 yr Author Ok, round two of our saga.... Ran the good part ofthe day with no errors, everything fine. Checked a few minutes ago, now disk THREE is disabled instead of disk FOUR. Stopped array, ran SMART on disk THREE and it came up clean. BIOS SMART shows disk three as clean. Removed, reassigned, data-rebuild now on drive three.... Which will be next? I'm taking bets here, gentlemen!! (purchases small firearm, loads rounds, cocks hammer, place barrel in mouth and watches unRAID screen - waiting... waiting ...)
March 15, 201115 yr Ok, round two of our saga.... Ran the good part ofthe day with no errors, everything fine. Checked a few minutes ago, now disk THREE is disabled instead of disk FOUR. Stopped array, ran SMART on disk THREE and it came up clean. BIOS SMART shows disk three as clean. Removed, reassigned, data-rebuild now on drive three.... Which will be next? I'm taking bets here, gentlemen!! (purchases small firearm, loads rounds, cocks hammer, place barrel in mouth and watches unRAID screen - waiting... waiting ...) You have (I think) 7 disk drives in your server. Please describe the case, power supply (exact make&model) and disk controller cards. Are you using "locking" SATA cables? How many fans?? Joe L.
March 15, 201115 yr Give us a detailed list of your hardware. It sounds like you have an intermittent connection or something of that sort. Also post some syslogs if you can of after the disabling of a disk happens.
March 15, 201115 yr Author I am using an aluminum server case, with a 6 drive cage in the front. Two case fans provide airflow directly over the drives, two more fans provide venting from the rear of the machine. The other two drives (7th sata and "extra" IDE drive) are spaced out in the 5.25" bays. My drive temps have never reached over 36 degrees and usually idle around 28-32. I do not use locking sata cables, but on each failure I have checked to see if they are loose, and they are all snug in the connectors. I am using a 450W power supply, my ups indicates I am only drawing ~150W of power, so I don't think I'm underpowering anything. I'll post my last two syslogs - a screenshot of my drive temps etc can be seen here: http://awesomescreenshot.com/0f69aji51 logs.zip
March 15, 201115 yr I am using an aluminum server case, with a 6 drive cage in the front. Two case fans provide airflow directly over the drives, two more fans provide venting from the rear of the machine. The other two drives (7th sata and "extra" IDE drive) are spaced out in the 5.25" bays. My drive temps have never reached over 36 degrees and usually idle around 28-32. I do not use locking sata cables, but on each failure I have checked to see if they are loose, and they are all snug in the connectors. I am using a 450W power supply, my ups indicates I am only drawing ~150W of power, so I don't think I'm underpowering anything. I'll post my last two syslogs - a screenshot of my drive temps etc can be seen here: http://awesomescreenshot.com/0f69aji51 what specific make/model power supply??
March 15, 201115 yr Author Hmm, shoot I'll have to wait until I get home and open up the beast... I'll get that info asap though, thanks for the help! *EDIT* I just ordered 7 locking SATA cables though, I just think that makes sense.... is there a recommended PSU (wattage, make etc)? Km.
March 15, 201115 yr Author Mar 15 18:02:15 Angband kernel: ata2: SError: { Persist HostInt PHYRdyChg 10B8B } Mar 15 18:02:15 Angband kernel: ata2.00: failed command: READ DMA EXT Mar 15 18:02:15 Angband kernel: ata2.00: cmd 25/00:00:f7:d2:b4/00:04:01:00:00/e0 tag 0 dma 524288 in Mar 15 18:02:15 Angband kernel: res 50/00:00:f6:d2:b4/00:00:01:00:00/e1 Emask 0x50 (ATA bus error) Mar 15 18:02:15 Angband kernel: ata2.00: status: { DRDY } Mar 15 18:02:15 Angband kernel: ata2: hard resetting link This popped up on the server while in the middle of the data rebuild... any clues?
March 15, 201115 yr Author ########### Replaying journal: The problem has occurred looks like a hardware problem. If you have bad blocks, we advise you to get a new hard drive, because once you get one bad block that the disk drive internals cannot hide from your sight,the chances of getting more are generally said to become much higher (precise statistics are unknown to us), and this disk drive is probably not expensive enough for you to you to risk your time and data on it. If you don't want to follow that follow that advice then if you have just a few bad blocks, try writing to the bad blocks and see if the drive remaps the bad blocks (that means it takes a block it has in reserve and allocates it for use for of that block number). If it cannot remap the block, use badblock option (-B) with reiserfs utils to handle this block correctly. bread: Cannot read the block (7739): (Input/output error). This is the result of reiserfsck on /dev/md3 and /dev/md4 Am I totally effed?
March 16, 201115 yr Author Ok, can someone take a peek at this syslog and tell me which disk is failing? Or if there are more than one that are failing? I'm so frustrated and anxious now that I fear I'm not doing much good to myself Maybe a few more sets of eyes and a clear head can help me out... syslog-2011-03-16.zip
March 16, 201115 yr Is there a reason you didn't post the PSU make and model? Syslog not that helpful with this type of issue. Usually when drives are dropping offline it is caused by either a bad data connection or a power problem. A bad data connection can be a bad cable, a lose port, bad drive cage, or even a bad / not well seated controller card. Power problems can be an underpowered PSU, a multi-rail PSU, bad power splitters, or poor power connection. The locking SATA cables will hopefully fix this, but until it is resolved your drives are going to be dropping off the array. Unlike Joe L, I do not favor a drive rebuild except as a last resort. I prefer the "trust parity" procedure. You will lose data written to the disk during the time it was simulated by unRaid, but the rest of the data will be intact. Although the rebuild should be writing the exact same data back to the disk, I've always felt writing over the same disk to be putting all your eggs in one basket and slightly risky. Maybe it's just me. But when having the problems you are having with drives dropping, I would not be rebuilding disks.
March 16, 201115 yr Author Is there a reason you didn't post the PSU make and model? Syslog not that helpful with this type of issue. Usually when drives are dropping offline it is caused by either a bad data connection or a power problem. A bad data connection can be a bad cable, a lose port, bad drive cage, or even a bad / not well seated controller card. Power problems can be an underpowered PSU, a multi-rail PSU, bad power splitters, or poor power connection. The locking SATA cables will hopefully fix this, but until it is resolved your drives are going to be dropping off the array. Unlike Joe L, I do not favor a drive rebuild except as a last resort. I prefer the "trust parity" procedure. You will lose data written to the disk during the time it was simulated by unRaid, but the rest of the data will be intact. Although the rebuild should be writing the exact same data back to the disk, I've always felt writing over the same disk to be putting all your eggs in one basket and slightly risky. Maybe it's just me. But when having the problems you are having with drives dropping, I would not be rebuilding disks. I totally forgot to check the PSU make/model I think I should just replace it anyway, everything else on the PC is new. I was trying to ascertain how many drives were falling off, that's why I posted the syslog. I just had a hard time reading it. The cables are definitely not loose or coming off the drives, I made sure of that last night, but I did order some locking ones to be 100% sure as well. From this point on I agree, the data rebuilds over and over are prolly not the way to go. I'll let this current one finish and hopefully will have a new PSU tonight to swap in (that's why I was asking recommendations on PSU makes/models). I was really at this point hoping for another set of eyes to check the syslog and tell me if, in fact, mutliple drives were dropping or just one. Km
March 16, 201115 yr Is there a reason you didn't post the PSU make and model? Syslog not that helpful with this type of issue. Usually when drives are dropping offline it is caused by either a bad data connection or a power problem. A bad data connection can be a bad cable, a lose port, bad drive cage, or even a bad / not well seated controller card. Power problems can be an underpowered PSU, a multi-rail PSU, bad power splitters, or poor power connection. The locking SATA cables will hopefully fix this, but until it is resolved your drives are going to be dropping off the array. Unlike Joe L, I do not favor a drive rebuild except as a last resort. I prefer the "trust parity" procedure. You will lose data written to the disk during the time it was simulated by unRaid, but the rest of the data will be intact. Although the rebuild should be writing the exact same data back to the disk, I've always felt writing over the same disk to be putting all your eggs in one basket and slightly risky. Maybe it's just me. But when having the problems you are having with drives dropping, I would not be rebuilding disks. I totally forgot to check the PSU make/model I think I should just replace it anyway, everything else on the PC is new. I was trying to ascertain how many drives were falling off, that's why I posted the syslog. I just had a hard time reading it. The cables are definitely not loose or coming off the drives, I made sure of that last night, but I did order some locking ones to be 100% sure as well. From this point on I agree, the data rebuilds over and over are prolly not the way to go. I'll let this current one finish and hopefully will have a new PSU tonight to swap in (that's why I was asking recommendations on PSU makes/models). I was really at this point hoping for another set of eyes to check the syslog and tell me if, in fact, mutliple drives were dropping or just one. Km You want a single rail 12 volt supply, capable of handling the 8 drives... Typically I figure about 2 amps capacity for each green drive and 3 for each non-green. That adds up to about 17 amps for just your disk drives. You still need to power the motherboard, disk controller cards, and fans. Figure about 5 Amps for them. I'd not go for a supply with less than a 35 Amp rating on a single 12 volt rail. If you are using a generic power supply, or one with multiple rails for the 12 volt lines, you probably have, at best, an 18 Amp rail available to SHARE between all the disks, the fans, and the motherboard.
March 16, 201115 yr Author You want a single rail 12 volt supply, capable of handling the 8 drives... Typically I figure about 2 amps capacity for each green drive and 3 for each non-green. That adds up to about 17 amps for just your disk drives. You still need to power the motherboard, disk controller cards, and fans. Figure about 5 Amps for them. I'd not go for a supply with less than a 35 Amp rating on a single 12 volt rail. If you are using a generic power supply, or one with multiple rails for the 12 volt lines, you probably have, at best, an 18 Amp rail available to SHARE between all the disks, the fans, and the motherboard. I am looking at picking up a Thermaltake 750 TRX, here are the specs as posted: http://www.thermaltakeusa.com/Product.aspx?S=1172&ID=1907 ATX 12V V2.3 & EPS 12V 2.91 enables most reliable and robust power delivery. Guarantee to deliver 750W of continuous power output at 40? Robust and dedicated single +12V output provides superior performance under all types of system loading. 80 PLUS® Standard Certified: 80% or more efficiency at 20%, 50%, and 100% load. Double forward ultra-efficient circuitry design for added power savings. Ultra-quiet 140mm cooling fan delivers excellent airflow at an exceptionally low noise level by varying the RPM in response to temperature. Universal AC input 115V~240V automatically scans and detects the correct voltage for different countries. 99% Active Power Factor Correction provides clean and reliable power to your system. Cable Management improves internal airflow by reducing cable clutter within PC to promote accelerated heat removal. Intel® & AMD CPU compliant. Nvidia® & ATI/AMD graphic card compatible. Dimension: 150mm(W)x86mm(H)x160mm(L). High reliability: MTBF>100,000 hours. Built in industrial-grade protections: Over Current, Over Voltage and Short-Circuit protections. Safety / EMI : UL/ CUL, TUV, FCC, CE and BSMI certified. -56AMP over the +12 V rail
March 16, 201115 yr You want a single rail 12 volt supply, capable of handling the 8 drives... Typically I figure about 2 amps capacity for each green drive and 3 for each non-green. That adds up to about 17 amps for just your disk drives. You still need to power the motherboard, disk controller cards, and fans. Figure about 5 Amps for them. I'd not go for a supply with less than a 35 Amp rating on a single 12 volt rail. If you are using a generic power supply, or one with multiple rails for the 12 volt lines, you probably have, at best, an 18 Amp rail available to SHARE between all the disks, the fans, and the motherboard. I am looking at picking up a Thermaltake 750 TRX, here are the specs as posted: http://www.thermaltakeusa.com/Product.aspx?S=1172&ID=1907 ATX 12V V2.3 & EPS 12V 2.91 enables most reliable and robust power delivery. Guarantee to deliver 750W of continuous power output at 40? Robust and dedicated single +12V output provides superior performance under all types of system loading. 80 PLUS® Standard Certified: 80% or more efficiency at 20%, 50%, and 100% load. Double forward ultra-efficient circuitry design for added power savings. Ultra-quiet 140mm cooling fan delivers excellent airflow at an exceptionally low noise level by varying the RPM in response to temperature. Universal AC input 115V~240V automatically scans and detects the correct voltage for different countries. 99% Active Power Factor Correction provides clean and reliable power to your system. Cable Management improves internal airflow by reducing cable clutter within PC to promote accelerated heat removal. Intel® & AMD CPU compliant. Nvidia® & ATI/AMD graphic card compatible. Dimension: 150mm(W)x86mm(H)x160mm(L). High reliability: MTBF>100,000 hours. Built in industrial-grade protections: Over Current, Over Voltage and Short-Circuit protections. Safety / EMI : UL/ CUL, TUV, FCC, CE and BSMI certified. -56AMP over the +12 V rail As described on their web-site, that power supply has a single rail 12 volt supply capable of 56 amps output. It would work fine.
March 17, 201115 yr Author Ok, so new PSU is installed, things seem to be running way smooth. For those interested, the old PSU: MiOS 400U (425w!!!!, yikes!!!!) Only 24amps on the 12 volt rail... eek. I think this was to blame Thanks for all the help and advice, it was greatly appreciated over the past few nail-biting nights!
March 17, 201115 yr Ok, so new PSU is installed, things seem to be running way smooth. For those interested, the old PSU: MiOS 400U (425w!!!!, yikes!!!!) Only 24amps on the 12 volt rail... eek. I think this was to blame Thanks for all the help and advice, it was greatly appreciated over the past few nail-biting nights! Will be interesting to see. My bets are it is something else beside the PSU. Reasoning is that the PSU pulls its most power when starting up / spinning up the drives. The power to just keep them spinning, or even reading from them, much less. If the PSU was booting the computer and spinning up the drives, these should be more than enough power to keep them spinning. But this is just my opinion. Conventional wisdom here is that top quality PSU is critical. Just my anecdotal experience has been that a good PSU is good enough. (All that being said, I am doing an update and got a Corsair 650TX, which is highly regarded here). By the way, 750 watts is way overkill. You should have more than enough power.
Archived
This topic is now archived and is closed to further replies.