Jump to content

I've had lots of drive issues the last few weeks, unlucky or underlying problem?


JustinChase

Recommended Posts

So, I had my first red-ball, then another, then a drive went unformatted, then my cache did the same.  Once I had most/all of that repaired, I decided to test a 3TB drive that had redballed by running a preclear.  it came back clean, so i decided to use it to replace a 1TB drive that has been giving me occasional errors, and is much older.

 

now that the 3TB drive is in the machine, and the rebuild is in process, it's going REALLY slow.  Current speed is 2.2MB/sec.  it says it will take 19 days to finish the rebuild.  the preclear went just fine, and a 'normal' speed.  i think it averaged around 90MB/sec for the preclear, and took about a day.

 

My cache drive has completely died in all this, and can't be recognized in another machine either.

 

I'm wondering if something else is causing all these issues in my server, and how I might get to the bottom of all these errors.  I've spent the last 2 weeks copying data and moving drives and preclearing and rebuilding and checking parity, and I really just want to get back to having a server that serves my media, and not have to spend all my time checking on drives and processes.

 

Suggestions, thoughts, ideas?

syslog.txt

Link to comment

The power supply is fairly new, less than 2 years old anyway.  I don't know how to test it though.  Is there some utility or bench test I can try?

 

I don't really want to buy another power supply to test, since there's no 'specific' issue that would indicate if the new power supply was working any better.

Link to comment

figures.  that's not a disk I've had any problems with before today.

 

Of course, I've moved around disks so often, I could have unseated the cable on that drive to cause this.

 

I suppose it's probably okay to stop the rebuild, shut down the machine, check all connections, and fire it up again.  That's not going to cause any problems with the rebuild is it?

 

Sadly, the drive it's rebuilding was 'empty' before the disk was replaced, since I'd already copied everything to other drives, so I actually thought the rebuild might go fairly quickly.  oh well.

Link to comment

Sadly, the drive it's rebuilding was 'empty' before the disk was replaced, since I'd already copied everything to other drives, so I actually thought the rebuild might go fairly quickly.  oh well.

 

The speed of a rebuild has absolutely nothing to do with how much data is on the disk.    The entire disk is rebuilt from the other drives + parity, regardless of whether it contains data, is all zeroes, or anything in-between.

 

As for using a PSU tester -- not likely to be helpful, since they only show if the various power buses are present and are powering up within the time limits for the "power good" signal.    They don't show issues that may be related to bus loading.    The only reasonable way to confirm whether it's a power issue or not is to try another power supply.    If you don't want to buy a spare unit (I find it handy to have a spare one around, but if you don't "fiddle" a lot with PC's you probably wouldn't have much use for it), then I'd exhaust all other possibilities (cables, etc.) first.

 

Link to comment

it's kind of funny, since this power supply was purchased because I thought my last power supply might be borderline, so I still have it sitting here, ready for the next PC I build.  I think it's only a 400W unit, so it wouldn't be a good 'test' unit to put in this machine anyway.

 

i decided to stop the rebuild, and check all the cables and restart.  That was about an hour ago and it's currently chugging along at 100MB/sec, with an estimated finish of about 6 hours from now.

 

Hopefully once finished, I'll run a parity check, then not have any more issues for a good long time.

 

fingers crossed!

 

thanks again to everyone for all the help and suggestions.

Link to comment

Okay, I finally got all drives back into the server, set the cache drive, then reinstalled a few dockers.  While going thru and setting them up, I ran into some more issues.

 

now, it seems my parity is unrecognized, and unRAID thinks it's a new drive, and is forcing me to do a parity check.  It's going really slow.  looking at the syslog, it seems another different drive is giving errors/having issues.

 

I'm so close to just taking everything apart, selling all the parts, and not having a server any more.

 

Does anyone have any suggestions on how to get this F...... server to just F...... run without problems, or do I just sell the parts, and spend all my newfound free time outside, in the real world?

syslog.zip

Link to comment

Okay, I finally got all drives back into the server, set the cache drive, then reinstalled a few dockers.  While going thru and setting them up, I ran into some more issues.

 

now, it seems my parity is unrecognized, and unRAID thinks it's a new drive, and is forcing me to do a parity check.  It's going really slow.  looking at the syslog, it seems another different drive is giving errors/having issues.

 

I'm so close to just taking everything apart, selling all the parts, and not having a server any more.

 

Does anyone have any suggestions on how to get this F...... server to just F...... run without problems, or do I just sell the parts, and spend all my newfound free time outside, in the real world?

 

Wow... you just can't catch a break, can you?

 

Based on issues I've currently had, I would suggest that you do your work under the Xen config since under non-Xen I was getting errors on the parity drive and parity corrections - however under Xen the parity check runs smooth without issue. Not sure why, but I added a post under defects about this.

 

Second, I still think your power supply is the most likely culprit. Not quite having enough power to all your drives would cause all sorts of inconsistent issues, and the problems could move from drive to drive as there is no guarantee which drive is being short changed by the PSU.

 

I may be completely wrong here, but I know I've had random errors with different drives when I've added more HDs than my PSU can support.

 

Link to comment

Second, I still think your power supply is the most likely culprit. Not quite having enough power to all your drives would cause all sorts of inconsistent issues, and the problems could move from drive to drive as there is no guarantee which drive is being short changed by the PSU.

 

I suspect power supply also.

 

So, if correct what power supply should I buy?  How much power do I need to run 12 drives and a decent video card?  I have a gold certified 650Watt, single rail power supply currently, and it seems that's not enough.

 

the UPS says I'm only using 129 watts with all drives spun up.

 

I just don't want to/can't spend another $500 or more to get a new power supply, and a couple more disks in the hopes that this will fix everything.

Link to comment

Second, I still think your power supply is the most likely culprit. Not quite having enough power to all your drives would cause all sorts of inconsistent issues, and the problems could move from drive to drive as there is no guarantee which drive is being short changed by the PSU.

 

I suspect power supply also.

 

So, if correct what power supply should I buy?  How much power do I need to run 12 drives and a decent video card?  I have a gold certified 650Watt, single rail power supply currently, and it seems that's not enough.

 

the UPS says I'm only using 129 watts with all drives spun up.

 

I just don't want to/can't spend another $500 or more to get a new power supply, and a couple more disks in the hopes that this will fix everything.

Is the video card a new addition to your server? If so, maybe removing it will give it enough power??

Link to comment

Post a SMART report for disk2. Check for BIOS and firmware updates

 

Here are smart reports for the drives that look like they are having issues, disk 4 (WD20EACS) and parity (ST4000DM000).  I didn't see anything wrong with disk 2 (), but it's attached also.

 

Also, most recent syslog.

 

I don't know how to read these, so please let me know if I need to act on any of this.

 

thanks again for all the help.

smartWD20EACS.txt

smartST4000DM000.txt

smartHDS5C3030ALA630.txt

syslog.txt

Link to comment

Second, I still think your power supply is the most likely culprit. Not quite having enough power to all your drives would cause all sorts of inconsistent issues, and the problems could move from drive to drive as there is no guarantee which drive is being short changed by the PSU.

 

I suspect power supply also.

 

So, if correct what power supply should I buy?  How much power do I need to run 12 drives and a decent video card?  I have a gold certified 650Watt, single rail power supply currently, and it seems that's not enough.

 

the UPS says I'm only using 129 watts with all drives spun up.

 

I just don't want to/can't spend another $500 or more to get a new power supply, and a couple more disks in the hopes that this will fix everything.

 

I have 13 drives currently, but am only using on-board video, and have a AX860, which likely gives me lots of head room.

 

However, it may not be that the 650 you have isn't giving sufficient power - it may just be faulty.

 

Mine is around $180 USD on Newegg, but as mentioned I think it may be overkill. It's only because I've had PSU issues  before that I wanted to give myself a lot of headroom so that as I expand my array I will (hopefully) not have similar issues.

Link to comment

The only drive that would have me worried is the "smartST4000DM000" specifically the Reported_Uncorrect value

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  110  099  006    Pre-fail  Always      -      27141960

  3 Spin_Up_Time            0x0003  098  092  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  099  099  020    Old_age  Always      -      1393

  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      40

  7 Seek_Error_Rate        0x000f  076  060  030    Pre-fail  Always      -      42275386

  9 Power_On_Hours          0x0032  092  092  000    Old_age  Always      -      7665

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      187

183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  094  094  000    Old_age  Always      -      6

188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0 0 0

189 High_Fly_Writes        0x003a  095  095  000    Old_age  Always      -      5

190 Airflow_Temperature_Cel 0x0022  069  050  045    Old_age  Always      -      31 (Min/Max 31/32)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      226

193 Load_Cycle_Count        0x0032  095  095  000    Old_age  Always      -      10892

194 Temperature_Celsius    0x0022  031  050  000    Old_age  Always      -      31 (0 16 0 0 0)

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      1311h+16m+55.411s

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      76567108688

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      335881793285

Link to comment

Is the video card a new addition to your server? If so, maybe removing it will give it enough power??

 

Actually, yes, it is new to the server.  I'll pull it out and see if that changes anything.

 

I wonder if any of the smart reports indicate a fatal problem with any of the drives, or if it might just be a power supply issue.

 

the video card says it only draws a maximum of 116 Watts, and needs a minimum 400W PS, so maybe I am just running of out clean power here.

 

Although, I'm not even using the video card at this point.  I was hoping to set up a windows VM to see if I could use it, but I guess that's not going to happen.

 

The really sad thing is that I bought a new motherboard and CPU to prepare for the possibility of GPU passthru, which is pretty much all wasted money at this point.  I may take everything apart, put the old motherboard and CPU back into the machine, replace the power supply, and rebuild my HTPC and just go back to having 2 computers, which worked just fine until I decided to 'make it better'.

 

Huge FAIL!!

Link to comment

The only drive that would have me worried is the "smartST4000DM000" specifically the Reported_Uncorrect value

 

I don't know what that means, but that's my parity drive.

 

I have a new 4TB drive I just installed to replace a different failed drive, and it has almost no data on it right now.  I wonder if I should remove it, set it to be parity, then pre-clear/test the current parity drive to see how it looks.

 

if I decided to do that, I honestly don't even know how that would work, as I'd have to remove that drive from the array, then I'd have a parity and that drive red-balled, which I don't think would allow unRAId to even start the array.

 

AARRRGGGGHHHHHH!!!!!!

 

I hate computers sometimes.

 

Also, the other 2TB drive is continually getting reset as per the syslog, so I suspect it's either a drive problem, or perhaps the PS.

 

I'm going to shutdown, remove the video card, then restart and see how things look then.

Link to comment

If you search amazon, I have:

 

Corsair RM Series 1000 Watt ATX/EPS 80PLUS Gold-Certified Power Supply - CP-9020062-NA RM1000

 

I am running 20 hard drives and haven't had any issues, and it is only $149 and it is modular...  1000W should easily handle the 12 drives..

 

I would try the Xen boot before anything else...

 

Some of those counts may not mean anything if they are old.  My parity drive has a few errors on it, but occurred years ago, and haven't increased and I haven't seen an issues from it..

 

 

Link to comment

Okay, I removed the video card, changed the default boot to XEN, replaced the SATA cable going to Disk4 - WDC_WD20EACS-11BHUB0_WD-WCAZA3758422 (sdd) drive (ata4) and rebooted.

 

disk 4 still throws lots of errors in the syslog, and the parity check is running at less than 1MB/sec.

 

I don't have a drive here that I can use to replace disk4, but if I need to, I could go buy one at Best Buy, which is about the same price as Amazon or Newegg.

 

Honestly, at this point I'd rather just remove the drive and move the data onto one of my other drives, but as slow as it's going, that will take many days to complete.

 

Any ideas/suggestions on how to proceed from here?

syslog.txt

Link to comment

Since you have your old (400w) power supply laying around, switch to that and see if things change.  You may simply have a faulty bus on your new power supply.    The 400w should be plenty as long as you don't reinstall the video card.    If it resolves things, then you'll at least know that a new PSU should both resolve the issues and allow you to use the video card.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...