Jump to content

Red Dot


Recommended Posts

Would it be worth changing controller cards? I could probably get new ones and test them and return them if its no difference.

 

Any recommendations?

 

Dan,

 

  The controllers may not be a problem. The problem could be in the fact that all your hard drives are the same model and they appear to be bought in two batches. (as the power on hours are in the 3500-4000 hour range).

And then one of your HDs had a temperature of 50 degC. If one of your HDs had suffered thermal damage there is a chance of others also - unfortunately WD hard drives do not have a SMART status showing the highest temperature in the past.

Something of concern (it is from 2010) - this is an article from Toms' hardware regarding the hard drive reliability - http://www.tomshardware.com/reviews/hdd-reliability-storelab,2681.html

and in the second page there is this snippet:

"Storelab notes that read/write head failure is somewhat characteristic for WD drives. Failures primarily occur as a result of physical impact or overheating (WD heads can be sensitive at temperatures above 45°C)."

 

This is perhaps your bigger problems.

 

  Next - you have two SASLP cards and some versions of Unraid do not like a single never mind two of these cards. Unfortunately I do not use them and cannot guide you here but you probably should stop trying now and just read any posts to see if RC5 and two SASLP are compatible before going ahead with further parity checks.

 

  Third (and this is not shared here) - there is a reasons why the commercial servers use ECC and are powered by UPS. You need a single memory bit flip in the area where Unraid is loaded (Unraid runs from the memory) caused by whatever reasons (cosmic radiations, power supply fluctuations) to crash the server. It is one thing when you play with your 5-10 years old hardware and with the free version but if you get a serious and have have 10-15 or more big 2-3-4TB drives you are risking a lot.

This is my opinion only and as I said not widely shared.

Link to comment
  • Replies 131
  • Created
  • Last Reply

Would it be worth changing controller cards? I could probably get new ones and test them and return them if its no difference.

 

Any recommendations?

 

You use two SASLP cards but fortunately have only 14 hard drives - attach 6 of them to the motherboard SATA ports and 8 to a single controller card and try this way - then swap the cards but I still do not think the cards are faulty.

Link to comment

The web interface is unresponsive even with no attempts to access the server.

 

Here is another print screen of the requested command. It repeats itself a few times each minute from about 9am until now, the attached pic is only a segment of that.

 

This may be an issue. Why are there so many smbd processes? What add-ons are running?

 

Are hundreds of smbd processes still running?

Link to comment

The server has been working fine for 6 months or so, although I have had to reboot it several times when it became unresponsive. Seems odd the problems have just started which leads me to believe it isn't a problem with 2 sata cards with unraid.

 

Ill give your idea a go and plug more into the mobo. Ive had my fill of this server today and ill need to order some new sata cables anyway.

 

How do I find out how many processes are running?

Link to comment

The server has been working fine for 6 months or so, although I have had to reboot it several times when it became unresponsive. Seems odd the problems have just started which leads me to believe it isn't a problem with 2 sata cards with unraid.

 

Ill give your idea a go and plug more into the mobo. Ive had my fill of this server today and ill need to order some new sata cables anyway.

 

How do I find out how many processes are running?

 

See Reply 54:

The web interface is unresponsive even with no attempts to access the server.

 

Here is another print screen of the requested command. It repeats itself a few times each minute from about 9am until now, the attached pic is only a segment of that.

 

Are there any new client devices?

Link to comment

Another day another red dot.

 

Now disk 10 has a red dot next to it and tells me its unformatted. All the data from that drive has disapeared from my shares. Ive rebooted several times but no change and without parity protection I guess that data is now gone.

 

With the array started I click on format, then to see progress I click on refresh. Each time the drive doesn't even begin to format and the option to format it returns.

Link to comment

I removed disk 10 from the array rather than format it and tried to build the parity again.

 

Fail.

 

I don't understand this at all. Im at a total loss and feel like Ive wasted over £2k.

 

I don't believe the drives are damaged from heat, they were very rarely above 50 degrees and they work perfectly as long as I don't run a parity check.

Link to comment

How recently have you run a long memtest (24 hours or longer)?  A stick of memory going bad could cause all sorts of weirdness.

 

Another thing to check would be the processor's heat sink junction, a poor mechanical connection or improper heat sink compound application can cause weirdness under load.

Link to comment

I haven't ever run a memtest but ill be sure to look into it, thank you for the advice.

 

The CPU has been working fine and I don't think unraid will stress an i3 much anyway. I understand RAM can go bad but can a CPUs heat sink lose its effectiveness?

Link to comment

I haven't ever run a memtest but ill be sure to look into it, thank you for the advice.

 

The CPU has been working fine and I don't think unraid will stress an i3 much anyway. I understand RAM can go bad but can a CPUs heat sink lose its effectiveness?

If the heat sink isn't perfectly parallel to the cpu, the heatsink won't be moving as much heat as it should. The fact that the i3 runs so cool means that a poor heatsink interface won't effect you as much or as soon as an old P4 would have, but the effect could still be there under load. Another issue I see on self built rigs is WAY too much heatsink compound. The proper amount just barely squeezes to the edge of the processor face when the cpu is clamped down. Ideally you wouldn't need but a very small dab, and the cpu and heatsink would be lapped to perfectly flat mirror smooth surfaces that would squeeze all the compound out save for a micron thick layer to take up the last of the space left from an imperfect surface. Also, if you do remove the cpu to check the heatsink surface, be sure to clean both faces and apply fresh, and don't move the cpu once you clamp it down. Wiggling it around just creates voids in the compound.

 

TL;DR The i3 runs so cool the heatsink almost isn't needed until the CPU is under load, so a marginal install can start to effect things down the road.

Link to comment

I removed disk 10 from the array rather than format it and tried to build the parity again.

 

Fail.

 

I don't understand this at all. Im at a total loss and feel like Ive wasted over £2k.

 

I don't believe the drives are damaged from heat, they were very rarely above 50 degrees and they work perfectly as long as I don't run a parity check.

 

Just a thought, but a parity check runs all the drives at the same time for hours.  This is the highest potential heat load your sever ever runs under.  If you have an airflow and therefore a heat problem, this is the most likely time for it to surface. If this has been going on for some time the high heat may now be just causing damage across multiple drives.  It would likely be some physical issue to cause you to seemingly have multiple drive failures progressively.  This all sounds like heat maybe your issue.  WHS would not run the parity check and therefore not create the same physical condition of the sustained combined load.

 

Having said this, there isn't an obvious connection for a heat issue to result in the interface dropping out, or numerous processes running simultaneously, but  that part is above my head.  Maybe memory is heating up as well and affecting the process?

Link to comment

Yesterday I installed a new case fan which brought the HDD temps down to a max of 41 during parity build.

 

I also changed the sata cables to my controller card.

 

The system successfully completed a parity build which is great news!

 

The bad news is that shortly after disk 8 got a red dot. It didn't get hot during the parity build so im not sure why. SO far Ive had red dots on disks 7, 8, 10. All of which are in the same HDD cage but on different sata cards.

 

The missing data from disk 10 returned for some unknown reason so im guessing disk 8 isn't broken either.

 

How do I do a memory test on the server? I googled it but can't find anything for unraid.

 

Ill see if I have sopme CPU paste laying about and refit the CPU heatsink tomorrow as well.

 

Overall some progress but still not working correctly.

Link to comment

How do I do a memory test on the server? I googled it but can't find anything for unraid.

http://lime-technology.com/wiki/index.php/FAQ#Why_am_I_getting_repeated_parity_errors.3F

The wiki link explains with more detail, but basically...

 

If you have a monitor hooked to your unRAID server & restart it, you will see a blue box with 2 options. The default option is to load unRAID & if you do nothing, unRAID starts after a few seconds. The other option is MemTest. Select MemTest with the arrows on your keyboard attached to your server.

Link to comment

I tried to start it 3 more times but each time I got the same result. I then looked at the memtest file in the latest RC6 download and noticed it was larger than the one on my flash drive so I copied that over as I assumed it was a newer version. WHen I ran it again it was still saying 4.10 so I guess its not updated.

 

2 more attempts failed to run.

 

Would bad RAM stop the test from running or is it something else?

Link to comment

Also, disk 8 with its red dot still works fine. Its part of the array and its files so in my shares. The only difference I can tell is there is a red dot and there isn't a temperature next to it in the web interface.

 

The drive has been taken out of service. The files are still in the array and the disk contents are being computed using all of the other disks. A second disk failure will result in data loss.

Link to comment

Luckily I have a spare pre cleared drive here ready to use so ill stick that in and rebuild the array. I wouldn't be suprised if disk 8 is fine though as ive had several drives get a red dot and then suddenly work again without me doing anything.

 

Any ideas on the memtest?

Link to comment

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...