Unraid extremely unstable - drives dropping randomly (even brand new drives)


Recommended Posts

Hi,

 

I have been playing with Unraid for a couple of weeks now... and based on a lots of people's review I was expecting it to be a very system to use.

However I feel that I am plagued by a lot of issues...

 

I posted sometime ago about a network issue (unraid does not see internet after a reboot before being online for at least ~20h !?! 

for reference. It is still not resolved.

 

But the issue that is driving mad is the issue of "disappearing drives"...

 

I have a bunch of drives: 5 x 14tb + 8tb + 3tb + 4tb I was planning on adding all of those to the array, but it seems that it's impossible to get them to stay long enough to run pre-clear.

Some drives (2x 14TB, the 3TB and the 4TB) seem to disappear randomly... Those drives are a mix of brand spanking new drives and old drives and a mix of chucked drives and proper NAS drives. There seem to be no pattern to the dropping. Unraid show no error what so ever, the drives are just no longer here... 

 

I can get them to re-appear on reboot... or by remove them and plug them back in...

 

Every single one of those drives that disappears got tested on a mac and seem to be working just fine...

All SMART data and health data on those drives are good. I have re-seated all the cables... I have run ~24h of memory test with issues...

The drives are plugged to a mix of a SAS card and directly to the mother board, seems to have no relation to the dropping either...

The drives are all connected through a cage (https://www.amazon.com/gp/product/B00DGZ42SM/ref=ppx_yo_dt_b_asin_title_o05_s00?ie=UTF8&psc=1) I tried to move drives around to install them in slot where the drive does seem to stay... changes nothing.

 

Also it seems that when I delete the Array (Tools > New Config) all the disappearing drive come back for while.... but that's only temporary.

 

Additionally, when I was running some tests on previous Array (before I got all the drives that I wanted in there, the performance seemed abysmal... especially when moving from the NVME Cache to the Array... I am talking single digit MB/s transfer...)

 

I am out of ideas... Unraid has been a much more complicated and frustrating experience than I was imagining....

 

Any ideas ?

unraid-diagnostics-20211016-1915.zip

Link to comment

Most error come from onboard SATA controller ( 3 port used ), pls try attach those disk to LSI HBA.

 

7 hours ago, citizengray said:

it seems that it's impossible to get them to stay long enough to run pre-clear.

 

Or try preclear disk ( also one disk connect in system ) with " different controller / with cage / without cage ... " to rule out the problem.

Edited by Vr2Io
Link to comment

Thank you for taking the time to look into my situation - it is very much appreciated.

 

I would assume that most of the errors come from the on board SATA controller, as this is where I plugged 4 of the 14TB drives, and it is where I am trying to run pre-clear on the brand new 14tb that I just acquired.

 

Assuming - for a second - that the issue is not coming from the Sata Cages (I am starting to doubt this now), what could be causing such issues with the onboard sata controller ?

The 2 nvmes are working flawlessly... I know they are not sata drives.. but they are on board the motherboard...

 

The cages have 2 molex power inputs. What I did to connect them, is that I basically daisy chained them with Molex Y connectors until all 5 of them got connected and then I plugged the last Y on the Molex connector directly from the PSU. Is that the right way to do it ? Maybe somehow the cages are underpowered and it's causing issues ? I did buy a 750W PSU for that in mind... and besides the CPU (Ryzen 5600G) I got nothing else really taking power... no GPUs, etc.

 

I'll play around with disk allocation in and outside of the cages and report back... might take me a while though.

 

Alex

Link to comment

 

1 hour ago, citizengray said:

Is that the right way to do it ?

If they connected well and reliable then no problem, but Y splitter should avoid due to have chance of voltage drop in heavy load.

 

1 hour ago, citizengray said:

Maybe somehow the cages are underpowered and it's causing issues ? I did buy a 750W PSU

750w far enough ( if no monster GPU ), but try another PSU also a good practice. 

Edited by Vr2Io
Link to comment

Just a comment as an Unraid user for over 10 years...  A few times over the years I have had issues with drives glitching or dropping off the array unexpectedly.  However, in every single case this was down to poor connections between Molex style plugs and sockets, mostly related to the use of splitters.  What may happen is that the "tube" part of the connector opens up slightly, and makes an intermittent connection on the "pin" part of the connector.  I have started replacing these, but a temporary fix has sometimes been to try closing up the tube a little with a fine-nosed pair of pliers.  

Link to comment

Thank you for the comments. I had noticed that the connection with the Molex Y splitters were not amazing, so I carefully pushed every pin to every tube with a small screw drivers from both sides to make sure there was a strong connection.

 

I am happy to replace the Y connectors, but then what should I use instead ?  Something like this: https://www.amazon.com/Computer-Supply-Splitter-Internal-Extension/dp/B08CRXG2FW/ref=sr_1_21?dchild=1&keywords=molex+power+splitter&qid=1634488522&sr=8-21 but I am failing to see how this is really better ?

 

Because from the PSU there is only one "line" for Molex, the 2 others are for SATA 15pin connectors.

Also I find it curious that the SATA Cages have 2 molex power input ? Why bother with 2 ? Why not just one as the power is ultimately coming from the same source anyways... That just strike me as odd...

 

I guess I could also add another line of molex to the PSU directly... https://www.amazon.com/COMeap-Molex-Drive-Adapter-Modular/dp/B08DMFYDBN/ref=sr_1_4?dchild=1&keywords=multi+molex+power+cable&qid=1634488409&sr=8-4

 

For reference, my PSU is: https://www.amazon.com/gp/product/B079GFTB8F/ref=ppx_yo_dt_b_asin_title_o00_s02?ie=UTF8&psc=1

And I am connecting 5 cages https://www.amazon.com/gp/product/B00DGZ42SM/ref=ppx_yo_dt_b_asin_title_o01_s00?ie=UTF8&psc=1 with the hope of one day using 20 drives... I got 8 right now.

 

Is there something I can do to test if the voltage is dropping or something ? I have a regular voltmeter ? What should I be looking for ?

 

Link to comment
27 minutes ago, citizengray said:

Also I find it curious that the SATA Cages have 2 molex power input ? Why bother with 2 ?

With dual connector, it can increase the power handling, you will found gaming GPU card, power socket increase when power draw increase. Dual connector should connect individual power wire/path.

 

27 minutes ago, citizengray said:

I have a regular voltmeter ?

Yes, measure DC voltage, perform some measurement still better then nothing, if you don't family with this then pls ignore it.

 

Voltage measurement will provide some basic info., ripple voltage also a key factor.

 

Edited by Vr2Io
Link to comment

Perusing the forums, I have also read about some folks who some BIOS misconfigured somehow. Is there anything in particular that I should be looking for in the BIOS to see if it is set correctly ? The only thing I remember changing are the boot order of devices (basically disabling everything but the flash drive) and enabling virtualization (with the intent of user docker - but I am no where near that).

 

Also, on an unrelated note. I had started a pre-clear last night on those 2 new 14TBs, one was connected to the onboard sata connectors (via a sata cage) and the other one connected to the LSI SAS card (also via a sata cage - but a different one). This morning the drive from the on board SATA had halted pre-clear due to error - basically the drive disappeared (cannot find file .../hdb etc.) but the other one was still going strong.

 

So I took out the drive from the sata cage connected to the onboard sata and moved it to a cage connected to the LSI and plugged it in there with the aim of restarting pre-clear. And as soon as I did that... the other drives failed too - same thing seemingly disappeared !?!

 

So very obviously I have some physical instability... but I really don't get it... everything is firmly installed, connected, all drives are properly bolted, the case is sitting firm on a shelf, all the cage are strongly inserted in a home made structure solidly attached to the wall... 

Link to comment
5 minutes ago, Vr2Io said:

With dual connector, it can increase the power handling, you will found gaming GPU card, power socket increase when power draw increase. Dual connector should connect individual power wire/path.

 

Ok, so you are saying that I should have 2 direct lines to the PSU, and each connector should be on a different line ? Ideally.

 

5 minutes ago, Vr2Io said:

 

Yes, measure DC voltage, perform some measurement still better then nothing, if you don't family with this then pls ignore it.

 

Voltage measurement will provide some basic info., ripple voltage also a key factor.

 

 

Ok, which cable should I take my measurements from ? 1 yellow, 1 red, 2 blacks ? (from the Molex)

 

Link to comment
7 minutes ago, citizengray said:

Ok, so you are saying that I should have 2 direct lines to the PSU, and each connector should be on a different line ? Ideally.

Sure.

 

7 minutes ago, citizengray said:

Ok, which cable should I take my measurements from ? 1 yellow, 1 red, 2 blacks ? (from the Molex)

 

All, black is ground

red black is 5v

yellow black is 12v

Edited by Vr2Io
Link to comment

Ok, I tried one more thing that I had never tried before.

 

I completely de-powered the whole column of 5 cages. They are basically entirely offline.

 

I have taken out the 2 brand new 14 TB and connected them directly to the SAS card and to the SATA 15 pin from the PSU.

 

I started pre-clear on both. For now it is going strong (the only 2 drives on the entire systems besides the 2 NVMEs).

But what is impressing more than just the drives working, is the fact that preclear is also working much faster... sustained read speed of ~260mb/s per drives for now. Whereas before, when I managed to run them through the cage (for a while at least), they were more around ~160mb/s - 200mb/s.

 

I can't believe the likely hood of all 5 cages being defective at once... (I did not even bought them at the same time).

 

Could it be somehow "incompatible" ? Both through the mother board sata controller AND the LSI SAS ? Feels too much like a coincidence... what am I missing here :(

Link to comment
32 minutes ago, citizengray said:

I can't believe the likely hood of all 5 cages being defective at once... (I did not even bought them at the same time).

 

33 minutes ago, citizengray said:

Could it be somehow "incompatible" ?

 

May be try eliminate the Y power splitter and try again to verify problem reproduce or not first.

Link to comment

I will build something use DIY component. One side coming from PSU and the other side connect to harddisk. 4 terminal are 12v G G 5V.

 

Each "Insulated Terminal Barrier Strip" ( red black yellow ) will insert 3 or 4 wire and I will soldering to ensure reliable connection.

 

I will build different harddisk wire set ( black wire with correct diameter ) for different need, just detach / attach at terminal block.

 

Each wire will connect max 4 disk, and inset to DIY SATA plug at the end.

 

Anyway, max 16 disks, never use Y power splitter.

 

image.png.9b4f41bc73e3ad74d7a103e6d74f4659.png image.png.72eb0e2d250ef4e7a88524935de57332.png

 

 

Or something like that, if connect point have two "Insulated Terminal Barrier Strip", then it can increase the no. of connect device or to change the direction.

 

image.thumb.png.c993a4a812b5f152edcc73fe207f6989.png

Edited by Vr2Io
  • Like 1
Link to comment

Ok preclear on those 2x 14TB have completed successfully last night. Not a single error, very stable.

As a reminder, those where connected outside of the cage directly to the PSU with the SATA 15 pin cables and to the LSI card.


Next step in my testing: connect only one cage without Y splitter and pre-clear another couple of disk in there to see if the Y splitter are the issue.

(waiting to receive the components to build my own MOLEX cable to power the cages by mail).

 

Thx again for the time you spent helping me.

  • Like 1
Link to comment

Ok... there seems to be a world of progress here.

 

I was able to power 1 cage and add 4 disk in there (3x14tb + 1x8tb) and the cage is powered up directly from the PSU with a chained molex cable. All connected to the LSI (on one SAS cable)

I started pre-clear on all 4 disk, and for now it's holding well (performance and stability) - granted it's only ~3% done on pass #1, but that's further than I have ever been able to do.

 

And outside of the cage I have the 2x14TB that I just completed preclear + 1 old 3TB (that completed preclear independently some time ago) that are connected directly to SATA 15 pin on the PSU and outside the cage directly with another SAS cable on the LSI.

 

And somehow - for the moment - everything seems to be holding fine. 4 disk doing pre-clear, and the 2x14TB building parity.

 

Wow. I actually still can't believe that the whole of this seems (to be confirmed) to be due to the power delivery to the cages... f*cking Y splitters... Can't wait to get the parts to build my own cables... 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.