Jump to content
jang430

Quick, need help. Array showing absurd # of reads and writes on several drives. Shares not accessible

48 posts in this topic Last Reply

Recommended Posts

Looks like it:
Aug 18 09:55:38 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!

Try it in a different slot, also make sure it's sufficiently cooled.
 
You're also having problems with the parity disk, and that one is on the onboard SATA ports, it dropped offline so there's no SMART report, but looks more like a connection problem.



Though the parity is directly connected onboard, the drive that it is reading from is connected to the hba controller. Could this also cause that to happen? The case is currently wide open and air cooled.


Sent from my iPhone using Tapatalk

Share this post


Link to post

Why you blacklist smbus ? How about if boot in safemode does problem persist ?

 

Aug 18 12:51:05 Tower kernel: Command line: BOOT_IMAGE=/bzimage vfio-pci.ids=8086:a170 modprobe.blacklist=i2c_i801,i2c_smbus pcie_acs_override=downstream,multifunction initrd=/bzroot

Share this post


Link to post
26 minutes ago, jang430 said:

Could this also cause that to happen?

No, like mentioned it looked more like a connection problem, but since there's no SMART report more dificult to say for sure.

 

26 minutes ago, jang430 said:

The case is currently wide open and air cooled.

This isn't enough to cool the HBA, LSI HBAs should have some airflow around them, they are designed for server cases where cooling ins't a problem.

Share this post


Link to post
Why you blacklist smbus ? How about if boot in safemode does problem persist ?
 
Aug 18 12:51:05 Tower kernel: Command line: BOOT_IMAGE=/bzimage vfio-pci.ids=8086:a170 modprobe.blacklist=i2c_i801,i2c_smbus pcie_acs_override=downstream,multifunction initrd=/bzroot

If I did that, it is not because I made it. Where do you do or undo that? Never tried safe mode, since I never thought it was any of the containers and plug-ins causing the problem.


Sent from my iPhone using Tapatalk

Share this post


Link to post
This isn't enough to cool the HBA, LSI HBAs should have some airflow around them, they are designed for server cases where cooling ins't a problem.


When you say connection problem, do you mean maybe how the card is seated, or maybe not seated properly? Also, whether the cables are connected to the drives properly?

The hba is in a case node 804 that has front and back fans in the motherboard chamber. Though due to planning to upgrade drives, I kept case open, apart from fans in front and back still functioning. Is this not enough?


Sent from my iPhone using Tapatalk

Share this post


Link to post
9 minutes ago, jang430 said:

When you say connection problem

Connection problem was about parity disk, replace cables and post new diags after booting so there's a SMART report.

 

9 minutes ago, jang430 said:

I kept case open, apart from fans in front and back still functioning. Is this not enough?

Not IMHO, doesn't mean that's the problem, but the HBA will be very hot like that, likely overheating, you can touch to cooler to check, with care not to burn a finger.

 

Share this post


Link to post

Ok. Noted. Thanks. Will investigate further.


Sent from my iPhone using Tapatalk

Share this post


Link to post

It seems I may have found the root cause.  The power I used to connect to the drives that are causing problems came from a molex to sata splitter because I don't have enough sata power.  When I connected the Molex from the PSU and the Molex from this splitter, I may have not connected them securely.  Though there's power, it seems whatever drive I'm connecting to it is encountering problems.  I've changed Sata cables, connected to onboard ports.  I've changed to other onboard ports also, until I finally looked at this sata power issue.  I tried to secure the 2 molex sides properly, and so far, I've finished my parity rebuilding, I've attached other drives to this same power connector, and performed heavy transfers using Unbalance plugin, so far so good.  Never would have suspected this if I didn't ran out of other things to troubleshoot already.

image.png.c500cfe5f1b50917a9a7f90922e8e88d.png

Share this post


Link to post

@johnnie.black, do you have idea how to cool Dell Perc H310?  Am using a Tower case.  So area where the PCIe card is placed is quite large, no active fan placed on the case (from the front), 1 at the back for exhaust only.  

Share this post


Link to post

I cool mine with a large case side fan, if that's not an option I would use a PCI fan adapter below the HBA, last resort would be a small fan on HBA heatsink, but don't really like that option as small fans are much more likely to fail and it could go unnoticed if it does.

Share this post


Link to post
On 8/21/2019 at 8:07 PM, jang430 said:

It seems I may have found the root cause.  The power I used to connect to the drives that are causing problems came from a molex to sata splitter because I don't have enough sata power.  When I connected the Molex from the PSU and the Molex from this splitter, I may have not connected them securely.  Though there's power, it seems whatever drive I'm connecting to it is encountering problems.  I've changed Sata cables, connected to onboard ports.  I've changed to other onboard ports also, until I finally looked at this sata power issue.  I tried to secure the 2 molex sides properly, and so far, I've finished my parity rebuilding, I've attached other drives to this same power connector, and performed heavy transfers using Unbalance plugin, so far so good.  Never would have suspected this if I didn't ran out of other things to troubleshoot already.

image.png.c500cfe5f1b50917a9a7f90922e8e88d.png

I just registered with the forum to thank you from the bottom of my heart for sharing this. I was having the exact same error issue on my parity drive and was using SATA splitters in exactly the same way, and of course it turned out that was the issue. I was tearing my hair out trying to figure out why a perfectly good brand-new drive started suddenly giving me read errors yet passed all SMART tests and worked just fine in another system. Put it back in, secured the SATA splitter, and now it looks like it's working normally again. What a wonderful, supportive community this is!

Share this post


Link to post

@CloudVader, glad that sorted out your problem :D . 

 

My quest isn't done yet :D . Though partially, that proved to be the solution for me.  But there's another issue here in the post I shared earlier wherein all drives in my array, have lots of errors.  Screenshot on an earlier post.  @johnnie.black, this time around, I was able to replace the Dell Perc I'm using to another unit.  So now, I have replaced the HBA, used different cables, connected sata power splitter securely, and still, have the same problem.  Meaning, thousands on errors across all drives.  This happened during parity checking.  This time around, I suspect it to be the HBA controller, and not the power splitter.  Probably, the ventilation.  Though what I don't understand is I've had this HBA for over a year, maybe 2, and always connected to the same Fractal Design Node 804, with front and rear fans.  That had been enough for the longest time.  Why are the parity errors coming out this time?  Seems like nothing changed.

 

Share this post


Link to post

Do you have an alternative power supply you could try?    Power supplies often degrade over time and if your supply is struggling to keep all drives powered during a parity check that might explain the errors.

Share this post


Link to post
2 hours ago, jang430 said:

Wondering if the Dell Perc H310 requires a fan on the heatsink? 

I recommend having some airflow around it, though I'm not a fan of small fans, I prefer a large fan close to it, like on the side of the case if it's a tower, or using a PCI slot adapter.

Share this post


Link to post

I also prefer not having any small fans on the Dell Perc.  I'm attaching a picture of Node 804 here, so you can visualize the amount of air that passes through.  image.png.0a4c02957b10e8d9258ebfbc05fb9679.png

Used to be, I didn't have any problems with errors with the drives.  All of a sudden, I do.  The most recent testing that still produced the multiple errors in the drives, I had my case, left and right, wide open.  Not sure if it's the lack of pressure within the chamber that caused not enough airflow through Dell heatsinks :D If there is such a thing.  

Share this post


Link to post
Posted (edited)

As  it is summer, there is one more option. Is the ambient temperature in the room where the system stands higher than usual ?

If you have been operating the system at borderline temperatures over the year, this could be a reason (And a wide open case does not help in heat dissipation as the airflow is no longer directed over the hot components).

 

In my office where my unraid box stands (also a node 804 by the way) the temperature during this summer is significantly higher than the last years and this has an effect on all the equipment within the case. So I am running all fans at max level currently to keep temperatures in the case at an acceptable level.

 

Cooling the HBA with a 40mm fan might not seem to be the best option, but it definetly helps in controlling the temperature of the HBA. Just use a quality fan like the ones from Noctua. If you use a motherboard fan connector to power the fan you might also be able to get a warning when the fan does not turn.

 

 

Edited by Kevek79
Typo

Share this post


Link to post
On 8/21/2019 at 8:07 PM, jang430 said:

It seems I may have found the root cause.  The power I used to connect to the drives that are causing problems came from a molex to sata splitter because I don't have enough sata power.  When I connected the Molex from the PSU and the Molex from this splitter, I may have not connected them securely.  Though there's power, it seems whatever drive I'm connecting to it is encountering problems.  I've changed Sata cables, connected to onboard ports.  I've changed to other onboard ports also, until I finally looked at this sata power issue.  I tried to secure the 2 molex sides properly, and so far, I've finished my parity rebuilding, I've attached other drives to this same power connector, and performed heavy transfers using Unbalance plugin, so far so good.  Never would have suspected this if I didn't ran out of other things to troubleshoot already.

image.png.c500cfe5f1b50917a9a7f90922e8e88d.png

 

FWIW- You shouldn't use the molded style splitters, they will fail and probably catch fire.

 

SATA1.JPG.e07722c1f16917c837219642ef6f5221.JPGSATA2.JPG.c1bd1c4b32a5f1c834929dba12a274ea.JPG

 

Share this post


Link to post

@Kevek79

 

Thanks for the tip.  Indeed, had a Noctua fan connected to the M/B directly.  Installed it last night.  I can see the fan running.  

 

Unfortunately, still having issues.  Same problem.  This actually eliminates the temp of HBA as the root cause.  I am attaching my diagnostics again here.  I hope people here can help identify issue.  @johnnie.black, hope you can help take a look.

 

@Michael_P, I am using the one marked X :D . My psu doesn't have enough modular ports to connect the correct type of splitter.  I hope it's not that serious.

tower-diagnostics-20190831-0027.zip

Share this post


Link to post
11 hours ago, jang430 said:

 

@Michael_P, I am using the one marked X :D . My psu doesn't have enough modular ports to connect the correct type of splitter.  I hope it's not that serious.

The one with the check mark is a splitter, but it's not molded connector. The problem with the molded connectors is how they're laid in the mold, the only way to QC them to ensure there isn't any strands touching or too close is to x-ray them as they come off the line - and at that price point they're not doing it. Google molded sata power splitter and you can read some horror stories.

I had one in my desktop PC for years, one morning I noticed it had shut itself off overnight so I turn it back on - bright flash, smoke, unhappy noises and the splitter was on fire. I thought it was just a fluke, until I happened across a youtube video a couple days later (crazy coincidence) of a guy building a NAS who had the same thing happen to him, he made a video dissecting the connectors and showing the problem.

 

 

Share this post


Link to post
Looks like the HBA is the problem, try it in a different slot if possible, if not try a different HBA.

I arrived at the same conclusion. Thanks for the help @Johnnie.black


Sent from my iPhone using Tapatalk

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.