Jump to content
Jaster

(Solved/Workaround)HBA/SAS Issues. Raid failing.

9 posts in this topic Last Reply

Recommended Posts

After I had a bad Controller, I got myself a new one.

 

I could run a full rebuild and the system seemed quite fine. After using it for about two hours its seems to become instable. Disks starts disappearing and I can see some controller erros in the logs....

The issues started after I started using VM's. Not sure if this is somehow connected. What can I do to rule that issue down?

knowlage-diagnostics-20190206-2113.zip

Edited by Jaster

Share this post


Link to post

Unraid lost communication with the HBA:

 

Feb  6 18:58:47 Knowlage kernel: mpt2sas_cm0: SAS host is non-operational !!!!

 

Doesn't look like it's related to VMs, but reboot and try starting the same VM and see if the same happens, if not try a different PCIe slot if available.

Share this post


Link to post
Feb 7 12:56:25 Knowlage kernel: mpt2sas_cm0: diag reset: FAILED
Feb  7 12:56:12 Knowlage kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Feb  7 12:56:24 Knowlage kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Feb  7 12:56:24 Knowlage kernel: sd 9:0:0:0: [sdj] Synchronizing SCSI cache

But it seems to recover itself...? I keep getting weird errors. How can I diagnose that? What could it be?

knowlage-diagnostics-20190207-1621.zip

 

EDIT: I assume, I found out something...

Taking a close look at the code below, it is all 8TB disks failing. All other disks seem to opperate without issues. Any suggestions?

 

Feb 7 16:19:42 Knowlage kernel: md: disk1 read error, sector=10744367352
Feb 7 16:19:42 Knowlage kernel: md: disk10 read error, sector=10744367352
Feb 7 16:19:42 Knowlage kernel: md: disk0 read error, sector=10744367352
Feb 7 16:19:42 Knowlage kernel: md: disk3 read error, sector=10744367432

 

Edited by Jaster

Share this post


Link to post

It's either a failing or overheating HBA, a fake Chinese HBA or some compatibility problem with your board.

Share this post


Link to post

I've been reading up on other boards like FreeNAS, etc.

I found out two facs:

1. LSI HBAs tend to overheat quite quickly. Sometimes the thermal grease is just dryed out or not correctly in place. You will realise this as the heatsink remains quite cool. A Lack of GOOD Airflow will overheat the HBA VERY quickly. So make sure there is plenty of airflow around it.

2. There are several issues with older Series (2xxx) regarding "large" drives. Depending on the Drive and controller it starts from 6TB. I could not figure out which drive/chipset combinations are involved. A solution (other than upgrading to a 3XXX Series) hasn't been found yet.

Share this post


Link to post
Just now, Jaster said:

There are several issues with older Series (2xxx) regarding "large" drives.

There appear to be some issues with with FreeBSD (FreeNAS) and some large capacity drives, never seen any issues on this forum or with Linux in general, I have some myself with 8TB disks without any problems.

Share this post


Link to post

I see the 8TB drives dropping again and again. It is always errors/read errors. And as you've seen then the controller becomes upopperational. Never any errors on other drivers...

I'm trying to run the 8TBs via the board and all others via the HBA. Maybe there will be some results.

 

P.S: everything seemed okays as long as I didn't have more than two 8TBs in there, the next two introduced all the issues. Even if the RAID isn't doing anything, after a while the 8TB will drop (often when then need to be spun up).

Edited by Jaster

Share this post


Link to post

I had an LSI HBA (a Dell H310) fail in the heat of last summer. I replaced it and that fixed the immediate problem. I then investigated the bad card and ended up taking the heatsink off the chip. The thermal compound had dried out and just crumbled away. So I cleaned it with isopropyl alcohol, replaced it with fresh MX-4 and put the heatsink back on and I haven't been able to make the card fail ever since. I notice that some people attach small fans to the heatsink to improve the local air flow but I haven't found that necessary.

Share this post


Link to post

After removing all 8TB drives from the HBA and adding some extra cooling, the systems seems to be stable.

If someone is intrested in further investigation, I'm happy to assist. Otherwise I'll let it pass as the issue is handled for me - it's not solved, but It works for me in the given configuration.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now