(Solved/Workaround)HBA/SAS Issues. Raid failing.


Jaster

Recommended Posts

After I had a bad Controller, I got myself a new one.

 

I could run a full rebuild and the system seemed quite fine. After using it for about two hours its seems to become instable. Disks starts disappearing and I can see some controller erros in the logs....

The issues started after I started using VM's. Not sure if this is somehow connected. What can I do to rule that issue down?

knowlage-diagnostics-20190206-2113.zip

Edited by Jaster
Link to comment
Feb 7 12:56:25 Knowlage kernel: mpt2sas_cm0: diag reset: FAILED
Feb  7 12:56:12 Knowlage kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Feb  7 12:56:24 Knowlage kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Feb  7 12:56:24 Knowlage kernel: sd 9:0:0:0: [sdj] Synchronizing SCSI cache

But it seems to recover itself...? I keep getting weird errors. How can I diagnose that? What could it be?

knowlage-diagnostics-20190207-1621.zip

 

EDIT: I assume, I found out something...

Taking a close look at the code below, it is all 8TB disks failing. All other disks seem to opperate without issues. Any suggestions?

 

Feb 7 16:19:42 Knowlage kernel: md: disk1 read error, sector=10744367352
Feb 7 16:19:42 Knowlage kernel: md: disk10 read error, sector=10744367352
Feb 7 16:19:42 Knowlage kernel: md: disk0 read error, sector=10744367352
Feb 7 16:19:42 Knowlage kernel: md: disk3 read error, sector=10744367432

 

Edited by Jaster
Link to comment

I've been reading up on other boards like FreeNAS, etc.

I found out two facs:

1. LSI HBAs tend to overheat quite quickly. Sometimes the thermal grease is just dryed out or not correctly in place. You will realise this as the heatsink remains quite cool. A Lack of GOOD Airflow will overheat the HBA VERY quickly. So make sure there is plenty of airflow around it.

2. There are several issues with older Series (2xxx) regarding "large" drives. Depending on the Drive and controller it starts from 6TB. I could not figure out which drive/chipset combinations are involved. A solution (other than upgrading to a 3XXX Series) hasn't been found yet.

Link to comment
Just now, Jaster said:

There are several issues with older Series (2xxx) regarding "large" drives.

There appear to be some issues with with FreeBSD (FreeNAS) and some large capacity drives, never seen any issues on this forum or with Linux in general, I have some myself with 8TB disks without any problems.

Link to comment

I see the 8TB drives dropping again and again. It is always errors/read errors. And as you've seen then the controller becomes upopperational. Never any errors on other drivers...

I'm trying to run the 8TBs via the board and all others via the HBA. Maybe there will be some results.

 

P.S: everything seemed okays as long as I didn't have more than two 8TBs in there, the next two introduced all the issues. Even if the RAID isn't doing anything, after a while the 8TB will drop (often when then need to be spun up).

Edited by Jaster
Link to comment

I had an LSI HBA (a Dell H310) fail in the heat of last summer. I replaced it and that fixed the immediate problem. I then investigated the bad card and ended up taking the heatsink off the chip. The thermal compound had dried out and just crumbled away. So I cleaned it with isopropyl alcohol, replaced it with fresh MX-4 and put the heatsink back on and I haven't been able to make the card fail ever since. I notice that some people attach small fans to the heatsink to improve the local air flow but I haven't found that necessary.

Link to comment

After removing all 8TB drives from the HBA and adding some extra cooling, the systems seems to be stable.

If someone is intrested in further investigation, I'm happy to assist. Otherwise I'll let it pass as the issue is handled for me - it's not solved, but It works for me in the given configuration.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.