Help diagnosing key failure point

Followers

February 18, 20251 yr

Hi there,

I’m using unRAID 7 currently and have been using my system for nearly 9 years. Recently, I’ve been having disk errors that are repeating and need some help to fine tune how I should diagnose the cause of this. Typically, the problem starts like this:

1) Array HDD disk temperatures start turning to “*”

2) These disks eventually start looking like they’re spun down with the grey circle icon, but can’t be spun up

3) Eventually, unRAID will see errors coming out and will mark them as failed disks

4) If I keep letting unRAID run, the system will lock up and I need to force a system restart

Usually this happens over a period of 12-24hr to my array of 17 disks, and ONLY if my Docker *arr stack is running on my SSD cache pool. About 1-2 disks would have errors/failure, but the pattern is random within the set of 17 disks after I restart the array.

As a background, the HDD array disks are connected to a Supermicro SAS backplane, which is connected to a LSI 9305-24i HBA PCIe card. I’ve been running this hardware configuration with same cables stably over the past 5+ years. In addition, I connect around 3 SATA SSDs to the LSI SATA card for my ZFS mirror SSD cache pool.

What I did to figure out the problem were these steps:

1) Ran Memtest86+ successfully and changed to new PSUs

2) Reattached my 3 SATA SSDs to a separate LSI2008 8i card with different cables. Once that was done, the system runs stably. It looks like it’s either a problem with the SATA connectors or the LSI 9405-24i. The LSI runs well when only the backplane and HDD are attached, and if the SSDs are attached onto another separate HBA.

3) So basically, I can solve this problem by separating the cache SSDs onto a different HBA; or, if I don’t run my Docker *arr stack.

At this point, I’m trying to see based on the diagnostics, is it likely a HBA card, SATA cables or power connector failure? Can anyone point me on the next steps to figure out which is the exact cause of my problems? Many thanks!

hk-homelab-diagnostics-20250218-1533.zip

Quote

Solved by ipreferpie

March 22, 20251 yr

Go to solution

February 18, 20251 yr

Community Expert

13 minutes ago, ipreferpie said:

1) Array HDD disk temperatures start turning to “*”

Which disk(s) is currently showing that?

Quote

February 18, 20251 yr

Author

4 minutes ago, JorgeB said:

Which disk(s) is currently showing that?

Right now, I had rebooted and rebuilt the failed disks, so none show failed disks. But around 2 failures ago, I remember was Disk 5 & 11. I’ve have run through around 6-7 disk failure/rebuild scenarios over the past 2 weeks. The most recent one was strange is that unRAID just hanged w/o disk failures and I’m just doing a parity check now.

Quote

February 18, 20251 yr

Community Expert

Post new diags when there's a problem, the syslog starts over after every boot.

Quote

February 18, 20251 yr

Community Expert

The syslog in the diagnostics is the RAM copy and only shows what happened since the reboot. It could be worth enabling the syslog server to get a log that survives a reboot so we can see what happened prior to the reboot. The mirror to flash option is the easiest to set up, but if you are worried about excessive wear on the flash drive you can put your server’s address into the Remote Server field.

Quote

February 18, 20251 yr

Author

Ok I’ll try to replicate the problem and have enabled syslog to flash. Will post new diagnostics once I start getting problems again. Many thanks

Quote

February 20, 20251 yr

Author

Here’s the updated most recent diagnostics. I’ve turned on syslog to flash. What happened this time was slightly different in that there were no temperature blanks, dropped/failed array disks; instead, just a full crash and system reboot. But I’m still very certain it has something to do with the HBA, cables, or power. Many thanks!

hk-homelab-diagnostics-20250220-0854.zip

Quote

February 20, 20251 yr

Community Expert

Feb 20 06:30:02 HK-HomeLab kernel: mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready

HBA problem, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot.

Quote

1 month later...

March 22, 20251 yr

Author
Solution

Just after several troubleshooting procedures back and forth in trying cables, PDU, repasting, adding fans and HBA, I discovered the issue. It’s the firmware on my LSI 9405-24i. After updating from 16.00.01 to 16.00.12 everything works now. Before, it was the ZFS and BTRFS SSD trim commands that caused the drops. Thank you for all the forum help in helping me narrow it down! Pls mark as solved.

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Help diagnosing key failure point

Featured Replies

Solved by ipreferpie

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)