HDD Failure, HBA Failure, or Both?


natecook

Recommended Posts

I've been rolling along for years with 2 cache, 2 parity, and 22 data disks. The data and parity disks run through two SASLP and one SAS2LP cards, and the cache directly connects to the mobo SATA ports. About 1/3 times, during a monthly parity check, one finicky SASLP (the one with data disks 1-8) would start spitting back errors for all 8 of its drives. I assumed it was just overheating, as when I added better ventilation, the issue became very sporadic.

 

On Monday, disk 5 on that SASLP was disabled. I/O errors, old drive, simple. I swapped it out, and felt great as started the rebuild. 5 hours later, the SASLP drops offline. As is tradition, I shut down and try again - still feeling fine. Another 4-5 hours and the SASLP is offline again. Now, when I reboot, disks 2 and 5 both show up as Unmountable. I try adding more fans, hoping to just get the data fixed before I start swapping hardware, and now the card fails after just 5 minutes. I ordered a SAS2LP to replace it, thinking all of my issues would be resolved without that bad SASLP.

 

Now, with the brand new SAS2LP installed, the parity sync runs at ~2 MB/s before it eventually fails. Looking at the logs, disk 8 is now throwing tons of I/O errors. FWIW, Disk 8 is from the same batch of Seagate 4TB hard drives as the failed disk 5, but disk 6 is too and it's fine.

 

What are my next steps here? I feel like I need to buy better HBA cards and trash some drives, but I don't know where to begin.

pangu-diagnostics-20200221-1859.zip

Link to comment

All LSI cards installed. I noticed immediately that my parity sync speeds were much improved - 125-150 MB/s on the LSI controllers vs 80-100 MB/s with the SASLP controllers. Unfortunately, I am still experiencing these read errors issues on drives that are seemingly good.

 

The array's status is:

  • disk 5 - disabled and unmountable (this disk was freshly installed and has not been rebuilt)
  • disk 2 - unprotected and unmountable (this drive was fine until the rebuild issues began)
  • all other disks - protected and mounted

However, when attempting a rebuild, I get random errors on a disk. Sometimes it's disk 4, sometimes 7, sometimes 8. I've tried swapping the cables and controllers around. The only consistent factor is that the read errors pop up on one of disks 1-8, which leads me to believe that the failing SASLP left some damage on those disks. What am I to do next?

pangu-diagnostics-20200223-1721.zip

Link to comment

@johnnie.black that makes so much sense. I was so caught up in my AOC vs LSI controller mixup that I didn’t think about power. I tried a few repair options, but it appears that this whole issue was tied to a bad power cable from my PSU. With a new PSU SATA cable, it’s been running a sync for over 6 hours without issue. Absolutely wild! I would have never thought to check a power cable.

 

What did you see that made you think it was power?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.