Jump to content
We're Hiring! Full Stack Developer ×

Troubleshooting frequent drive disables


Go to solution Solved by gregtamaki,

Recommended Posts

Hello,

For the past few months I have been having one or two drives go disabled in my unraid box with 2 parity, *almost* always on boot. Maybe once a month over the year, then 2x a week this past month. I think they are often the same 3-4 drives/slots - 5,7 I think are usual culprits. Additionally one of my cache SSDs is not connecting since booting last night, but I'm in the middle of another rebuild so I can't reboot to see if it connects.


Setup:
- PRIME B660-PLUS D4 motherboard with 12th Gen i5
- SuperMicro 846 4u case and stock 1200w PSU
- BPN-SAS2-846EL1 Backplane
- LSI 9207i
- 2x SAS parity drives, ManyX SATA drives, and 3 SSD for Cache
- I have my critical stuff backed up to a NAS, but I really am emotionally attached to my linux isos and would not like to find them again.

 

Additional background:
- The first time I plugged it in after a long distance move two years ago, I got a lovely pop/sizzle from the Supermicro Mobo near where the power supply connects...never powered on. I have since replaced all of the internal components, save for the backplane/drives and one of the PSUs. New ASUS mobo, 64GB memory, LSI card. One PSU and the LSI SAS card was dead (at least I think that was the issue).
 

 

Things I have done:
-In the past few weeks I have added two 16TB drives and removed four of my 4TB drives to reduce failure points - followed SpaceInvaderOne's guide (https://www.youtube.com/watch?v=nV5snitWrBk), but I think on the third drive the script to write the 0s didn't complete and I broke parity. Scary next day rebuilding parity!
-Killed the spin down on two of my 8TB Seagates acc'd to this thread (https://forums.unraid.net/topic/103938-69x-lsi-controllers-ironwolf-disks-disabling-summary-fix/), this seemed to be the issue for the disconnects that weren't on boot
-Swapped to the backup PSU I had (the orig loud 1200w one that came with it originally that was never plugged in)

 

My options to try next:
-Swap in a backup backplane (BPN-SAS2-846EL1) that I ordered on eBay when the disconnects started happening more frequently

-Swap in another LSI 9207i card I bought as a backup after the first one died

-Swap out the SAS cables (again)
-Order a power distributor (I think it's a PDB-PT846-2824), but they are pretty expensive. If I was in the US i would just have bought a second 846 by now as a parts donor. Considering just getting a regular ATX supply and following a guide where people secure it in there.
-Connect the cache pool SSDs directly to the motherboard, which it loooks I always should have had it like this as my card/backplane doesn't support TRIM

 

 

Questions:
-Is there anything I'm missing from my diag logs?
-Am I missing anything else I should consider doing?

 

I'm not very smart, so there are likely things that I have set up in a stupid way or something I am missing.

 

Edit: removed diag after a month

Edited by gregtamaki
added more info
Link to comment

Thank you!!

 

1) Yep - did have this same issue on my two ST8000VN004 where it would disable on spin up while the system was running. I ended up disabling spinup on those two. I just changed all disks to never spin down, but I'm 95% sure I had that setting most of the past few months. Not sure if that's causing any issues with my other drives disabling on boot...but I do have many Ironwolfs. I am currently rebuilding #12 which is a 16TB ironwolf tho.

2) Ah good call - 12  is rebuilding now to pull it back into the array, paused that and ran xfs_repair on it, then resumed the data rebuild. Hope that helps.

 

Once my array finishes rebuilding, I'm going to

1) Put back in my old SQ PSU

2) Move SSDs from backplane to mobo

3) Swap hba and sas cable

4) Swap backplane

 

Probably will pause between 3 & 4 for a few days to see how things are going.

 

 

Edited by gregtamaki
Link to comment
  • Solution

I swapped the HBA and PSU. After two days without issue, I re-enabled drive spindown for one of my ST16000NM000J drives. Within hours it spundown and disabled on spin up. I don't see anyone else on that thread having many issues outside of the ST8000VN004 and ST8000DM004, but I went ahead and disabled spin down and followed the instructions on the above linked thread to disable EPC and the low current spin up on all my Seagates (some don't have EPC). If this works I'll update that thread with my results as I don't think I see anyone else there who had this issue with the ST16000NM000J .

 

 

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...