Jump to content

Drives dropout randomly (kinda)


Recommended Posts

I have been using Unraid for years, and I have upgraded through the years, but this last jump seems to be making me question not only my tech skill and manhood but my sanity as well.

 

System Specs:
 

Quote

PCPartPicker Part List: https://pcpartpicker.com/list/jr9ct7

CPU: AMD Ryzen 9 5950X 3.4 GHz 16-Core Processor  ($358.00 @ Newegg)
Motherboard: Asus PRIME X570-PRO ATX AM4 Motherboard  ($377.36 @ Amazon)
Memory: G.Skill Ripjaws V 128 GB (4 x 32 GB) DDR4-3200 CL16 Memory  ($259.99 @ Amazon)
Storage: Samsung 870 Evo 2 TB 2.5" Solid State Drive  ($179.99 @ Amazon)
Storage: Samsung 870 Evo 2 TB 2.5" Solid State Drive  ($179.99 @ Amazon)
Storage: Samsung 870 Evo 2 TB 2.5" Solid State Drive  ($179.99 @ Amazon)
Storage: Samsung 870 Evo 2 TB 2.5" Solid State Drive  ($179.99 @ Amazon)
Storage: Sabrent Rocket 4 Plus 8 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive  ($1165.00 @ Amazon)
Storage: Sabrent Rocket 4 Plus 8 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive  ($1165.00 @ Amazon)
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: HP 7 TB 3.5" 7200 RPM Internal Hard Drive
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf NAS 8 TB 3.5" 7200 RPM Internal Hard Drive  ($149.88 @ Amazon)
Storage: Seagate IronWolf Pro NAS 12 TB 3.5" 7200 RPM Internal Hard Drive  ($229.99 @ Adorama)
Storage: Seagate IronWolf Pro NAS 12 TB 3.5" 7200 RPM Internal Hard Drive  ($229.99 @ Adorama)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Storage: Seagate IronWolf Pro NAS 16 TB 3.5" 7200 RPM Internal Hard Drive  ($299.95 @ Amazon)
Video Card: PNY VCQP4000-PB Quadro P4000 8 GB Video Card  ($425.00 @ Amazon)
Power Supply: SilverStone Technology ST1300-TI, 80 Plus Titanium 1300W Fully Modular ATX/PS2 Power Supply, SST-ST1300-TI-X  ($340.99 @ Newegg)
Case Fan: Noctua NF-A12x25 PWM chromax.black.swap 60.09 CFM 120 mm Fan  ($34.95 @ Amazon)
Case Fan: Noctua NF-A12x25 PWM chromax.black.swap 60.09 CFM 120 mm Fan  ($34.95 @ Amazon)
Case Fan: Noctua NF-A12x25 PWM chromax.black.swap 60.09 CFM 120 mm Fan  ($34.95 @ Amazon)
Case Fan: Noctua NF-A12x25 PWM chromax.black.swap 60.09 CFM 120 mm Fan  ($34.95 @ Amazon)
Fan Controller: Razer RZ34-02140700-R3M1 Fan Controller  ($48.96 @ Amazon)

Parts notes: I have had the motherboard since 2021, the RAM is about a year old, and the CPU is about 6 months old. The 16 Tb drives, Sas Controller, Power supply, Case, SSDs, and NVMe are new.

The issues started when I installed the RAM. After about a week, the system would lock up randomly and complain about BTRFS corruption. I decided to move the data to my Synology and start over with a better CPU and a new case.

 

Issues:

Rebuild is very slow

Disks drop off and return with a new device name/address (Viz SDB becomes SDAC), Errors pop up on the disks, and Smart becomes unresponsive for that disk. (Note it does not seem to be the same drive (always), and it has included drives of other sizes). The issue only seems to happen on parity check/rebuild.

Troubleshooting I have completed so far:

  1. Initial Hardware Tests
    1. PC Doctor Tests: You ran initial hardware diagnostics using PC Doctor to check for any immediate issues.
    2. Burn-In Tests: Conducted a burn-in test at a PC shop, which included SMART tests on the drives. All tests passed without errors.
  2. Power Supply Upgrade
    1. Power Supply Upgrade: Upgraded the power supply to a 1300W unit because the existing one might have been reaching its maximum output, potentially causing instability.
  3. CPU and Drive Upgrades
    1. CPU Upgrade: Replaced the CPU with a 16-core, 32-thread processor to enhance performance and address potential CPU-related issues.
    2. Drive Upgrades: Installed new 16TB drives during a case upgrade.
      Also moved existing 12TB and 8TB drives from an old Synology system to the new setup.
    3. SSD Installation: Added 4 SSDs on a PCI card connected via SATA to the server to expand storage capabilities.
    4. Cache NVMe Upgrade: Upgraded the NVMe cache to a 4TB model to increase cache size and performance.
  4. Memory Upgrade and Issues
    1. RAM Upgrade: Increased RAM from 32GB to 128GB. This upgrade caused issues such as cache corruption and the system becoming unresponsive after 96-130 hours on old system.
  5. Further Hardware Troubleshooting
    1. PC Shop Inspection: Took the server to a PC shop for a thorough check to ensure no hardware issues were overlooked.
    2. SAS Cable Replacement: Swapped out and replaced SAS cables to new backplanes to ensure proper connectivity and eliminate cable faults.
    3. Firmware and BIOS Updates: Updated the firmware on the 9305-24i controller and ensured the BIOS was configured to IT mode for optimal performance.
    4. Drive Relocation: Moved drives to new bays to see if physical placement was causing issues.
    5. External Drive Tests: Tested the drives outside of the server to verify their functionality.
    6. CPU and Memory Tests: Tests were conducted on the CPU and memory to rule out any potential faults for over a week.
  6. Observations and Investigations
    1. Drive Unmounting and Remounting: I noticed that initially, only two drives would unmount and remount, but eventually, all drives on one backplane were affected. Then other backplanes.
    2. Random Drive Placement: After moving the server to the PC shop, drives were randomly placed back into the array by size, which could have caused issues.
    3. Suspected Cable Issues: It was considered that a single miniSAS-to-SATA cable might be causing problems, especially if the problematic drives shared it. Replaced SAS Cable 1
    4. Controller Considerations: Ordered another 9305-24i controller as some SATA drives directly connected to the motherboard were missing.
    5. Randomly placed drives into random bays.
    6. SAS backplates separated to different sata rails
    7. Replaced USB Stick
    8. Fresh load of unraid
    9. Happens in safemode and normal mode
  7. Changed the Seagate Settings with this guide:

I have attached diagnostics from the and screenshots and photos of the system. I am out of ideas... and need advice

unnamed.jpg

Screenshot 2024-04-18 100732.png

unnamed7.jpg

unnamed6.jpg

unnamed5.png

unnamed4.jpg

unnamed3.png

unnamed2.png

unnamed1.png

2024-07-28 09_30_04-7-Zip.png

tower-diagnostics-20240417-1722(1).zip tower-diagnostics-20240420-1617.zip tower-diagnostics-20240728-0930.zip

Link to comment
  • 2 weeks later...
Posted (edited)

It only seems to be the 16 TB and every once in a while, a 12 TB one. 1300w should be more than enough; what is the best way to check on this to rule it in or out?

 

One of the strange things is it seems to happen to the same drives, normally one that is in the protection or first 5.


I shuffled the drives to eliminate the backplanes and maybe overload one of them.

Edited by Exilepc
Link to comment

It normally drops out one drive at a time (not always the same drive), but it has happened to up to three drives at the same time.

 

Viz

Boot 1, start check:

Drive 3 drops out and starts to have errors and stop showing temp and show up in unmapped drives

Boot 2, start check:

Drive 5 and 1 drops and starts to have errors and stop showing temp and show up in unmapped drives

Boot 3:

Drive 2 is marked as failed

Boot 4, start check:

No issues

Boot 5, start check:

Drives 2,3,5 start to have errors and stop showing temp and show up in unmapped drives

 

Link to comment
14 hours ago, Exilepc said:

not always the same drive

If it's not always the same drive it still suggests like mentioned a power/connection issue, are you using any power splitters? You can also try a different PSU if available, it won't be lack of power but it may not be working correctly.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...