Jump to content

Help with Multiple Drive Failure


Go to solution Solved by SYMYAY,

Recommended Posts

I just recently got three of my drives went offline. And it prompts that "Errors occurred - Check SMART report". When I check the smart report, I didn't see anything out of the ordinary. And also in the "SMART health status" area, it shows as passed. I doubt that all three drives failed at the same time. Not sure what happend here, any idea? Thanks in advance!

 

Below is one of the drive's smart report, I reducted the serial number.

 

 

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.19.107-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MB6000JEFND
Revision:             HPD2
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500834bf0df
Serial number:        Z4D1M3SA00***********
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Aug 31 15:15:03 2021 CST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     57 C
Drive Trip Temperature:        60 C

Manufactured in week 13 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  123
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1977
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     325690.146           0
write:         0        0         0         0          0     256955.992           0
verify:        0        0         0         0          0      53703.942           0

Non-medium error count: 22626511

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   32803                 - [-   -    -]
# 2  Background short  Completed                   -      11                 - [-   -    -]
# 3  Background short  Completed                   -       3                 - [-   -    -]
# 4  Background short  Completed                   -       2                 - [-   -    -]

Long (extended) Self-test duration: 36900 seconds [615.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 42973:56 [2578436 minutes]
    Number of background scans performed: 129,  scan progress: 0.00%
    Number of background medium scans performed: 129

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c500834bf0dd
    attached SAS address = 0x500262d0cd7a5b20
    attached phy identifier = 9
    Invalid DWORD count = 159
    Running disparity error count = 160
    Loss of DWORD synchronization = 30
    Phy reset problem = 36
    Phy event descriptors:
     Invalid word count: 159
     Running disparity error count: 160
     Loss of dword synchronization count: 30
     Phy reset problem count: 36
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500834bf0de
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

 

1.PNG

2.PNG

3.PNG

Link to comment
18 minutes ago, JorgeB said:

This is normal with SAS devices, SMART on the GUI only works correctly with SATA, diagnostics before rebooting might give some clues.

Thanks for your advice. I downloaded the diagnostics, but I have no idea what everything suppose to mean. Do you mind to help me take a peek? Thanks a lot

 

The ones that are currently down are Parity 2 and Disk 3.

tower-diagnostics-20210831-1650.zip

Link to comment
2 hours ago, JorgeB said:

Disks look OK, diags are after rebooting so we can't see what happened, but multiple disk errors are usually a power/connection/controller problem, one thing you should do is update the LSI to latest firmware, since it's on a very old one, then and if it happens again grab diags before rebooting.

Thank you so much! I'll try to update the LSI firmware right away and check the cable in the case. Really appreciate your help! Viva Unraid!

Link to comment
  • 3 months later...
  • Solution

Just a quick update for the issue I had.

 

Update firmware were just a temporary fix, I figured out the root cause last week when I heard click noise from the disks.

 

I used a single cable from power supply to drive 6 enterprise 6TB disks, and apparently that cable wasn't powerful enough to drive all of them.

 

Changed to a 1000W modular power supply with tons of sata connectors, and connect only up to 4 drives with each cable solved my problem completely. Hopefully if there's anyone else have same issue, this may help you.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...