[6.9.2] 2 Failed Drives. What Do I do now?

mattcrem · May 15, 2021

So I woke up today to two failed drives as per the below screenshot. The situation is as follows:

Disk 3 - marked "unmountable: not mounted: I have connected it via USB HDD dock to my windows laptop. Both FTK Imager and DiskInternals LinuxReader are able to access the drive and show me the folders and files on it, although not sure if the data is actually readable, but folder structure seems intact.

Disk 4 - Shown as failedBoth FTK and linux reader are showing me the filestructure of this drive.

Since there are two failed drives, I can't mount the array, not even in maintenance mode. So what are my options here to try and retrieve the data and rebuild the array?

Edited May 15, 2021 by mattcrem
updated Images

Frank1940 · May 15, 2021

First thing which the Gurus (I am not even close to one with this type of problem!) are going to need is a Diagnostics File. Tools >>> Diagnostics Post it up in a new Post. (No one will know that you did it if you add to your first post by editing it!)

mattcrem · May 15, 2021

48 minutes ago, Frank1940 said:

First thing which the Gurus (I am not even close to one with this type of problem!) are going to need is a Diagnostics File. Tools >>> Diagnostics Post it up in a new Post. (No one will know that you did it if you add to your first post by editing it!)

you're right, thanks for reminding me, attaching diagnostics:

cremonanas-diagnostics-20210515-1749.zip

JorgeB · May 16, 2021

Run a filesystem check on disk1, disk4 appears to be really failing, you can run an extended SMART test to confirm, if it fails replace.

sota · May 16, 2021

Here's my suggestion:

Since you can see the contents of the disks on another system, copy a couple items out to spot check their viability.

If good, copy everything/clone the drive(s) to new one(s). Now you'll at least have a copy.

Once that's done, run extended SMART tests on both to see what their actual condition is.

@JorgeB I looked at the diags, and can see where disk1 is poo, but disk4 appears to mount fine... unless i'm reading the log wrong.

JorgeB · May 16, 2021

6 minutes ago, sota said:

@JorgeB I looked at the diags, and can see where disk1 is poo, but disk4 appears to mount fine... unless i'm reading the log wrong.

Don't understand what you mean, disk1 is unmountable but enable, disk4 is disable but mounting.

mattcrem · May 17, 2021

On 5/16/2021 at 12:49 PM, sota said:

Here's my suggestion:

Since you can see the contents of the disks on another system, copy a couple items out to spot check their viability.

If good, copy everything/clone the drive(s) to new one(s). Now you'll at least have a copy.

Once that's done, run extended SMART tests on both to see what their actual condition is.

@JorgeB I looked at the diags, and can see where disk1 is poo, but disk4 appears to mount fine... unless i'm reading the log wrong.

Disk 1 happens to have just media files. I connected the disk directly to a windows PC and with FTK imager exported some of them, however neither media player classic nor VLC are able to playback these files.

I have now remounted everything in my unraid server and using teracopy to move everything to a local hard disk and waiting to see if this allows me to open the files this way.

Disk 4 was just added to the array but I never actually had anything stored on it if that helps.

On 5/16/2021 at 9:35 AM, JorgeB said:

Run a filesystem check on disk1, disk4 appears to be really failing, you can run an extended SMART test to confirm, if it fails replace.

I have not done this yet, however checking dmesg I see a long list of bad sectors in disk1.

JorgeB · May 17, 2021

48 minutes ago, mattcrem said:

I have not done this yet, however checking dmesg I see a long list of bad sectors in disk1.

There's no SMART report for disk1 on the diags, see if you can get one manually.

mattcrem · May 18, 2021

23 hours ago, JorgeB said:

There's no SMART report for disk1 on the diags, see if you can get one manually.

smartctl -a -d ata /dev/sdd
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.10.28-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    Z340CGBC
LU WWN Device Id: 5 000c50 064b34a34
Firmware Version: CC27
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 18 13:57:19 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Read SMART Data failed: Input/output error

=== START OF READ SMART DATA SECTION ===
Error SMART Status command failed: Input/output error
SMART Status command failed: Input/output error
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.

SMART Error Log Version: 1
ATA Error Count: 554 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 554 occurred at disk power-on lifetime: 35743 hours (1489 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      03:37:23.810  READ FPDMA QUEUED
  61 00 00 ff ff ff 4f 00      03:37:23.809  WRITE FPDMA QUEUED
  60 00 f8 48 02 00 40 00      03:37:23.407  READ FPDMA QUEUED
  60 00 f8 48 01 00 40 00      03:37:23.406  READ FPDMA QUEUED
  60 00 78 c8 00 00 40 00      03:37:23.406  READ FPDMA QUEUED

Error 553 occurred at disk power-on lifetime: 35742 hours (1489 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 78 ff ff ff 4f 00      03:15:15.182  READ FPDMA QUEUED
  60 00 40 ff ff ff 4f 00      03:15:15.182  READ FPDMA QUEUED
  61 00 38 ff ff ff 4f 00      03:15:15.181  WRITE FPDMA QUEUED
  61 00 40 ff ff ff 4f 00      03:15:15.094  WRITE FPDMA QUEUED
  61 00 30 ff ff ff 4f 00      03:15:15.082  WRITE FPDMA QUEUED

Error 552 occurred at disk power-on lifetime: 35742 hours (1489 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 a8 ff ff ff 4f 00      03:15:11.427  WRITE FPDMA QUEUED
  60 00 38 ff ff ff 4f 00      03:15:11.422  READ FPDMA QUEUED
  60 00 d0 ff ff ff 4f 00      03:15:11.422  READ FPDMA QUEUED
  61 00 40 ff ff ff 4f 00      03:15:11.420  WRITE FPDMA QUEUED
  61 00 38 ff ff ff 4f 00      03:15:11.418  WRITE FPDMA QUEUED

Error 551 occurred at disk power-on lifetime: 35742 hours (1489 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 20 ff ff ff 4f 00      03:15:07.023  WRITE FPDMA QUEUED
  60 00 60 ff ff ff 4f 00      03:15:07.005  READ FPDMA QUEUED
  60 00 40 ff ff ff 4f 00      03:15:07.005  READ FPDMA QUEUED
  60 00 98 ff ff ff 4f 00      03:15:07.004  READ FPDMA QUEUED
  61 00 60 ff ff ff 4f 00      03:15:07.003  WRITE FPDMA QUEUED

Error 550 occurred at disk power-on lifetime: 35742 hours (1489 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 88 ff ff ff 4f 00      03:15:03.238  READ FPDMA QUEUED
  61 00 88 ff ff ff 4f 00      03:15:03.236  WRITE FPDMA QUEUED
  60 00 a8 ff ff ff 4f 00      03:15:03.167  READ FPDMA QUEUED
  60 00 58 ff ff ff 4f 00      03:15:03.167  READ FPDMA QUEUED
  60 00 48 ff ff ff 4f 00      03:15:03.166  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     31226         -
# 2  Short offline       Completed without error       00%     17784         -
# 3  Short offline       Completed without error       00%     17643         -
# 4  Short offline       Completed without error       00%     17502         -
# 5  Short offline       Completed without error       00%     17361         -
# 6  Short offline       Completed without error       00%     17256         -
# 7  Short offline       Completed without error       00%     17115         -
# 8  Short offline       Completed without error       00%     16975         -
# 9  Short offline       Completed without error       00%     16892         -
#10  Short offline       Completed without error       00%     16751         -
#11  Short offline       Completed without error       00%     16608         -
#12  Short offline       Completed without error       00%     16467         -
#13  Short offline       Completed without error       00%     16326         -
#14  Short offline       Completed without error       00%     16185         -
#15  Short offline       Completed without error       00%     16050         -
#16  Short offline       Completed without error       00%     16020         -
#17  Short offline       Completed without error       00%     15934         -
#18  Short offline       Completed without error       00%     15801         -
#19  Short offline       Completed without error       00%     15660         -
#20  Short offline       Completed without error       00%     15518         -
#21  Short offline       Completed without error       00%     15377         -

Selective Self-tests/Logging not supported

JorgeB · May 18, 2021

It's not showing any attributes, run an extended SMART test.

mattcrem · May 18, 2021

So I ran the below:

~# smartctl -d ata -tlong /dev/sdd
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.10.28-Unraid] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Read SMART Data failed: Input/output error

it just shows the I/O error in less than 1 minute of entering the command

JorgeB · May 18, 2021

Yeah, best to replace a disk that doesn't even report SMART data correctly.

mattcrem · May 18, 2021

The problem is I have two failed drives, I have replaced one of them with a fresh new drive (as I'm out of SATA ports) and unraid tells me that there are too many bad drives and the array won't start at all. is there any way around this?

JorgeB · May 18, 2021

You only have on disable drive, the other one is just unmountable, you can replace disk4 and see if disk1 can be used for the rebuild without errors, then replace disk1, if there are read errors on disk1 the rebuilt disk will likely have some corruption resulting in some data loss, but not much you can do about that if disk1 is also failing with single parity.

[6.9.2] 2 Failed Drives. What Do I do now?

Recommended Posts

mattcrem

Link to comment

Frank1940

Link to comment

mattcrem

Link to comment

JorgeB

Link to comment

sota

Link to comment

JorgeB

Link to comment

mattcrem

Link to comment

JorgeB

Link to comment

mattcrem

Link to comment

JorgeB

Link to comment

mattcrem

Link to comment

JorgeB

Link to comment

mattcrem

Link to comment

JorgeB

Link to comment

Join the conversation