On Failed Drives, Failing Drives, and Parity.


Recommended Posts

I've been in a long process of moving my data storage from Drobo to unRAID (you can see most of the saga in my other topics; there aren't many). Until recently, I had the following:

  • 8Tb WDC_WD80EFAX-68KNBN0 (parity)
  • 6Tb WDC_WD60EFAX-68JH4N0 (data; I planned to also have this be parity when all was finished)
  • 2Tb WD2003FZEX (data)

 

The 6Tb and the 2Tb had all my data, and parity was valid. At that point, I also had 3x 3Tb WD drives and 1x 3Tb Seagate drive still operational in the Drobo, with all the data on them as well (that's where it all came from).

 

Things were working well, but I wanted dual parity drives, and I wanted to make use of all my 3Tb drives from the Drobo. However, I only had 6 SATA ports on my motherboard, so I bought what I thought was a good SATA expansion card, with 8 ports. I plugged all my Drobo drives into that card, and began the preclear process on them.

 

When preclear was done, I added them to the array, but one drive consistently seemed to get kicked out of the array... I would add it, but it would appear in Unassigned Devices when I started the array. Figuring this issue was due to drive failure rather than some sort of issue with the SATA expansion card, I didn't think much of it.

 

All seemed well with the remining 3, however, so I used unBalance to scatter (in move mode, argh!) my data from the 6Tb drive to all the other ones. That seemed to work well, but then more problems began to appear.

 

One drive (disk 5 in the array) seemed to intermittently error out. The data that had been moved to it was still available due to the parity drive most of the time, but at one point I noticed that a lot of files and folders were missing from the array, and they were all items that had been on that drive. Not sure why parity wasn't "picking up the slack" in that case.

 

At this point was when I posted the earlier diagnostics (in this thread), and was told that the "drives attached to [my SATA card] are continually resetting, probably severely impacting performance."

 

So that's when I shut everything down, and connected as many drives as I could directly to the motherboard. I had also added a 1Tb SSD as a cache drive at an earlier point, so I then had what you see in the attached diagnostics:

  • 8Tb WDC_WD80EFAX-68KNBN0 (parity)
  • 6Tb WDC_WD60EFAX-68JH4N0 (data, but largely empty)
  • 2Tb WD2003FZEX (data)
  • 3Tb WDC_WD30EZRX-00D8PB0 (data)
  • 3Tb ST3000VN007 (data, but see below)
  • 1Tb Samsung SSD 860 EVO (cache)

 

One thing to note is that the drive that was intermittently erroring out (Disk 5) is physically disconnected for now as I ran out of ports.

 

Now, another drive, the Seagate, has begun to report hundreds of thousands of read errors (currently 261,818), but without getting kicked from the array. The data on there is family photos going back about 2 decades, but it's also all backed up on Google Photos, so it's (ironically for family photos) not unrecoverable.

 

I'm currently waiting for another SATA card to arrive, hopefully this week. Then I'll be able to plug back in Disk 5, and hopefully it will be intact and working properly with a proper SATA card.

 

My questions for all you wizards out there are:

  1. Am I doing things "right", or is there anything else I should be doing now? 
  2. Exactly what should I do once the replacement SATA card arrives? Is there a way to check if it's well-behaved before plugging drives into it?
  3. Is there any way to "rebuild" the data that is on the missing Disk 5 from parity, so that it's "actually" stored on one of the other disks that isn't missing?

 

Thanks for any help!

cube-diagnostics-20211010-0843.zip

Edited by Sandwich
Added 3rd question at end
Link to comment
41 minutes ago, Sandwich said:

What would happen if I replaced disk 4 with disk 5? The parity would then be emulating disk 4, and 5 would hopefully be available regularly, right?

No, because disk5 is already disable, just connecting it won't enable it, it would need to be rebuilt, but since disk4 has issue it's not possible.

Link to comment
3 hours ago, JorgeB said:

You can do a new config with disk5 to recover that, assuming the disk is OK

When you say "new config", do you mean to effectively recreate the current array config, with the addition of Disk 5? Or create a new, separate, second array (is that possible?) with Disk 5 as the only assigned device?

Link to comment

Ok, so I have the SATA card, and it seems to at the very least allow unRAID to recognize drives plugged into it. The other card did that as well, so I'm not sure what that's worth.

 

In any case, I have a crucial question: I have TWO disks that were excluded from the array when I had to stop using the previous SATA card, and I'm not sure which one was "disk5". The only way I have to reliably identify them is by their manufacturer ID (eg. WD30EZRX-00DC0B0). Can you tell me which was disk5? If it helps, I'm attaching an earlier diagnostic from Sep 26th; I think I might have had all the disks attached at that point.

Alternately, if there's no way to tell what the manufacturer ID was for disk5, is there a way to browse the files on the disk without having to add it to the array? I'm pretty sure disk5 had lots of data, and disk6(?) was empty.

cube-diagnostics-20210926-2115.zip

Edited by Sandwich
Typo
Link to comment

I've replaced cables, and now all drives seem to be getting recognized consistently. So I started the array and it began a parity rebuild, which finished "successfully". During that process, two drives reported nearly-identical numbers of read errors: WDC_WD30EZRX-00D8PB0_WD-WMC4N1244814 - 3 TB (sdf) with 732,487,239 errors (disk 3), and ST3000VN007-2E4166_Z730JMM1 - 3 TB (sdg) with 732,562,482 errors (disk 4). 

 

Additionally, I do appear to have lost all the data on disks 3, 4, and 5, despite them still showing as partly/mostly full.

 

Finally, when I try to browse the filesystems of 3, 4, or 5, it just says "No listing: Too many files".

 

What's going on?

 

Fresh diags attached.

cube-diagnostics-20211019-1139.zip

Edited by Sandwich
disk clarification
Link to comment

You're still having multiple ATA errors on multiple disks:

 

Oct 18 22:09:43 Cube kernel: ata10.00: status: { DRDY }
Oct 18 22:09:43 Cube kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Oct 18 22:09:43 Cube kernel: ata10.00: cmd 61/40:88:08:58:00/05:00:00:00:00/40 tag 17 ncq dma 688128 out
Oct 18 22:09:43 Cube kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 18 22:09:43 Cube kernel: ata10.00: status: { DRDY }
Oct 18 22:09:43 Cube kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Oct 18 22:09:43 Cube kernel: ata10.00: cmd 61/f8:90:48:5d:00/04:00:00:00:00/40 tag 18 ncq dma 651264 out
Oct 18 22:09:43 Cube kernel:         res 40/00:01:01:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 18 22:09:43 Cube kernel: ata10.00: status: { DRDY }
Oct 18 22:09:43 Cube kernel: ata10: hard resetting link
Oct 18 22:09:43 Cube ntpd[1794]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Oct 18 22:09:48 Cube kernel: ata10: link is slow to respond, please be patient (ready=0)
Oct 18 22:09:53 Cube kernel: ata10: COMRESET failed (errno=-16)
Oct 18 22:09:53 Cube kernel: ata10: hard resetting link
Oct 18 22:09:53 Cube kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
Oct 18 22:09:53 Cube kernel: ata10.00: configured for UDMA/133
Oct 18 22:09:53 Cube kernel: ata10: EH complete
Oct 18 22:10:11 Cube kernel: ata8.00: READ LOG DMA EXT failed, trying PIO
Oct 18 22:10:11 Cube kernel: ata8: failed to read log page 10h (errno=-5)
Oct 18 22:10:11 Cube kernel: ata8.00: exception Emask 0x1 SAct 0xc02000 SErr 0x0 action 0x0
Oct 18 22:10:11 Cube kernel: ata8.00: irq_stat 0x40000001
Oct 18 22:10:11 Cube kernel: ata8.00: failed command: READ FPDMA QUEUED
Oct 18 22:10:11 Cube kernel: ata8.00: cmd 60/20:68:00:ee:4a/00:00:0a:00:00/40 tag 13 ncq dma 16384 in
Oct 18 22:10:11 Cube kernel:         res 51/04:b8:80:87:00/00:00:00:00:00/40 Emask

 

These are usually a power/connection problem, and the data should come back after a reboot.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.