Both partiy disks fail on Raid 6 array at the same time. Help please!

ShitTheBed · January 15

Hello Unraiders,

I came home to both parity disks being disabled, with a whopping 2000+ errors on each parity disk.

I checked the logs and they both failed very close in time to each other.

There is a plugin that is mentioned in the logs too...

The beggining of the errors starts with:

Jan 15 01:31:56 VAULT kernel: ata2.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x6 frozen
Jan 15 01:31:56 VAULT kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jan 15 01:31:56 VAULT kernel: ata2.00: cmd 61/08:80:80:08:00/00:00:00:00:00/40 tag 16 ncq dma 4096 out
Jan 15 01:31:56 VAULT kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 15 01:31:56 VAULT kernel: ata2.00: status: { DRDY }
Jan 15 01:31:56 VAULT kernel: ata2: hard resetting link

It is an 8 disk array, raid 6. I am worried for my NAS!

I find it very strange that both parity disks fail together when the array has being going for 2 years. I restarted twice, and both disks remain disabled. This message popped up:

image.png.aa83d68845e3ea6b1f5f6f6c5c7bf11d.png

Adnd the 2k errors on each parity disk were zeroed away:

Unraid version is: Version 6.12.6 2023-12-01

I have left the machine off whilst I post here.

Any help would be appreciated.

- ShitTheBed

itimpi · January 15

The parity drives will remain disabled until you go through the process of rebuilding them. If you want to try rebuilding parity back to the same drives then the process is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page.

The error you posted looks rather like a cabling or power issue so you should check that before attempting any rebuild. The system’s diagnostics might also give clues as to what was happening.

ShitTheBed · January 15

Hi Itimpi,

Thank you for your prompt response!

I have the diagnostics, I collected them before restart. Find them attached.

Can you guide me on where to look for clues?

Thanks again

vault-diagnostics-20240115-1723.zip

JorgeB · January 15

Jan 15 01:32:11 VAULT kernel: ata1: softreset failed (1st FIS failed)
Jan 15 01:32:11 VAULT kernel: ata1: hard resetting link
Jan 15 01:32:16 VAULT kernel: ata6: link is slow to respond, please be patient (ready=0)
Jan 15 01:32:16 VAULT kernel: ata2: softreset failed (1st FIS failed)
Jan 15 01:32:16 VAULT kernel: ata2: hard resetting link
Jan 15 01:32:20 VAULT kernel: ata5: softreset failed (1st FIS failed)
Jan 15 01:32:20 VAULT kernel: ata5: hard resetting link
Jan 15 01:32:20 VAULT kernel: ata6: found unknown device (class 0)
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Jan 15 01:32:21 VAULT kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 15 01:32:21 VAULT kernel: ata1: softreset failed (1st FIS failed)
Jan 15 01:32:21 VAULT kernel: ata1: hard resetting link
Jan 15 01:32:26 VAULT kernel: ata6.00: qc timeout after 5000 msecs (cmd 0xec)
Jan 15 01:32:26 VAULT kernel: ata6.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Jan 15 01:32:26 VAULT kernel: ata6.00: revalidation failed (errno=-5)
Jan 15 01:32:26 VAULT kernel: ata6: hard resetting link
Jan 15 01:32:30 VAULT kernel: ata5: softreset failed (1st FIS failed)
Jan 15 01:32:30 VAULT kernel: ata5: hard resetting link
Jan 15 01:32:36 VAULT kernel: ata6: softreset failed (1st FIS failed)
Jan 15 01:32:36 VAULT kernel: ata6: hard resetting link
Jan 15 01:32:46 VAULT kernel: ata6: softreset failed (1st FIS failed)
Jan 15 01:32:46 VAULT kernel: ata6: hard resetting link
Jan 15 01:32:51 VAULT kernel: ata2: softreset failed (1st FIS failed)
Jan 15 01:32:51 VAULT kernel: ata2: limiting SATA link speed to 3.0 Gbps
Jan 15 01:32:51 VAULT kernel: ata2: hard resetting link
Jan 15 01:32:56 VAULT kernel: ata1: softreset failed (1st FIS failed)
Jan 15 01:32:56 VAULT kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 15 01:32:56 VAULT kernel: ata1: hard resetting link
Jan 15 01:32:56 VAULT kernel: ata2: softreset failed (1st FIS failed)
Jan 15 01:32:56 VAULT kernel: ata2: reset failed, giving up
Jan 15 01:32:56 VAULT kernel: ata2.00: disable device

Problem with all 4 disks connected to the onboard controller at the same time, unless they share a splitter it could be a controllers issue.

ShitTheBed · January 15

Very insightful. I see. The case im using is UNAS-810a. It comes with a built in controller. From the pic it looks like two x4 controllers:

202649.22222e91435fa96ff1755a2776d8e1e9.

i will find out if the 4 ones with issues were on same controller.

i take it that the two non parity disks were fine after the 'issue', but parity cant tolerate the interuption?

itimpi · January 15

Just now, ShitTheBed said:

take it that the two non parity disks were fine after the 'issue', but parity cant tolerate the interuption?

Unraid starts disabling drives when they drop offline, but only disables the number equivalent to the number of parity drives. I think it was just that the parity drives were the first ones noticed.

ShitTheBed · January 15

Aha i see! I am going to fully back up data first. Will see if i can source a new controller, it might be tricky as its part of the (expensive!) case.

Im guessing a brown out aswell as a psu issue could cause this?

ShitTheBed · January 15

UPDATE: I zerod in on which physical drives were disabled. As JorgeB highlighted, the log shows 4 going down. I've bolded them in the list below:

ata1 => sdb 1TB SSD
ata2 => sdc 1TB SSD
ata5 => sdd HDD (parity 2)
ata6 => sde HDD (parity)
ata9 => sdf HDD
ata10 => sdg HDD
ata11 => sdh HDD
ata12 => sdi HDD
ata13 => sdj HDD
ata14 => sdk HDD

The top two SSD's form a 1tb mirrored cache. The bottom 8 HDD form an array using two of these backplanes:

https://www.u-nas.com/xcart/cart.php?target=product&product_id=17703

I thought for sure the 4 effected drives were going to be 4 HDD's on one backplane. It wasn't! It was both SSD's (not on a backplane - they go directly into mobo) that form the cache and both parity drives (on one backplane).

Now I will pop of case cover and check if drives share a PCIE card, or plug right into mobo, or both...

Another twist - the four failed drives all plug directly into mobo! The other 6 HDD's plug into the PCIE card:

I'm now thinking power issues. PSU or unstable power coming into PSU.

I tell my wife about the issue, and how it happened at 1:31 am. She thinks she went into garage (where NAS is) at that time for midnight weatabix (pregnant). She checks fitbit, the watch that tracks sleep, and confirms she did indeed!

We had an electrician in a week ago who installed lots of automatic lights in garage...

I'm now thinking the new light set up isnt playing nice with the NAS. I will leave them always on for now, so no power spikes. I will look into how it is wired, and if I can get hardware to monitor power supply to NAS.

I will do another update if I can confirm lights causing power spikes.

Thanks for the leads guys!

Both partiy disks fail on Raid 6 array at the same time. Help please!

Recommended Posts

ShitTheBed

Link to comment

itimpi

Link to comment

ShitTheBed

Link to comment

JorgeB

Link to comment

ShitTheBed

Link to comment

itimpi

Link to comment

ShitTheBed

Link to comment

ShitTheBed

Link to comment

Join the conversation