Jump to content

Both partiy disks fail on Raid 6 array at the same time. Help please!


Recommended Posts

Hello Unraiders,

 

I came home to both parity disks being disabled, with a whopping 2000+ errors on each parity disk.

I checked the logs and they both failed very close in time to each other.

There is a plugin that is mentioned in the logs too...

 

https://pastebin.com/9pg56jC2

 

The beggining of the errors starts with:

Jan 15 01:31:56 VAULT kernel: ata2.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x6 frozen
Jan 15 01:31:56 VAULT kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jan 15 01:31:56 VAULT kernel: ata2.00: cmd 61/08:80:80:08:00/00:00:00:00:00/40 tag 16 ncq dma 4096 out
Jan 15 01:31:56 VAULT kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 15 01:31:56 VAULT kernel: ata2.00: status: { DRDY }
Jan 15 01:31:56 VAULT kernel: ata2: hard resetting link

 

It is an 8 disk array, raid 6. I am worried for my NAS!

I find it very strange that both parity disks fail together when the array has being going for 2 years. I restarted twice, and both disks remain disabled. This message popped up:

 

image.png.aa83d68845e3ea6b1f5f6f6c5c7bf11d.png

 

Adnd the 2k errors on each parity disk were zeroed away:

 

image.thumb.png.b143fa3dfdbc4c475642c45e0cc79543.png

 

Unraid version is: Version 6.12.6 2023-12-01

 

I have left the machine off whilst I post here.

Any help would be appreciated.

 

- ShitTheBed

Link to comment

The parity drives will remain disabled until you go through the process of rebuilding them.  If you want to try rebuilding parity back to the same drives then the process is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page.

 

The error you posted looks rather like a cabling or power issue so you should check that before attempting any rebuild.    The system’s diagnostics might also give clues as to what was happening.

 

 

Link to comment
Jan 15 01:32:11 VAULT kernel: ata1: softreset failed (1st FIS failed)
Jan 15 01:32:11 VAULT kernel: ata1: hard resetting link
Jan 15 01:32:16 VAULT kernel: ata6: link is slow to respond, please be patient (ready=0)
Jan 15 01:32:16 VAULT kernel: ata2: softreset failed (1st FIS failed)
Jan 15 01:32:16 VAULT kernel: ata2: hard resetting link
Jan 15 01:32:20 VAULT kernel: ata5: softreset failed (1st FIS failed)
Jan 15 01:32:20 VAULT kernel: ata5: hard resetting link
Jan 15 01:32:20 VAULT kernel: ata6: found unknown device (class 0)
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Jan 15 01:32:21 VAULT kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 15 01:32:21 VAULT kernel: ata1: softreset failed (1st FIS failed)
Jan 15 01:32:21 VAULT kernel: ata1: hard resetting link
Jan 15 01:32:26 VAULT kernel: ata6.00: qc timeout after 5000 msecs (cmd 0xec)
Jan 15 01:32:26 VAULT kernel: ata6.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Jan 15 01:32:26 VAULT kernel: ata6.00: revalidation failed (errno=-5)
Jan 15 01:32:26 VAULT kernel: ata6: hard resetting link
Jan 15 01:32:30 VAULT kernel: ata5: softreset failed (1st FIS failed)
Jan 15 01:32:30 VAULT kernel: ata5: hard resetting link
Jan 15 01:32:36 VAULT kernel: ata6: softreset failed (1st FIS failed)
Jan 15 01:32:36 VAULT kernel: ata6: hard resetting link
Jan 15 01:32:46 VAULT kernel: ata6: softreset failed (1st FIS failed)
Jan 15 01:32:46 VAULT kernel: ata6: hard resetting link
Jan 15 01:32:51 VAULT kernel: ata2: softreset failed (1st FIS failed)
Jan 15 01:32:51 VAULT kernel: ata2: limiting SATA link speed to 3.0 Gbps
Jan 15 01:32:51 VAULT kernel: ata2: hard resetting link
Jan 15 01:32:56 VAULT kernel: ata1: softreset failed (1st FIS failed)
Jan 15 01:32:56 VAULT kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 15 01:32:56 VAULT kernel: ata1: hard resetting link
Jan 15 01:32:56 VAULT kernel: ata2: softreset failed (1st FIS failed)
Jan 15 01:32:56 VAULT kernel: ata2: reset failed, giving up
Jan 15 01:32:56 VAULT kernel: ata2.00: disable device

Problem with all 4 disks connected to the onboard controller at the same time, unless they share a splitter it could be a controllers issue.

Link to comment

Very insightful. I see. The case im using is UNAS-810a. It comes with a built in controller. From the pic it looks like two x4 controllers:

202649.22222e91435fa96ff1755a2776d8e1e9.

i will find out if the 4 ones with issues were on same controller.

i take it that the two non parity disks were fine after the 'issue', but parity cant tolerate the interuption?

Link to comment
Just now, ShitTheBed said:

take it that the two non parity disks were fine after the 'issue', but parity cant tolerate the interuption?

Unraid starts disabling drives when they drop offline, but only disables the number equivalent to the number of parity drives.   I think it was just that the parity drives were the first ones noticed.

Link to comment

UPDATE: I zerod in on which physical drives were disabled. As JorgeB highlighted, the log shows 4 going down. I've bolded them in the list below:

 

ata1 => sdb 1TB SSD
ata2 => sdc 1TB SSD
ata5 => sdd HDD (parity 2)
ata6 => sde HDD (parity)

ata9 => sdf HDD    
ata10 => sdg HDD
ata11 => sdh HDD
ata12 => sdi HDD
ata13 => sdj HDD
ata14 => sdk HDD

 

The top two SSD's form a 1tb mirrored cache. The bottom 8 HDD form an array using two of these backplanes:

https://www.u-nas.com/xcart/cart.php?target=product&product_id=17703

 

I thought for sure the 4 effected drives were going to be 4 HDD's on one backplane. It wasn't! It was both SSD's (not on a backplane - they go directly into mobo) that form the cache and both parity drives (on one backplane).

 

Now I will pop of case cover and check if drives share a PCIE card, or plug right into mobo, or both...

 

Another twist - the four failed drives all plug directly into mobo! The other 6 HDD's plug into the PCIE card:

 

1669805076_WhatsAppImage2024-01-15at23_21.54_9f3ab4aa.thumb.jpg.07bf63aeba1cead2cf87173dd55308ea.jpg

 

I'm now thinking power issues. PSU or unstable power coming into PSU.

 

I tell my wife about the issue, and how it happened at 1:31 am. She thinks she went into garage (where NAS is) at that time for midnight weatabix (pregnant). She checks fitbit, the watch that tracks sleep, and confirms she did indeed!

 

917760270_WhatsAppImage2024-01-15at23_30.42_a5099c71.thumb.jpg.2c3d5df2cf2ae383bceef2f190c8ba4a.jpg

 

We had an electrician in a week ago who installed lots of automatic lights in garage...

 

I'm now thinking the new light set up isnt playing nice with the NAS. I will leave them always on for now, so no power spikes. I will look into how it is wired, and if I can get hardware to monitor power supply to NAS.

 

I will do another update if I can confirm lights causing power spikes.

 

Thanks for the leads guys! :)

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...