Parity Errors then an Unexplained Parity Drive Error


Recommended Posts

About a month ago my server warned 4 times over 3 hours of “[5] reallocated sector ct” and once of “[187] reported uncorrect”. At the same time it was running it’s parity check, at the end of which it also reported parity errors. (I have it run a parity check w/ corrections weekly, and this was the first time it EVER reported parity errors)

 

Understanding that the Data drive is probably on the way out the door, I acknowledged the errors and restarted the server to clear the warning and see how long it would take for it to report more errors

 

That disk has not reported any more errors since then. However, the parity check has reported errors each week since. Then yesterday, before the parity check was finished, it sent me the following :

 

“Event: Unraid array errors
Subject: Warning [THE-DARK-TOWER] - array has errors
Description: Array has 1 disk with read errors
Importance: warning
Parity disk - ST4000VN008-2DR166_ZDHAF8R2 (sdg) (errors 134) “

 

Running extended SMART tests on all Data disks (including the one the reported errors) returns “Completed without error”. However, the SMART test on the Parity drive ends with “Interrupted (host reset)” 

 

All disks are 4TB Seagate Ironwolf NAS. The drive that gave the initial errors was put into service in May 2018 (currently shows 37245 power on hours). The Parity drive was only installed last November (currently shows 6715 hours, and is thankfully still under warranty). (Also note that the Parity drive has a green thumbs up and says “Healthy” on the dashboard)

 

Am I looking at replacing both drives?

Any opinions on whether it was errors on the Data drive or problems with the Parity drive that has been causing the parity errors?

Any suggestions on further tests of the Parity drive to aid in a warranty claim?

 

Thanks for your help

Link to comment
29 minutes ago, geofbennett said:

I have it run a parity check w/ corrections weekly

Scheduled checks should be noncorrecting.

 

29 minutes ago, geofbennett said:

SMART test on the Parity drive ends with “Interrupted (host reset)

You have to disable spindown on the disk to get extended self-test to complete.

 

11 minutes ago, JorgeB said:

Please post the diagnostics.

attach to your NEXT post

Link to comment
2 hours ago, trurl said:

You have to disable spindown on the disk to get extended self-test to complete.

 

/Settings/DiskSettings - Default spin down delay is already set to "Never".  Is there another way to disable spindown?

 

I've set the parity check scheduler "Write corrections to parity disk:" to "No", but the check mark next to "Write corrections to parity" on Main does not go away.  I suspect that check mark only applies if I tell it to check parity outside the schedule, correct?

 

Diagnostics attached

the-dark-tower-diagnostics-20220816-1520.zip

Link to comment
1 hour ago, geofbennett said:

that check mark only applies if I tell it to check parity outside the schedule, correct?

correct

4 hours ago, geofbennett said:

the Data drive is probably on the way out

you didn't actually tell us which drive that was so I had to open them all

 

I would replace disk3

Link to comment
15 minutes ago, trurl said:

you didn't actually tell us which drive that was so I had to open them all

 

 

Sorry about that...  Yes, Disk 3 is the one that reported errors last month.

 

Any ideas about the error report for the parity drive?  Or should I just not worry about it?  That's what's kind of concerning to me.  I'm cool with waiting a bit to see if more errors appear if it is only one disk, but being there are 2 and one of them is Parity I'm getting a little nervous.

Link to comment
13 hours ago, geofbennett said:

cool with waiting a bit to see if more errors appear if it is only one disk, but being there are 2 and one of them is Parity I'm getting a little nervous.

Not sure why you would be cool with only one disk with errors. And parity isn't any more important than any other disk, arguably it is the least important since it contains none of your data.

Link to comment
48 minutes ago, trurl said:

Not sure why you would be cool with only one disk with errors. And parity isn't any more important than any other disk, arguably it is the least important since it contains none of your data.

 

It was my understanding that the Parity drive is what enables you to rebuild a Data drive if it should fail, but if you only have a single parity drive and multiple drives fail at the same time then you will lose data.  Is that not true?

Link to comment

I don't think it's a cable problem. There are (suggestive) indications. in the syslog, that your problem with your Disk3 is due to a flaky SATA port (ata6) on your motherboard (chipset). I would swap connections at the motherboard between Disk3 and another DiskN. If the problems DO "transfer" to DiskN (and stay on ata6), that does eliminate Disk3 and its cable, and nails it to the board.

 

If not, ...

 

Disclaimer: not an Unraid user (just like fun problems)

 

 

Edited by UhClem
Link to comment

Swapped Disk 3 and Parity. I used the new cables, but couldn't switch the cables at the motherboard, had to switch the cables at the drives because one of the plugs has a 90deg bend and the other port on the board is obstructed.

 

For my own edification, which log and which details are we looking at for the ATA errors?  (if you can explain without too much effort that is, I'm so grateful for the help, I want to learn more but I don't want to put you guys out any more than I have to)

the-dark-tower-diagnostics-20220817-1409.zip

Link to comment

ATA errors are in the syslog, like these:

 

Aug 17 14:06:48 The-Dark-Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Aug 17 14:06:50 The-Dark-Tower kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Aug 17 14:06:50 The-Dark-Tower kernel: ata4.00: configured for UDMA/133

 

Less than before so far, but hey followed the parity disk, so that suggests a device problem.

Link to comment

Thanks again.

 

After resetting everything back the way it was before (including original cables) I see that the ATA6 is showing the "slow to respond" message as well as some other messages that are not being mentioned for any of the other ports.

 

Ok, new drive should be here Friday.  Once it is installed should I check the diagnostics before or after rebuilding parity? Or Both?

the-dark-tower-diagnostics-20220817-1704.zip

Link to comment
  • 1 month later...

Just as a follow up and to close this out in case anybody has the same or similar problem in the future...

 

I replaced the parity drive, restarted, and rebuilt parity over 30 days ago.  Every weekly parity check since has turned up with zero errors.

 

I also reset the data drive that had reported errors and it has not reported any problems since (though I have a feeling it will before long)

 

The attached diagnostics no longer show any of the ata links as slow to respond.

 

Thanks again to everybody for their help.

the-dark-tower-diagnostics-20220922-1035.zip

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.