Oh oh - 2128 parity errors - General Support

August 16, 20232 yr

Hello all.

Today (Tuesday) I ran a parity check and the result was 2128 parity errors. Oh oh. After running some self-tests I'm thinking I've identified the culprit disk, however I'm seeking input from more experienced users to ensure I've made reasonable conclusions.

I've been running weekly parity checks for about a year (daily checks prior to that). They have all had 0 errors until recently. Aug 6 reported 226, and Aug 13 reported 128. So this concerned me but I wasn't getting other disk errors that gave me reason to suspect a particular disk. So I started keeping a closer eye on the array disks. I ran extended self-tests on all 3 disks - 1 parity and 2 data disks. The self-tests on the parity disk and data disk 1 finished normally and I didn't see anything that jumped out at me. But disk 2 had some trouble finishing the self-test - I had to turn off the spin-down to get the self test to finish, and it finally did.

Yesterday I ran another parity check and the result was zero errors. Today I ran the parity check again, and today there was a whopping 2128 errors. At least that looks whopping to me.

What is interesting is that I was also running the extended self-test on disk 2 at the same time. I thought the self test wouldn't/shouldn't be affecting unraid processes. Is that wrong?

So I think disk 2 is having some issues, and when I look at the self-test report I see

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   083   064   006    -    183312320
...
195 Hardware_ECC_Recovered  -O-RC-   083   064   000    -    183312320

The other 2 disks are around 79,400 for both these attributes. (I'm assuming the numbers for these 2 attributes are a real count of errors.)

Since an ECC recovery appears to have been accomplished for all the read errors, I would expect that the data should be fine, but the parity check errors suggest that there are errors being introduced onto the disk.

I'd like to ask for some guidance from the community. Is the conclusion that disk 2 is having problems sound? Are there other diagnostics I should check? Are there other possible causes of the parity errors I should check before replacing the disk 2 disk?

There appears to be nothing on disk 2. I have isos configured to go there but I am not running any VMs, so there are no ISOs. I seem to have the unraid configuration set up to be storing my data all on disk 1 - 635 GB (on 2TB drives). This appears to be consistent with the High-water allocation setting.

I am attaching diagnostics and the self-test reports for all 3 disks.

Thank you for any and all help and feedback.

chip-un-diagnostics-20230816-0004.zip Parity-smart-20230814-2304.zip Disk1-smart-20230814-2305.zip Disk2-smart-20230815-2303.zip

Quote

August 16, 20232 yr

Community Expert

5 hours ago, MrChip said:

So I think disk 2 is having some issues,

Those are OK, with Seagate drives there's a specific way to reading them.

RAM would be the #1 suspect, very uncommon to be a disk, though it can happen, start by running memtest for at least a couple of hours, more would be better, alternatively remove one of your RAM sticks and run two parity checks (1st one can still find errors), if the 2nd one finds more try the other stick for 2 tests, that would basically rule out a RAM issue.

Quote

August 17, 20232 yr

Author

On 8/16/2023 at 6:00 AM, JorgeB said:

RAM would be the #1 suspect,

Thanks for the input, but what's the number 2 suspect? I ran Memtest86 (10.5) - 4 passes, 5 hours and 8 minutes elapsed time - and there were zero errors.

What's the way to read the Seagate drive figures? All 3 drives are Seagate and bought at the same time, but just one has the error rate in the millions. The other two are in the 74000 range. I'm again thinking that disk 2 has issues.

Quote

August 17, 20232 yr

Community Expert

Seagate values for those attributes do have a meaning, they just are multibit, you need to convert to hex, the last 8 hex digits show the total number of reads/seeks, and if there are only 8 digits it's fine, if there more than 8 digits those show the actual number to errors, e.g. for disk2:

RAW Value - 183312320 - in hex AED 1FC0 - so only values for the last 8 hex digits = 0 errors

This would be an example from a drive with actual errors:

RAW Value - 126005584255 - convert to hex 1D 5684 A17F - again the last 8 digits don't matter, but now we now have 1D errors, convert back to decimal and that is a value of 29

You can usually eyeball the number, only huge numbers are reason for concern, you can also get the error value directly with SMARTCTL:

smartctl -a -v 1,raw48:54 /dev/sdX

Quote

August 17, 20232 yr

Author

Okay, that's helping me understand the numbers. Thanks.

So using the smartctl command suggested I see an error count of 0 for all 3 disks. But now I'm still wondering why the figure for disk 2 is so much higher than the other two disks. For (parity/disk1/disk2) the Raw_Read_Error_Rate was (in the OP) (79448/79456/183312320). If I understand, since the error count is 0 for all three, then these numbers are the number of reads/seeks. I don't understand the difference across the three disks. If the values for the parity disk and disk1 are sensible, the the value of 183312320 for disk2 doesn't make sense as that disk is essentially empty. And all 3 disks were bought at the same time to populate this server - their Power_On_Hours is (21736/21752/21710), so disk2 isn't any older. Disk2 should have less reads/seeks, not several orders of magnitude more.

Or maybe the parity and disk1 figures are too low for Raw_Read_Error_Rate? Only 79k after years of service and plenty of parity checks?

Quote

August 17, 20232 yr

Author

I've been watching the server more carefully today, and things have changed quite a bit from the OP. In the OP the Raw_Read_Error_Rate was 79448/79456/183312320 (parity/disk1/disk2). Today I have:

root@chip-un:~# smartctl -a -v 1,raw48 /dev/sdd | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       165959888
root@chip-un:~# smartctl -a -v 1,raw48 /dev/sde | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       168463584
root@chip-un:~# smartctl -a -v 1,raw48 /dev/sdf | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       173822392
root@chip-un:~#

(sdd=parity, sde=disk1, sdf=disk2)

Quite the big change for parity and disk1, but now they are in the same ballpark with disk2. So I'm no longer thinking that disk2 is the most likely culprit.

I did another parity check today - 1516 parity errors this time.

Today I did not write corrections to parity, but last time I did. I'm not sure which way to go on this question. If one of my data drives is failing, could writing parity mess things up even further?

Quote

August 17, 20232 yr

Author

One thought I've had is to take disk2 out of the array and then check parity on just disk1 for a few days. If the parity errors cease, then disk 2 is suspect. Otherwise disk1 or parity is suspect.

Could the disk controller be a source of these issues? I have these disks on an LBA.

Quote

August 18, 20232 yr

Community Expert

Could be a disk, could be the controller, could still be RAM, since memtest not finding an error is never definite, since you have two sticks of RAM you could try with just one, if still errors try the other, that would basically rule out RAM, don't forget that the 1st check can still find errors even if the issue was fixed, so always need to run 2 checks when testing.

Quote

August 18, 20232 yr

Author

Thanks.

I'm now getting file system errors on disk1:

root@chip-un:~# cd /mnt/disk1
root@chip-un:/mnt/disk1# ls ./User/kntc/Accounting 
/bin/ls: cannot access './User/kntc/Accounting': Structure needs cleaning
root@chip-un:/mnt/disk1#

~~I'm starting another thread to ask about fixing a file system on an Unraid array.~~ Edit: there's plenty of posts/info about fixing filesystem issues, so I didn't start a new post.

It's not likely a coincidence having the parity errors and the file system errors. But the cause isn't clear. I'm leaning to disk or controller, but memory is still possible too.

Edited August 18, 20232 yr by MrChip
Correction.

Quote

August 19, 20232 yr

Author

I'm glad I'm well backed up ...

I've put the array into maintenance mode and performed file system check (through the GUI). There were a lot of issues that were fixed. I restored affected files from backup.

All disks pass a file system check fine now.

I brought the array online normally and did a parity check (with correction). Over 16k corrections, but I guess that may not be a big surprise given the file system damage that had occurred.

I did a second parity check right after the first one. It'll finish in a few minutes, but it's already over 17k corrections. That's way out of whack.

I'm not sure how to approach this issue from here. JorgeB suggested a memory issue, but my feeling is that if a memory issue was creating that many parity corrections that there would be other signs and symptoms in the system.

Quote

August 19, 20232 yr

Author

Hmm, I just found these logs on the disk log for the parity drive. It shows wsdd is failing at times - segfault and general protection fault. I see them only in the disk log for the parity drive, not the other drives. I'm not sure if they are significant to my parity correction problem or not.

Oh, and they do also appear in the syslog (but a fewer number - I mean that there are more wsdd segfaults in the parity drive syslog then in the syslog over the same time period.

Here's what appears in the parity drive log:

Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] 4096-byte physical blocks
Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Write Protect is off
Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Mode Sense: 7f 00 10 08
Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
Aug 17 11:15:33 chip-un kernel: sdd: sdd1
Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Attached SCSI disk
Aug 17 11:18:35 chip-un emhttpd: ST2000DM008-2FR102_ZK303HMX (sdd) 512 3907029168
Aug 17 11:18:35 chip-un kernel: mdcmd (1): import 0 sdd 64 1953514552 0 ST2000DM008-2FR102_ZK303HMX
Aug 17 11:18:35 chip-un kernel: md: import disk0: (sdd) ST2000DM008-2FR102_ZK303HMX size: 1953514552
Aug 17 11:18:35 chip-un emhttpd: read SMART /dev/sdd
Aug 17 11:18:38 chip-un root: /usr/sbin/wsdd
Aug 17 11:24:21 chip-un root: /usr/sbin/wsdd
Aug 17 11:24:23 chip-un kernel: wsdd[16602]: segfault at 1004d ip 0000000000403c92 sp 00007fffcd51c250 error 4 in wsdd[402000+4000]
Aug 18 13:19:13 chip-un emhttpd: read SMART /dev/sdd
Aug 18 13:19:55 chip-un root: /usr/sbin/wsdd
Aug 18 13:43:37 chip-un emhttpd: read SMART /dev/sdd
Aug 18 13:43:39 chip-un root: /usr/sbin/wsdd
Aug 18 13:44:01 chip-un root: /usr/sbin/wsdd
Aug 18 13:44:03 chip-un kernel: wsdd[14070]: segfault at f0007624 ip 0000000000403c92 sp 00007ffe41a216b0 error 4 in wsdd[402000+4000]
Aug 18 13:44:08 chip-un root: /usr/sbin/wsdd
Aug 18 13:44:08 chip-un wsdd[15554]: set_multicast: Failed to set IPv4 multicast
Aug 18 13:44:08 chip-un wsdd[15554]: Failed to add multicast for WSDD: Address already in use
Aug 18 13:44:08 chip-un wsdd[15554]: set_multicast: Failed to set IPv4 multicast
Aug 18 18:48:28 chip-un emhttpd: read SMART /dev/sdd
Aug 18 18:49:09 chip-un root: /usr/sbin/wsdd
Aug 18 18:51:06 chip-un emhttpd: read SMART /dev/sdd
Aug 18 18:51:06 chip-un root: /usr/sbin/wsdd
Aug 18 18:51:27 chip-un root: /usr/sbin/wsdd
Aug 18 18:51:29 chip-un kernel: wsdd[28652]: segfault at e08ade5f ip 0000000000403c92 sp 00007ffea454b4a0 error 4 in wsdd[402000+4000]
Aug 18 22:29:04 chip-un emhttpd: read SMART /dev/sdd
Aug 18 22:29:25 chip-un root: /usr/sbin/wsdd
Aug 18 22:31:45 chip-un emhttpd: read SMART /dev/sdd
Aug 18 22:31:46 chip-un root: /usr/sbin/wsdd
Aug 18 22:32:07 chip-un root: /usr/sbin/wsdd
Aug 18 22:32:09 chip-un kernel: traps: wsdd[19417] general protection fault ip:403c92 sp:7fff4f11f7c0 error:0 in wsdd[402000+4000]

Quote

August 19, 20232 yr

Community Expert

2 hours ago, MrChip said:

but my feeling is that if a memory issue was creating that many parity corrections that there would be other signs and symptoms in the system.

Like file system corruption? That's also a possible result of bad RAM, if it happened in multiple disks unlike to be a disk problem, could be board/controller, did you do this?

On 8/18/2023 at 8:33 AM, JorgeB said:

since you have two sticks of RAM you could try with just one, if still errors try the other, that would basically rule out RAM, don't forget that the 1st check can still find errors even if the issue was fixed, so always need to run 2 checks when testing.

Quote

August 23, 20232 yr

Author
Solution

Disk1 has now failed (several days ago now). A file system check showed tons of errors, and the root of the file system was unreadable. I took disk1 out of the array so the array had just disk2 and parity. I rebuilt the parity and restored the data from backup. I've run several parity checks since then and all are zero errors.

I bought a new disk, zeroed it, and added it to the array today. A parity-sync is in progress. I'll run a parity check daily for a time to keep an eye on it.

I think disk1 was the source of the parity error that prompted my OP, but I'll wait a few parity check cycles before I conclude that. I want to thank JorgeB for his input and suggestions - much appreciated.

Quote

August 30, 20232 yr

Author

After replacing disk1 and rebuilding the parity, and waiting for several parity check cycles, I conclude that my parity errors reported in the OP were due to a failing hard drive. Since replacing the drive I have no parity errors reported over multiple parity checks. I consider this issue to be resolved.

Quote

Oh oh - 2128 parity errors

Featured Replies

Solved by MrChip

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)