August 16, 20232 yr Hello all. Today (Tuesday) I ran a parity check and the result was 2128 parity errors. Oh oh. After running some self-tests I'm thinking I've identified the culprit disk, however I'm seeking input from more experienced users to ensure I've made reasonable conclusions. I've been running weekly parity checks for about a year (daily checks prior to that). They have all had 0 errors until recently. Aug 6 reported 226, and Aug 13 reported 128. So this concerned me but I wasn't getting other disk errors that gave me reason to suspect a particular disk. So I started keeping a closer eye on the array disks. I ran extended self-tests on all 3 disks - 1 parity and 2 data disks. The self-tests on the parity disk and data disk 1 finished normally and I didn't see anything that jumped out at me. But disk 2 had some trouble finishing the self-test - I had to turn off the spin-down to get the self test to finish, and it finally did. Yesterday I ran another parity check and the result was zero errors. Today I ran the parity check again, and today there was a whopping 2128 errors. At least that looks whopping to me. What is interesting is that I was also running the extended self-test on disk 2 at the same time. I thought the self test wouldn't/shouldn't be affecting unraid processes. Is that wrong? So I think disk 2 is having some issues, and when I look at the self-test report I see SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 083 064 006 - 183312320 ... 195 Hardware_ECC_Recovered -O-RC- 083 064 000 - 183312320 The other 2 disks are around 79,400 for both these attributes. (I'm assuming the numbers for these 2 attributes are a real count of errors.) Since an ECC recovery appears to have been accomplished for all the read errors, I would expect that the data should be fine, but the parity check errors suggest that there are errors being introduced onto the disk. I'd like to ask for some guidance from the community. Is the conclusion that disk 2 is having problems sound? Are there other diagnostics I should check? Are there other possible causes of the parity errors I should check before replacing the disk 2 disk? There appears to be nothing on disk 2. I have isos configured to go there but I am not running any VMs, so there are no ISOs. I seem to have the unraid configuration set up to be storing my data all on disk 1 - 635 GB (on 2TB drives). This appears to be consistent with the High-water allocation setting. I am attaching diagnostics and the self-test reports for all 3 disks. Thank you for any and all help and feedback. chip-un-diagnostics-20230816-0004.zip Parity-smart-20230814-2304.zip Disk1-smart-20230814-2305.zip Disk2-smart-20230815-2303.zip
August 16, 20232 yr Community Expert 5 hours ago, MrChip said: So I think disk 2 is having some issues, Those are OK, with Seagate drives there's a specific way to reading them. RAM would be the #1 suspect, very uncommon to be a disk, though it can happen, start by running memtest for at least a couple of hours, more would be better, alternatively remove one of your RAM sticks and run two parity checks (1st one can still find errors), if the 2nd one finds more try the other stick for 2 tests, that would basically rule out a RAM issue.
August 17, 20232 yr Author On 8/16/2023 at 6:00 AM, JorgeB said: RAM would be the #1 suspect, Thanks for the input, but what's the number 2 suspect? I ran Memtest86 (10.5) - 4 passes, 5 hours and 8 minutes elapsed time - and there were zero errors. What's the way to read the Seagate drive figures? All 3 drives are Seagate and bought at the same time, but just one has the error rate in the millions. The other two are in the 74000 range. I'm again thinking that disk 2 has issues.
August 17, 20232 yr Community Expert Seagate values for those attributes do have a meaning, they just are multibit, you need to convert to hex, the last 8 hex digits show the total number of reads/seeks, and if there are only 8 digits it's fine, if there more than 8 digits those show the actual number to errors, e.g. for disk2: RAW Value - 183312320 - in hex AED 1FC0 - so only values for the last 8 hex digits = 0 errors This would be an example from a drive with actual errors: RAW Value - 126005584255 - convert to hex 1D 5684 A17F - again the last 8 digits don't matter, but now we now have 1D errors, convert back to decimal and that is a value of 29 You can usually eyeball the number, only huge numbers are reason for concern, you can also get the error value directly with SMARTCTL: smartctl -a -v 1,raw48:54 /dev/sdX
August 17, 20232 yr Author Okay, that's helping me understand the numbers. Thanks. So using the smartctl command suggested I see an error count of 0 for all 3 disks. But now I'm still wondering why the figure for disk 2 is so much higher than the other two disks. For (parity/disk1/disk2) the Raw_Read_Error_Rate was (in the OP) (79448/79456/183312320). If I understand, since the error count is 0 for all three, then these numbers are the number of reads/seeks. I don't understand the difference across the three disks. If the values for the parity disk and disk1 are sensible, the the value of 183312320 for disk2 doesn't make sense as that disk is essentially empty. And all 3 disks were bought at the same time to populate this server - their Power_On_Hours is (21736/21752/21710), so disk2 isn't any older. Disk2 should have less reads/seeks, not several orders of magnitude more. Or maybe the parity and disk1 figures are too low for Raw_Read_Error_Rate? Only 79k after years of service and plenty of parity checks?
August 17, 20232 yr Author I've been watching the server more carefully today, and things have changed quite a bit from the OP. In the OP the Raw_Read_Error_Rate was 79448/79456/183312320 (parity/disk1/disk2). Today I have: root@chip-un:~# smartctl -a -v 1,raw48 /dev/sdd | grep Raw_Read 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 165959888 root@chip-un:~# smartctl -a -v 1,raw48 /dev/sde | grep Raw_Read 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 168463584 root@chip-un:~# smartctl -a -v 1,raw48 /dev/sdf | grep Raw_Read 1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 173822392 root@chip-un:~# (sdd=parity, sde=disk1, sdf=disk2) Quite the big change for parity and disk1, but now they are in the same ballpark with disk2. So I'm no longer thinking that disk2 is the most likely culprit. I did another parity check today - 1516 parity errors this time. Today I did not write corrections to parity, but last time I did. I'm not sure which way to go on this question. If one of my data drives is failing, could writing parity mess things up even further?
August 17, 20232 yr Author One thought I've had is to take disk2 out of the array and then check parity on just disk1 for a few days. If the parity errors cease, then disk 2 is suspect. Otherwise disk1 or parity is suspect. Could the disk controller be a source of these issues? I have these disks on an LBA.
August 18, 20232 yr Community Expert Could be a disk, could be the controller, could still be RAM, since memtest not finding an error is never definite, since you have two sticks of RAM you could try with just one, if still errors try the other, that would basically rule out RAM, don't forget that the 1st check can still find errors even if the issue was fixed, so always need to run 2 checks when testing.
August 18, 20232 yr Author Thanks. I'm now getting file system errors on disk1: root@chip-un:~# cd /mnt/disk1 root@chip-un:/mnt/disk1# ls ./User/kntc/Accounting /bin/ls: cannot access './User/kntc/Accounting': Structure needs cleaning root@chip-un:/mnt/disk1# I'm starting another thread to ask about fixing a file system on an Unraid array. Edit: there's plenty of posts/info about fixing filesystem issues, so I didn't start a new post. It's not likely a coincidence having the parity errors and the file system errors. But the cause isn't clear. I'm leaning to disk or controller, but memory is still possible too. Edited August 18, 20232 yr by MrChip Correction.
August 19, 20232 yr Author I'm glad I'm well backed up ... I've put the array into maintenance mode and performed file system check (through the GUI). There were a lot of issues that were fixed. I restored affected files from backup. All disks pass a file system check fine now. I brought the array online normally and did a parity check (with correction). Over 16k corrections, but I guess that may not be a big surprise given the file system damage that had occurred. I did a second parity check right after the first one. It'll finish in a few minutes, but it's already over 17k corrections. That's way out of whack. I'm not sure how to approach this issue from here. JorgeB suggested a memory issue, but my feeling is that if a memory issue was creating that many parity corrections that there would be other signs and symptoms in the system.
August 19, 20232 yr Author Hmm, I just found these logs on the disk log for the parity drive. It shows wsdd is failing at times - segfault and general protection fault. I see them only in the disk log for the parity drive, not the other drives. I'm not sure if they are significant to my parity correction problem or not. Oh, and they do also appear in the syslog (but a fewer number - I mean that there are more wsdd segfaults in the parity drive syslog then in the syslog over the same time period. Here's what appears in the parity drive log: Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB) Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] 4096-byte physical blocks Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Write Protect is off Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Mode Sense: 7f 00 10 08 Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA Aug 17 11:15:33 chip-un kernel: sdd: sdd1 Aug 17 11:15:33 chip-un kernel: sd 7:0:0:0: [sdd] Attached SCSI disk Aug 17 11:18:35 chip-un emhttpd: ST2000DM008-2FR102_ZK303HMX (sdd) 512 3907029168 Aug 17 11:18:35 chip-un kernel: mdcmd (1): import 0 sdd 64 1953514552 0 ST2000DM008-2FR102_ZK303HMX Aug 17 11:18:35 chip-un kernel: md: import disk0: (sdd) ST2000DM008-2FR102_ZK303HMX size: 1953514552 Aug 17 11:18:35 chip-un emhttpd: read SMART /dev/sdd Aug 17 11:18:38 chip-un root: /usr/sbin/wsdd Aug 17 11:24:21 chip-un root: /usr/sbin/wsdd Aug 17 11:24:23 chip-un kernel: wsdd[16602]: segfault at 1004d ip 0000000000403c92 sp 00007fffcd51c250 error 4 in wsdd[402000+4000] Aug 18 13:19:13 chip-un emhttpd: read SMART /dev/sdd Aug 18 13:19:55 chip-un root: /usr/sbin/wsdd Aug 18 13:43:37 chip-un emhttpd: read SMART /dev/sdd Aug 18 13:43:39 chip-un root: /usr/sbin/wsdd Aug 18 13:44:01 chip-un root: /usr/sbin/wsdd Aug 18 13:44:03 chip-un kernel: wsdd[14070]: segfault at f0007624 ip 0000000000403c92 sp 00007ffe41a216b0 error 4 in wsdd[402000+4000] Aug 18 13:44:08 chip-un root: /usr/sbin/wsdd Aug 18 13:44:08 chip-un wsdd[15554]: set_multicast: Failed to set IPv4 multicast Aug 18 13:44:08 chip-un wsdd[15554]: Failed to add multicast for WSDD: Address already in use Aug 18 13:44:08 chip-un wsdd[15554]: set_multicast: Failed to set IPv4 multicast Aug 18 18:48:28 chip-un emhttpd: read SMART /dev/sdd Aug 18 18:49:09 chip-un root: /usr/sbin/wsdd Aug 18 18:51:06 chip-un emhttpd: read SMART /dev/sdd Aug 18 18:51:06 chip-un root: /usr/sbin/wsdd Aug 18 18:51:27 chip-un root: /usr/sbin/wsdd Aug 18 18:51:29 chip-un kernel: wsdd[28652]: segfault at e08ade5f ip 0000000000403c92 sp 00007ffea454b4a0 error 4 in wsdd[402000+4000] Aug 18 22:29:04 chip-un emhttpd: read SMART /dev/sdd Aug 18 22:29:25 chip-un root: /usr/sbin/wsdd Aug 18 22:31:45 chip-un emhttpd: read SMART /dev/sdd Aug 18 22:31:46 chip-un root: /usr/sbin/wsdd Aug 18 22:32:07 chip-un root: /usr/sbin/wsdd Aug 18 22:32:09 chip-un kernel: traps: wsdd[19417] general protection fault ip:403c92 sp:7fff4f11f7c0 error:0 in wsdd[402000+4000]
August 19, 20232 yr Community Expert 2 hours ago, MrChip said: but my feeling is that if a memory issue was creating that many parity corrections that there would be other signs and symptoms in the system. Like file system corruption? That's also a possible result of bad RAM, if it happened in multiple disks unlike to be a disk problem, could be board/controller, did you do this? On 8/18/2023 at 8:33 AM, JorgeB said: since you have two sticks of RAM you could try with just one, if still errors try the other, that would basically rule out RAM, don't forget that the 1st check can still find errors even if the issue was fixed, so always need to run 2 checks when testing.
August 23, 20232 yr Author Solution Disk1 has now failed (several days ago now). A file system check showed tons of errors, and the root of the file system was unreadable. I took disk1 out of the array so the array had just disk2 and parity. I rebuilt the parity and restored the data from backup. I've run several parity checks since then and all are zero errors. I bought a new disk, zeroed it, and added it to the array today. A parity-sync is in progress. I'll run a parity check daily for a time to keep an eye on it. I think disk1 was the source of the parity error that prompted my OP, but I'll wait a few parity check cycles before I conclude that. I want to thank JorgeB for his input and suggestions - much appreciated.
August 30, 20232 yr Author After replacing disk1 and rebuilding the parity, and waiting for several parity check cycles, I conclude that my parity errors reported in the OP were due to a failing hard drive. Since replacing the drive I have no parity errors reported over multiple parity checks. I consider this issue to be resolved.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.