(Solved) Parity Check Errors

Greg-Mega · December 1, 2020

Migrated to Unraid exactly 12 months ago to the week and am having my first issues I'm struggling to solve.

2020-12-01, 18:54:27	9 hr, 5 min, 8 sec	122.3 MB/s	OK	1083	CORRECTIVE
2020-12-01, 08:56:03	8 hr, 56 min, 2 sec	124.4 MB/s	OK	364 	NON-CORRECTIVE
2020-11-15, 10:36:24	9 hr, 26 sec		123.4 MB/s	OK	0	NON-CORRECTIVE

Am currently running another non-corrective parity check and its about 70% complete and its at 416 errors.

I havn't had any power failures or unexpected shut downs of any sort, machine is on a giant UPS and locked in rack.

The only thing I can identify is that the sectors are very close together between the corrective check earlier today and the non corrective check I'm running now

Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689640
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689648
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689656
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689664
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689672
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689680
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689688
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689696
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689704
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689712
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689720
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689728
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689736
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689744
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689752
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689760
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689768
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689776
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689784
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689792
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689800
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689808
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689816
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689824
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689832

And now:

Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689840
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689848
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689856
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689864
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689872
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689880
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689888
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689896
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689904
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689912
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689920
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689928
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689936
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689944

I have checked the SMART data on all my disks which report 0 reallocated sectors on all disks, 0 Current pending sectors, 3 of my Constellation ES.3's have a couple of UDMA CRC error counts (1, 14 and 47) but they haven't changed or gone up i don't think, they're not climbing rapidly anyway thats for sure nothing else obvious these were spare disks from my business i had laying around and they only have 11 months uptime.

The only thing I've really changed since the 15th was migrating some Hyper-V VM's to KVM and I've checked my VM configs to make sure I wasn't writing directly to any disks (eg. mnt\diskX\) but they're all either to mnt\user\share or mainly to unallocated disks eg. mnt\disks\MODEL_NUMBER\. I did change a setting on the domains share to disable caching while I was converting some VHDX files to qcow2 (use cache = no, i think the default here was prefer) but i doubt that would cause parity issues.

Hardware is a Dell T430, E5-2620v4, 96GB DDR4 Registered ECC, 2x 750W PSU, LSI MegaRAID SAS-3 3108 (in IT Mode) & LSI SAS 2008 (w\ IT Firmware) running Unraid 6.8.0.

Drive arrangement is (4TB Drives mainly Constelation ES.3\WD Black - 7 data, 2 parity).

My thoughts are at this stage to run another corrective parity check overnight in the hopes it fixes it with all VM's (except 1 small VM) shut down and see what happens. I'm not sure where else to start looking (diags attached) hopefully someone may be able to spot something I'm overlooking.

unraid-diagnostics-20201201-2333.zip

Edited December 5, 2020 by Greg-Mega

JorgeB · December 1, 2020

ECC RAM should rule that out, could be a board/CPU/controller issue or just a disk returning bad data, unfortunately no easy way to tell except to start swapping things around.

Greg-Mega · December 1, 2020

2020-12-02, 10:21:25	8 hr, 57 min, 2 sec	124.2 MB/s	OK	2396

Just finished another parity check, 2396 errors this time around! Seems to be getting worse.

No idea where to even start looking for the culprit from here.

trurl · December 2, 2020

Have you run an extended SMART test on each disk?

Greg-Mega · December 2, 2020

Running one right now on each disk, hopefully it identifies something! These look like they're going to take a while to complete so I will report back!

Greg-Mega · December 2, 2020

One disk I failed to really look at was my cache disk (Intel SSD), this probably doesn't look good:

187	Uncorrectable error count	0x000f	115	115	050	Pre-fail	Always	Never	105985041

I have a heap of spare SSD's they're also larger than this one so i guess i could switch it out after the extended SMART test is done. Would that cause the parity errors being a cache disk and not part of the array though? I wouldn't have thought so which is why i didn't really look at it's SMART health.

Edited December 2, 2020 by Greg-Mega

JorgeB · December 2, 2020

Cache disk won't have anything to do with parity errors, it could be a disk returning bad data without showing anything on SMART, just recently there was a case like that, but could also be other hardware problem.

Greg-Mega · December 2, 2020

I just found a docker container with a log location set to /mnt/disks/SSD/appdata instead of /mnt/cache/appdata.

I shut down all my dockers last night before I ran the corrective check so I’m just running another non corrective check. Could that do it? I know it would if it was do a data disk but doesn’t seem right a cache disk would cause it. I guess I’ll know in a few hours!

extended smart tests all came back fine.

JorgeB · December 2, 2020

25 minutes ago, Greg-Mega said:

Could that do it?

No.

Greg-Mega · December 3, 2020

Just replaced all my SFF-8088 cables, backplane, cache drive - that actually mentioned it had errors when i stopped the array so I moved everything off and installed a new one. Just started running a Parity check with write corrections enabled so I guess in about 24 hours after this is done and another non-corrective check is done I'll know something! Fingers crossed!

I do actually see one change since I did this, my parity check is running at 160MB/sec, it was maxing out at 140MB/sec prior.

Total size:		4 TB	
Elapsed time:		10 minutes	
Current position:	98.1 GB (2.5 %)	
Estimated speed:	165.7 MB/sec	
Estimated finish:	6 hours, 33 minutes	
Sync errors corrected:	0

So I'm hoping that's a good sign!

Edited December 3, 2020 by Greg-Mega

Greg-Mega · December 3, 2020

First pass done, corected 5835 errors. Second pass is running now.

Greg-Mega · December 4, 2020

Total size:		4 TB	
Elapsed time:		4 hours, 21 minutes	
Current position:	2.38 TB (59.4 %)	
Estimated speed:	136.5 MB/sec	
Estimated finish:	3 hours, 18 minutes	
Sync errors corrected:	0

Well this is looking promising! I had errors appear around 30 - 40% on previous runs!

Greg-Mega · December 4, 2020

658 corrected errors this time around... It's an imrpovement. I'm wondering if its not heat, I'm going to relocate it out of the rack and into my office for the weekend which is a fair bit cooler.

Greg-Mega · December 4, 2020

Quote

Unraid Parity check: 05-12-2020 05:42 AM

Notice [UNRAID] - Parity check finished (0 errors)
Duration: 8 hours, 35 minutes, 45 seconds. Average speed: 129.3 MB/s

Moved the server into my office, increased the minimum fan speeds from 10% to 40% which made it quite a bit louder and dropped the temp of one of my pareity drives by 6 degrees but I can actually feel some airflow getting sucked over the backplane now and when I moved it from the rack to my office i felt my HBA's and the heatsink on them was fairly warm but not scortch your fingers hot so this has got more air circulating around them too.

I think the errors were caused by the HBA controllers getting hot, the drives weren't crazy hot.

Now, the bigger issue is where to relocate the rack temporarily until I build out down stairs (12 meters x 15m) at the moment its just a slab, core filled blockwork wall and a glass sliding door that I'm using as a workshop but we're renovating another house so all my money is tied up in that to finish my house so I'm thinking our 4th bedroom which is covered by the ducted HVAC is going to become a makeshift server room. Kind of wish it was just a drive failing haha

(Solved) Parity Check Errors

Recommended Posts

Greg-Mega

Link to comment

JorgeB

Link to comment

Greg-Mega

Link to comment

trurl

Link to comment

Greg-Mega

Link to comment

Greg-Mega

Link to comment

JorgeB

Link to comment

Greg-Mega

Link to comment

JorgeB

Link to comment

Greg-Mega

Link to comment

Greg-Mega

Link to comment

Greg-Mega

Link to comment

Greg-Mega

Link to comment

Greg-Mega

Link to comment

Join the conversation