Jump to content

(Solved) Parity Check Errors


Recommended Posts

Migrated to Unraid exactly 12 months ago to the week and am having my first issues I'm struggling to solve. 

2020-12-01, 18:54:27	9 hr, 5 min, 8 sec	122.3 MB/s	OK	1083	CORRECTIVE
2020-12-01, 08:56:03	8 hr, 56 min, 2 sec	124.4 MB/s	OK	364 	NON-CORRECTIVE
2020-11-15, 10:36:24	9 hr, 26 sec		123.4 MB/s	OK	0	NON-CORRECTIVE

Am currently running another non-corrective parity check and its about 70% complete and its at 416 errors.

 

I havn't had any power failures or unexpected shut downs of any sort, machine is on a giant UPS and locked in rack.

 

The only thing I can identify is that the sectors are very close together between the corrective check earlier today and the non corrective check I'm running now

Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689640
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689648
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689656
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689664
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689672
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689680
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689688
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689696
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689704
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689712
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689720
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689728
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689736
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689744
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689752
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689760
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689768
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689776
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689784
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689792
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689800
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689808
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689816
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689824
Dec  1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689832

And now:

Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689840
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689848
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689856
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689864
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689872
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689880
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689888
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689896
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689904
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689912
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689920
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689928
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689936
Dec  1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689944

I have checked the SMART data on all my disks which report 0 reallocated sectors on all disks, 0 Current pending sectors, 3 of my Constellation ES.3's have a couple of UDMA CRC error counts (1, 14 and 47) but they haven't changed or gone up i don't think, they're not climbing rapidly anyway thats for sure nothing else obvious these were spare disks from my business i had laying around and they only have 11 months uptime.

 

The only thing I've really changed since the 15th was migrating some Hyper-V VM's to KVM and I've checked my VM configs to make sure I wasn't writing directly to any disks (eg. mnt\diskX\) but they're all either to mnt\user\share or mainly to unallocated disks eg. mnt\disks\MODEL_NUMBER\. I did change a setting on the domains share to disable caching while I was converting some VHDX files to qcow2 (use cache = no, i think the default here was prefer) but i doubt that would cause parity issues.

 

Hardware is a Dell T430, E5-2620v4, 96GB DDR4 Registered ECC, 2x 750W PSU, LSI MegaRAID SAS-3 3108 (in IT Mode) & LSI SAS 2008 (w\ IT Firmware) running Unraid 6.8.0.

 

Drive arrangement is (4TB Drives mainly Constelation ES.3\WD Black - 7 data, 2 parity).

 

My thoughts are at this stage to run another corrective parity check overnight in the hopes it fixes it with all VM's (except 1 small VM) shut down and see what happens. I'm not sure where else to start looking (diags attached) hopefully someone may be able to spot something I'm overlooking.

 

unraid-diagnostics-20201201-2333.zip

Edited by Greg-Mega
Link to comment

One disk I failed to really look at was my cache disk (Intel SSD), this probably doesn't look good:

187	Uncorrectable error count	0x000f	115	115	050	Pre-fail	Always	Never	105985041

I have a heap of spare SSD's they're also larger than this one so i guess i could switch it out after the extended SMART test is done. Would that cause the parity errors being a cache disk and not part of the array though? I wouldn't have thought so which is why i didn't really look at it's SMART health.

 

 

Edited by Greg-Mega
Link to comment

I just found a docker container with a log location set to /mnt/disks/SSD/appdata instead of /mnt/cache/appdata. 

I shut down all my dockers last night before I ran the corrective check so I’m just running another non corrective check. Could that do it? I know it would if it was do a data disk but doesn’t seem right a cache disk would cause it. I guess I’ll know in a few hours! 
 

extended smart tests all came back fine.

Link to comment

Just replaced all my SFF-8088 cables, backplane, cache drive - that actually mentioned it had errors when i stopped the array so I moved everything off and installed a new one. Just started running a Parity check with write corrections enabled so I guess in about 24 hours after this is done and another non-corrective check is done I'll know something! Fingers crossed! 

 

I do actually see one change since I did this, my parity check is running at 160MB/sec, it was maxing out at 140MB/sec prior.

Total size:		4 TB	
Elapsed time:		10 minutes	
Current position:	98.1 GB (2.5 %)	
Estimated speed:	165.7 MB/sec	
Estimated finish:	6 hours, 33 minutes	
Sync errors corrected:	0

So I'm hoping that's a good sign! 

Edited by Greg-Mega
Link to comment

 

Quote

Unraid Parity check: 05-12-2020 05:42 AM

Notice [UNRAID] - Parity check finished (0 errors)
Duration: 8 hours, 35 minutes, 45 seconds. Average speed: 129.3 MB/s

Moved the server into my office, increased the minimum fan speeds from 10% to 40% which made it quite a bit louder and dropped the temp of one of my pareity drives by 6 degrees but I can actually feel some airflow getting sucked over the backplane now and when I moved it from the rack to my office i felt my HBA's and the heatsink on them was fairly warm but not scortch your fingers hot so this has got more air circulating around them too.

 

I think the errors were caused by the HBA controllers getting hot, the drives weren't crazy hot.

 

Now, the bigger issue is where to relocate the rack temporarily until I build out down stairs (12 meters x 15m) at the moment its just a slab, core filled blockwork wall and a glass sliding door that I'm using as a workshop but we're renovating another house so all my money is tied up in that to finish my house so I'm thinking our 4th bedroom which is covered by the ducted HVAC is going to become a makeshift server room. Kind of wish it was just a drive failing haha

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...