Greg-Mega Posted December 1, 2020 Share Posted December 1, 2020 (edited) Migrated to Unraid exactly 12 months ago to the week and am having my first issues I'm struggling to solve. 2020-12-01, 18:54:27 9 hr, 5 min, 8 sec 122.3 MB/s OK 1083 CORRECTIVE 2020-12-01, 08:56:03 8 hr, 56 min, 2 sec 124.4 MB/s OK 364 NON-CORRECTIVE 2020-11-15, 10:36:24 9 hr, 26 sec 123.4 MB/s OK 0 NON-CORRECTIVE Am currently running another non-corrective parity check and its about 70% complete and its at 416 errors. I havn't had any power failures or unexpected shut downs of any sort, machine is on a giant UPS and locked in rack. The only thing I can identify is that the sectors are very close together between the corrective check earlier today and the non corrective check I'm running now Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689640 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689648 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689656 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689664 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689672 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689680 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689688 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689696 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689704 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689712 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689720 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689728 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689736 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689744 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689752 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689760 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689768 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689776 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689784 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689792 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689800 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689808 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689816 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689824 Dec 1 13:00:58 UNRAID kernel: md: recovery thread: PQ corrected, sector=3305689832 And now: Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689840 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689848 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689856 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689864 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689872 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689880 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689888 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689896 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689904 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689912 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689920 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689928 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689936 Dec 1 22:41:18 UNRAID kernel: md: recovery thread: PQ incorrect, sector=3305689944 I have checked the SMART data on all my disks which report 0 reallocated sectors on all disks, 0 Current pending sectors, 3 of my Constellation ES.3's have a couple of UDMA CRC error counts (1, 14 and 47) but they haven't changed or gone up i don't think, they're not climbing rapidly anyway thats for sure nothing else obvious these were spare disks from my business i had laying around and they only have 11 months uptime. The only thing I've really changed since the 15th was migrating some Hyper-V VM's to KVM and I've checked my VM configs to make sure I wasn't writing directly to any disks (eg. mnt\diskX\) but they're all either to mnt\user\share or mainly to unallocated disks eg. mnt\disks\MODEL_NUMBER\. I did change a setting on the domains share to disable caching while I was converting some VHDX files to qcow2 (use cache = no, i think the default here was prefer) but i doubt that would cause parity issues. Hardware is a Dell T430, E5-2620v4, 96GB DDR4 Registered ECC, 2x 750W PSU, LSI MegaRAID SAS-3 3108 (in IT Mode) & LSI SAS 2008 (w\ IT Firmware) running Unraid 6.8.0. Drive arrangement is (4TB Drives mainly Constelation ES.3\WD Black - 7 data, 2 parity). My thoughts are at this stage to run another corrective parity check overnight in the hopes it fixes it with all VM's (except 1 small VM) shut down and see what happens. I'm not sure where else to start looking (diags attached) hopefully someone may be able to spot something I'm overlooking. unraid-diagnostics-20201201-2333.zip Edited December 5, 2020 by Greg-Mega Quote Link to comment
JorgeB Posted December 1, 2020 Share Posted December 1, 2020 ECC RAM should rule that out, could be a board/CPU/controller issue or just a disk returning bad data, unfortunately no easy way to tell except to start swapping things around. Quote Link to comment
Greg-Mega Posted December 1, 2020 Author Share Posted December 1, 2020 2020-12-02, 10:21:25 8 hr, 57 min, 2 sec 124.2 MB/s OK 2396 Just finished another parity check, 2396 errors this time around! Seems to be getting worse. No idea where to even start looking for the culprit from here. Quote Link to comment
trurl Posted December 2, 2020 Share Posted December 2, 2020 Have you run an extended SMART test on each disk? Quote Link to comment
Greg-Mega Posted December 2, 2020 Author Share Posted December 2, 2020 Running one right now on each disk, hopefully it identifies something! These look like they're going to take a while to complete so I will report back! Quote Link to comment
Greg-Mega Posted December 2, 2020 Author Share Posted December 2, 2020 (edited) One disk I failed to really look at was my cache disk (Intel SSD), this probably doesn't look good: 187 Uncorrectable error count 0x000f 115 115 050 Pre-fail Always Never 105985041 I have a heap of spare SSD's they're also larger than this one so i guess i could switch it out after the extended SMART test is done. Would that cause the parity errors being a cache disk and not part of the array though? I wouldn't have thought so which is why i didn't really look at it's SMART health. Edited December 2, 2020 by Greg-Mega Quote Link to comment
JorgeB Posted December 2, 2020 Share Posted December 2, 2020 Cache disk won't have anything to do with parity errors, it could be a disk returning bad data without showing anything on SMART, just recently there was a case like that, but could also be other hardware problem. Quote Link to comment
Greg-Mega Posted December 2, 2020 Author Share Posted December 2, 2020 I just found a docker container with a log location set to /mnt/disks/SSD/appdata instead of /mnt/cache/appdata. I shut down all my dockers last night before I ran the corrective check so I’m just running another non corrective check. Could that do it? I know it would if it was do a data disk but doesn’t seem right a cache disk would cause it. I guess I’ll know in a few hours! extended smart tests all came back fine. Quote Link to comment
JorgeB Posted December 2, 2020 Share Posted December 2, 2020 25 minutes ago, Greg-Mega said: Could that do it? No. Quote Link to comment
Greg-Mega Posted December 3, 2020 Author Share Posted December 3, 2020 (edited) Just replaced all my SFF-8088 cables, backplane, cache drive - that actually mentioned it had errors when i stopped the array so I moved everything off and installed a new one. Just started running a Parity check with write corrections enabled so I guess in about 24 hours after this is done and another non-corrective check is done I'll know something! Fingers crossed! I do actually see one change since I did this, my parity check is running at 160MB/sec, it was maxing out at 140MB/sec prior. Total size: 4 TB Elapsed time: 10 minutes Current position: 98.1 GB (2.5 %) Estimated speed: 165.7 MB/sec Estimated finish: 6 hours, 33 minutes Sync errors corrected: 0 So I'm hoping that's a good sign! Edited December 3, 2020 by Greg-Mega Quote Link to comment
Greg-Mega Posted December 3, 2020 Author Share Posted December 3, 2020 First pass done, corected 5835 errors. Second pass is running now. Quote Link to comment
Greg-Mega Posted December 4, 2020 Author Share Posted December 4, 2020 Total size: 4 TB Elapsed time: 4 hours, 21 minutes Current position: 2.38 TB (59.4 %) Estimated speed: 136.5 MB/sec Estimated finish: 3 hours, 18 minutes Sync errors corrected: 0 Well this is looking promising! I had errors appear around 30 - 40% on previous runs! Quote Link to comment
Greg-Mega Posted December 4, 2020 Author Share Posted December 4, 2020 658 corrected errors this time around... It's an imrpovement. I'm wondering if its not heat, I'm going to relocate it out of the rack and into my office for the weekend which is a fair bit cooler. Quote Link to comment
Greg-Mega Posted December 4, 2020 Author Share Posted December 4, 2020 Quote Unraid Parity check: 05-12-2020 05:42 AM Notice [UNRAID] - Parity check finished (0 errors) Duration: 8 hours, 35 minutes, 45 seconds. Average speed: 129.3 MB/s Moved the server into my office, increased the minimum fan speeds from 10% to 40% which made it quite a bit louder and dropped the temp of one of my pareity drives by 6 degrees but I can actually feel some airflow getting sucked over the backplane now and when I moved it from the rack to my office i felt my HBA's and the heatsink on them was fairly warm but not scortch your fingers hot so this has got more air circulating around them too. I think the errors were caused by the HBA controllers getting hot, the drives weren't crazy hot. Now, the bigger issue is where to relocate the rack temporarily until I build out down stairs (12 meters x 15m) at the moment its just a slab, core filled blockwork wall and a glass sliding door that I'm using as a workshop but we're renovating another house so all my money is tied up in that to finish my house so I'm thinking our 4th bedroom which is covered by the ducted HVAC is going to become a makeshift server room. Kind of wish it was just a drive failing haha Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.