February 16, 20179 yr Good day Crew, In addition to my monthly automated parity check, I run a parity check just prior and just after every version update (currently on 6.3.1) Today is my servers automated run (the 15th) and it has found 5 errors. The server has not been physically touched, other than to press the power button on the front after a powerdown due to a system update, in a LONG time. From my (limited) knowledge the SMART of the drives looks good. I have dual parity, all drives have been triple cleared. All array drives are XFS. My single cache drive is BTFRS. Does anyone have any idea WHAT may have caused the parity errors and what my next steps should be? Diagnostic logs are attached. On older servers parity errors where always due to bad drives (obvious smart issues), bundling sata cables together, or cracking open the case to install new drives and jostling sata cables. This server has not shown a single parity error since it went into service about 4 years ago. Thanks!
February 16, 20179 yr Bit of a mystery. The only two thing that I know for sure that should cause a parity error are a hard shutdown, and a memory issue. Your parity errors are in a tight little group - could be just one block of writes. Tends to make memory error unlikely - although can't rule it out. Feb 15 04:40:18 Tower2 kernel: md: recovery thread: PQ corrected, sector=3519069768 Feb 15 04:40:18 Tower2 kernel: md: recovery thread: PQ corrected, sector=3519069776 Feb 15 04:40:18 Tower2 kernel: md: recovery thread: PQ corrected, sector=3519069784 Feb 15 04:40:18 Tower2 kernel: md: recovery thread: PQ corrected, sector=3519069792 Feb 15 04:40:18 Tower2 kernel: md: recovery thread: PQ corrected, sector=3519069800 That is exactly 32 sectors = 16K. That could very well be the size of the unRAID parity block reads and writes. That error is about 1.7T into the parity check. I would suggest running another parity check. Sometimes I've seen a second parity check will flip the parity on the exact same set of sectors back. That would indicate some misread or mishandling of that 16K block on the original 1st parity check. Sorry for the trouble. Looks like you did most everything right in your build. Maybe someone else will find something or have another suggestion.
February 16, 20179 yr Without looking at your diagnostics, I can say that I had exactly that error: 5 errors in contiguous blocks on a server with a SAS2LP-MV8 controller and IOMMU enabled. I disabled IOMMU and ran a correcting parity check and the problem was fixed. See here.
February 16, 20179 yr Having written the above I decided I ought to take a look at your diagnostics. I see: IOMMU enabled Marvell controller My advice: either disable IOMMU in the BIOS or, if you really need hardware passthrough to VMs, replace your controller.
February 16, 20179 yr Author Thanks bjp999 & John_M! I kicked off a parity check this AM… hopefully the system ‘will flip the parity on the exact same set of sectors back’ and I can go back to trouble-free backups Hooking up a monitor will be a bit of a PITA as I will need to pull the rig down to attach a long gpu cable (The machine is in a tight well vented cabinet above head level). I will try altering syslinux.conf from append initrd=/bzroot to append iommu=pt initrd=/bzroot before physically moving the machine If I do need to hook up the gpu cable, I will run a memtest before messing with the IOMMU. The (marvel) controller in use happens to be the motherboard controller which has been in use for the ~4 year lifespan. My VMs do not utilize passthrough… I have a Windows 7 VM and a Elementary OS VM that are both turned on for a couple hours / month as needed (but not while parity is running). I also have a single docker (crashplan). Just so odd that this occurred out of the blue. (after upgrading to 6.3.1 I ran the pre/post parity check with no issues as is the norm).
February 16, 20179 yr I don't think the parity errors were introduced during the scheduled parity check. In my case that is impossible because my scheduled checks are non-correcting. Therefore the only time they could have occurred is during normal writes to the array. Both P and Q parity were affected.
February 16, 20179 yr Author This server's only SMB share is for the local Windows 7 VM -> which is seldom used. (only when I need to use Zortrax software to prep a STL for printing). The main purpose of this server is for hosting a Linux Desktop from which I run Rsync copies from my primary server.... which are then backed-up to crashplan cloud... In theory, these bulk writes all go to the cache drive. Still... interesting issue here.
February 16, 20179 yr I don't think the parity errors were introduced during the scheduled parity check. In my case that is impossible because my scheduled checks are non-correcting. Therefore the only time they could have occurred is during normal writes to the array. Both P and Q parity were affected. Are you saying he may have a small corruption in a file? unless it happened to hit empty space. Do you believe that what got written to the disk is wrong and what go written to the parity was correct? If so, could he have done a rebuild and fixed it if he had run an uncorrecting check. Also with no idea which file is on that exact sector (a knowable thing with no way that I know of to figure it out) there is no way to know what file is affected without md5s or a backup. Would be time consuming. (Back in the old days, there was a great tool called "The Norton Utilities" that wold let you want the file allocation chains on the disk. Wouldn't have been easy, but I bet I could have figured out which file it was with that tool on an old FAT (FAT32?) drive. Wish something like that existed for XFS!) BTW - I have a special interest. I am on 6.0.1 with a SASLP2 and a SASLP in my server (addon cards). Is this " iommu=pt" something I should do preemptively when I upgrade to 6.3?
February 16, 20179 yr Author Is this " iommu=pt" something I should do preemptively when I upgrade to 6.3? -> http://lime-technology.com/forum/index.php?topic=40683.0 Second Parity check just finished with Zero errors Launching a third parity check
February 16, 20179 yr Are you saying he may have a small corruption in a file? unless it happened to hit empty space. I can't say that with any degree of certainty. I asked the question about the implications of simultaneous errors in both P and Q parity around the time of unRAID 6.2's release. With single parity the conclusion has always been that any parity check error in the absence of any obvious hardware problems is due to an error in the parity rather than in the data. With dual parity, a parity check indicating an error in P but not a corresponding one in Q quite reasonably indicates the error is indeed in P; and similarly a parity check indicating an error in Q but not a corresponding one in P also quite reasonably indicates the error is indeed in Q. But what if errors in both P and Q are found simultaneously - does that really indicate a data error? Well, the consensus was no, probably not, and even if it did there's nothing to be done about it anyway because there's no way of determining which of the data disks is affected so the best thing to do is correct both P and Q and if possible check the files for corruption using checksums and replace from backups. Since my particular server is used to backup another using a regular invocation of rsync I changed the syntax slightly to force it to use checksums instead of timestamps. It took a long time but decided that the two servers were indeed already in sync. So my conclusion was that the five blocks of errors had indeed been simultaneous P and Q parity errors and that the actual data files were fine. What I'd really like to try is a 16 disk array protected by four parity disks. With four parity bits protecting 24 data disks would give real scope for pinpointing data errors but others see no point in extending beyond dual parity. Do you believe that what got written to the disk is wrong and what go written to the parity was correct? If so, could he have done a rebuild and fixed it if he had run an uncorrecting check. I think it's possible but, in the light of my experience, unlikely. Though our situations, which on the face of it look almost identical, may not be so in reality. Also with no idea which file is on that exact sector (a knowable thing with no way that I know of to figure it out) there is no way to know what file is affected without md5s or a backup. Would be time consuming. (Back in the old days, there was a great tool called "The Norton Utilities" that wold let you want the file allocation chains on the disk. Wouldn't have been easy, but I bet I could have figured out which file it was with that tool on an old FAT (FAT32?) drive. Wish something like that existed for XFS!) Well, I should think the file system itself could be used to work out the answer. I know of no way of querying it though. I was lucky because it had another copy of every file and I think the rsync method I used was valid. BTW - I have a special interest. I am on 6.0.1 with a SASLP2 and a SASLP in my server (addon cards). Is this " iommu=pt" something I should do preemptively when I upgrade to 6.3? I didn't try the "iommu=pt" option because I don't need IOMMU and, according to the thread, it doesn't work in all cases anyway. IOMMU was an unnecessary complication that I was able simply to switch off. A lot of people are reporting issues with Marvell-based controllers at the moment and a fix doesn't seem to be forthcoming. If I wanted to enable IOMMU I would use an LSI-based SAS card instead.
February 16, 20179 yr My issue (and landS's, if it turns out to be the same) is not the classic "Marvell bug" as described in RobJ's thread. I haven't seen SATA link resets and disks dropping off-line, for example. But it has to be related, as I believe JustinChase's probably was, because disabling IOMMU completely fixed it. Many users have abandoned their SASLPs and SAS2LPs that have worked well until recently and it's becoming difficult to recommend them to potential purchasers any more. That said, they are among the few cards that work out of the box, without any tedious reflashing and as long as IOMMU is disabled I'm happy to continue using them. FWIW, I have three unRAID servers and all have Marvell-based cards but none has IOMMU enabled. The one I mentioned has a SAS2LP, the server it backs up has a SASLP and my HP Microserver Gen 8 has a 9235-based SATA/e-SATA card. I will be replacing the SASLP soon, though not for this reason. It currently has only four disks connected but I want to add more without significantly increasing my parity check times. The most cost effective replacement seems to be the Dell H310, which has four times the PCIe bandwidth and would completely avoid any Marvell-related problems. So I found a used but tested and guaranteed server pull on eBay, flashed to IT mode and it's waiting to be installed when I move that server to a bigger case.
February 16, 20179 yr Second Parity check just finished with Zero errors Launching a third parity check I'm trying to follow exactly what you are doing and the results you're getting. After Brian pointed out those five blocks of parity errors in your log you ran a correcting parity check which found zero errors? Did I get that right? Or do you mean it found those five blocks of errors and corrected them and you now have zero errors? Sorry to be pedantic but the difference is crucial. If you're unsure, please post your current syslog.
February 16, 20179 yr Author All parity runs have been set to correcting I believe. Pre 1 - ~48 monthly auto parity runs and 2 parity runs at each update point for 4 years, all was ok 1 - V 6.3.0 Pre-Update, Manually Ran parity, all was Ok 2 - V 6.3.1 Post-Update, Manually Ran parity, all was Ok 3 - monthly auto run of parity, 5 errors found... posted on the board here 4 - After Brian pointed out those five blocks of parity errors in your log you ran a correcting parity check which found zero errors? Correct. The system did NOT find any errors. 5 - kicking off another parity check to see if it again finds zero errors -> current stage.
February 16, 20179 yr That isn't necessarily a different result from mine. In my case the five blocks of P and Q errors were detected by the scheduled non-correcting check and corrected by the subsequent correcting parity check, which I ran after disabling IOMMU. In your case the five blocks of P and Q errors were detected and corrected by the scheduled correcting check and the subsequent check showed that both parities are still in sync with the data. I don't think the third check you're currently running will reveal anything but you might want to try writing some more files to the array and then doing a parity check. If my theory is right you should see no more parity errors if IOMMU is disabled but you may see some if it is enabled.
February 16, 20179 yr Author and I am risking a meltdown if I continue to run with IOMMU enabled and see no more parity errors? I will need to pull the machine off the rack this summer to add another data drive... and will happily swap the IOMMU setting then in bios to off (I take it just disabling all VT should take care of that?) again... just strange that this is the first time any errors have shown in 4 years... I have had dual parity the day the official build came out supporting it so a good number of checks have been run.
February 16, 20179 yr My feeling is that whenever you write to the array you risk not correctly updating parity but I don't have a lot of evidence to offer in order to convince you. You may want to risk it but you might consider increasing the frequency of your parity checks. You only need to disable Intel VT-d or AMD-Vi. The other one (Intel VT-x or AMD-V) you can leave enabled. I don't know your upgrade history but the "Marvell bug" came to light some time in 2015 and RobJ posted in June. Since around November 2016 there has been a marked increase in people with Marvell-related issues, possibly as they switched to the 6.3-rcs. Do a bit of searching within this General Support board and see what you find. I've been trying to keep track of related reports and I've helped one or two people who are clearly affected by the Marvell bug. It can be a tricky one because some people are not affected at all. Some people believe it to be a driver issue and others consider it to be a card firmware issue. Whatever, it can be eliminated entirely either by disabling IOMMU or by using a different SAS/SATA controller. I'm not telling you what you should do, I'm just offering my advice based on my own experience in response to a request for help.
February 17, 20179 yr Author I really appreciate the help John_M. The last parity check found no issues -> I will run another the next time I write data to the rig. Writes to this machine only occur about once a week or two (it is the backup to my main unraid machine) … My feeling is that whenever you write to the array you risk not correctly updating parity but I don't have a lot of evidence to offer in order to convince you. You may want to risk it but you might consider increasing the frequency of your parity checks. Ah… so if a disk goes bad, and I rebuild, I may rebuild corrupted data? Given that I have been running in this manner for ~4 years... and on v 6.3 since stable... may parity be trusted at all?… after disabling IOMUU on the motherboard should I do a new config, preclear both parity drives, and then rebuild parity? Or does this mean I can increase parity checks to once a week as a validating/correcting tool to correct any further instances of the Marvel bug. If this is the case, I will leave IOMMU on until I pull the machine down to add another data drive in a couple of months… unless I see further parity corrections occuring You only need to disable Intel VT-d or AMD-Vi. The other one (Intel VT-x or AMD-V) you can leave enabled. Awesome. I don't know your upgrade history I was using v6 beta/rc for some time after v5. This machine never had v4 on it. Once I was on 6.2 stable I remained on that until 6.3, and 6.31 stable came out. Gotchya… so the marvel bug appears to be a manifestation of the 6.3 releases.
February 17, 20179 yr Ah… so if a disk goes bad, and I rebuild, I may rebuild corrupted data? Yes. If you had needed to do a rebuild on the day before your scheduled parity check - the one that found the five errors - you would have rebuilt corrupted data. Given that I have been running in this manner for ~4 years... and on v 6.3 since stable... may parity be trusted at all? It seems to be a recent problem, affecting some from around mid-2015 and others more recently. … after disabling IOMUU on the motherboard should I do a new config, preclear both parity drives, and then rebuild parity? No. There's no need for any of that. Disable IOMMU and do a correcting parity check. If it finds any errors they will be corrected and that's the end of it. Alternatively, if your faith in Marvell cards is shaken, fit an LSI card instead. Then you can leave IOMMU on if you need it, though if you don't I'd switch it off anyway. Or does this mean I can increase parity checks to once a week as a validating/correcting tool to correct any further instances of the Marvel bug. If this is the case, I will leave IOMMU on until I pull the machine down to add another data drive in a couple of months… unless I see further parity corrections occuring. The easiest thing is to do one of the above and move on. If you do neither then you can't trust parity. In your case that situation is manageable insofar as you can run a correcting parity check after your weekly backup. But why not simply fix the problem instead of merely treating the symptoms? Gotchya… so the marvel bug appears to be a manifestation of the 6.3 releases. It seems to be affecting more people now that unRAID 6.3 is released, but when I noticed it I was running 6.2 and when the original manifestation of the bug appeared it must have been at the time 6.1 or 6.0. Perhaps the problem is with the 64-bit version of the driver. Note though that the original bug was much more obvious - disks being dropped as the controller lost contact with them. So the problem was obvious and action necessary. What you and I are seeing is more subtle and more dangerous as a result, though it affects far fewer people. On the other hand, even now there are plenty of users who are completely unaffected.
February 17, 20179 yr Author This is perfect John_M, thank you greatly! For now I will run correcting parity checks after the weekly write to the array. As soon as I pull the machine down from the low clearance rack to add an additional data drive, I will hook up a monitor and disable IOMMU by disabling Intel VT-d in bios. I will also install a long GPU cable so if monitor access is needed in the future, it is much easier to gain. I would physically pull the machine down NOW, however a broken spine recovery means that if I can put off moving a heavy overhead object while building up some muscle first, then I will wait. And for now, while inconvenient, the issue is manageable. The primary server (a supermicro D525) also has dual parity, but is on an Intel controller.
February 17, 20179 yr Ouch! I understand your reluctance to wrestle with a heavy object. I wish you a speedy recovery.
February 17, 20179 yr Author Thanks on the well-wishing! It's an old injury that flares up every couple of years and requires some hefty pt to get back to a 'normal' lifestyle. I really appreciate your taking the time to provide a thorough education on what it is I am likely dealing with. The likelihood of my primary server loosing more than 2 data disks shortly after writing backup files to the secondary server before a correcting parity check can be run is relatively small thankfully! And within a couple of months VT-d will be disabled
February 27, 20179 yr Author @John_M @bjp999 New wrinkle crew. This morning I received the following email: TOWER_2 Status: Notice [TOWER2] - bunker verify command Event: unRAID file corruption Subject: Notice [TOWER2] - bunker verify command Description: Found 1 file with SHA256 hash key corruption Importance: alert SHA256 hash key mismatch, /mnt/disk3/Share/Folder/Filename.ext is corrupted No additional data has been written to the server since all of this has started. strapped on my back brace pulled the machine physically down inserted the new HGST drive for preclearing (and soon to be added data capacity) moved ALL of my drives from the on-board sata controller to the Supermicro AOC-SASLP-MV8 flashed to Firmware_3.1.0.21. Previously only 3 of the 6 (now 7) drives where hooked up to the SAS controller I did NOT disable passthrough at this time, but as no data/parity/cache drives are using the motherboard's sata controller, this should be OK? Left the external drive cage (preclearing duties only) and the BD-Rom drive hooked up to the motherboard's sata controller I assume I should delete the file referenced by bunker verify as having become corrupted and then with the original file... What other actions are now recommended? Edit: Dang... the SASLP uses a marvel controller... the motherboard's sata controller is an Intel unit. Dang faulty mind memory! lspci: 00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core Processor Family PCI Express Root Port (rev 09) 00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04) 00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04) 00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04) 00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04) 00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4) 00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4) 00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4) 00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04) 00:1f.0 ISA bridge: Intel Corporation H77 Express Chipset LPC Controller (rev 04) 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04) 00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04) 02:00.0 RAID bus controller: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B (rev 01) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 09) 04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) Thanks! 2017.02.27 Edit. Ordered a Dell PERC H310. Cross flashed to IT mode per: http://lime-technology.com/forum/index.php?topic=12767.msg259006#msg259006 Update on 23.02.2017 Firmware is still P20.00.07.00 Uses sas2flsh through the whole process. Tested on a "backflashed" H200, to be confirmed on a stock H200 card and on H310's. Card backup is now dumping the full flash. This can be used to restore the initial condition of the card. Added script for automatic SAS address extraction. No reboot necessary any more. https://www.mediafire.com/?0op114fpim9xwwf MD5: to be generated Make sure you read and understand the __READMEFIRST.txt before starting! Edited February 27, 20179 yr by landS
March 1, 20179 yr Author Copied data to the array, waited for mover to finish, ran parity check: so far 2 sync errors have been found and corrected.... Looks like I need to pull the rig down and swap the SUPERMICRO saslp with a Dell perc h310 flashed to an LSI IT . Edited March 1, 20179 yr by landS
March 17, 20179 yr Author zero problems with multiple writes and multiple parity checks in the last 2 weeks since swaping in the H310 card. Also parity checks and disk access speeds are way up.
Archived
This topic is now archived and is closed to further replies.