RageInvader Posted March 27, 2016 Share Posted March 27, 2016 Have been loving unRAID for about 3 years now, but recently added a new drive to my array, pre-cleared, then added, then while parity syncing it had corrected errors, after the weekly parity sync, it corrected more errors, which led me to checking the logs to then discover that... I'm having 'ata' errors on drives attached to my Supermicro SAS2LP-MV8, which is causing parity sync errors in the hundreds. After a week long of troubleshooting, checking cables, buying new cables, reseating the PCI cards, I have isolated the issue to only happen when I have my GTX 670 GPU installed in my machine. I have tried with my SAS card in the PCI-e 2.0 slot and the GPU in the 3.0 slot, and have also tried the other way round, if I have both cards installed I get the errors every 10 minutes or so. I have also tried with the GPU installed but the PCI power from PSU not attached, possibly ruling out a bad PSU. I have had my GPU out for 48 hours now with no errors, so i'm almost certain that having it installed is causing the issues. Unfortunately I don't have the full diagnostics but have the full syslog from when the errors are occurring, as I have came to post in the forums and just now read the read-first post, and worried about my data getting corrupted if I put my GPU back in. I have attached my full syslog from when the errors where occurring, and also my full diagnostics. Many thanks in advance for your help. syslog-witherrors.txt.zip ness-diagnostics-20160327-0759-noerrors.zip Quote Link to comment
John_M Posted March 27, 2016 Share Posted March 27, 2016 Are the two cards sharing the same interrupt? Quote Link to comment
RageInvader Posted March 27, 2016 Author Share Posted March 27, 2016 Are the two cards sharing the same interrupt? I'm unsure what you mean, how would I find out if they are? They show up on unRAID in different IOMMU groups, if that has anything to do with what u mean? Quote Link to comment
John_M Posted March 28, 2016 Share Posted March 28, 2016 I meant, are they both being allocated the same IRQ? But, looking at your syslog, it looks like a DMA problem: Mar 23 23:58:44 Ness kernel: dmar: DMAR:[DMA Write] Request device [04:00.0] fault addr ff600000 Mar 23 23:58:44 Ness kernel: DMAR:[fault reason 05] PTE Write access is not set Mar 23 23:58:44 Ness kernel: dmar: DRHD: handling fault status reg 3 Mar 23 23:58:44 Ness kernel: dmar: DMAR:[DMA Write] Request device [04:00.0] fault addr ff601000 Mar 23 23:58:44 Ness kernel: DMAR:[fault reason 05] PTE Write access is not set Mar 23 23:58:44 Ness kernel: dmar: DRHD: handling fault status reg 3 Quote Link to comment
RageInvader Posted March 29, 2016 Author Share Posted March 29, 2016 Im sorry again I have no idea what DMA is, I have googled the error lines, but can't find anything relevant to me. I have been away all weekend and my GPU has been removed from the system, but after 3/4 days with no errors, I got some this morning, so I have attached new logs. Many thanks ness-diagnostics-20160329-1611-nogpu.zip Quote Link to comment
JorgeB Posted March 29, 2016 Share Posted March 29, 2016 Those look to me like the errors that affect a few users with the SAS2LP and v6. AFAIK there's no solution as it only affects a few users and the issue could not be replicated by LT, the SAS2LP works well for most, myself included. If it was me I'd try to sell it on ebay and buy a LSI based controller like the IBM M1015 or Dell H310, you can probably do it without loosing any money. In the meantime you can try to move all the array disks you can to the onboard ports and see if it gets better. Quote Link to comment
RageInvader Posted March 29, 2016 Author Share Posted March 29, 2016 I have had my controller for about a year and a half now, Originally had it in an old dell server, then moved my unRAID into my gaming machine, and had been running for 6 months with no issues, I added a new drive to the system, all went well, parity sync fine, then a week later was seeing these errors. Quote Link to comment
RageInvader Posted March 30, 2016 Author Share Posted March 30, 2016 Okay after some more research with my machine, I have got my GPU and Marvell controller installed after doing a BIOS update VT-d was disabled and no errors have occurred in 48 hours. So I'm almost certain this is related to this bug here. https://lime-technology.com/forum/index.php?topic=40683.0 Although I'm completely baffled as to why i'm only getting these errors now, and never before. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.