Rich Posted January 15, 2017 Author Share Posted January 15, 2017 Ok this seems strange. I've just disable vt-d and am running another parity check (so far, so good) and decided to run lspci again and the info has changed, Before (vt-d enabled) 01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) 06:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9480 SAS/SATA 6Gb/s RAID controller (rev c3) Now (vt-d disabled) 01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) 06:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) The number and description are different - 88SE9485 SAS/SATA 6Gb/s controller (rev c3) Should this randomly change? I assumed it was effectively the version number and name of the card. I've not attempted to update anything, simply disabled vt-d and rebooted Quote Link to comment
John_M Posted January 15, 2017 Share Posted January 15, 2017 Could you have mistakenly replaced the wrong card? What does lspci give if you enable VT-d again? Quote Link to comment
John_M Posted January 15, 2017 Share Posted January 15, 2017 I don't have any issues with IOMMU and the SAS2. (and I'm a 9485) Thanks for that data point, Squid. My experience of unRAID doesn't extend as far back as v5 betas but I'd be interested to read the thread about having to re-flash the card to get it to be recognised. I'll try to search for it. Quote Link to comment
Squid Posted January 15, 2017 Share Posted January 15, 2017 http://lime-technology.com/forum/index.php?topic=29052.0 Quote Link to comment
John_M Posted January 15, 2017 Share Posted January 15, 2017 Thanks, I'll have a read. Quote Link to comment
Rich Posted January 15, 2017 Author Share Posted January 15, 2017 Both lspci's were run today, I swapped the second and RMA'ed cards over on Thursday, so definitely not a card mix up. I'm going to let this parity check run, as so far it is error free, then I'll re-enable vt-d and see what lspci says. Its a bit weird though, isn't it Quote Link to comment
John_M Posted January 15, 2017 Share Posted January 15, 2017 Yes, it's certainly strange. Sorry about the mix-up suggestion - I did have to ask, though. Quote Link to comment
Squid Posted January 15, 2017 Share Posted January 15, 2017 Thanks, I'll have a read. The fix for it happened on 5.03 - linux: patch mvsas driver to recognize newer AOC-SAS2LP-MV8 cards with PCI ID 1b4b:9485 But before that fix came into being you had to reflash the firmware with the same version to get it to be recognized. Like I said, no clue if it'll work, but it is a possibility. Don't know if anyone has ever taken the time to figure out the age of the cards that have the issue... All I can say for certain is that my card did have the updated PCI ID and a reflash made it work for myself and others, and I do not have the IOMMU trouble. Quote Link to comment
Rich Posted January 15, 2017 Author Share Posted January 15, 2017 Yes, it's certainly strange. Sorry about the mix-up suggestion - I did have to ask, though. Lol, no worries Quote Link to comment
Rich Posted January 16, 2017 Author Share Posted January 16, 2017 Well the first parity check after disabling vt-d has come back with zero errors of any kind. I am going to wait until the weekend and then run another check and if it comes back clear, then try re-ebabling vt-d and adding iommu=pt to the syslinux.conf. Then after that I'm going to re-flash the bios of the new controller and see what that does. Fingers crossed! Quote Link to comment
John_M Posted January 16, 2017 Share Posted January 16, 2017 I'm encouraged by that result. It sounds like a good plan. Make sure you do some writing to the array before your next parity check. Quote Link to comment
Rich Posted January 17, 2017 Author Share Posted January 17, 2017 That's the reason I've been waiting a week between tests, although last night my VM and Docker backup cron job ran, that's at least 200GB written to the array. So i think only waiting 5 days should be ok this time Quote Link to comment
Rich Posted January 21, 2017 Author Share Posted January 21, 2017 The '5 days later' parity check ran perfectly, with nothing entered in the syslog other than the start and finish. I'd written well over 300GB to the array during the five days as well, so i'm confident in the result I can already confirm that re-enabling vt-d and adding iommu=pt to syslinux.conf has no effect, as errors appeared within the first hour of a parity check. So it is looking very strongly like the vt-d / controller conflict theory is correct. I have now removed iommu=pt from syslinux.conf and have reflashed the new controller with the latest firmware (which it was already on) and am about to start another parity check. Fingers crossed!! Quote Link to comment
Rich Posted January 21, 2017 Author Share Posted January 21, 2017 I just got timeout errors, so reflashing the controller hasn't sorted the problem So for now I'm going to have to run with vt-d disabled and no passthroughs to my VM, which is not what I want to do at all. I'm not sure what to try now? I think its safe to say there's definitely a problem between the controller and vt-d, but does that lie with the motherboard, the controller or unRAID? What's really annoying is that the exact same model card is sitting in the port above as well and working perfectly Is there a process I can follow to officially raise this to Limetech? I plan on contacting Supermicro support as well to see if they can help. Jan 21 15:12:10 unRAID kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Jan 21 15:12:10 unRAID kernel: sas: trying to find task 0xffff88035619e500 Jan 21 15:12:10 unRAID kernel: sas: sas_scsi_find_task: aborting task 0xffff88035619e500 Jan 21 15:12:10 unRAID kernel: sas: sas_scsi_find_task: task 0xffff88035619e500 is aborted Jan 21 15:12:10 unRAID kernel: sas: sas_eh_handle_sas_errors: task 0xffff88035619e500 is aborted Jan 21 15:12:10 unRAID kernel: sas: ata18: end_device-8:3: cmd error handler Jan 21 15:12:10 unRAID kernel: sas: ata15: end_device-8:0: dev error handler Jan 21 15:12:10 unRAID kernel: sas: ata16: end_device-8:1: dev error handler Jan 21 15:12:10 unRAID kernel: sas: ata17: end_device-8:2: dev error handler Jan 21 15:12:10 unRAID kernel: sas: ata18: end_device-8:3: dev error handler Jan 21 15:12:10 unRAID kernel: ata18.00: exception Emask 0x0 SAct 0x8000000 SErr 0x0 action 0x6 frozen Jan 21 15:12:10 unRAID kernel: ata18.00: failed command: READ FPDMA QUEUED Jan 21 15:12:10 unRAID kernel: ata18.00: cmd 60/00:00:e8:b3:02/04:00:05:00:00/40 tag 27 ncq 524288 in Jan 21 15:12:10 unRAID kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 21 15:12:10 unRAID kernel: ata18.00: status: { DRDY } Jan 21 15:12:10 unRAID kernel: ata18: hard resetting link Jan 21 15:12:11 unRAID kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Jan 21 15:12:13 unRAID kernel: drivers/scsi/mvsas/mv_sas.c 1430:mvs_I_T_nexus_reset for device[3]:rc= 0 Jan 21 15:12:13 unRAID kernel: ata18.00: configured for UDMA/133 Jan 21 15:12:13 unRAID kernel: ata18: EH complete Jan 21 15:12:13 unRAID kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Quote Link to comment
John_M Posted January 21, 2017 Share Posted January 21, 2017 The problem is believed to be with the driver, according to RobJ's thread. The only sure way to be rid of it seems to be to use HBAs based on a different controller. Popular choices include the re-flashed Dell Perc H310 and the similarly re-flashed IBM megaRAID M1015. Both are available reasonably cheaply but neither works straight from the box. Quote Link to comment
Rich Posted January 21, 2017 Author Share Posted January 21, 2017 If it was the driver though, wouldn't I be seeing the same behaviour from both controllers? Quote Link to comment
John_M Posted January 21, 2017 Share Posted January 21, 2017 If both controllers were identical, then yes, probably. But one of them appears to be based on the 9485, while one appears to be based on the 9480. Except that they identified as the same on one occasion - did you investigate that further? Also, not everyone is affected, as Squid pointed out. Have you tried swapping them over (motherboard slots)? Quote Link to comment
Rich Posted January 21, 2017 Author Share Posted January 21, 2017 I ran lspi again after re-enabling vt-d and got the same result as my second previous lspci, 01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) 06:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3) So they're still identiacal. Its really weird that it changed at all, but even stranger that it changed without me makng any modification to the card (it was pre reflashing the bios). I'm not sure what else I can to do investigate? Yeah I have swapped the cards over, when the second / newer card got RMA'd I swapped the first / older card into the opposite slot then, but the problems followed the new card to its different slot (I hope that makes sense ). Quote Link to comment
John_M Posted January 22, 2017 Share Posted January 22, 2017 That is a mystery, then The only suggestion I have left is to look at alternative SAS HBAs. Quote Link to comment
Rich Posted January 22, 2017 Author Share Posted January 22, 2017 Ok, I guess I'll have to admit defeat and start looking Thanks for all your help Quote Link to comment
limetech Posted January 24, 2017 Share Posted January 24, 2017 Rich- I've read through the thread and wondering if you've tried physically swapping the two controllers around in their PCI slots? That is, current config: Controller #1 in in PCI slot A Controller #2 is in PCI slot B results in errors on which controller and which port? next, physically swap the two controllers so that: Controller #1 in in PCI slot B Controller #2 is in PCI slot A now do the errors follow the slot or follow the controller? If they follow the controller then almost certainly this is a defective controller. If they follow the slot then almost certainly this is a hardware/bios/controller firmware issue. Possible to be driver issue too (less likely since the card does do millions of I/O's successfully). BTW unRAID uses stock kernel drivers for this chipset. If problem follows the slot another thing to try is a different physical slot if you have one on your motherboard. Finally if you can say disabling vt-d definitely solves problem, then for sure this is a bios/controller problem right? I will say, in my experience we have had issues with multiple controllers plugged into the same motherboard. Problems where a controller works perfectly if there is only one of them installed, but start failing if two or more are installed. For example several years ago we were using x2 Adaptec 4-port SATA controllers in one of our server products. In a cost-cutting measure, tested a few other controllers and settled on a "Rosewill" model which worked perfectly in testing, but only testing done was with single controller in a motherboard. Placed a large order and spent a long time pulling hair and gnashing teeth when no matter what, x2 installed in same motherboard simply would not work reliably. Ended up RMA'ing the entire batch and went back to Adaptec. Quote Link to comment
Rich Posted January 31, 2017 Author Share Posted January 31, 2017 Thanks for the post and your feedback Limetech, very much appreciated I have swapped the cards over already and the errors followed the card. That was an RMA'd card though which generated exactly the same errors as the card it replaced and I find it hard to believe I received two faulty controllers in a row For the moment I've disabled vt-d on the motherboard and everything appears to be running smoothly. The only thing i was concerned about was not being able to pass though my Belkin UPS to my Windows VM for auto shutdown, but the 'Network UPS Tools (NUT)' plugin has solved that problem for the moment. For what its worth, i believe that the problem lies with either the Marvell controller and its firmware on the card or a mystery bug similar to what was mention above with the motherboard and it not being able to handle two controllers simultaneously. For the moment though, i have found the vt-d work around, which will at least keep my parity drives and data synced and safe Thank you to all who helped with with this, much appreciated. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.