Repeated parity sync errors after server upgrade SAS2LP-MV8


Rich

Recommended Posts

Ok this seems strange. I've just disable vt-d and am running another parity check (so far, so good) and decided to run lspci again and the info has changed,

 

Before (vt-d enabled)

01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)
06:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9480 SAS/SATA 6Gb/s RAID controller (rev c3)

 

Now (vt-d disabled)

01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)
06:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)

 

The number and description are different - 88SE9485 SAS/SATA 6Gb/s controller (rev c3)

 

Should this randomly change? I assumed it was effectively the version number and name of the card. I've not attempted to update anything, simply disabled vt-d and rebooted  ???

Link to comment
  • Replies 71
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

I don't have any issues with IOMMU and the SAS2.  (and I'm a 9485)

 

Thanks for that data point, Squid.

 

My experience of unRAID doesn't extend as far back as v5 betas but I'd be interested to read the thread about having to re-flash the card to get it to be recognised. I'll try to search for it.

 

Link to comment

Both lspci's were run today, I swapped the second and RMA'ed cards over on Thursday, so definitely not a card mix up.

 

I'm going to let this parity check run, as so far it is error free, then I'll re-enable vt-d and see what lspci says.

 

Its a bit weird though, isn't it  :-\

Link to comment

Thanks, I'll have a read.

The fix for it happened on 5.03

- linux: patch mvsas driver to recognize newer AOC-SAS2LP-MV8 cards with PCI ID 1b4b:9485

 

But before that fix came into being you had to reflash the firmware with the same version to get it to be recognized.  Like I said, no clue if it'll work, but it is a possibility.  Don't know if anyone has ever taken the time to figure out the age of the cards that have the issue...  All I can say for certain is that my card did have the updated PCI ID and a reflash made it work for myself and others, and I do not have the IOMMU trouble.

Link to comment

Well the first parity check after disabling vt-d has come back with zero errors of any kind.

I am going to wait until the weekend and then run another check and if it comes back clear, then try re-ebabling vt-d and adding iommu=pt to the syslinux.conf. Then after that I'm going to re-flash the bios of the new controller and see what that does. Fingers crossed!

Link to comment

The '5 days later' parity check ran perfectly, with nothing entered in the syslog other than the start and finish. I'd written well over 300GB to the array during the five days as well, so i'm confident in the result  :D

 

I can already confirm that re-enabling vt-d and adding iommu=pt to syslinux.conf has no effect, as errors appeared within the first hour of a parity check. So it is looking very strongly like the vt-d / controller conflict theory is correct.

 

I have now removed iommu=pt from syslinux.conf and have reflashed the new controller with the latest firmware (which it was already on) and am about to start another parity check. Fingers crossed!!

Link to comment

I just got timeout errors, so reflashing the controller hasn't sorted the problem  :(

 

So for now I'm going to have to run with vt-d disabled and no passthroughs to my VM, which is not what I want to do at all.

 

I'm not sure what to try now? I think its safe to say there's definitely a problem between the controller and vt-d, but does that lie with the motherboard, the controller or unRAID? What's really annoying is that the exact same model card is sitting in the port above as well and working perfectly  :-\

 

Is there a process I can follow to officially raise this to Limetech? I plan on contacting Supermicro support as well to see if they can help.

 

Jan 21 15:12:10 unRAID kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Jan 21 15:12:10 unRAID kernel: sas: trying to find task 0xffff88035619e500
Jan 21 15:12:10 unRAID kernel: sas: sas_scsi_find_task: aborting task 0xffff88035619e500
Jan 21 15:12:10 unRAID kernel: sas: sas_scsi_find_task: task 0xffff88035619e500 is aborted
Jan 21 15:12:10 unRAID kernel: sas: sas_eh_handle_sas_errors: task 0xffff88035619e500 is aborted
Jan 21 15:12:10 unRAID kernel: sas: ata18: end_device-8:3: cmd error handler
Jan 21 15:12:10 unRAID kernel: sas: ata15: end_device-8:0: dev error handler
Jan 21 15:12:10 unRAID kernel: sas: ata16: end_device-8:1: dev error handler
Jan 21 15:12:10 unRAID kernel: sas: ata17: end_device-8:2: dev error handler
Jan 21 15:12:10 unRAID kernel: sas: ata18: end_device-8:3: dev error handler
Jan 21 15:12:10 unRAID kernel: ata18.00: exception Emask 0x0 SAct 0x8000000 SErr 0x0 action 0x6 frozen
Jan 21 15:12:10 unRAID kernel: ata18.00: failed command: READ FPDMA QUEUED
Jan 21 15:12:10 unRAID kernel: ata18.00: cmd 60/00:00:e8:b3:02/04:00:05:00:00/40 tag 27 ncq 524288 in
Jan 21 15:12:10 unRAID kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 21 15:12:10 unRAID kernel: ata18.00: status: { DRDY }
Jan 21 15:12:10 unRAID kernel: ata18: hard resetting link
Jan 21 15:12:11 unRAID kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Jan 21 15:12:13 unRAID kernel: drivers/scsi/mvsas/mv_sas.c 1430:mvs_I_T_nexus_reset for device[3]:rc= 0
Jan 21 15:12:13 unRAID kernel: ata18.00: configured for UDMA/133
Jan 21 15:12:13 unRAID kernel: ata18: EH complete
Jan 21 15:12:13 unRAID kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Link to comment

The problem is believed to be with the driver, according to RobJ's thread. The only sure way to be rid of it seems to be to use HBAs based on a different controller. Popular choices include the re-flashed Dell Perc H310 and the similarly re-flashed IBM megaRAID M1015. Both are available reasonably cheaply but neither works straight from the box.

Link to comment

If both controllers were identical, then yes, probably. But one of them appears to be based on the 9485, while one appears to be based on the 9480. Except that they identified as the same on one occasion - did you investigate that further? Also, not everyone is affected, as Squid pointed out. Have you tried swapping them over (motherboard slots)?

Link to comment

I ran lspi again after re-enabling vt-d and got the same result as my second previous lspci,

01:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)
06:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)

So they're still identiacal. Its really weird that it changed at all, but even stranger that it changed without me makng any modification to the card (it was pre reflashing the bios). I'm not sure what else I can to do investigate?

 

Yeah I have swapped the cards over, when the second / newer card got RMA'd I swapped the first / older card into the opposite slot then, but the problems followed the new card to its different slot (I hope that makes sense ???).

Link to comment

Rich-

I've read through the thread and wondering if you've tried physically swapping the two controllers around in their PCI slots?

 

That is, current config:

Controller #1 in in PCI slot A

Controller #2 is in PCI slot B

results in errors on which controller and which port?

 

next, physically swap the two controllers so that:

Controller #1 in in PCI slot B

Controller #2 is in PCI slot A

now do the errors follow the slot or follow the controller?

 

If they follow the controller then almost certainly this is a defective controller.

If they follow the slot then almost certainly this is a hardware/bios/controller firmware issue.  Possible to be driver issue too (less likely since the card does do millions of I/O's successfully).  BTW unRAID uses stock kernel drivers for this chipset.

 

If problem follows the slot another thing to try is a different physical slot if you have one on your motherboard.

 

Finally if you can say disabling vt-d definitely solves problem, then for sure this is a bios/controller problem right?

 

I will say, in my experience we have had issues with multiple controllers plugged into the same motherboard.  Problems where a controller works perfectly if there is only one of them installed, but start failing if two or more are installed.  For example several years ago we were using x2 Adaptec 4-port SATA controllers in one of our server products.  In a cost-cutting measure, tested a few other controllers and settled on a "Rosewill" model which worked perfectly in testing, but only testing done was with single controller in a motherboard.  Placed a large order and spent a long time pulling hair and gnashing teeth when no matter what, x2 installed in same motherboard simply would not work reliably.  Ended up RMA'ing the entire batch and went back to Adaptec.

Link to comment

Thanks for the post and your feedback Limetech, very much appreciated  :)

 

I have swapped the cards over already and the errors followed the card. That was an RMA'd card though which generated exactly the same errors as the card it replaced and I find it hard to believe I received two faulty controllers in a row  ???

 

For the moment I've disabled vt-d on the motherboard and everything appears to be running smoothly. The only thing i was concerned about was not being able to pass though my Belkin UPS to my Windows VM for auto shutdown, but the 'Network UPS Tools (NUT)' plugin has solved that problem for the moment.

 

For what its worth, i believe that the problem lies with either the Marvell controller and its firmware on the card or a mystery bug similar to what was mention above with the motherboard and it not being able to handle two controllers simultaneously.

 

For the moment though, i have found the vt-d work around, which will at least keep my parity drives and data synced and safe  :)

 

Thank you to all who helped with with this, much appreciated.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.