Unraid OS version 6.10.3-rc1 available


Recommended Posts

The primary purpose of this release is to address the issue seen with many HP MIcroserver Gen8/9 servers where data corruption could occur if Intel VT-d is enabled.

 

As always, please make a flash back up before upgrading: Main/Flash/Flash Backup.

 

While we have not identified the exact kernel commit that introduced this issue, we believe there is a viable solution.  The solution involves changing the default IOMMU operational mode in the Linux kernel from "DMA Translation" to "Pass-through" (equivalent to "intel_iommu=pt" kernel option).  At first we thought the 'tg3' network driver was the culprit; however, upon thorough investigation, we think this is coincidental and we have removed code that "blacklists" the tg3 driver.

 

We have decided to publish this release on the Unraid OS next branch so that those users with test servers may give this release a try.  To update to this release, navigate to Tools/Update OS and select 'next' under Branch.  As soon as we have confirmation from more HP Microserver users that no more "DMAR ERROR" syslog messages are generated, we will publish 6.10.3 stable release.  Similarly, since we have effectively changed the intel_iommu mode, we would be interested to know if any VM issues arise - in all our testing there are no issues.

 

More info by @JorgeB a few posts down:

6 hours ago, JorgeB said:

Many thanks to @jmztaylorand @Monteromanfor helping test this new IOMMU mode, they are using different affected servers, one Lenovo X3100 and one HP Microserver G8, with both it was very easy to trigger the DMAR errors by starting a parity check, errors would start repeating after just a few seconds, e.g:

 

With the Lenovo before this release:

Jun  7 10:09:41 Tower kernel: md: recovery thread: check P ...
Jun  7 10:09:44 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xb0780 already set (to b0780003 not 17594b803)
Jun  7 10:09:44 Tower kernel: ------------[ cut here ]------------
Jun  7 10:09:44 Tower kernel: WARNING: CPU: 4 PID: 6907 at drivers/iommu/intel/iommu.c:2336 __domain_mapping+0x2e3/0x362

 

With the HP:

May 19 06:56:40 Tower kernel: md: recovery thread: check P ...
May 19 06:56:56 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xb5f80 already set (to b5f80003 not 1636f4803)
May 19 06:56:56 Tower kernel: ------------[ cut here ]------------
May 19 06:56:56 Tower kernel: WARNING: CPU: 2 PID: 5826 at drivers/iommu/intel/iommu.c:2408 __domain_mapping+0x2e5/0x390

 

With the new release both ran a parity check for over 10 minutes with VT-d enabled and no signs of any errors, I'm confident this solves the DMAR/corruption issues for all affected platforms, as a bonus this IOMMU pass-through mode can apparently have better performance, some Linux distros have already switched to using it by default.

 

Main kudos go to LT's @eschultz, he's the one that came up with the solution, also found the doc below with some more info about this mode for anyone interested:

 

https://lenovopress.lenovo.com/lp1467.pdf

 

 

P.S. this will also fix the bzimage checksum error many Dell server users were experiencing during boot after updating to v6.10.x, the fix for that was also using iommu=pt, and probably what saved some of those servers from experiencing the DMAR/corruption problem.

 

 

 

 

 


Version 6.10.3-rc1 2022-06-10

Improvements

Plugin authors: A plugin file may include a tag which displays a markdown formatted message when a new version is available. Use this to give instructions or warnings to users before the upgrade is done.

Changed default kernel IOMMU operation mode from "DMA Translation" to "Pass-through". - removed 'tg3' blacklisting

Brought back color-coding in logging windows.

Bug fixes

Fix issue detecting Mellanox NIC.

Misc. webGUI bug fixes

Change Log vs. Unraid OS 6.10.2

Base distro:

  • no changes

Linux kernel:

  • version 5.15.46-Unraid
  • CONFIG_IOMMU_DEFAULT_PASSTHROUGH: Passthrough

Management:

  • startup: improve network device detection
  • webgui: Added color coding in log files
  • webgui: In case of flash corruption try the test again
  • webgui: Improved syslog reading
  • webgui: Added log size setting when viewing syslog
  • webgui: Plugin manager: add ALERT message function
  • webgui: Add INFO icon to banner
  • webgui: Added translations to PageMap page
  • webgui: Fix: non-correcting parity check actually correcting if non-English language pack installed
  • webgui: Updated azure/gray themes
    • Better support for Firefox
    • Move utilization and notification indicators to the right
  • Like 5
  • Upvote 2
Link to comment

I installed this update and now my system boots the USB drive and it loads to the "Unraid OS" and will not continue to load the OS. I have tried "Unraid Os GUI" the system was fine before the update. Like a dumb ass I was tired and did not make a manual backup of the drive before updating, mine is days old.

 

Any idea how to recover?

Link to comment
1 hour ago, SFord said:

Like a dumb ass I was tired and did not make a manual backup of the drive before updating,

Make a backup of current config folder, recreate the flash drive manually or by using the USB tool, then restore the config folder.

Link to comment
3 hours ago, SFord said:

Like a dumb ass I was tired and did not make a manual backup of the drive before updating

 

@limetech Please don't forget to add the usual update disclaimer "[...]Before updating take a backup[...]". This is not only for user security, it's for your security as well ;-)

 

Link to comment

Many thanks to @jmztaylorand @Monteromanfor helping test this new IOMMU mode, they are using different affected servers, one Lenovo X3100 and one HP Microserver G8, with both it was very easy to trigger the DMAR errors by starting a parity check, errors would start repeating after just a few seconds, e.g:

 

With the Lenovo before this release:

Jun  7 10:09:41 Tower kernel: md: recovery thread: check P ...
Jun  7 10:09:44 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xb0780 already set (to b0780003 not 17594b803)
Jun  7 10:09:44 Tower kernel: ------------[ cut here ]------------
Jun  7 10:09:44 Tower kernel: WARNING: CPU: 4 PID: 6907 at drivers/iommu/intel/iommu.c:2336 __domain_mapping+0x2e3/0x362

 

With the HP:

May 19 06:56:40 Tower kernel: md: recovery thread: check P ...
May 19 06:56:56 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0xb5f80 already set (to b5f80003 not 1636f4803)
May 19 06:56:56 Tower kernel: ------------[ cut here ]------------
May 19 06:56:56 Tower kernel: WARNING: CPU: 2 PID: 5826 at drivers/iommu/intel/iommu.c:2408 __domain_mapping+0x2e5/0x390

 

With the new release both ran a parity check for over 10 minutes with VT-d enabled and no signs of any errors, I'm confident this solves the DMAR/corruption issues for all affected platforms, as a bonus this IOMMU pass-through mode can apparently have better performance, some Linux distros have already switched to using it by default.

 

Main kudos go to LT's @eschultz, he's the one that came up with the solution, also found the doc below with some more info about this mode for anyone interested:

 

https://lenovopress.lenovo.com/lp1467.pdf

 

 

P.S. this will also fix the bzimage checksum error many Dell server users were experiencing during boot after updating to v6.10.x, the fix for that was also using iommu=pt, and probably what saved some of those servers from experiencing the DMAR/corruption problem.

 

 

 

 

  • Like 5
Link to comment

I’ve been running 6.10.3 RC1 all day since working with @JorgeB and it has been 100% fine. I have been running my docker apps all of my drives are BTFS and I have not had a single issue all day. The log has been devoid of any errors from the previous 6.10.x versions.

 

  • Like 2
  • Thanks 1
Link to comment
20 hours ago, JorgeB said:

Make a backup of current config folder, recreate the flash drive manually or by using the USB tool, then restore the config folder.

Once I got some sleep and pulled the drive and copied the folder all was good. I appreciate a clear real answer. I made all the mistakes and did not stick with my 3/2/1 backups. I got cocky and that's when the gods will rise up and "byte" you in the ass.

 

Thank you everyone who answered.

  • Like 1
Link to comment
12 hours ago, nraygun said:

Will this fix allow the posting of a DMAR error when there really is a DMAR error?

If I understand this correctly, there can't be DMAR errors, since there's no DMA remapping, from the pdf linked above, pass-through mode is the one on the right:

 

image.thumb.png.98148ab31797da029b623bb7fff88888.png

  • Like 1
  • Thanks 3
Link to comment
On 6/10/2022 at 4:10 PM, limetech said:

 

Version 6.10.3-rc1 2022-06-10

Bug fixes

Fix issue detecting Mellanox NIC.

...

Hi unRAIDers, can anyone say for sure if there is a problem with ALL Mellanox NICs with the 6.10.x update or just certain models? I'm running dual port 10Gbe Mellanox ConnectX-3 Pro NICs and I want to upgrade to the latest stable version (6.10.2), but will wait for the next stable release if it's not working quite right with my NIC setup.

 

Thanks guys!!

Link to comment
7 hours ago, Joseph said:

Hi unRAIDers, can anyone say for sure if there is a problem with ALL Mellanox NICs with the 6.10.x update or just certain models?

I believe the main issue is if you want to use a Mellanox NIC as eth0 or if you add a Mellanox NIC when running v6.10.x, if it is working with v6.9.x and it's not set as eth0 it should remain working after updating to v6.10.2.

 

Having said that I suspect v6.10.3 stable is going to be released very soon, so probably best just to wait, or if you want to update now update to v6.10.3-rc1 which should be basically the same as v6.10.3 final.

 

 

Link to comment
5 hours ago, JorgeB said:

I believe the main issue is if you want to use a Mellanox NIC as eth0 or if you add a Mellanox NIC when running v6.10.x, if it is working with v6.9.x and it's not set as eth0 it should remain working after updating to v6.10.2.

 

Having said that I suspect v6.10.3 stable is going to be released very soon, so probably best just to wait, or if you want to update now update to v6.10.3-rc1 which should be basically the same as v6.10.3 final.

 

 

I have a mellanox x2 set as eth0. Is this an issue. I haven’t noticed anything except some short pauses every few minutes with file transfers over SMB. Should I update to 6.10.3-rc1?

Link to comment
6 hours ago, JorgeB said:

I believe the main issue is if you want to use a Mellanox NIC as eth0 or if you add a Mellanox NIC when running v6.10.x, if it is working with v6.9.x and it's not set as eth0 it should remain working after updating to v6.10.2.

 

Having said that I suspect v6.10.3 stable is going to be released very soon, so probably best just to wait, or if you want to update now update to v6.10.3-rc1 which should be basically the same as v6.10.3 final.

 

 

Thanks for the clarification... think I'll wait until the rc becomes a stable release.

Link to comment
1 hour ago, wgstarks said:

I haven’t noticed anything except some short pauses every few minutes with file transfers over SMB. Should I update to 6.10.3-rc1?

That should be unrelated, issue is that with at least some models or in same configs you can't set a Mellanox NIC as eth0, it won't be eth0 after rebooting.

Link to comment
17 minutes ago, JorgeB said:

That should be unrelated, issue is that with at least some models or in same configs you can't set a Mellanox NIC as eth0, it won't be eth0 after rebooting.

I guess I’m not having an issue then. The mellanox card is still eth0 after several reboots. Maybe it’s only certain cards.

Link to comment
  • limetech unfeatured, unpinned and locked this topic
Guest
This topic is now closed to further replies.