DomUK Posted May 30, 2022 Share Posted May 30, 2022 Dell r720xd affected aswell 1 1 Quote Link to comment
DomUK Posted May 30, 2022 Share Posted May 30, 2022 The dell r430 also didn't upgrade cleanly to 6.10.2 The blank tg3.conf on the boot drive sorted both r430 and r720xd after updating. Quote Link to comment
limetech Posted May 30, 2022 Author Share Posted May 30, 2022 49 minutes ago, John_M said: Watching from the sideline because my Gen8 still has its original non VTd-capable Celeron processor, but wouldn't a better solution be to disable VT-d automatically via syslinux.cfg when the problematic configuration is detected instead of disabling the NIC? It would still take some users by surprise, of course, but at least they'd still be able to connect to their servers. AFAIK it's not possible to programmatically disable VT-d. The way the kernel initializes is based on whether VT-d is enabled or not. The current approach was taken in an abundance of caution. Going into a 3-day holiday here in the US I decided it's better for users to lose network connection (which I agree sucks) than to suffer data loss, when we know about possible data loss (that would suck even more). I've just added some code to the downloaded 'unRAIDServer.plg' file that will detect the combination of 'tg3' module loaded and VT-d enabled, and will bail out of the upgrade unless ./config/modprobe.d/tg.conf file exists. This should greatly help those upgrading but new users on affected platform will still see no ethernet. This is going to take us some time to get this fixed; probably will have to go purchase a known-affected platform. The issue is acknowledged here: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04565693 Why this has suddenly happened is a mystery. 1 4 Quote Link to comment
Frank1940 Posted May 30, 2022 Share Posted May 30, 2022 11 minutes ago, limetech said: 've just added some code to the downloaded 'unRAIDServer.plg' file that will detect the combination of 'tg3' module loaded and VT-d enabled, and will bail out of the upgrade unless ./config/modprobe.d/tg.conf file exists. Be sure to provide some way for those of us who are trying to provide support by changing the version number (6.10.2a or 6.10.3). There is already enough confusion over this issue. Quote Link to comment
limetech Posted May 30, 2022 Author Share Posted May 30, 2022 4 hours ago, Frank1940 said: Be sure to provide some way for those of us who are trying to provide support by changing the version number (6.10.2a or 6.10.3). There is already enough confusion over this issue. When you click 'Check for Updates" it downloads 'unRAIDServer.plg' file from our download server. When this file is 'executed' and detects tg3 present and iommu enabled it does this: echo "NOTE: combination of NIC using tg3 driver and Intel VT-d enabled may cause DATA CORRUPTION on some platforms." echo "Please disable VT-d in BIOS or pass 'intel_iommu=off' on syslinux kernel append line." echo "Alternaltely create 'config/modprobe.d/tg3.conf' file:" echo " touch /boot/config/modprobe.d/tg3.conf # if your platform is not affected" echo "or" echo " echo 'blacklist tg3' > /boot/config/modprobe.d/tg3.conf # to blacklist the tg3 driver" echo exit 1 The script only checks for existence of modprobe.d/tg3.conf file, not it's content. Hence user can choose to blacklist or not. 2 1 Quote Link to comment
John_M Posted May 30, 2022 Share Posted May 30, 2022 17 minutes ago, limetech said: The issue is acknowledged here: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04565693 Thanks for that. The document suggests, as an alternative to disabling IOMMU, Quote Disable HP Shared Memory in the network adapter Option ROM and gives instructions on how to do it. Maybe someone with affected hardware and who is prepared to take a risk could give that a try? 1 Quote Link to comment
nraygun Posted May 30, 2022 Share Posted May 30, 2022 Well, wait a minute. I had blanked out the tg3.conf on my Dell R710 which uses the bnx2 module. Before blanking it out, I was getting lots (screen fulls) of DMAR errors. I get 2 on boot but then none (so far) after that. Shouldn't I have had NO errors? Is there something else going on here? Quote Link to comment
Frank1940 Posted May 30, 2022 Share Posted May 30, 2022 1 hour ago, limetech said: echo "NOTE: combination of NIC using tg3 driver and Intel VT-d enabled may cause DATA CORRUPTION" echo "Please disable VT-d in BIOS or pass 'intel_iommu=off' on syslinux kernel append line." echo "Alternaltely create 'config/modprobe.d/tg3.conf' file:" echo " touch /boot/config/modprobe.d/tg3.conf # if your platform is not affected" echo "or" echo " echo 'blacklist tg3' > /boot/config/modprobe.d/tg3.conf # to blacklist the tg3 driver" echo exit 1 Am I correct in assuming that the upgrade box will remain open with the text display and the actual upgrade process terminated (Normal expectation for a script with an exit status of '1') ? Quote Link to comment
PeteAsking Posted May 30, 2022 Share Posted May 30, 2022 (edited) After reading this forum post I am now too scared to update unraid. Very worrying if we are between no network or possible data loss after an update. The update procedure is already very much a risky process at times since cloning a usb stick means the original is blacklisted so reverting back is not easy (as the cloned stick is useless) and now we are heading into territory where specific hardware can cause catastrophic service impact on things as simple as a NIC. Obviously not trying to say anyone is to blame nor suggest I have a better solution to the problem but as a user of unraid I can see how this would cause reactions I am seeing on the forums. It is very much a product that encourages consolidating many services onto one dedicated box, so when that service is broken, it can be multiple different parts on the network affected, NAS, DHCP, DNS, Web services etc etc etc. I feel like not everyone is a forum user and this could be better communicated or something into the update procedure could be implemented like a "known issues tick this box to proceed the upgrade" type thing where when you update it says "some users may lose connectivity, click I agree here to accept this and continue the update and agree you have read what to do if this affects you (link to instructions here) or something. (Not claiming this is the best solution just saying what came into my head upon reading this). Edited May 30, 2022 by PeteAsking 1 Quote Link to comment
Spiritreader Posted May 30, 2022 Share Posted May 30, 2022 When editing a docker container, Unraid does not remember that I picked the "Advanced View" anymore. Instead it always shows me the basic view. Is this intentional? It's a bit cumbersome, having to click this every time, as I regularily use/modify the fields that are hidden in basic view. Quote Link to comment
bonienl Posted May 30, 2022 Share Posted May 30, 2022 38 minutes ago, Spiritreader said: Is this intentional? Yes, configuration is always opened in basic mode. Quote Link to comment
BRiT Posted May 30, 2022 Share Posted May 30, 2022 1 hour ago, PeteAsking said: After reading this forum post I am now too scared to update unraid. Very worrying if we are between no network or possible data loss after an update. If you're already on an identified hardware setup and on 6.10 or 6.10.1 you're already in a possible data loss situation. The data loss is not tied to 6.10.2 upgrade. 1 Quote Link to comment
Spiritreader Posted May 30, 2022 Share Posted May 30, 2022 58 minutes ago, bonienl said: Yes, configuration is always opened in basic mode. Ah I realized that template authoring mode got disabled for some reason. Probably had to do with me restoring a bunch of settiings from backup due to fixing another issue. Thanks for the replay nonetheless! Quote Link to comment
nraygun Posted May 30, 2022 Share Posted May 30, 2022 2 hours ago, PeteAsking said: After reading this forum post I am now too scared to update unraid. Right there with you. I just went back to 6.9.2 until this all gets sorted out. Can't say I was comfortable with the thought that I might not have data corruption. I also had a problem with My Servers that is now resolved now that I went back to 6.9.2. It was giving me an unraid-api error. Thanks and good luck to the team investigating this DMAR error situation! 2 Quote Link to comment
PeteAsking Posted May 31, 2022 Share Posted May 31, 2022 (Not sure if we can just be candid and post our thoughts but assuming its ok). been thinking that I am unsure if my worries are unfounded but perhaps if people avoid upgrading due to the process being considered a possible risk then it could be the case that not a lot of people actually try the RC versions of releases as a result. This might make limetechs job difficult when releasing a new version. Maybe we as the members in the unraid community should try to form a beta testing task force that would or possibly could assist limetech if it worked within constraints they provided. At the moment my impression (which could be incorrect) is that its more of a passive testing. As in, the RC release is posted to the forum and anyone can feel like trying it out and providing feedback. This passive testing might not be as effective as say a committed group of like 100 people who pledge to update and will email their logs, along with hardware info that is relevant, for limetech to review even if no real issues are detected at all. This provides many different real systems both with or without issues to compare and for them to look at. The other problem is many of us (like me) might not know what errors to really look for even of things appear to be working in the short term but a once over from limetech might uncover inconsistencies in logs we dont fully appreciate and provide a more active ‘search and identify’ type beta testing path. if something like that was desirable then im sure a bunch of us could get together and step up to commit to testing RC versions and giving any info and logs to limetech to review. Pretty sure as long as there was a semi reasonable way to revert in case of a non bootable or unusable system then a lot of people could pledge to commit to testing. Might be like a fun thing for a group of us to get together and do not sure what other people think Im just saying we all hang around the forums anyway. If everyone disagrees thats ok too I was just saying what came into my head. Im not telling anyone what to do. 1 Quote Link to comment
trurl Posted May 31, 2022 Share Posted May 31, 2022 5 hours ago, PeteAsking said: cloning a usb stick means the original is blacklisted so reverting back is not easy Reverting is easy and cloning flash is unnecessary Quote Link to comment
PeteAsking Posted May 31, 2022 Share Posted May 31, 2022 (edited) 10 minutes ago, trurl said: Reverting is easy and cloning flash is unnecessary If the system does not boot it is not as easy as unplugging a stick and plugging in the cloned stick, regardless of how easy it is claimed to be. Unplugging one thing and plugging in another thing is literally going to be the easiest conceivable option in a disaster situation where your entire network is down with no internet because the single device that homes everything is down. Just saying. That is why enterprise equipment has a flash and a backup flash for example and you can select which flash to boot up from when it starts up eg like netgear switches and whatever else has a duel boot type thing. I think even some home motherboards have like a duel boot bios thing. Same reasoning. Its just how other people solve a problem with 100% guarantee rollback since its a different chip being relied upon, or different flash boot. Or different cloned stick. I feel you have to go the extra mile to reassure people that rollback cant fail rather than just be like ‘yeah you read these instructions and its super easy to follow if something breaks’ as this has an element of user needing not to be an idiot like me to complete the task correctly. Its also a time element. Follow instructions = time. Unplug one thing and plug in another thing = 30 seconds. Very different scenario in a time sensitive window. Edited May 31, 2022 by PeteAsking 1 Quote Link to comment
trurl Posted May 31, 2022 Share Posted May 31, 2022 To revert just copy 'previous' folder on flash to the top level. Or if you have a recent flash backup you can install any version of Unraid you want on that original flash drive and get your configuration from the config folder of your backup. Quote Link to comment
PeteAsking Posted May 31, 2022 Share Posted May 31, 2022 I understand you feel comfortable and have various methods that you have used to do this restore procedure and it seems simple. I am simply giving you my feeling as someone who is a non expert and I feel my experience is both valid and possibly even shared with other users who are also in the same non expert position I am in. However I will also leave it at that. If you are happy with how the update rollout is going and that there is no need to improve the situation then I am happy to leave it to you guys and just get the all clear when its safe to upgrade. Not being mean or anything I really am happy to chill out and just get told when to jog on and hit update. It doesnt phase me and I dont feel any need to be combative about any of it if thats what people like. Peace 1 Quote Link to comment
Thorsten Posted May 31, 2022 Share Posted May 31, 2022 (edited) The error with "DMAR: ERROR: DMA PTE for vPFN" is also reported on the Ubuntu Bug page. Affected system: Linux kernel 5.15.0.27.30 on an HPE ProLiant DL20 Gen9 server. See here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1970453 With the same work around: Setting the intel_iommu=off kernel boot parameter seems to work around the problem. Also an interesting comment in the Bug Report I hit this bug upgrading my home server (proliant microserver gen9) and it seems to be causing memory corruption when it occurs ( at least in combination with zfs ). Using zfs mirrored root I experienced this issue after only a few minutes uptime, with DMAR messages flooding the log and very high CPU usage. After rebooting with intel_iommu=off things are back to normal, but a zfs scrub indicated several thousand checksum errors detected on the root volume, some of them unrecoverable that had to be restored from backup, and a separate zfs RAIDZ1 volume experienced corrupted metadata and had to be rolled back with some data loss. Edited May 31, 2022 by Thorsten 1 2 Quote Link to comment
JorgeB Posted May 31, 2022 Share Posted May 31, 2022 1 hour ago, Thorsten said: The error with "DMAR: ERROR: DMA PTE for vPFN" is also reported on the Ubuntu Bug page. Affected system: Linux kernel 5.15.0.27.30 on an HPE ProLiant DL20 Gen9 server. See here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1970453 Good find, looks like the exact same issue. 1 Quote Link to comment
urhellishntemre Posted May 31, 2022 Share Posted May 31, 2022 I got a weird issue, that maybe ya'all can help me fix. I know I don't have diags, but I don't know if I can find the diags for this issue. I was using the built in NoVNC viewer for both my Windows 10 VM, and backlaze_personal_backup docker from CA. Both NoVNC instances worked in 6.10.1, and also 6.10.0. But both can't connect under 6.10.2. Windows 10 VM (and also a new Windows 11 VM I was going to set up) say Failed to connect to server. Backblaze_personal_backup says server disconnected, error code 1006. When I try to get to the logs for Backblaze_personal_backup which it used to work fine in 6.10.1, and 6.10, they now just show a blank screen. The windows 10 VM just has logs pertaining to VM settings as far as I can tell. I looked through my other dockers and binhex-krusader (which also uses noVNC) works fine. I have also tried rebooting the dockers/vms, and also turning off and then on both the docker service and the VM service. I can connect to the noVNC instances via noVNC viewer on my desktop, but I haven't run across this before. Where would I be able to get the logs/diags for these issues to help you guys help me out? Quote Link to comment
JorgeB Posted May 31, 2022 Share Posted May 31, 2022 33 minutes ago, urhellishntemre said: I was using the built in NoVNC viewer for both my Windows 10 VM, and backlaze_personal_backup docker from CA. Both NoVNC instances worked in 6.10.1, and also 6.10.0. But both can't connect under 6.10.2. Try clearing the browser history, cookies, etc Quote Link to comment
bitcore Posted May 31, 2022 Share Posted May 31, 2022 (edited) I agree, a link to the discussion forum in the little release notes provided in the webGUI (the "i" button) would be very helpful and courteous to users. Additionally, I'd love to have more than just the changelog in there. The same text blurbs that are posted in the forum post for the release notes would also be helpful. I'm personally pretty diligent about reading the upgrade threads from top to bottom, and I'll admit that though it's great we have organized release threads, it's a little annoying to have to search for the thread to be sure I don't run into any show-stoppers. A link would really be lovely. Anyway, thanks for the release! Edit: Also, my upgrade from 6.10.1 to 6.10.2 went just fine. Edited May 31, 2022 by bitcore Quote Link to comment
eltonk Posted May 31, 2022 Share Posted May 31, 2022 23 hours ago, Frank1940 said: On the banner on top of the first page that comes up with the GUI (most likely the MAIN tab), there is a big 'Upgrade now' button in a banner box. (I know it is there because I usually wait a bit to upgrade until I can read the release notes and a couple of pages of comments in the release thread.) I don't recall there even being an 'I' button on that banner. It seems that too many folks simply will click on anything. (Malware writers often use this same behavior pattern to have the unsuspecting do their bidding...) So you are saying that we should not trust on the Unraid's top banner telling us to "update" the system because it can be a Malware?! This is a joke, right? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.