sorano

Members
  • Posts

    26
  • Joined

  • Last visited

Everything posted by sorano

  1. I really appreciate you taking your time trying to help. Right now I'm just going to accept that this piece of trash Asus motherboard is fucking broken with SR-IOV and plan better for my next build. No matter what I do sriov_numvfs will not show up in sysfs for the device, so the echo'ing has no effect.
  2. Well, in the last post you told me I need to vfio bind the NIC before configuring VFs right? So after that I added the two first lines: # VFIO bind Intel X520 before creating Virtual Functions bash /boot/config/vfio-pci-bind.sh 8086:154d 0000:06:00.0 bash /boot/config/vfio-pci-bind.sh 8086:154d 0000:06:00.1 Then the next two lines are creating the actual VFs # Create Intel X520 Virtual Functions echo 4 > /sys/bus/pci/devices/0000:06:00.0/sriov_numvfs echo 4 > /sys/bus/pci/devices/0000:06:00.1/sriov_numvfs (This is the part that does not work since sriov_numvfs is not visible under /sys/bus/pci/devices/0000:06:00.0/ so the echo does nothing)
  3. Damn, I really hoped that would have been it but it's still not working. Same as before. I cannot even create VFs even though X520 is VFIO bound: sriov_numfs still does not exist for the card:
  4. After spending way to many hours trying to get Mellanox Virtual Function running under Windows I finally gave up and bought an Intel X520 DA-2 New card, new problems. It's like the kernel is ignoring the SR-IOV functions of the card for some reason. I checked in BIOS but could not find anything related to activating SR-IOV for the card either. Since sriov_numvfs does not exist for the device I cannot get any VF's.
  5. I have a ConnectX-3 Pro. But I'm having issues with SR-IOV to a Windows VM as in this thread: I was discussing my issue with the moderators in VFIO discord and one of them said I should try with newer Mellanox drivers.
  6. @ich777 If you have the time and possibility I would like to request the Mellanox 4.9-2.2.4.0 LTS driver: https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed/#tabs-2 Release notes/Installation: https://docs.mellanox.com/display/MLNXENv492240/Release+Notes Linux Driver Compatibility Matrix: https://www.mellanox.com/support/mlnx-ofed-matrix Currently, UnRAID 6.9.1 ships with kernel: mlx4_core: Mellanox ConnectX core driver v4.0-0 modinfo mlx4_en filename: /lib/modules/5.10.21-Unraid/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko.xz version: 4.0-0 license: Dual BSD/GPL description: Mellanox ConnectX HCA Ethernet driver author: Liran Liss, Yevgeny Petrilin srcversion: EE160E8DB5FA601160D41B2 depends: mlx4_core retpoline: Y intree: Y name: mlx4_en vermagic: 5.10.21-Unraid SMP mod_unload parm: udp_rss:Enable RSS for incoming UDP traffic or disabled (0) (uint) parm: pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit mask (uint) parm: pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit mask (uint) parm: inline_thold:Threshold for using inline data (range: 17-104, default: 104) (uint)
  7. It's Windows 10 20H2. But yeah, seems to be related to that driver mess just like you wrote. While troubleshooting I added the VF to an Ubuntu VM that just boots off Ubuntu 20.10 live iso and the VF worked straight away. Guess I'm gonna be finding and testing alot of drivers, if you have any recommendation it would be greatly appreciated. The card is updated with latest official firmware; 2.42.5000.
  8. So I tested to pass through my Mellanox ConnectX3-Pro in UnRAID 6.9.1 stable and this bug is still here.
  9. Seeing this I decided to put my Mellanox ConnectX3-Pro into my UnRAID server. I created 4 VFs following @ConnectivIT's guide using Option 1 (didnt set any static MAC yet). Then I added 1 of the VFs to my Win10 gaming workstation. However, Windows device manager doesn't want to start the device. There is a Mellanox ConnectX-3 VPI (MT041900) Virtual Network Adapter visible, but it is stopped with code 43. I tried installing 5.50.53000 WinOF driver from Nvidia site, but still no go. Any suggestion for a solution would be greatly appreciated!
  10. Yeah I'm also seeing ALOT higher CPU usage on my gaming VM since the update. Upgraded from 6.8.3 to 6.9.1. Was fine at the start, after a while I started to notice that mouse movement would freeze and stutter, checked load on the VM's isolated cores in unraid webui, 99-100% on all of them. Since then I've been trying loads of edits to get it better, updated xml to use Q35-5.1 since that is supposed to be alot better with VFIO. Still havent come up with anything really good though. 😒
  11. I had been running memtest86 v8.4 for a couple of hours in order to rule out bad RAM but it was not showing any errors. So I decided it could be worth trying out btrfs restore, unfortunately the outcome was pretty similar to mounting readonly and copying. Some files restored fine but the big img files that are of interest would just keep looping. I'm going back to memtest and leave that running under night to get more a reliable result.
  12. I checked up on how the copying was going and mc had stopped with: Should I try to mount with other options like degraded,usebackuproot ? Or just accept that those files have been lost.
  13. Appreciate the help! I managed to mount the cache array with mount -o ro,notreelog,nologreplay /dev/sdl1 /pleasework and it's currently copying the data to the main array. Is there any risk that the files thats currently being copied have become corrupted? Is there anything I should pay extra attention to prevent this from happening again? And can I trust unraid webui regarding that this disk is the cause of the problem cause it seems that whenever there is issues with the cache array the error will always point at the "first" disk in the cache array. After it's finished copying the data, is there any point in trying to run a btrfs --repair? I was planning to re-design the array after 6.9 stable but sometimes things doesn't go according to plan :P.
  14. Greetings. Last night my cache array stopped working as normal. What I suspect happened is that some docker container filled the free space. I'm unsure on how to continue to resolve this. So far I have started the array in maintenance and ran the btrfs -readonly which shows this at the beginning [1/7] checking root items [2/7] checking extents bad key ordering 42 43 bad key ordering 42 43 bad key ordering 42 43 bad key ordering 42 43 bad block 16614422200320 ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache there is no free space entry for 16470947741696-16470947749888 there is no free space entry for 16470947741696-16473475514368 cache appears valid but isn't 16468106805248 [4/7] checking fs roots bad key ordering 42 43 then it continues with alot of bad key ordering 42 43 and after that lots of files like unresolved ref dir 6983009 index 30 namelen 15 name ClamWinPortable filetype 0 errors 3, no dir item, no dir index root 5 inode 75545 errors 2000, link count wrong What gives me some hope is the: cache appears valid but isn't 16468106805248 So, hopefully some btrfs magic can be made to get it back online. Edit: I never saw the cache array being near full in the webui, so I'm thinking something like inodes/allocation full? Any help is appreciated. I've attached diagnostics. tower-diagnostics-20210124-1049.zip
  15. Ok, it finished now: Then I stopped maintenance and started the array with the disk still unassigned and my missing folders are back! What is my next step now? Stop the array and re-assign the disk again?
  16. Thanks for helping @johnnie.black It's running now. Is there anything I can do when it's done in order to try to salvage any data that was lost?
  17. So I stopped array and started in maintenance mode and ran check with -nv So it doesnt tell me to run xfs_repair, what would my next step be?
  18. So when I stopped my array for a reboot to replace a cache ssd it wouldn't stop and I had to powercycle the host after a long wait. Once it was back up I got a warning about a array disk that had error'd out. I started the array with the disk unassigned and then I saw this in syslog: Then I noticed some of the emulated shares are missing folders and files. I'm running unraid 6.7.2. What should be my next step in order to try to salvage any data that is missing? I've attached diagnostics. tower-diagnostics-20190627-1838.zip
  19. Diagnostics are attached in first posts. Give me a min and I'll locate the time when it happened. In the syslog file: May 16 12:29:52 Tower sshd[7773]: Accepted password for root from 10.0.1.11 port 51344 ssh2 May 16 12:30:12 Tower kernel: usb 4-1: USB disconnect, device number 2 So between 12:29:52 and 12:30:12 I ran hdparm -W1 /dev/sdb
  20. Ok, I forgot to mention I powercycled the cabinett that hosts the drives that had errors after the first stop of the array. I take it nothing can be done to save the lost files now? About split-level, I'm running: Automatically split any directory as required. I tested to browse the share from webgui, how could I verify if they are anywhere else @itimpi ?
  21. The latest fix common problems version was complaining that one of my disks did not have write cache enabled. So in order to fix that I ran hdparm -W1 /dev/sdb (while the array was still mounted). hdparm took longer than normal and ended up with a failure, then just seconds after hdparm ended, unraid started reporting errors on all 4 disks connected to the same hdd cabinet as /dev/sdb. Eventually unraid dropped one of the disks and marked it as failed. I stopped the array. Unassigned the disk that had been marked as failed. Started the array. Stopped the array. Re-assigned the disk that had been failed. Now it's running Parity-Sync/Data-Rebuild. The thing is, I browsed some of my shares and noticed there were folders that were just empty. Will the files inside those folder be recreated when parity/data-rebuild is completed or did I suffer major dataloss? If so can I do anything to try to recover those files? (No backup). Edit: Forgot diagnostics. tower-diagnostics-20190516-1724.zip
  22. I did the: "... start the array without any cache devices assigned, stop the array, assigned them all (there can't be the overwrite warning for any cache device) and start again, for the latter best to temporarily disable any services using the cache pool, like docker and/or VMs." And everything went well. Big up @johnnie.black
  23. I felt the need to clarify the reason how I ended up vfio-bind'ing the HBA. Might save someone else from suffering the same fate. At first when I upgraded to 6.7.0 I had problems with booting unraid properly, while safe mode worked. After troubleshooting the non-boot I found the cause to be something installed from devpack plugin which I removed and then could boot normally. But by then I had already made a clean unraid install on the USB and just copied over my config folder (but I forgot the syslinux folder). The forgotten syslinux folder caused the pcie_acs_override=downstream to dissapear in my syslinux configuration append line which in turn cause my HBA and GPU to end up in the same IOMMU group. The new vfio-pci.cfg will by default passthrough everything in the IOMMU group, and that is what screwed me over when I started my array. So all in all bad luck mixed with some laziness screwed me over. Don't be lazy people and double check your configurations! Now I'm hoping some btrfs guru will drop some magic repair command on me to save the day. IOMMU group 1: [8086:1901] 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) [8086:1905] 00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 05) [10de:1b81] 01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) [10de:10f0] 01:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1) [1000:0072] 02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
  24. So I just upgraded to 6.7.0 and when I rebooted after adding the new vfio-pci.cfg to passthrough my GPU's unraid somehow thought I passthroughed my HBA (and my dumb ass didnt notice) which in turn caused me to start the BTRFS cache array with just 2 SSD's (instead of the normal 10, the missing 8 disks are all connected to the "passthroughed" HBA). Now my cache array is in the "Unmountable: No file system" state and I'm scared. I've rolled back the vfio-pci.cfg and unraid can see all SSD's but I need help to repair my BTRFS RAID10 cache filesystem. I havent started the array after my first fuckup and if I try assign the disks to cache now I get the "All existing data on this device will be OVERWRITTEN when array is Started." Technically the data should still be left on the disks, but metadata and logs are probably bonkers. I've encountered the "Unmountable: No file system" on the main array and at that time I repaired it with xfs_repair. But that was just 1 disk. Any guidance on how to proceed to repair the cache array would be very much appreciated.