juanamingo

Members
  • Posts

    33
  • Joined

  • Last visited

Everything posted by juanamingo

  1. Gotcha - thank you sir! It looks like I'd have to compile a custom kernel to suppress that.
  2. Well I finally got a chance to look into this. I have 4 nvme slots on my board and 2 nvme drives in a cache pool. I pulled both drives and installed heatsinks on them because every so often when the mover was running one of the drives would hit 50-60C and I didn't like that much. (I think it was the drive that was erroring, but i'm not sure) Originally the drive with the errors was in slot 0, and the 2nd drive in slot 1. When reinstalling the drives i put the drive with errors in slot 2, and the 2nd drive in slot 0, leaving slots 1 and 3 unoccupied. I'm still seeing the errors and they've followed the drive.... Some google research seems to indicate this is a harmless error - BUT in a week or so, log folder will fill up and I'll need to delete the syslog.1 or reboot. Any suggestions besides seeing if Samsung will replace the drive?
  3. Thanks - I'll try that as soon as I can pull the server and lyk. I shouldn't be worried about a drive reporting only 315 hours on when it should have > 1500?
  4. Good afternoon all! Fix Common Problems just notified me that my log folder was filling up - currently about 67% full. I took a look and saw that there are two 3 syslog entries - totaling about 256Mb, so i took a look in the newest syslog and I'm seeing this repeated: Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000 Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: [ 0] RxErr (First) Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: It has been corrected by h/w and requires no further action Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: event severity: corrected Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Error 0, type: corrected Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: section_type: PCIe error Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: port_type: 0, PCIe end point Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: version: 0.2 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: command: 0x0406, status: 0x0010 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: device_id: 0000:02:00.0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: slot: 0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: secondary_bus: 0x00 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: class_code: 010802 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Error 1, type: corrected Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: section_type: PCIe error Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: port_type: 0, PCIe end point Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: version: 0.2 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: command: 0x0406, status: 0x0010 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: device_id: 0000:02:00.0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: slot: 0 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: secondary_bus: 0x00 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: class_code: 010802 Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000 device_id: 0000:02:00.0 is my Samsung 980Pro NVMe drive, which is the 2nd drive in my cache pool. I haven't noticed this error before and have been running this setup for about 3-4 months. The only thing that has changed is upgrading from 6.9 -> 6.10.1 -> 6.10.2 One thing that's weird is I looked at the attributes for the drive and see: "Power on hours 315 (13d, 3h)" - which is NOT right.... Its partner drive has "Power on hours 2,605", which is ~108 days or ~3 1/2 months - and sounds about right as they were installed at almost the same time (about a week apart) Any suggestions? guardian-diagnostics-20220606-1527.zip
  5. Thanks! I haven't had a chance to get back to this, but if I'm able to try that, I'll let you know how it goes.
  6. Interesting... one of the few things I haven't tried is pulling the Quadro... Maybe roll back the driver too... I'll give those a whirl and report back. Thanks!
  7. Anyone have any ideas? Any other info or configs i need to provide? I've been trying every setting i could possibly think of and came across, but no luck so far. I did read that threadrippers were problematic for passthrough, but also read that it had been fixed in 6.8 or 6.9 (don't remember which).
  8. I contacted Supermicro about this and they had nothing illuminating to add, other than they recommend populating the GPU at the 1st slot (closest to cpu) - which is counterintuitively named Slot 7.... So moved the GPU to that slot, bound them to vfi at boot again, rebooted, and updated the vm. I'm stuck in D3 now. here are the logs while stuck in d3 and after force stop. Edit: Now getting the `internal error: Unknown PCI header type '127' for device '0000:41:00.0'` on starting the vm. guardian-diagnostics-20220128-1638-after-force-stop.zip guardian-diagnostics-20220128-1635-_in-D3.zip
  9. Well, I thought I had made some progress.... I tried stubbing the card in syslinux, but it didn't appear to work (still showed in the Nvidia Driver card list) I removed that entry and tried vfio-bind and rebooted. The card didn't show under the Nvidia list, and i passed it through without a bios to the vm. I ran my benchmark and it ran for a good 10 minutes, I thought I was good.... and then it crashed just like before. I'm not sure if it makes a difference so I dumped diagnostics before, and after force stopping the vm. My board _only_ supports UEFI boot, so I can't try legacy. It looks like one of my m.2 cache drives dropped when the card crashed.... maybe there's something wrong with the drives.... guardian-diagnostics-20220127-1848-crashed.zip guardian-diagnostics-20220127-1851-force-stop.zip
  10. I'm pretty sure you'd only notice if you were benchmarking, at least that's been my experience. Hope it helps! As far as your GPU passthrough - have you installed the nvidia drivers and run anything graphically intensive on it? Is it stable? I'm having issues as soon as i do anything gpu "intense" and was curious if you did anything special, or if it just worked. I posted my woes here
  11. Just out of curiosity, why pass the drive through? Just to install direct to the drive? You could mount the drive separately (or add another cache pool with that drive) and then point the vdisk image at that drive, making it the size of the drive. That'd be cleaner IMHO - if you ever wanted to upgrade the m.2 down the road, you'd just have 1 file to move to the new drive.
  12. Hi all, I'm trying to consolidate several machines into one - specifically two gaming vm's with dedicated cards. I recently updated my hardware from a Xeon 8core x99 build to a 32 core Threadripper RPO The update went flawlessly, and it's been running solid for over a week. The board BIOS is up to date, 1.0C, BIOS settings are stock / default. I'm having issues with VM stability when using pass through graphics. I have a Quadro P2200 that's dedicated to Plex transcoding and then a GTX1080. (I also have a GTX 980 currently plugged in to check if it affects overall thermals, but all my testing below was used without the card in the system.) The 1080 is passed through to a freshly created Windows 10 VM, using a BIOS that I dumped from the card itself and removed the header from (per SIO's video). I have set it to 4 core / 8 thread (specifically cores 5/37, 7/39, 9/41, 11/43), 16GB RAM, Q35-5.1, USB 3.0 qemu XHCI, and the only usb devices passed through are the keyboard and mouse. I dumped the bios for both cards using GPUZ on an up to date Win 10 machine (non vm), and if it matters, I used HexFiend on a Mac to edit the ROM. The GTX 1080 came from this machine. I've run the Heaven benchmark for an hour and it's fine, so I believe the card is ok. My IOMMU groups are correctly separated from what I can see, each card is in it's own group. When starting the VM with the card passed through, it disappears from the NVidia Driver plugin page - as I would expect. When the VM is shut down or force stopped, it re-appears. I believe that is correct behavior. The card shows up in windows device manager, I've updated Windows completely before passing the card through, using VNC. I set up RDP and connected, installed the latest Nvidia driver (and GeForce experience) and everything seemed fine. Disconnecting from RDP and using a directly attached monitor (HDMI), and passed through kbd/mouse I can log in and everything seemed fine. I installed the latest MSI Afterburner, and everything seems within spec, GPU temp was around 53C. I installed Heaven benchmark and it ran for a few minutes then the screen garbled and went black. I can't get to the vm via monitor / keyboard, rdp or vnc. Recently I added VNC as the primary and the 1080 as the secondary graphics adapter to see if I could VNC to the machine when it "crashes", but nothing works. It's locked /crashed. If I force stop the vm, and restart it - I can get access again via monitor / RDP / VNC. Any time I try anything graphically intensive, I get this crash / hard lockup on the VM within a few minutes. Any thoughts or suggestions on what else to check? I've attached diagnostics from after the crash / force stop of the vm. guardian-diagnostics-20220126-1625.zip
  13. No worries on that - it just stood out as odd to me is all. It could be that the other did the same and i had just rebooted, so uptime was low. I'm getting old, so let's go with bad memory =) I was reading it as there was about 50% cpu usage, and this is what top is showing now, and the ui I'm guessing now that the use is host related - not container related - just as the uptime.
  14. It's been a while since i used the UI - and granted, it was a different docker - but have a couple of questions. 1) The uptime - i thought i remember it being the container uptime, not the server uptime? 2) The load averages - on a fresh install (no servers etc), unless i'm reading them wrong - they seem like they're high? The UnRaid dashboard, top and htop aren't showing any weird usage.
  15. (...The Castle of ) aaarrrrggh. Oooohoohohooo!.... No, no. 'Aaaauugggh', at the back of the throat. Aaauugh. N-- no. No, no, no, no. 'Oooooooh', in surprise and alarm. Nope... I think I'm the peasant now..... My browser auto-filled the user as `root` because of the last mineos container and i didn't think to change it... `nobody` worked! (I figured `Arrrgh` fit well with my reaction when I realized i was dumb) Thanks again man!
  16. Help! Help! I'm being repressed! /appdata is on the cache drive already, but i did change it to use /mnt/cache/appdata/binhex-mineos-node and started it. At the login using root / mineos and: Have to restart the container to get back to the login or I get:
  17. Ahh! Now we see the violence inherent in the system!
  18. Hey man, first off - thanks for all the work you do with all your docker images! Much appreciated. Secondly.... "Ni"! Finally, to my issue.... i just did a fresh install of this image, changed nothing on the install (not even the pw - and none of the ports are conflicting) and get to the login, login with root / mineos and get a `This site can't be reached` and the web ui seems to have crashed - have to restart it to access the login. Can't get in no matter what. Any suggestions (anything else i can provide detail wise to help troubleshoot)? Thanks in advance!
  19. Correct, i thought it was because i never saw any errors pop up in the logs and never looked any further. I'm in the same boat as you, wanting to keep all my drives on my HBA, but it is what it is. If I have some time this week, I'll swap to the mbd and when i do i'll post back with results. Not sure when i'll have time tho.....
  20. Thanks! Next maintenance performed I'll give that a try!
  21. Ran that and got back /etc/libvirt: 924.1 MiB (969015296 bytes) trimmed /var/lib/docker: 12.3 GiB (13221044224 bytes) trimmed So then it's Not available / Not supported / Not whatever Not having paid much attention to it in the past (always assumed it was working), is that something i should worry about / investigate & get working (the cache ssd trim)?
  22. That's what I wasn't 100% sure of - i figured i'd see something related to cache for SSD trimming, but i'm not. Does that mean that it's not available/supported for trim? or not running? I have it the SSD trim schedule to Daily at 00:00
  23. Sorry that didn't help! Yep, 6.5.3 here. I just double checked and the trim plugin is indeed enabled and running daily at 00:00 looked at my logs for last night and don't see any errors - but also don't see any explicit 'trim' log entries other than Aug 9 00:00:02 Guardian root: /etc/libvirt: 923 MiB (967876608 bytes) trimmed Aug 9 00:00:02 Guardian root: /var/lib/docker: 12.2 GiB (13096083456 bytes) trimmed but i don't believe those are actual trim related. I'm hesitant to update the fw of the hba because "if it ain't broke, don't fix it" and i don't have time to deal with a downed / glitchy system. that's the whole reason i switched from the marvel based 2760A to this hba - i was tired of having my parity drives redball on me weekly. If there's something i can try out or test lmk and i'll do my best.
  24. I'm having the same issue and not only does typing the '|' character ('shift + \' ) produce '>' and not the expected '|' but the '<' key ('shift + ,') also produces '>' and not the expected '<'. typing the non-shifted characters works fine, \ and , I've tried the browser vnc, tight vnc and tiger vnc. unraid 6.5.3 arch linux typing 'SHIFT + \' produces '>' and 'SHIFT + <' produces '>' ubuntu 14.04 typing 'SHIFT + \' produces '>' and 'SHIFT + <' produces '>' osx high sierra both 'SHIFT + \' and 'SHIFT +' , produce a '+' over top a '_' (a "plus-minus" symbol if you will) windows 10 - the '\' and '|' can be typed fine, but 'SHIFT + ,' produce '|' instead of '<' and 'SHIFT + .' produce '|' instead of '>' If any logs or anything are needed, please lmk