Jump to content

Sinopsis

Members
  • Posts

    22
  • Joined

  • Last visited

Posts posted by Sinopsis

  1. On 8/15/2022 at 9:53 AM, dustyken said:

    Having same issue as @SockDust.  Anyone know how to proceed?

     

    I was able to solve this by starting another container with the version of mysql in the log file, then connecting to the container and shut down sql safely with the following command

     

    mysqladmin shutdown -p

     

    Then restart your other container with the latest tag or whatever

    • Upvote 1
  2. Was watching system log this time when it crashed....This was in it, and the console is a little different this time

     

    Jul 9 23:29:17 SERVER1 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
    Jul 9 23:29:17 SERVER1 kernel: PGD 0 P4D 0
    Jul 9 23:29:17 SERVER1 kernel: Oops: 0000 [#1] SMP PTI
    Jul 9 23:29:17 SERVER1 kernel: CPU: 5 PID: 3593 Comm: CPU 10/KVM Tainted: G W O 4.19.107-Unraid #1
    Jul 9 23:29:17 SERVER1 kernel: Hardware name: Supermicro X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.3 07/13/2018
    Jul 9 23:29:17 SERVER1 kernel: RIP: 0010:drop_spte+0x4b/0x78 [kvm]
    Jul 9 23:29:17 SERVER1 kernel: Code: 4c 01 e0 72 09 ba ff ee 00 00 48 c1 e2 1f 48 01 d0 ba f5 ff 7f 00 4c 89 e6 48 c1 e8 0c 48 c1 e2 29 48 c1 e0 06 48 8b 54 10 28 <48> 2b 72 40 48 89 d7 48 c1 fe 03 e8 63 d6 ff ff 48 89 ef 48 89 c6
    Jul 9 23:29:17 SERVER1 kernel: RSP: 0018:ffffc9000ce53c50 EFLAGS: 00010202
    Jul 9 23:29:17 SERVER1 kernel: RAX: 000000007f20a640 RBX: ffffc900243250e0 RCX: 0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: RDX: 0000000000000000 RSI: ffff889fc8299668 RDI: 7fffc4408733186c
    Jul 9 23:29:17 SERVER1 kernel: RBP: ffffc9000cb14000 R08: 0000000000000001 R09: 0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff889fc8299668
    Jul 9 23:29:17 SERVER1 kernel: R13: 0000000000000000 R14: ffff8884a1450000 R15: ffff8884a1450008
    Jul 9 23:29:17 SERVER1 kernel: FS: 0000152a383ff700(0000) GS:ffff889fff940000(0000) knlGS:0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul 9 23:29:17 SERVER1 kernel: CR2: 0000000000000040 CR3: 0000000124c1e005 CR4: 00000000000626e0
    Jul 9 23:29:17 SERVER1 kernel: Call Trace:
    Jul 9 23:29:17 SERVER1 kernel: kvm_zap_rmapp+0x3a/0x5e [kvm]
    Jul 9 23:29:17 SERVER1 kernel: ? kvm_io_bus_read+0x43/0xcc [kvm]
    Jul 9 23:29:17 SERVER1 kernel: kvm_unmap_rmapp+0x5/0x9 [kvm]
    Jul 9 23:29:17 SERVER1 kernel: kvm_handle_hva_range+0x11c/0x159 [kvm]
    Jul 9 23:29:17 SERVER1 kernel: ? kvm_zap_rmapp+0x5e/0x5e [kvm]
    Jul 9 23:29:17 SERVER1 kernel: kvm_mmu_notifier_invalidate_range_start+0x49/0x8f [kvm]
    Jul 9 23:29:17 SERVER1 kernel: __mmu_notifier_invalidate_range_start+0x78/0xc9
    Jul 9 23:29:17 SERVER1 kernel: change_protection+0x300/0x879
    Jul 9 23:29:17 SERVER1 kernel: change_prot_numa+0x13/0x22
    Jul 9 23:29:17 SERVER1 kernel: task_numa_work+0x20b/0x2b5
    Jul 9 23:29:17 SERVER1 kernel: task_work_run+0x77/0x88
    Jul 9 23:29:17 SERVER1 kernel: exit_to_usermode_loop+0x4b/0xa2
    Jul 9 23:29:17 SERVER1 kernel: do_syscall_64+0xdf/0xf2
    Jul 9 23:29:17 SERVER1 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Jul 9 23:29:17 SERVER1 kernel: RIP: 0033:0x152a3f5e14b7
    Jul 9 23:29:17 SERVER1 kernel: Code: 00 00 90 48 8b 05 d9 29 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 29 0d 00 f7 d8 64 89 01 48
    Jul 9 23:29:17 SERVER1 kernel: RSP: 002b:0000152a383fe678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    Jul 9 23:29:17 SERVER1 kernel: RAX: 0000000000000000 RBX: 000000000000ae80 RCX: 0000152a3f5e14b7
    Jul 9 23:29:17 SERVER1 kernel: RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001f
    Jul 9 23:29:17 SERVER1 kernel: RBP: 0000152a3988a2c0 R08: 000055c2583d0770 R09: 000000000000ffff
    Jul 9 23:29:17 SERVER1 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: R13: 0000152a3dcc0002 R14: 0000000000001072 R15: 0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: Modules linked in: vhost_net tun vhost tap kvm_intel kvm cdc_acm ccp xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables xt_nat veth macvlan ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod ixgbe(O) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper isci ipmi_ssif intel_cstate mpt3sas nvme libsas i2c_i801 ahci raid_class pcc_cpufreq scsi_transport_sas intel_uncore i2c_core intel_rapl_perf nvme_core libahci wmi ipmi_si button [last unloaded: tun]
    Jul 9 23:29:17 SERVER1 kernel: CR2: 0000000000000040
    Jul 9 23:29:17 SERVER1 kernel: ---[ end trace 1c4b462ac4b3e0e1 ]---
    Jul 9 23:29:17 SERVER1 kernel: RIP: 0010:drop_spte+0x4b/0x78 [kvm]
    Jul 9 23:29:17 SERVER1 kernel: Code: 4c 01 e0 72 09 ba ff ee 00 00 48 c1 e2 1f 48 01 d0 ba f5 ff 7f 00 4c 89 e6 48 c1 e8 0c 48 c1 e2 29 48 c1 e0 06 48 8b 54 10 28 <48> 2b 72 40 48 89 d7 48 c1 fe 03 e8 63 d6 ff ff 48 89 ef 48 89 c6
    Jul 9 23:29:17 SERVER1 kernel: RSP: 0018:ffffc9000ce53c50 EFLAGS: 00010202
    Jul 9 23:29:17 SERVER1 kernel: RAX: 000000007f20a640 RBX: ffffc900243250e0 RCX: 0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: RDX: 0000000000000000 RSI: ffff889fc8299668 RDI: 7fffc4408733186c
    Jul 9 23:29:17 SERVER1 kernel: RBP: ffffc9000cb14000 R08: 0000000000000001 R09: 0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff889fc8299668
    Jul 9 23:29:17 SERVER1 kernel: R13: 0000000000000000 R14: ffff8884a1450000 R15: ffff8884a1450008
    Jul 9 23:29:17 SERVER1 kernel: FS: 0000152a383ff700(0000) GS:ffff889fff940000(0000) knlGS:0000000000000000
    Jul 9 23:29:17 SERVER1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul 9 23:29:17 SERVER1 kernel: CR2: 0000000000000040 CR3: 0000000124c1e005 CR4: 00000000000626e0

     

    Untitled.png

  3. Not sure if this is somehow related, but two times today while mover was running, i started getting tons of errors like this:

     

    Jul 9 17:00:59 SERVER1 move: move: create_parent: /mnt/cache/media/Movies/The Fifth Element (1997) (PG-13)/extrafanart error: Read-only file system

     

     

  4. 1 minute ago, jonp said:

    Ok, to be fair, Hyper-V and KVM are not anywhere close on the spectrum of hypervisors and if other underlying gear changed (including the HBA and storage), that obviously could have an impact.  What about BIOS updates?  Any available?  Another thing you could try would be to disable IOMMU in the BIOS to see if that has any impact.

    For sure...HyperV is rather lacking....although, to be fair, if it had USB pass through, I probably would have just left it as a Windows box on a RAID10 volume :)  I'm much more comfortable with M$.

     

    No, it has the most current bios update, from 7/2017.  And I think the only thing that update addressed was the Spectre vulnerability.

     

    I'll try moving it off 0,12 and see if its more stable.  If it crashes again, I'll swap the usb and disks to the 2nd box and move those box's components to this box to see if I experience the same behavior.  If so, I'll try disabling IOMMU (not familiar with that)

  5. 1 minute ago, jonp said:

    Wow, that's pretty concerning.  If there is no hardware pass-through happening and you're getting these kinds of crashes, it leads me to believe a buggy BIOS on your hardware.  What is the underlying hardware on this system?

     

    I pulled a pair of these out of our datacenter and brought them home:

     

    https://www.supermicro.com/products/motherboard/Xeon/C600/X9DRH-7F.cfm

     

    They were rock solid as our HyperV hypervisors for several years with no issues.

     

    The only difference is that I can think of is I've flashed the onboard LSI 2208 to be 2308 HBA instead.

  6. 2 minutes ago, jonp said:

    Ok, what happens if you path the storage to something other than that PCIe NVMe Unassigned Device?  Again, the goal here is to narrow down the root cause or what combination is causing it.

     

    Another thing you could try would be changing the Machine Type or the BIOS type to see if that has an affect.

    I had crashes before with the default path (on the cache mount), but couldn't get the console to come up via IPMI in the previous crashes, so was unable to see the call stack.  This is the first time it's crashed and I was able to not only see the console, but interact with..could login and use the cli, but had no network connectivity.  I couldn't shutdown the VM gracefully or even force shut it down.

     

    I hate trying to troubleshoot problems that I can't reproduce to test :/

  7. 3 minutes ago, Jerky_san said:

    Spin locks I believe are when something just constantly sits there waiting on something. I can say your allocating core0/12. I would say don't do that because unraid will ALWAYS use core0/12 even when you try isolating it. It just doesn't work so I'd highly suggest removing that one. I'm still researching but who knows it might fix it ^_^ lol

    Ok, I've unselected cpu 0/12 from the vm.  The crashes are pretty random and don't seem to follow any pattern that I can see.  

     

    Unrelated, should we also try to prevent docker from running on 0/12 ?

  8. 3 minutes ago, jonp said:

    Hi there,

     

    Are you trying to pass through the NVMe drive to the VM directly?  If so, try not doing that and see if you can reproduce the lockup.  If so, then the issue stems from the underlying hardware/VM configuration.  If the issue goes away, then you know it's isolated to that PCIe device.

    No, I'm not trying to pass it through.  I just have my VM storage set to the unassigned device that happens to be that PCIe NVMe drive.  In my case, thats: /mnt/disks/VirtualMachines/

     

     

     

  9. Update:

     

    If I'm reading this correctly:

     

    root@SERVER1:/sys# lscpu --all --extended
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
      0    0      0    0 0:0:0:0          yes 2500.0000 1200.0000
      1    0      0    1 1:1:1:0          yes 2500.0000 1200.0000
      2    0      0    2 2:2:2:0          yes 2500.0000 1200.0000
      3    0      0    3 3:3:3:0          yes 2500.0000 1200.0000
      4    0      0    4 4:4:4:0          yes 2500.0000 1200.0000
      5    0      0    5 5:5:5:0          yes 2500.0000 1200.0000
      6    1      1    6 6:6:6:1          yes 2500.0000 1200.0000
      7    1      1    7 7:7:7:1          yes 2500.0000 1200.0000
      8    1      1    8 8:8:8:1          yes 2500.0000 1200.0000
      9    1      1    9 9:9:9:1          yes 2500.0000 1200.0000
     10    1      1   10 10:10:10:1       yes 2500.0000 1200.0000
     11    1      1   11 11:11:11:1       yes 2500.0000 1200.0000
     12    0      0    0 0:0:0:0          yes 2500.0000 1200.0000
     13    0      0    1 1:1:1:0          yes 2500.0000 1200.0000
     14    0      0    2 2:2:2:0          yes 2500.0000 1200.0000
     15    0      0    3 3:3:3:0          yes 2500.0000 1200.0000
     16    0      0    4 4:4:4:0          yes 2500.0000 1200.0000
     17    0      0    5 5:5:5:0          yes 2500.0000 1200.0000
     18    1      1    6 6:6:6:1          yes 2500.0000 1200.0000
     19    1      1    7 7:7:7:1          yes 2500.0000 1200.0000
     20    1      1    8 8:8:8:1          yes 2500.0000 1200.0000
     21    1      1    9 9:9:9:1          yes 2500.0000 1200.0000
     22    1      1   10 10:10:10:1       yes 2500.0000 1200.0000
     23    1      1   11 11:11:11:1       yes 2500.0000 1200.0000
    root@SERVER1:/sys# 

    Then the logical cpu selection corresponds to:

     

    0,12, 1,13, 2,14, 3,15, 4,16, 5,17 are physical cpu #1

    6,18, 7,19, 8,20, 9,21, 10,22, 11,23 are physical cpu #2

     

    Which makes sense, but shoots a hole in my theory about the pcie bus :(

  10. I feel pretty confident that the lockups have to do with the vms.  I rebuilt this box and right now only have one vm on it.  I see kvm references in the call stack on the crash information.

     

    My first thought is that maybe the storage that the vm is on might be plugged into a pcie lane that is connected to different physical cpu maybe?

     

    It's on an Intel i750 PCIE NvME drive plugged into PCIE Slot 2, which according to the diagram on page 1-4 of this manual: https://www.supermicro.com/manuals/motherboard/C606_602/MNL-1306.pdf  Should be CPU1

     

    In the attached "capture.png", which cpu's might be physical cpu 1 and which might be physical cpu 2?

     

    unknown.png

    Capture.PNG

  11. I see the benefits, but as someone who primarily deals with enterprise systems, I prefer to have direct support for products I pay for, even if I have to pay more. 

     

    Maybe I'm being overly critical because I'm frustrated and having so many issues with the system (besides this one that I posted about which is just an annoyance more than anything).

     

    Random hard locks - have to power cycle the server to get it back.  I just pulled this server out of our datacenter where it was one of our primary hypervisors and had been rock solid for years.

     

    Active directory integration seems completely broken.  Every time it reboots it shows as "unjoined", and the logs are full of "root: chown: invalid user: 'Domain Admins:Domain Users'" errors when it finally does show joined.

     

    I've got about 10 days left on this trial, and at this point I'm considering just scrapping it completely and just using ProxMox and the hardware raid controller.  I liked the idea of not having to have all 28 disks ( 2 different servers - same specs) spun up the majority of the time which is why I was even looking at this.

  12. 5 hours ago, johnnie.black said:

    That would better asked on the UD support thread, but no Unraid flash drive should ever appear as an unassigned device.

    Maybe, but that is ALSO NOT WHAT I POSTED ABOUT.  I posted asking why the installer is only partitioning half of my flash drive.  Is this the level of support I should expect if I decide to purchase a license?  

  13. Reformatting it might solve the fat_free_clusters error, but it wont solve the issue I posted about, nor would it explain why I see the same behavior on two different machines with 2 different flash drives.

  14. 1 hour ago, trurl said:

    If your boot flash is showing up in Unassigned Devices it has already disconnected. 

     

    Make sure you are booting from USB2 port. 

    It's shown in unassigned devices since the initial install, on both servers.  The servers only have usb 2.0 ports (they're older supermicro servers)

     

  15. I've been running trial now for about 2 weeks on a brand new https://amzn.to/2Yz2Amc

     

    I'm getting flash write errors.  It's also only showing that its 16GB in unassigned disks, but the fdisk -l output is below:

     

    Disk /dev/sda: 28.67 GiB, 30765219840 bytes, 60088320 sectors
    Disk model: Cruzer Fit      
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0x00000000
    
    Device     Boot Start      End  Sectors  Size Id Type
    /dev/sda1  *     2048 60088319 60086272 28.7G  c W95 FAT32 (LBA)

    I bought a 2nd of the same flash and am seeing the same thing on a second server.

     

     

×
×
  • Create New...