• eth0 hardware hang


    MorgothCreator
    • Solved

    This issue does not appear if the network is not stressed enought, but now I have 11 VM's running on this machine all with a preety large amount of traffic on the network and once at several hours this undefinitelly hang of the eth0 appear.

     

    Sometime the this issue can be fixet wit a network restart, but sometime only with restart.

     

    The kernel log is:

     

    Mar 31 06:24:59 Tower kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
    Mar 31 06:24:59 Tower kernel:  TDH                  <22>
    Mar 31 06:24:59 Tower kernel:  TDT                  <78>
    Mar 31 06:24:59 Tower kernel:  next_to_use          <78>
    Mar 31 06:24:59 Tower kernel:  next_to_clean        <22>
    Mar 31 06:24:59 Tower kernel: buffer_info[next_to_clean]:
    Mar 31 06:24:59 Tower kernel:  time_stamp           <101b4a085>
    Mar 31 06:24:59 Tower kernel:  next_to_watch        <23>
    Mar 31 06:24:59 Tower kernel:  jiffies              <101b4a800>
    Mar 31 06:24:59 Tower kernel:  next_to_watch.status <0>
    Mar 31 06:24:59 Tower kernel: MAC Status             <40080083>
    Mar 31 06:24:59 Tower kernel: PHY Status             <796d>
    Mar 31 06:24:59 Tower kernel: PHY 1000BASE-T Status  <3800>
    Mar 31 06:24:59 Tower kernel: PHY Extended Status    <3000>
    Mar 31 06:24:59 Tower kernel: PCI Status             <10>
    Mar 31 06:25:01 Tower root: Checking network connection at 192.168.128.1 ...
    Mar 31 06:25:01 Tower kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
    Mar 31 06:25:01 Tower kernel:  TDH                  <22>
    Mar 31 06:25:01 Tower kernel:  TDT                  <78>
    Mar 31 06:25:01 Tower kernel:  next_to_use          <78>
    Mar 31 06:25:01 Tower kernel:  next_to_clean        <22>
    Mar 31 06:25:01 Tower kernel: buffer_info[next_to_clean]:
    Mar 31 06:25:01 Tower kernel:  time_stamp           <101b4a085>
    Mar 31 06:25:01 Tower kernel:  next_to_watch        <23>
    Mar 31 06:25:01 Tower kernel:  jiffies              <101b4b000>
    Mar 31 06:25:01 Tower kernel:  next_to_watch.status <0>
    Mar 31 06:25:01 Tower kernel: MAC Status             <40080083>
    Mar 31 06:25:01 Tower kernel: PHY Status             <796d>
    Mar 31 06:25:01 Tower kernel: PHY 1000BASE-T Status  <3800>
    Mar 31 06:25:01 Tower kernel: PHY Extended Status    <3000>
    Mar 31 06:25:01 Tower kernel: PCI Status             <10>
    Mar 31 06:25:03 Tower kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
    Mar 31 06:25:03 Tower kernel:  TDH                  <22>
    Mar 31 06:25:03 Tower kernel:  TDT                  <78>
    Mar 31 06:25:03 Tower kernel:  next_to_use          <78>
    Mar 31 06:25:03 Tower kernel:  next_to_clean        <22>
    Mar 31 06:25:03 Tower kernel: buffer_info[next_to_clean]:
    Mar 31 06:25:03 Tower kernel:  time_stamp           <101b4a085>
    Mar 31 06:25:03 Tower kernel:  next_to_watch        <23>
    Mar 31 06:25:03 Tower kernel:  jiffies              <101b4b7c0>
    Mar 31 06:25:03 Tower kernel:  next_to_watch.status <0>
    Mar 31 06:25:03 Tower kernel: MAC Status             <40080083>
    Mar 31 06:25:03 Tower kernel: PHY Status             <796d>
    Mar 31 06:25:03 Tower kernel: PHY 1000BASE-T Status  <3800>
    Mar 31 06:25:03 Tower kernel: PHY Extended Status    <3000>
    Mar 31 06:25:03 Tower kernel: PCI Status             <10>
    Mar 31 06:25:05 Tower kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
    Mar 31 06:25:05 Tower kernel:  TDH                  <22>
    Mar 31 06:25:05 Tower kernel:  TDT                  <78>
    Mar 31 06:25:05 Tower kernel:  next_to_use          <78>
    Mar 31 06:25:05 Tower kernel:  next_to_clean        <22>
    Mar 31 06:25:05 Tower kernel: buffer_info[next_to_clean]:
    Mar 31 06:25:05 Tower kernel:  time_stamp           <101b4a085>
    Mar 31 06:25:05 Tower kernel:  next_to_watch        <23>
    Mar 31 06:25:05 Tower kernel:  jiffies              <101b4bf80>
    Mar 31 06:25:05 Tower kernel:  next_to_watch.status <0>
    Mar 31 06:25:05 Tower kernel: MAC Status             <40080083>
    Mar 31 06:25:05 Tower kernel: PHY Status             <796d>
    Mar 31 06:25:05 Tower kernel: PHY 1000BASE-T Status  <3800>
    Mar 31 06:25:05 Tower kernel: PHY Extended Status    <3000>
    Mar 31 06:25:05 Tower kernel: PCI Status             <10>
    Mar 31 06:25:05 Tower kernel: ------------[ cut here ]------------
    Mar 31 06:25:05 Tower kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
    Mar 31 06:25:05 Tower kernel: WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x150/0x1a8
    Mar 31 06:25:05 Tower kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables vhost_net tun vhost tap ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat nfsd lockd grace sunrpc md_mod bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc wmi_bmof mxm_wmi aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_uncore intel_rapl_perf e1000e i2c_i801 ahci i2c_core libahci wmi video thermal fan backlight acpi_pad button pcc_cpufreq
    Mar 31 06:25:05 Tower kernel: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.18.20-unRAID #1
    Mar 31 06:25:05 Tower kernel: Hardware name: System manufacturer System Product Name/STRIX B250F GAMING, BIOS 1207 09/04/2018
    Mar 31 06:25:05 Tower kernel: RIP: 0010:dev_watchdog+0x150/0x1a8
    Mar 31 06:25:05 Tower kernel: Code: 15 fd 97 00 00 75 36 4c 89 ef c6 05 09 fd 97 00 01 e8 93 c5 fd ff 89 e9 4c 89 ee 48 c7 c7 ee 0f d9 81 48 89 c2 e8 53 c0 b2 ff <0f> 0b eb 0f ff c5 48 81 c2 40 01 00 00 39 cd 75 98 eb 13 48 8b 83
    Mar 31 06:25:05 Tower kernel: RSP: 0018:ffff880636d83ea0 EFLAGS: 00010286
    Mar 31 06:25:05 Tower kernel: RAX: 0000000000000000 RBX: ffff8806174bc3b0 RCX: 0000000000000007
    Mar 31 06:25:05 Tower kernel: RDX: 0000000000000000 RSI: ffff880636d96470 RDI: ffff880636d96470
    Mar 31 06:25:05 Tower kernel: RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000020400
    Mar 31 06:25:05 Tower kernel: R10: 00000000000003da R11: 0000000000012ce0 R12: ffff8806174bc39c
    Mar 31 06:25:05 Tower kernel: R13: ffff8806174bc000 R14: ffff880616616280 R15: 0000000000000003
    Mar 31 06:25:05 Tower kernel: FS:  0000000000000000(0000) GS:ffff880636d80000(0000) knlGS:0000000000000000
    Mar 31 06:25:05 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 31 06:25:05 Tower kernel: CR2: 000015418ae5bcb0 CR3: 0000000001e0a004 CR4: 00000000003626e0
    Mar 31 06:25:05 Tower kernel: Call Trace:
    Mar 31 06:25:05 Tower kernel: <IRQ>
    Mar 31 06:25:05 Tower kernel: call_timer_fn+0x18/0x7b
    Mar 31 06:25:05 Tower kernel: ? qdisc_reset+0xc0/0xc0
    Mar 31 06:25:05 Tower kernel: expire_timers+0x7f/0x8e
    Mar 31 06:25:05 Tower kernel: run_timer_softirq+0x72/0x120
    Mar 31 06:25:05 Tower kernel: ? enqueue_hrtimer.isra.3+0x23/0x27
    Mar 31 06:25:05 Tower kernel: ? __hrtimer_run_queues+0xd7/0x105
    Mar 31 06:25:05 Tower kernel: ? recalibrate_cpu_khz+0x1/0x1
    Mar 31 06:25:05 Tower kernel: ? ktime_get+0x3a/0x8d
    Mar 31 06:25:05 Tower kernel: __do_softirq+0xce/0x1e2
    Mar 31 06:25:05 Tower kernel: irq_exit+0x5e/0x9d
    Mar 31 06:25:05 Tower kernel: smp_apic_timer_interrupt+0x7e/0x91
    Mar 31 06:25:05 Tower kernel: apic_timer_interrupt+0xf/0x20
    Mar 31 06:25:05 Tower kernel: </IRQ>
    Mar 31 06:25:05 Tower kernel: RIP: 0010:cpuidle_enter_state+0xe8/0x141
    Mar 31 06:25:05 Tower kernel: Code: ff 45 84 ff 74 1d 9c 58 0f 1f 44 00 00 0f ba e0 09 73 09 0f 0b fa 66 0f 1f 44 00 00 31 ff e8 73 af be ff fb 66 0f 1f 44 00 00 <48> 2b 1c 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 39 cb
    Mar 31 06:25:05 Tower kernel: RSP: 0018:ffffc900031e3ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    Mar 31 06:25:05 Tower kernel: RAX: ffff880636da0c00 RBX: 00001a4e35e90c1c RCX: 000000000000001f
    Mar 31 06:25:05 Tower kernel: RDX: 00001a4e35e90c1c RSI: 0000000020b84733 RDI: 0000000000000000
    Mar 31 06:25:05 Tower kernel: RBP: ffff880636da8d00 R08: 0000000000000002 R09: 0000000000020480
    Mar 31 06:25:05 Tower kernel: R10: 00000000009c38a0 R11: 0000671842b060fd R12: 0000000000000004
    Mar 31 06:25:05 Tower kernel: R13: 0000000000000004 R14: ffffffff81e589b8 R15: 0000000000000000
    Mar 31 06:25:05 Tower kernel: do_idle+0x192/0x20e
    Mar 31 06:25:05 Tower kernel: cpu_startup_entry+0x6a/0x6c
    Mar 31 06:25:05 Tower kernel: start_secondary+0x197/0x1b2
    Mar 31 06:25:05 Tower kernel: secondary_startup_64+0xa5/0xb0
    Mar 31 06:25:05 Tower kernel: ---[ end trace 4bc8f5a4b3412996 ]---
    Mar 31 06:25:05 Tower kernel: e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly
    Mar 31 06:25:05 Tower kernel: bond0: link status definitely down for interface eth0, disabling it
    Mar 31 06:25:05 Tower kernel: device eth0 left promiscuous mode
    Mar 31 06:25:05 Tower kernel: bond0: now running without any active interface!
    Mar 31 06:25:05 Tower kernel: br0: port 1(bond0) entered disabled state
    Mar 31 06:25:09 Tower kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
    Mar 31 06:25:09 Tower kernel: bond0: link status definitely up for interface eth0, 1000 Mbps full duplex
    Mar 31 06:25:09 Tower kernel: bond0: making interface eth0 the new active one
    Mar 31 06:25:09 Tower kernel: device eth0 entered promiscuous mode
    Mar 31 06:25:09 Tower kernel: bond0: first active interface up!
    Mar 31 06:25:09 Tower kernel: br0: port 1(bond0) entered blocking state
    Mar 31 06:25:09 Tower kernel: br0: port 1(bond0) entered forwarding state
    Mar 31 06:25:11 Tower root: wap_check: Network connection is down, restarting network ...

     

     

    Last line and "Tower root: Checking network connection at 192.168.128.1 ..." is my script that ping the router every two minutes, without this script to restart the machine remain down indefinitelly.

     

    The script to check the network is:

     

    #!/bin/bash
    logger "Checking network connection at 192.168.128.1 ..."
    x=`ping -c1 192.168.128.1 2>&1 | grep 100%`
    if [ ! "$x" = "" ]; then
            logger "wap_check: Network connection is down, restarting network ..."
            reboot
            exit
    fi

     

     

    System information:

     

    Model: Custom

    M/B: ASUSTeK COMPUTER INC. - STRIX B250F GAMING

    CPU: Intel® Core™ i3-7100 CPU @ 3.90GHz

    HVM: Enabled

    IOMMU: Enabled

    Cache: 128 kB, 512 kB, 3072 kB

    Memory: 24 GB (max. installable capacity 64 GB)

    Network: bond0: fault-tolerance (active-backup), mtu 1500
     eth0: 1000 Mb/s, full duplex, mtu 1500

    Kernel: Linux 4.18.20-unRAID x86_64

    OpenSSL: 1.1.1a

    Uptime: 0 days, 13:39:28

     




    User Feedback

    Recommended Comments

    Ehh, yes is a bit too much (is a 4 thread core) , but does not eat to much core power and each VM is configured with 512MB of ram and one virtualized core.

    The problem is not the core, is the network that is hanging and the kernel does not know what to do with this hanging, or something is wrong with virtualbox network driver.

    I watched the LED's on the network, and when the network is dead , for three seconds all status LED's on the network card are off, and 7-8 seconds light up and show network activity, and continue to do this infinitelly, but the rest of the machine work well the HDD led show normal activity.

    Tomorow I will receive the new hardware for this machine, a Rizen 7 1700 (8 core, 16 threads) and I will see the comportament of the machine.

    Link to comment

    I guess you have the wrong expectations here. Even if you don't see a full utilisation of all cores, the CPU has to queue up all operations of all VMs and cause delays which can cause "hickups" or lags. That effect you see in dropped network can be one effect of that. Each packet that has to be send or received accross that nic has to be processed by unraid to transfer it to the correct VM/Docker. In an situation, where the process chain/queue is already filled up to an certain point it can lead to a drop of the packet and an rerequest, depending on the protocol or worst case in an reset of that device. You will definitly see better performance with a newer higher core count platform, thats true, but even on a 8core CPU 11 VMs, all running at the same time is a lot of IO operations you're generating. Sharing cores between VMs and Dockers, running at the same time can and will affect the performance of all components and devices used by these VMs/Dockers. Seperate everything as much as you can to reduce side effects like stutter or lag ;)

    Link to comment

    I kill three of the VM's, and now is working OK without any hang, I see that there is a limitation of maximum two virtual cores per processor thread, that I see it explicitly documented on Synology DSM that use the same VirtualBox and does not let you allocate more than double the number of threads of the processor, so the maximum viable allocated cores in my case is 8 virtual cores on a 2 core 4 thread procesoor.

    About the wrong expectation is that I wondered if I can fool the machine to do this work until the new hardware will come from the store.

    Ehh, my bad, I will change the status of this thread to "other" and "solved", and remain for others to see like documentation :D

    Edited by MorgothCreator
    Link to comment

    Are you talking about Unraid or are you using Virtualbox or Synology DSM. These are 3 fundamental different software solutions. 

    Link to comment

    I talking about unRaid.

    Virtual box is used in unRaid and Synology DSM, so the limitations are the same.
    The only difference is that Synologi DSM does not let you allocate virtual cores more than double the threads of the processor.

    Link to comment

    You ever heared what nested virtualization is and what limitions there are? Virtualize something inside a already virtualized environment is NOT a good idea. You will face a lot of issues when it comes to performance. Why are you using a extra layer of virtualization, if you don't need to? To fool your machine it has more power than it actual has? Thats not how it works! 

    Link to comment

    All my VM's from above disertation are on a bare metal unRaid machine.

    I only compared unRaid OS with Synolgy DSM OS that both use Virtual Box.

    Who can think to virtualize inside another virtualized enviroment :))))

    Link to comment
    42 minutes ago, MorgothCreator said:

    Virtual box is used in unRaid and Synology DSM, so the limitations are the same.

    To sum things up, "Virtualbox" is NOT used in Unraid. Unraid uses Qemu/KVM which is a type1 hypervisor where VirtualBox is a type2 hypervisor. Thats 2 complete different things. 😉

    Link to comment

    Ahh, sorry, I am new to unRaid and I do not come to this part until now and I assumed that use VirtualBox :D, my bad.

    But it seems that there are the same limitations :D

    Link to comment

    Small advice. Try to separate the VMs from each other. Try not to use the same cores on different VMs at the same time or you will end up with this situations you saw before.

     

    Lets say you have 8 cores. Limit your total number of running VMs to max 7. Don't use core0, it is always used by Unraid itself to manage all the stuff in the background. And try not to split a core and it's hyperthread between 2 VMs. If you have a core+HT give it to the same VM. Sure it will work but you will drastically drop your performance and will end up with some weird errors. 

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.