e1000e Detected Hardware Unit Hang


sota

Recommended Posts

Apr 21 23:14:44 Tigger kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
Apr 21 23:14:44 Tigger kernel:  TDH                  <2d>
Apr 21 23:14:44 Tigger kernel:  TDT                  <44>
Apr 21 23:14:44 Tigger kernel:  next_to_use          <44>
Apr 21 23:14:44 Tigger kernel:  next_to_clean        <2c>
Apr 21 23:14:44 Tigger kernel: buffer_info[next_to_clean]:
Apr 21 23:14:44 Tigger kernel:  time_stamp           <13a8c3651>
Apr 21 23:14:44 Tigger kernel:  next_to_watch        <2d>
Apr 21 23:14:44 Tigger kernel:  jiffies              <13a8c3f00>
Apr 21 23:14:44 Tigger kernel:  next_to_watch.status <0>
Apr 21 23:14:44 Tigger kernel: MAC Status             <80083>
Apr 21 23:14:44 Tigger kernel: PHY Status             <796d>
Apr 21 23:14:44 Tigger kernel: PHY 1000BASE-T Status  <3800>
Apr 21 23:14:44 Tigger kernel: PHY Extended Status    <3000>
Apr 21 23:14:44 Tigger kernel: PCI Status             <10>
Apr 21 23:14:46 Tigger kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
Apr 21 23:14:46 Tigger kernel:  TDH                  <2d>
Apr 21 23:14:46 Tigger kernel:  TDT                  <44>
Apr 21 23:14:46 Tigger kernel:  next_to_use          <44>
Apr 21 23:14:46 Tigger kernel:  next_to_clean        <2c>
Apr 21 23:14:46 Tigger kernel: buffer_info[next_to_clean]:
Apr 21 23:14:46 Tigger kernel:  time_stamp           <13a8c3651>
Apr 21 23:14:46 Tigger kernel:  next_to_watch        <2d>
Apr 21 23:14:46 Tigger kernel:  jiffies              <13a8c46c0>
Apr 21 23:14:46 Tigger kernel:  next_to_watch.status <0>
Apr 21 23:14:46 Tigger kernel: MAC Status             <80083>
Apr 21 23:14:46 Tigger kernel: PHY Status             <796d>
Apr 21 23:14:46 Tigger kernel: PHY 1000BASE-T Status  <3800>
Apr 21 23:14:46 Tigger kernel: PHY Extended Status    <3000>
Apr 21 23:14:46 Tigger kernel: PCI Status             <10>
Apr 21 23:14:48 Tigger kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
Apr 21 23:14:48 Tigger kernel:  TDH                  <2d>
Apr 21 23:14:48 Tigger kernel:  TDT                  <44>
Apr 21 23:14:48 Tigger kernel:  next_to_use          <44>
Apr 21 23:14:48 Tigger kernel:  next_to_clean        <2c>
Apr 21 23:14:48 Tigger kernel: buffer_info[next_to_clean]:
Apr 21 23:14:48 Tigger kernel:  time_stamp           <13a8c3651>
Apr 21 23:14:48 Tigger kernel:  next_to_watch        <2d>
Apr 21 23:14:48 Tigger kernel:  jiffies              <13a8c4ec0>
Apr 21 23:14:48 Tigger kernel:  next_to_watch.status <0>
Apr 21 23:14:48 Tigger kernel: MAC Status             <80083>
Apr 21 23:14:48 Tigger kernel: PHY Status             <796d>
Apr 21 23:14:48 Tigger kernel: PHY 1000BASE-T Status  <3800>
Apr 21 23:14:48 Tigger kernel: PHY Extended Status    <3000>
Apr 21 23:14:48 Tigger kernel: PCI Status             <10>
Apr 21 23:14:49 Tigger kernel: e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
Apr 21 23:14:50 Tigger kernel: bond0: (slave eth0): link status definitely down, disabling slave
Apr 21 23:14:50 Tigger kernel: device eth0 left promiscuous mode
Apr 21 23:14:50 Tigger kernel: bond0: now running without any active interface!
Apr 21 23:14:50 Tigger kernel: br0: port 1(bond0) entered disabled state
Apr 21 23:14:53 Tigger kernel: e1000e 0000:00:19.0 eth0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Apr 21 23:14:53 Tigger kernel: bond0: (slave eth0): link status definitely up, 1000 Mbps full duplex
Apr 21 23:14:53 Tigger kernel: bond0: (slave eth0): making interface the new active one
Apr 21 23:14:53 Tigger kernel: device eth0 entered promiscuous mode
Apr 21 23:14:53 Tigger kernel: bond0: active interface up!
Apr 21 23:14:53 Tigger kernel: br0: port 1(bond0) entered blocking state
Apr 21 23:14:53 Tigger kernel: br0: port 1(bond0) entered forwarding state

 

Started having this problem recently.  Not physically at the machine right now, but does this look like a faulty card, bad port on the switch, or a bad cable?  Or, something else software related.

 

Machine is basically a glorified file cabinet, running a single Windows 7 x64 VM for SageTV (haven't gotten the docker to work to my liking, and i'm under a time crunch with a failing physical SageTV server and a 4/25 "hard" cut over date for Cablevision switching to encrypted.)

Everything seemed to be working fine until a couple days ago, when while I was remoted into the VM (anydesk) I kept and keep getting disconnected.  Finally tried to watch the log, only to discover /var/log was full.  I caught the above after the most recent disconnect.

 

diagnostics and syslog.1 are attached.

syslog.zip tigger-diagnostics-20220421-2322.zip

Link to comment
  • 1 month later...
  • 1 month later...
  • 2 weeks later...
On 8/10/2022 at 2:57 AM, smeehrrr said:

I started hitting this today after upgrading to 6.10.3.  Attaching diagnostics just in case that's helpful.

nova-diagnostics-20220809-1753.zip 193.51 kB · 0 downloads

I have exactly the same issue... worked great for almost a year with unraid 6.9.2. Then upgraded to unraid 6.10.3 and now my servers 4 ethernet ports (Intel Nic Pro/1000) suddenly go down one after another until bond0 is completely gone and ssh connections drop. Not even the terminal is responding fine afterwards anymore (local keyboard and monitor installed due to this error). Typing in reboot -f doesn't work neither... i have to force shutdown the server with longpressing the pwr button. 

Link to comment

For what it's worth, running

 ethtool -K eth0 tso off

made the problem go away for me. I don't know what I'm giving up in terms for performance but it was a good enough temporary workaround to get all my files copied over.

I did not have any of your additional issues with terminal, etc, just a momentary hitch that was long enough to interrupt file copies but not long enough to make ssh drop.

 

  • Like 1
Link to comment

Just a quick update from my side. After try and error for many more hours, i had no idea left what could possibly cause this issue. I tried everything from rollback, to fresh installation of unraid (even bought new flash stick in case my one was faulty) runned a memtest for about 3 and a half days, where each test passed without a single error etc.
After trying almost everything i felt really lost because i had no further ideas what could possibly cause this issue. Then i googled a bit for similar issues but not specific for unraid, more for general computer systems (Googled for the symptoms). At some point i found many interesting articles about "half faulty" PSU's that still power the computer / server but do not deliver a constant / correct voltage anymore. I tested this theory by letting the server (unraid 6.10.3) run in safe mode without any dockers or other services, just plain unraid. The server then stayed up for 3 days without any crash. As soon as i rebooted back to normal mode and started some load intensive docker containers, the server crashed again. Then i borrowed an unused power supply from a friend and temporary installed it in the server... and voila! it was working like it did all the month before. So i bought a new power supply, installed it in the server and since then it is up and running again for 4 days. on Monday and yesterday i was intensively stress testing the server. No more crashes, everything back to normal. As the server was still powering on and booting normal, the psu was actually the last thing i thought about. Anyway, really strange issue that almost brought me to the madhouse / nuthouse, but i could finally solve it. 

PS: Sorry for my english it is not my motherlanguage  

  • Like 2
Link to comment
  • 1 year later...
On 8/18/2022 at 5:15 PM, smeehrrr said:

For what it's worth, running

 ethtool -K eth0 tso off

made the problem go away for me. I don't know what I'm giving up in terms for performance but it was a good enough temporary workaround to get all my files copied over.

I did not have any of your additional issues with terminal, etc, just a momentary hitch that was long enough to interrupt file copies but not long enough to make ssh drop.

 

Thank you immensely to the @smeehrrr for sharing this crucial workaround. For several years, I've been grappling with persistent network issues on my Unraid server, pushing me to the brink of frustration. Implementing the suggested command has finally brought relief. This problem began around 2018-2019, and my system had been functioning smoothly prior to that. The resolution provided here has been a game changer, and I'm deeply grateful for the shared knowledge and support.

I'm curious about the underlying mechanics of this fix. Is there a more permanent solution that can be implemented? I'm keen to understand why this particular command was effective, especially considering the issue's persistence from 2018-2019 to the present day (Unraid Version: 6.12.6).

Here are some specifics of my setup for context:

Server Model: LENOVO ThinkServer TS440

BIOS Version: FBKTDIAUS (Dated: Thu 16 Sep 2021)

Processor: Intel® Xeon® CPU E3-1275L v3 @ 2.70GHz

Any additional insights or suggestions for a long-term resolution would be greatly appreciated!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.