Network interfaces keep going down

cinereus · September 29, 2022

I have two interfaces eth0 and eth1.

eth0 goes directly to the router and provide and internet connection

eth1 goes directly to my PC

Last night I noticed that eth0 had dropped. I unplugged and replugged the ethernet cable and all was fine.

I had to do this one more time later in the evening.

When I woke up this morning I found that eth0 had dropped to 100 Mbps.

I unplugged and replugged and it reconnected instantly at 1000 Mbps.

A couple of hours later I noticed that eth1 wasn't working. I unplugged and replugged which did nothing.

Now eth1 is flickering between connecting and disconnecting. It stays connected for long enough to go to "unidentified network" but doesn't have time to resolve an IP before it says "interface down" and my PC says "not connected".

As I was trying to diagnose this it seems eth0 has gone down completely and won't connect at all. This means I can't even get diagnostics.

On the outside it seems that the onboard ethernet is just dying! Is this possible? Could I see anything in diagnostics to show whether this is the case?

The hardware is Supermicro SuperChassis CSE-826 with a X9DRH-7TF V1.02 motherboard.

cinereus · September 29, 2022

Managed to log in long enough to get diagnostics:

fs-diagnostics-20220929-1325.zip

cinereus · September 29, 2022

Clearer diagnostics after reboot:

And here's the system log:

 Sep 29 14:29:34 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/ TX
 Sep 29 14:29:34 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Sep 29 14:29:40 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
 Sep 29 14:29:40 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
 Sep 29 14:29:46 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
 Sep 29 14:29:47 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
 Sep 29 14:29:56 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
 Sep 29 14:29:56 fs kernel: ixgbe 0000: 05:00.0 eth0: NIC Link is Down
 Sep 29 14:30:20 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
 Sep 29 14:30:20 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Sep 29 14:30:35 fs emhttpd: cmd: /usr/local/emhttp/plugins/user.scripts/showLog.php dropbox and drive sync
 Sep 29 14:30:44 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
 Sep 29 14:30:45 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
 Sep 29 14:30:57 fs kernel: ixgbe 0000:05: 00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
 Sep 29 14:30:57 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
 Sep 29 14:31:14 fs kernel: ixgbe 0000 :05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX

fs-diagnostics-20220929-1434.zip

JorgeB · September 29, 2022

For eth0 looks more like a connection problem, try replacing the cable or using a different switch/router

eth1 crashed, only a reboot will fix that:

Sep 29 05:36:42 fs kernel: DMAR: DRHD: handling fault status reg 2
Sep 29 05:36:42 fs kernel: DMAR: [DMA Read] Request device [05:00.1] PASID ffffffff fault addr f4a19000 [fault reason 06] PTE Read access is not set
Sep 29 05:36:42 fs kernel: DMAR: [DMA Read] Request device [05:00.1] PASID ffffffff fault addr f9d54000 [fault reason 06] PTE Read access is not set

Unrelated but the server is detecting RAM errors, should fix that.

cinereus · September 30, 2022

12 hours ago, JorgeB said:
For eth0 looks more like a connection problem, try replacing the cable or using a different switch/router

eth1 crashed, only a reboot will fix that:
Sep 29 05:36:42 fs kernel: DMAR: DRHD: handling fault status reg 2
Sep 29 05:36:42 fs kernel: DMAR: [DMA Read] Request device [05:00.1] PASID ffffffff fault addr f4a19000 [fault reason 06] PTE Read access is not set
Sep 29 05:36:42 fs kernel: DMAR: [DMA Read] Request device [05:00.1] PASID ffffffff fault addr f9d54000 [fault reason 06] PTE Read access is not set
Unrelated but the server is detecting RAM errors, should fix that.

Thanks. I have now rebooted and will see how it goes. Where do you see the eth1 crash?

JorgeB · September 30, 2022

It starts with the log snippet posted above, device 05:00.1 is eth1.

cinereus · October 3, 2022

eth1 is going up and down constantly again even after a reboot. Any idea what the issue is that's causing it to crash every other day?

fs-diagnostics-20221003-1735.zip

JorgeB · October 3, 2022

It crashed again:

Oct  1 13:44:26 fs kernel: DMAR: DRHD: handling fault status reg 2

Try updating to v6.10.3 or v6.11.0 since from v6.10.3 DMA remapping is no longer used, and that appears to be what's causing the problem.

cinereus · October 3, 2022

8 minutes ago, JorgeB said:
It crashed again:
Oct  1 13:44:26 fs kernel: DMAR: DRHD: handling fault status reg 2
Try updating to v6.10.3 or v6.11.0 since from v6.10.3 DMA remapping is no longer used, and that appears to be what's causing the problem.

I don't get why it would have worked for years with no issue before though?

syslog keeps repeating this what does it mean?

Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1 eth1: Detected Tx Unit Hang
Oct 3 17:56:14 fs kernel: Tx Queue <20>
Oct 3 17:56:14 fs kernel: TDH, TDT <0>, <2>
Oct 3 17:56:14 fs kernel: next_to_use <2>
Oct 3 17:56:14 fs kernel: next_to_clean <0>
Oct 3 17:56:14 fs kernel: tx_buffer_info[next_to_clean]
Oct 3 17:56:14 fs kernel: time_stamp <115869ae9>
Oct 3 17:56:14 fs kernel: jiffies <11586a9c0>
Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1 eth1: tx hang 17692 detected on queue 20, resetting adapter
Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1 eth1: initiating reset due to tx timeout
Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1 eth1: Reset adapter
Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1 eth1: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1 eth1: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Oct 3 17:56:14 fs kernel: ixgbe 0000:05:00.1: master disable timed out
Oct 3 17:56:18 fs kernel: ixgbe 0000:05:00.1 eth1: NIC Link is Up 1 Gbps, Flow Control: RX/TX
Oct 3 17:56:24 fs kernel: ixgbe 0000:05:00.1 eth1: Detected Tx Unit Hang

JorgeB · October 3, 2022

3 minutes ago, cinereus said:

I don't get why it would have worked for years with no issue before though?

NICs might be going bad, if one goes it's expected that the other goes at the same time, but it won't hurt to upgrade to see if there's any difference, you should upgrade anyway since v6.9.32 is quite old now.

4 minutes ago, cinereus said:

syslog keeps repeating this what does it mean?

That's because of the earlier crash.

cinereus · October 3, 2022

7 minutes ago, JorgeB said:

NICs might be going bad, if one goes it's expected that the other goes at the same time, but it won't hurt to upgrade to see if there's any difference, you should upgrade anyway since v6.9.32 is quite old now.

That's because of the earlier crash.

Thanks. If my NICs are "going bad" is there anything I can do? I think I'd need to replace the whole motherboard?!

JorgeB · October 3, 2022

You can install add-on NICs.

cinereus · October 4, 2022

8 hours ago, JorgeB said:
It crashed again:
Oct  1 13:44:26 fs kernel: DMAR: DRHD: handling fault status reg 2
Try updating to v6.10.3 or v6.11.0 since from v6.10.3 DMA remapping is no longer used, and that appears to be what's causing the problem.

After installing the update I'm getting this on eth0:

Oct  4 01:57:51 fs kernel: tun: Universal TUN/TAP device driver, 1.6
Oct  4 01:57:54 fs  ntpd[1467]: Listen normally on 4 eth0 192.168.0.250:123
Oct  4 01:57:54 fs  ntpd[1467]: Listen normally on 5 eth0 [fe80::ec4:7aff:fe59:76ee%5]:123
Oct  4 01:57:54 fs  ntpd[1467]: new interface(s) found: waking up resolver
Oct  4 02:04:12 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Oct  4 02:04:13 fs  ntpd[1467]: Deleting interface #4 eth0, 192.168.0.250#123, interface stats: received=40, sent=40, dropped=0, active_time=379 secs
Oct  4 02:04:13 fs  ntpd[1467]: 216.239.35.0 local addr 192.168.0.250 -> <null>
Oct  4 02:04:13 fs  ntpd[1467]: 216.239.35.4 local addr 192.168.0.250 -> <null>
Oct  4 02:04:13 fs  ntpd[1467]: 216.239.35.8 local addr 192.168.0.250 -> <null>
Oct  4 02:04:13 fs  ntpd[1467]: 216.239.35.12 local addr 192.168.0.250 -> <null>
Oct  4 02:04:13 fs  ntpd[1467]: Deleting interface #5 eth0, fe80::ec4:7aff:fe59:76ee%5#123, interface stats: received=0, sent=0, dropped=0, active_time=379 secs
Oct  4 02:04:15 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 1 Gbps, Flow Control: RX/TX
Oct  4 02:04:16 fs  ntpd[1467]: Listen normally on 6 eth0 192.168.0.250:123
Oct  4 02:04:16 fs  ntpd[1467]: Listen normally on 7 eth0 [fe80::ec4:7aff:fe59:76ee%5]:123
Oct  4 02:04:16 fs  ntpd[1467]: new interface(s) found: waking up resolver
Oct  4 02:04:22 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Oct  4 02:04:24 fs  ntpd[1467]: Deleting interface #6 eth0, 192.168.0.250#123, interface stats: received=4, sent=4, dropped=0, active_time=8 secs
Oct  4 02:04:24 fs  ntpd[1467]: 216.239.35.0 local addr 192.168.0.250 -> <null>
Oct  4 02:04:24 fs  ntpd[1467]: 216.239.35.4 local addr 192.168.0.250 -> <null>
Oct  4 02:04:24 fs  ntpd[1467]: 216.239.35.8 local addr 192.168.0.250 -> <null>
Oct  4 02:04:24 fs  ntpd[1467]: 216.239.35.12 local addr 192.168.0.250 -> <null>
Oct  4 02:04:24 fs  ntpd[1467]: Deleting interface #7 eth0, fe80::ec4:7aff:fe59:76ee%5#123, interface stats: received=0, sent=0, dropped=0, active_time=8 secs
Oct  4 02:04:25 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 1 Gbps, Flow Control: RX/TX
Oct  4 02:04:27 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Oct  4 02:04:35 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Up 100 Mbps, Flow Control: RX/TX
Oct  4 02:04:36 fs  ntpd[1467]: Listen normally on 8 eth0 192.168.0.250:123
Oct  4 02:04:36 fs  ntpd[1467]: Listen normally on 9 eth0 [fe80::ec4:7aff:fe59:76ee%5]:123
Oct  4 02:04:36 fs  ntpd[1467]: new interface(s) found: waking up resolver
Oct  4 02:05:30 fs  vnstatd[6518]: Detected bandwidth limit for "eth0" changed from 1000 Mbit to 100 Mbit.

JorgeB · October 4, 2022

Looks like the NICs really have a problem, assuming you replaced/swapped cables before.

cinereus · October 7, 2022

On 10/4/2022 at 8:19 AM, JorgeB said:

Looks like the NICs really have a problem, assuming you replaced/swapped cables before.

eth0 has been stuck at 100 Mbps since my reboot. Just bought a new cable for eth0 and now get this:

Oct  7 13:55:33 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Oct  7 13:55:35 fs  ntpd[1467]: Deleting interface #36 eth0, 192.168.0.250#123, interface stats: received=696, sent=696, dropped=0, active_time=176867 secs
Oct  7 13:55:35 fs  ntpd[1467]: 216.239.35.0 local addr 192.168.0.250 -> <null>
Oct  7 13:55:35 fs  ntpd[1467]: 216.239.35.4 local addr 192.168.0.250 -> <null>
Oct  7 13:55:35 fs  ntpd[1467]: 216.239.35.8 local addr 192.168.0.250 -> <null>
Oct  7 13:55:35 fs  ntpd[1467]: 216.239.35.12 local addr 192.168.0.250 -> <null>
Oct  7 13:55:35 fs  ntpd[1467]: Deleting interface #37 eth0, fe80::ec4:7aff:fe59:76ee%5#123, interface stats: received=0, sent=0, dropped=0, active_time=176867 secs
Oct  7 14:07:37 fs  ntpd[1467]: Listen normally on 38 eth0 192.168.0.250:123
Oct  7 14:07:37 fs  ntpd[1467]: Listen normally on 39 eth0 [fe80::ec4:7aff:fe59:76ee%5]:123
Oct  7 14:07:37 fs  ntpd[1467]: new interface(s) found: waking up resolver
Oct  7 14:07:40 fs kernel: ixgbe 0000:05:00.0 eth0: NIC Link is Down
Oct  7 14:07:42 fs  ntpd[1467]: Deleting interface #38 eth0, 192.168.0.250#123, interface stats: received=2, sent=8, dropped=0, active_time=5 secs
Oct  7 14:07:42 fs  ntpd[1467]: 216.239.35.0 local addr 192.168.0.250 -> <null>
Oct  7 14:07:42 fs  ntpd[1467]: 216.239.35.4 local addr 192.168.0.250 -> <null>
Oct  7 14:07:42 fs  ntpd[1467]: 216.239.35.8 local addr 192.168.0.250 -> <null>
Oct  7 14:07:42 fs  ntpd[1467]: 216.239.35.12 local addr 192.168.0.250 -> <null>
Oct  7 14:07:42 fs  ntpd[1467]: Deleting interface #39 eth0, fe80::ec4:7aff:fe59:76ee%5#123, interface stats: received=0, sent=0, dropped=0, active_time=5 secs
Oct  7 14:12:39 fs  ntpd[1467]: no peer for too long, server running free now

Dashboard says "interface down" with brand new cable. Swapping to old cable also says "interface down" not even the 100 Mbps I had before.

What gives?!

cinereus · October 7, 2022

Here are diagnostics after a fresh reboot where eth0 is still not working.

fs-diagnostics-20221007-1425.zip

JorgeB · October 7, 2022

Did we need believe the problem were the NICs?

cinereus · October 7, 2022

16 minutes ago, JorgeB said:

Did we need believe the problem were the NICs?

It's hard to tell. They ha e been working solidly for the last couple of days. I don't understand these errors in the syslog.

cinereus · October 7, 2022

Curiouser and curiouser.

eth0 refused to connect to my router after trying multiple cables. However, after moving the router connection to eth1, eth0 now works fine with other connections. Several hours of testing later And the cable that was previously limited to 100 Mbps is now very happy at 1 Gbps.

Not sure whether diagnostics say anything sensible about this?

fs-diagnostics-20221007-1603.zip

Edited October 7, 2022 by cinereus

Network interfaces keep going down

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation