Oddness with eth0 going away and coming back


Recommended Posts

My motherboard has a onboard nic that uses the r8169 driver.

 

I get occasional problems with the network going away and coming back:

 

Jul  3 23:08:03 Tower kernel: r8169: eth0: link up
Jul  3 23:08:57 Tower kernel: r8169: eth0: link up
Jul  3 23:11:39 Tower kernel: r8169: eth0: link up
Jul  3 23:12:14 Tower in.telnetd[2006]: connect from 192.168.0.15 (192.168.0.15)
Jul  3 23:12:33 Tower login[2007]: ROOT LOGIN  on `pts/1' from `192.168.0.15'
Jul  3 23:21:34 Tower emhttp: shcmd (80): /usr/sbin/hdparm -y /dev/sdc >/dev/null
Jul  3 23:27:21 Tower kernel: r8169: eth0: link up
Jul  3 23:29:04 Tower in.telnetd[2026]: connect from 192.168.0.15 (192.168.0.15)
Jul  3 23:29:15 Tower login[2027]: ROOT LOGIN  on `pts/2' from `192.168.0.15'
Jul  3 23:31:34 Tower emhttp: shcmd (81): /usr/sbin/hdparm -y /dev/sdg >/dev/null
Jul  3 23:31:35 Tower emhttp: shcmd (82): /usr/sbin/hdparm -y /dev/sda >/dev/null
Jul  3 23:31:45 Tower emhttp: shcmd (83): /usr/sbin/hdparm -y /dev/sdd >/dev/null
Jul  3 23:31:46 Tower emhttp: shcmd (84): /usr/sbin/hdparm -y /dev/sde >/dev/null
Jul  3 23:39:48 Tower emhttp: shcmd (85): /usr/sbin/hdparm -y /dev/sdf >/dev/null
Jul  3 23:49:11 Tower emhttp: shcmd (86): /usr/sbin/hdparm -y /dev/sdg >/dev/null
Jul  3 23:53:02 Tower kernel: r8169: eth0: link up
Jul  3 23:53:38 Tower last message repeated 2 times
Jul  3 23:56:32 Tower last message repeated 2 times
Jul  3 23:58:26 Tower last message repeated 2 times
Jul  3 23:59:56 Tower kernel: r8169: eth0: link up

 

The switch is solid, the cable is good and this server has been rock solid in its life before unRAID.  Doing some googling I did find a interesting post in a linux kernel bug thing

 

Backported Realtek 8169/8168/8101 (aka r8169) driver from 2.6.30 for users of earlier Ubuntu releases (Jaunty/9.04, e.g.). The 2.6.30 version of the driver has big stability fixes for NICs with significant load, but a significant NIC API change in 2.6.29 prevents later code from compiling/running on earlier kernels. This package set fixes that. This PPA will be obsolete in Karmic (which, at this point, is looking like it'll be running 2.6.30 already), but in the meantime...

 

Maybe I'm barking up the wrong tree but could their be a problem with the current version of the driver that's causing this? Current kernel says it is a 2.6.29 version.

 

Putting files on I never had any trouble (probably because the throughput was always 10-14MB/s where as pulling files off it gets around 60MB/s at least until it starts having issues and the eth0 goes away and comes back. 

 

attached my dmesg and syslog

Link to comment

Jul  3 22:21:33 Tower kernel: ------------[ cut here ]------------
Jul  3 22:21:33 Tower kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf8/0x178()
Jul  3 22:21:33 Tower kernel: Hardware name: EP45-DS4P
Jul  3 22:21:33 Tower kernel: NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Jul  3 22:21:33 Tower kernel: Modules linked in: md_mod ahci sata_sil24 libata r8169
Jul  3 22:21:33 Tower kernel: Pid: 0, comm: swapper Not tainted 2.6.29.1-unRAID #2
Jul  3 22:21:33 Tower kernel: Call Trace:
Jul  3 22:21:33 Tower kernel:  [<c0120e07>] warn_slowpath+0x74/0x8a
Jul  3 22:21:33 Tower kernel:  [<c021edab>] ? cpumask_next_and+0x26/0x37
Jul  3 22:21:33 Tower kernel:  [<c011b88c>] ? find_busiest_group+0x208/0x686
Jul  3 22:21:33 Tower kernel:  [<c011a353>] ? enqueue_task_fair+0x96/0x9d
Jul  3 22:21:33 Tower kernel:  [<c022291b>] ? strlcpy+0x17/0x48
Jul  3 22:21:33 Tower kernel:  [<c02e9283>] dev_watchdog+0xf8/0x178
Jul  3 22:21:33 Tower kernel:  [<c0133b25>] ? sched_clock_cpu+0x147/0x154
Jul  3 22:21:33 Tower kernel:  [<c02e918b>] ? dev_watchdog+0x0/0x178
Jul  3 22:21:33 Tower kernel:  [<c01281af>] run_timer_softirq+0x105/0x158
Jul  3 22:21:33 Tower kernel:  [<c0124a48>] __do_softirq+0x84/0x121
Jul  3 22:21:33 Tower kernel:  [<c0124b1a>] do_softirq+0x35/0x3a
Jul  3 22:21:33 Tower kernel:  [<c0124d97>] irq_exit+0x38/0x3a
Jul  3 22:21:33 Tower kernel:  [<c0110df0>] smp_apic_timer_interrupt+0x74/0x82
Jul  3 22:21:33 Tower kernel:  [<c01034c0>] apic_timer_interrupt+0x28/0x30
Jul  3 22:21:33 Tower kernel:  [<c025a1a4>] ? acpi_idle_enter_bm+0x23d/0x2ab
Jul  3 22:21:33 Tower kernel:  [<c02c1fcd>] cpuidle_idle_call+0x60/0x97
Jul  3 22:21:33 Tower kernel:  [<c01019dd>] cpu_idle+0x50/0x64
Jul  3 22:21:33 Tower kernel:  [<c03422be>] start_secondary+0x18a/0x18f
Jul  3 22:21:33 Tower kernel: ---[ end trace 65760633120bbdea ]---
Jul  3 22:21:33 Tower kernel: r8169: eth0: link up

 

This is the more significant part, a 'Call Trace' after "NETDEV WATCHDOG: eth0 (r8169): transmit timed out".  This has been an infrequent problem, always associated with the Realtec driver, but is not just the current version of the driver, has occurred several versions back too.  What others have done is purchase and install a good LAN card, such as an Intel PRO/1000, and disable the onboard Realtec NIC's.

 

I suggested the Intel PRO/1000 because although it costs more than other gigabit LAN cards, is better quality and has more headroom at higher traffic loads (as I understand).  And since you already have fairly high-end equipment ... 

 

I must say that, for what it's worth, you now hold the new BogoMIPS record, 22664.05 BogoMIPS total!  The 'Intel® Core2 Quad CPU Q9550 (@ 2.83GHz)' has a lot to do with that!

 

While we certainly will be moving to 2.6.30 and later, there is no telling when.  Tom's philosophy in the past has always appeared conservative, stay several versions behind, with an emphasis on stability, which is consistent and expected with a good server storage system.  With the betas recently, he has been upgrading more aggressively, and 2.6.29 is pretty close to current, but then they are betas.  I don't know what he will decide for the v4.5 final release.

Link to comment

Thanks for that.  I was thinking of getting one of the motherboards in the unRAID HCL so as I can lose the videocard as well.

 

If I do change motherboard and all the disks are added to a new controller, will the unRAID not start when it sees the world is quite different?  I'll obviously know where to map each physical drive to unRAID drive (parity, disk1-5) but don't want it to start 'fixing' things in the way it first detects...

Link to comment

Thanks for that.  I was thinking of getting one of the motherboards in the unRAID HCL so as I can lose the videocard as well.

 

If I do change motherboard and all the disks are added to a new controller, will the unRAID not start when it sees the world is quite different?  I'll obviously know where to map each physical drive to unRAID drive (parity, disk1-5) but don't want it to start 'fixing' things in the way it first detects...

 

If to many disks are missing or in the wrong place then the array will not start.  Just assign the disks to the correct places and then hit the start button.

Link to comment

Ok so you have the same NIC chipset on your board.

 

There are a number of boards in the HCL with specifically this Realtek 8111C chipset that don't mention any issues.  Perhaps they were tested prior to the 4.5 beta 6 version unRAID.

 

Can anyone confirm not seeing these

 

NETDEV WATCHDOG: eth0 (r8169): transmit timed out

 

in syslog using a Realtek 8111C NIC and advise what version unRAID they are running?

 

Link to comment

I'm afraid most users with that chipset have not had a problem with it, and many are using the latest v4.5-beta6.  That is why I said that it is quite infrequent.  I believe I have seen 3, maybe 4, cases of it.  Sorry, you are one of the 'lucky' ones.

Link to comment

Well, for what it's worth I thought I should post the motherboard model that I've had the trouble with.

 

Gigabyte GA-EP45-DS4P rev 1.0

 

I've resigned myself for now (till the next beta, it may be fixed in the next kernel apparently) that it will be for archive only (never had problem putting files to it, probably because with parity happening its capped at 15MB/sec) and built another unRAID box using a Gigabyte GA-G31M-ES2L (also realtek NIC !).  With this second box I've had no problems with 4.5b6 at all moving large files to / from with no kernel errors in syslog.

Link to comment
  • 2 months later...

I've had the same problem for the first time today. I'm currently running BubbaRaid Version 0.01.17-Beta based on unRAID 4.4.2 on a Gigabyte GA-MA74GM-S2. This server has been running 24/7 for the last 3 weeks - that's when I built it - and it's the first time it occurred. Let's hope it will not start occurring more often.

Link to comment
  • 4 weeks later...

 

Hi everybody, I have the same issue. Several weeks ago I posted the problem in the beta6 thread. I copy/paste it for the Hardware info:

 

-----------------------------------------------------------------------------------------

Sometimes, when I want to watch a movie it happens this:

 

1º- I power on the Tower

2º- I power the HTPC

3º- I select the movie using XBMC, which is pointing to smb://tower/PelisHD ---> User share.

4º- After 30-60 minutes of the movie, it stops. This is because the HTPC has lost the connection with Tower.

5º- I check the Tower with ANOTHER computer but the Tower doesn't respond. No http://tower, NO PING, nothing.

6º- Finally I go where the Tower is and I plug-in a keyboard (you can see in the syslog). I can access the disk, I can see in "ifconfig" that Tower has an IP, and... finally, when I type "top".... the pc that i use in 5º starts to receive the answer of the ping.

7º- Now I can return to my HTPC and continue watching the movie without interruptions.

 

I attach the syslog where is easy to find when I start to type in the Tower (Sep  8 18:45:50 Tower kernel: usb 4-2: new low speed USB device using uhci_hcd and address 2)

 

My configuration is this:

---

Unraid 4.5 beta6 - Plus key

USB: Sandisk cruzer (16gb)

 

Mobo: Asus P5Q-VM (asus)

  Intel® G45 / ICH10

  Realtek® 8111C PCI-E Gigabit LAN controllers, featuring AI Net2

 

CPU: Celeron S 430 (1x 1800 MHz)

RAM: 2 GB

HDD: 4 disks >> parity and disk1: Seagate ST31000528AS (1TB), others: 1 Hitachi 750 GB and 1 seagate 500GB

 

The router is a apple Time Capsule with DHCP, but I always give the IP: 10.0.1.2 to Tower (MAC reservation).

CAT6 cable.

---

-------------------------------------------------------------------------

 

Joe L. gave me this answer:

 

See this answer for some things to try (basically, they moved the "IRQ" the network card was assigned to one less used) : http://graag.blogspot.com/2007/12/netdev-watchdog-eth0-transmit-timed-out.html  It is a apparently a bug with the driver for some realtek chipsets.  Worst case, install a different network card.

 

But finally I will change the NIC, I will try the Intel PRO/1000 or something like this. Anyone suggest another one?

 

 

Link to comment

If it is the same issue as the one reported in this thread, then you will find within your syslog a section very similar to the "[ cut here ]" section in my post above.  If not, then it is something else, and you will need to post your syslog.  Please see the Troubleshooting link in my sig for instructions on capturing and posting your syslog.

Link to comment

 

Sorry, I forgot it.

 

I have attached it now, but i will point out some lines

 

...
Sep  8 17:55:31 Tower emhttp: shcmd (22): /etc/rc.d/rc.nfsd restart | logger\
Sep  8 18:03:33 Tower shfs0: duplicate object: /mnt/disk2/PelisHD/.DS_Store\
Sep  8 18:03:33 Tower shfs0: duplicate object: /mnt/disk3/PelisHD/.DS_Store\
Sep  8 18:04:31 Tower shfs0: duplicate object: /mnt/disk2/PelisHD/.DS_Store\
Sep  8 18:04:31 Tower shfs0: duplicate object: /mnt/disk3/PelisHD/.DS_Store\
Sep  8 18:04:36 Tower shfs0: duplicate object: /mnt/disk2/PelisHD/.DS_Store\
Sep  8 18:04:36 Tower shfs0: duplicate object: /mnt/disk3/PelisHD/.DS_Store\
Sep  8 18:45:50 Tower kernel: usb 4-2: new low speed USB device using uhci_hcd and address 2\
Sep  8 18:45:51 Tower kernel: usb 4-2: configuration #1 chosen from 1 choice\
Sep  8 18:45:51 Tower kernel: input: MLK Trust Deskset 15177 as /devices/pci0000:00/0000:00:1a.1/usb4/4-2/4-2:1.0/input/input3\
Sep  8 18:45:51 Tower kernel: sunplus 0003:04FC:05D8.0001: input,hidraw0: USB HID v1.00 Keyboard [MLK Trust Deskset 15177] on usb-0000:00:1a.1-2/input0\
Sep  8 18:45:51 Tower kernel: sunplus 0003:04FC:05D8.0002: fixing up Sunplus Wireless Desktop report descriptor\
Sep  8 18:45:51 Tower kernel: input: MLK Trust Deskset 15177 as /devices/pci0000:00/0000:00:1a.1/usb4/4-2/4-2:1.1/input/input4\
Sep  8 18:45:51 Tower kernel: sunplus 0003:04FC:05D8.0002: input,hiddev96,hidraw1: USB HID v1.00 Mouse [MLK Trust Deskset 15177] on usb-0000:00:1a.1-2/input1\
Sep  8 18:46:20 Tower login[1289]: ROOT LOGIN  on `tty1'\
Sep  8 18:46:56 Tower kernel: ------------[ cut here ]------------\
Sep  8 18:46:56 Tower kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xf8/0x178()\
Sep  8 18:46:56 Tower kernel: Hardware name: P5Q-VM\
Sep  8 18:46:56 Tower kernel: NETDEV WATCHDOG: eth0 (r8169): transmit timed out\
Sep  8 18:46:56 Tower kernel: Modules linked in: md_mod ata_piix libata r8169\
Sep  8 18:46:56 Tower kernel: Pid: 0, comm: swapper Not tainted 2.6.29.1-unRAID #2\
Sep  8 18:46:56 Tower kernel: Call Trace:\
Sep  8 18:46:56 Tower kernel:  [<c0120e07>] warn_slowpath+0x74/0x8a\
Sep  8 18:46:56 Tower kernel:  [<c01190ed>] ? enqueue_task+0xd/0x18\
Sep  8 18:46:56 Tower kernel:  [<c011c0fb>] ? try_to_wake_up+0x12b/0x136\
Sep  8 18:46:56 Tower kernel:  [<c011c11d>] ? wake_up_state+0xa/0xc\
Sep  8 18:46:56 Tower kernel:  [<c0129510>] ? signal_wake_up+0x23/0x31\
Sep  8 18:46:56 Tower kernel:  [<c012968f>] ? complete_signal+0x171/0x178\
Sep  8 18:46:56 Tower kernel:  [<c012992c>] ? send_signal+0x1ab/0x1c2\
Sep  8 18:46:56 Tower kernel:  [<c01356dc>] ? getnstimeofday+0x51/0xdc\
Sep  8 18:46:56 Tower kernel:  [<c022291b>] ? strlcpy+0x17/0x48\
Sep  8 18:46:56 Tower kernel:  [<c02e9283>] dev_watchdog+0xf8/0x178\
Sep  8 18:46:56 Tower kernel:  [<c0110039>] ? safe_smp_processor_id+0x39/0x84\
Sep  8 18:46:56 Tower kernel:  [<c01285de>] ? update_process_times+0x49/0x4e\
Sep  8 18:46:56 Tower kernel:  [<c01379cb>] ? tick_periodic+0x62/0x64\
Sep  8 18:46:56 Tower kernel:  [<c02e918b>] ? dev_watchdog+0x0/0x178\
Sep  8 18:46:56 Tower kernel:  [<c01281af>] run_timer_softirq+0x105/0x158\
Sep  8 18:46:56 Tower kernel:  [<c0124a48>] __do_softirq+0x84/0x121\
Sep  8 18:46:56 Tower kernel:  [<c0124b1a>] do_softirq+0x35/0x3a\
Sep  8 18:46:56 Tower kernel:  [<c0124d97>] irq_exit+0x38/0x3a\
Sep  8 18:46:56 Tower kernel:  [<c0104a69>] do_IRQ+0x67/0x7e\
Sep  8 18:46:56 Tower kernel:  [<c01033a7>] common_interrupt+0x27/0x2c\
Sep  8 18:46:56 Tower kernel:  [<c025a32e>] ? acpi_idle_enter_simple+0x11c/0x186\
Sep  8 18:46:56 Tower kernel:  [<c02c1fcd>] cpuidle_idle_call+0x60/0x97\
Sep  8 18:46:56 Tower kernel:  [<c01019dd>] cpu_idle+0x50/0x64\
Sep  8 18:46:56 Tower kernel:  [<c0336813>] rest_init+0x53/0x55\
Sep  8 18:46:56 Tower kernel: ---[ end trace b36bc0f67b222a45 ]---\
Sep  8 18:46:56 Tower kernel: r8169: eth0: link up\
}

 

At 18:04:36 I was watching a movie normally.

But at 18:40 more or less, it stopped, so I plugged in a Keyboard to check.

 

 

Link to comment

That is definitely the same problem, and so the same advice I gave above would apply.  However, there was a hint in the thread related to the new unRAID v4.5-beta7 release that there are newer drivers included, so upgrading to it is worth a try.  Once you have seen this Realtek problem, you can probably expect to see it again, unless the new unRAID release has the patched Realtek driver (r8169).  It would be good to hear from someone what the r8169 version is, in v4.5-beta7.  Previous releases contain r8169, version 2.3LK-NAPI.

 

One note, your syslog (and what is visible above) is dated September 8, so either this was an old syslog, or your system clock is wrong.

Link to comment

That is definitely the same problem, and so the same advice I gave above would apply.  However, there was a hint in the thread related to the new unRAID v4.5-beta7 release that there are newer drivers included, so upgrading to it is worth a try.  Once you have seen this Realtek problem, you can probably expect to see it again, unless the new unRAID release has the patched Realtek driver (r8169).  It would be good to hear from someone what the r8169 version is, in v4.5-beta7.  Previous releases contain r8169, version 2.3LK-NAPI.

 

One note, your syslog (and what is visible above) is dated September 8, so either this was an old syslog, or your system clock is wrong.

4.5b7 has 2.3LK-NAPI  (Or at least that string is still in the kernel module)

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.