Server stops responding on network - unraid 4.7


Recommended Posts

Recently my server has started to randomly stop responding on the network.  It won't respond to \\Tower\*, directly typing the IP address into the browser, telnet sessions, or even a ping.  It is however still running and I am able to power it down by pressing the power button on the case.  When it reboots everything is just fine except I've lost the syslog.  Is there a way to capture the contents of the syslog as part of the shutdown process so I can see what's happening?  I'm trying to avoid having to hookup a monitor and keyboard.  I've attached a copy of my syslog after rebooting just FYI.

syslog-2011-04-05.txt

Link to comment

Thanks for the reply.  I've seen that mentioned before but I was concerned that since my problem is a loss of network comm that the important info might not be posted to the syslog until after the connection was lost so it wouldn't get posted in telnet.  I suppose its worth a try.  Otherwise I'll have to hook up a monitor and keyboard until I can get this figured out.

Link to comment

you could do something like the following while troubleshooting. Wouldn't neccessarily put it in the GO script (unless you have it check/wait until shares are available)

 

/etc/rc.d/rc.syslog stop
sleep 5
DT_STAMP=`date +%Y%m%d-%H%M`
mv /mnt/user/pmm/log/syslog /mnt/user/pmm/log/syslog_${DT_STAMP}
sleep 2
cp /var/log/syslog /mnt/user/pmm/log/
rm -f /var/log/syslog
ln -s /mnt/user/pmm/log/syslog /var/log/syslog
/etc/rc.d/rc.syslog start

 

this will put it on the array. but then it wouldn't get copied to the usb drive on normal shutdown (haven't dug into that yet)

 

 

Link to comment

Thanks for the help guys but I think after many hours of searching that I finally have my answer.  According to this post the unMenu powerdown script already saves the syslog for me.  I believe that it is supposed to copy them on the flash drive in /boot/logs/.  So I should have the info I need in that file because my server was not responding on the network last night and I had to reboot it.  I feel kind of stupid that it was there all the time and I didn't even know it.

Link to comment

Yep, the syslog from yesterday when I shut down was in the /boot/logs folder on my flash drive.  I've attached a copy of the entire syslog. Here is the part from my syslog that concerns me. These were the last entries before I initiated the powerdown.  It looks like something went wrong with the NIC but I don't really know what to make of this.

 

Apr  5 15:44:19 Tower kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xff/0x17f()
Apr  5 15:44:19 Tower kernel: Hardware name: C2SEA
Apr  5 15:44:19 Tower kernel: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Apr  5 15:44:19 Tower kernel: Modules linked in: md_mod xor i2c_i801 i2c_core ahci r8169
Apr  5 15:44:19 Tower kernel: Pid: 0, comm: swapper Not tainted 2.6.32.9-unRAID #8
Apr  5 15:44:19 Tower kernel: Call Trace:
Apr  5 15:44:19 Tower kernel:  [<c102449e>] warn_slowpath_common+0x60/0x77
Apr  5 15:44:19 Tower kernel:  [<c10244e9>] warn_slowpath_fmt+0x24/0x27
Apr  5 15:44:19 Tower kernel:  [<c123b505>] dev_watchdog+0xff/0x17f
Apr  5 15:44:19 Tower kernel:  [<c1037139>] ? sched_clock_cpu+0x136/0x14a
Apr  5 15:44:19 Tower kernel:  [<c123b406>] ? dev_watchdog+0x0/0x17f
Apr  5 15:44:19 Tower kernel:  [<c102bb23>] run_timer_softirq+0x105/0x158
Apr  5 15:44:19 Tower kernel:  [<c1028261>] __do_softirq+0x84/0xf8
Apr  5 15:44:19 Tower kernel:  [<c10282fb>] do_softirq+0x26/0x2b
Apr  5 15:44:19 Tower kernel:  [<c1028556>] irq_exit+0x29/0x2b
Apr  5 15:44:19 Tower kernel:  [<c10118f0>] smp_apic_timer_interrupt+0x6f/0x7d
Apr  5 15:44:19 Tower kernel:  [<c10031f6>] apic_timer_interrupt+0x2a/0x30
Apr  5 15:44:19 Tower kernel:  [<c10085f9>] ? mwait_idle+0x4c/0x52
Apr  5 15:44:19 Tower kernel:  [<c12108ad>] cpuidle_idle_call+0x28/0x9b
Apr  5 15:44:19 Tower kernel:  [<c1001a14>] cpu_idle+0x3a/0x4e
Apr  5 15:44:19 Tower kernel:  [<c129c662>] start_secondary+0x195/0x19a
Apr  5 15:44:19 Tower kernel: ---[ end trace ccea7bb31804fb24 ]---
Apr  5 15:44:21 Tower kernel: r8169: eth0: link up
Apr  5 15:50:03 Tower kernel: r8169: eth0: link up
Apr  5 15:55:27 Tower kernel: r8169: eth0: link up
Apr  5 15:56:14 Tower kernel: mdcmd (18): spindown 2
Apr  5 17:19:40 Tower kernel: r8169: eth0: link up

syslog.zip

Link to comment

@dgaschk - I didn't see your reply until this morning.  I checked last night and this morning using the "Show new replies to your posts" link at the top of the forum and there were no replies to this message.  However when I checked the actual post there was a reply. I'll launch the memtest when I get home this evening.

 

I'm curious though, did you see something in my syslog that would indicate a problem with my RAM or is this just a standard troubleshooting step that will help isolate the issue?

Link to comment

I launched memtest when I got home from work and let it run all night.  I stopped it this morning after 20 passes (~13 hours) with 0 errors.  So it looks like my RAM is not the problem.  I suppose I'm still looking for an explanation as to what these entries from my syslong means ...

 

Apr  5 15:44:19 Tower kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xff/0x17f()
Apr  5 15:44:19 Tower kernel: Hardware name: C2SEA
Apr  5 15:44:19 Tower kernel: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Apr  5 15:44:19 Tower kernel: Modules linked in: md_mod xor i2c_i801 i2c_core ahci r8169
Apr  5 15:44:19 Tower kernel: Pid: 0, comm: swapper Not tainted 2.6.32.9-unRAID #8
Apr  5 15:44:19 Tower kernel: Call Trace:
Apr  5 15:44:19 Tower kernel:  [<c102449e>] warn_slowpath_common+0x60/0x77
Apr  5 15:44:19 Tower kernel:  [<c10244e9>] warn_slowpath_fmt+0x24/0x27
Apr  5 15:44:19 Tower kernel:  [<c123b505>] dev_watchdog+0xff/0x17f
Apr  5 15:44:19 Tower kernel:  [<c1037139>] ? sched_clock_cpu+0x136/0x14a
Apr  5 15:44:19 Tower kernel:  [<c123b406>] ? dev_watchdog+0x0/0x17f
Apr  5 15:44:19 Tower kernel:  [<c102bb23>] run_timer_softirq+0x105/0x158
Apr  5 15:44:19 Tower kernel:  [<c1028261>] __do_softirq+0x84/0xf8
Apr  5 15:44:19 Tower kernel:  [<c10282fb>] do_softirq+0x26/0x2b
Apr  5 15:44:19 Tower kernel:  [<c1028556>] irq_exit+0x29/0x2b
Apr  5 15:44:19 Tower kernel:  [<c10118f0>] smp_apic_timer_interrupt+0x6f/0x7d
Apr  5 15:44:19 Tower kernel:  [<c10031f6>] apic_timer_interrupt+0x2a/0x30
Apr  5 15:44:19 Tower kernel:  [<c10085f9>] ? mwait_idle+0x4c/0x52
Apr  5 15:44:19 Tower kernel:  [<c12108ad>] cpuidle_idle_call+0x28/0x9b
Apr  5 15:44:19 Tower kernel:  [<c1001a14>] cpu_idle+0x3a/0x4e
Apr  5 15:44:19 Tower kernel:  [<c129c662>] start_secondary+0x195/0x19a
Apr  5 15:44:19 Tower kernel: ---[ end trace ccea7bb31804fb24 ]---
Apr  5 15:44:21 Tower kernel: r8169: eth0: link up
Apr  5 15:50:03 Tower kernel: r8169: eth0: link up
Apr  5 15:55:27 Tower kernel: r8169: eth0: link up
Apr  5 15:56:14 Tower kernel: mdcmd (18): spindown 2
Apr  5 17:19:40 Tower kernel: r8169: eth0: link up

Link to comment

What happens at your console, are you able to type anything there, or is that all locked up as well?  Also, does it lock up on its own, or just during a file copy?

 

I am having a similar issue, but mine occurs during file copies, http://lime-technology.com/forum/index.php?topic=11826.0

 

I have had no luck in tracking a solution and it is becoming a little aggravating, because there is very little information available as to a potential cause.

 

My network is not that complex that I would be the only one to experience this so that is certainly why I am beginning to get nervous.

Link to comment

Basically I got home from work and my wife just says, "What happend to all the movies?  XBMC is saying they are not available."  So I checked the unRAID and unMenu webGUI but was unable to connect.  I tried to telnet and I could not connect.  I tried to ping and got no response.  I went downstairs to the server and the NIC was still on (i.e. lights were on and blinking) and the switch was showing the server present as well.  I pushed the power button to see if the server would reboot and it did.  I now have a monitor and a keyboard hooked up so the next time it happens I'll check the console.

 

Here is what I think was happening when the link was lost...I think the server was streaming a movie to my HTPC when the timeout occurred.  Again I was not home so I'm not sure.  I do know that at 15:56:14 a spindown command was issued to drive 2.  My drives are all on 15 min spindown timers so that means that something was accessing the server at ~15:41.  The transmit queue problem occurred at 15:44:19. Just to confirm that a movie was playing I asked my wife - "Hey, were you watching a movie on the HTPC this afternoon?"  The answer I got was something like this - "I don't know, I don't keep track of what I do with the HTPC." Thanks honey, that helps a lot.

 

I did find this post on the forums where a few other users have had this same problem.  It does not appear to happen very often.  I guess I'm one of the unlucky ones. The suggestion from Rob J. was to get an intel NIC becasue it seems this issue is specific to the Realtek 8111 NICs. I see this as my last resort.

 

Now I also found this bug report - Bug 538920 - r8169 netdev timeout when aspm is enabled which seems to indicate that setting aspm to off may fix the problem.  The bug is reported against Fedora but it is the same hardware (Realtek 8111C) and driver (r8169) that I'm using so I think that it is applicable to unraid.  Is this correct?  Reading through all the posts I see two possible solutions:

1) Set kernel boot parameter pcie_aspm=off

or

2) Use kernel 2.6.37 and newer

 

I'm not even sure if either of these two options would apply to unraid.  I don't think that using a newer kernel is possible - doesn't limetech have to do that?  But I'm not sure about the aspm parameter.  Maybe that is a possibility.

Link to comment

Here is what I've done in the past 3-4 weeks (in order)...

  • Upgraded to 4.7
  • Removed the jumpers from all my EARS drives (see below)
  • Replaced a WD10EARS data drive with a WD20EARS
  • Added a user script to control the speed of my case fan based on HDD temps
  • Installed new RAM (2x2GB)
  • Installed a non-array drive
  • Installed sabnzbd package onto the non-array drive
  • Reverted back to original RAM (but only using a single 2GB module now)

 

To remove the jumpers I started by upgrading to 4.7 and then preclearing a new EARS drive unjumpered.  I then removed a jumpered EARS drive in my array and replaced it with an unjumpered one.  After rebuilding the drive and verifying parity I then removed the jumper from the drive I had just replaced and then precleared it. I repeated this sequence until all my jumpered EARS drives were unjumpered.

 

Now I've had this problem I think about 3-4 times in the past month.  I believe that this problem (loss of network connection) occurred once before I made any of these changes so I don't believe that these changes have anything to do with my problem - but I suppose I could be wrong.  Also about 4-6 weeks ago I began to use my server a lot more for streaming to my HTPC.  So I would say that my usage of the server has actually been the biggest change. From what I have read about this issue I think that it is related to the 8111C NIC so it only makes sense that since I recently began using the server a lot more recently that this problem has surfaced.

Link to comment

Well I found an Active State Power Management(ASPM) option in my BIOS and set that to DISABLED. That did not work at all.  The server booted but the array would not start and eventually my server turned itself off.  Time to visit Google. Still trying to figure out how to set the kernel boot parameter pcie_aspm=off.  If that's even possible in unraid.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.