[Edit: Not solved, happened again] Loss of connectivity, eth0 drops randomely - R720

Keexrean · May 28, 2020

Before you ask, no diagnostics attached yet, because I'm just waiting for an other drop to happen to make it lay one hot just out of the oven, but still putting the post out in the wild in case someone has a valuable input about it!

R720 Dell power edgy, 4port NIC at the back.

Randomly losing connection, console reactive, server reactive, everything else looks normal, but no amount of unplugging and plugging back in the ethernet cable fixes it, no amount of bringing the port up or down from the GUI either. Reboot needed.

Tried configuring port 4 into eth0 instead of port 1, and it happened again yesterday.

Twist: I have a SFP+ back-to-back link to an other computer that allowed me to access the webGUI through said link, so no, it's not a GUI crash, and indeed, everything's normal on it, except obviously that the server lost connectivity to the network and thus also internet, every dockers/services I run being thus down for everyone else too.

If anyone encountered that on these servers already, I'ld gladly hear about it!

I've heard that this issue was actually kinda known regarding the R720 and alikes, and I'm contemplating the option of getting a PCIe NIC as a work around, and thus if anyone has an example of a good reliable affordable PCIe card 2ports or more NIC that would also work in low profile (I may or may not mount it in the low profile slots, don't know yet), let me know!

Side note: No, I can't just use bounding to have all 4 ports connected as one and give me extra fail-over and stuff, my router is too dumb for that and my switch is unmanaged and doesn't support it. My network equipment is quite le basic aside from that back-to-back SFP link... that doesn't count as a network.

Edited June 27, 2020 by Keexrean

Keexrean · June 25, 2020

28days of uptime now, and still going strong. I.... just don't get it, touched nothing. Ugh.

EDIT:

I rebooted the server (because I was doing stuff on it, and needed to reboot).
Booted, all is nice and dandy. Leaving home. Come back, 3 hours of uptime, and around 1hour in, it happened AGAIN out of nowhere, while it's been issue-free during almost a month.

And basically masturbating the ethernet cables on both ends yields nothing.
Have to shut down all docker containers and VMs, and unplug both ends of the cables and plug them back in...

And I don't know if it's linked, but I had an issue of "tainted VMs" in libvirt (don't know why), along with libvirt service sometimes not finishing to start (as in still displayed at starting in the bottom left corner of the webGUI), and auto start VMs not starting.

*sigh*

procyon-diagnostics-20200626-1227.zip

Edited June 26, 2020 by Keexrean

Keexrean · June 26, 2020

And after like... half an hour of uptime? Happened again!
Here is the diag for this one.

I later used '/etc/rc.d/rc.inet1 restart' , which, without even touching the cable, made ~~everything work again~~... Scratch that, if restarting the docker containers was enough, restart the VM didn't gave her back access to the network, and stopping libvirt, impossible to start it again after. Had to reboot.

Also for information:

- eth0 is on the R720's 4 port stock nic, connected to my lan
- eth1 is a back to back SFP+ to an other box (main workstation) which was really useful to access the GUI when eth0 failed

- eth2 is an SFP+ connected to nothing
- eth3-4-5 are the other 3 out of 4 R720's stock nics

procyon-diagnostics-20200626-1318.zip

Edited June 26, 2020 by Keexrean

Keexrean · July 14, 2020

And again... and again...

procyon-diagnostics-20200714-0018.zip procyon-diagnostics-20200713-2010.zip

Keexrean · July 14, 2020

And again

procyon-diagnostics-20200714-0910.zip

Keexrean · December 22, 2020

Okay, update. Since that fiasco, I basically dropped $$$ into a second add-in NIC, a double 10gbps RJ45 card (to be eth0), on top of the original 4 1gbps nic and the already added double 10gbps SFP+ card.

And so far it was working great!
Except today.

After 72days of uptime, it crashed... but not the OG nic! The SFP nic is now the buggy one! unplugging and plugging back in didn't do a thing.

Reboot fixed it... but I still find it surprisingly unpredictable and unsolvable behavior, that wouldn't be much of an issue for a desktop distro, but quite worrying on a server-oriented distro, AND on server hardware, mind you.

(I wouldn't make much of this kind of issues on a desktop, it's really because it's a server that I take it seriously.)

procyon-diagnostics-20201222-1913.zip

Keexrean · April 1, 2021

Still happeeeeens... including in the middle of a backuuuuuup.....

Keexrean · April 8, 2021

Hello hi, about 2 weeks into using unraid 6.9, trying 2 of my 3 known working SolarFlare SFN5122N cards, and I already had to reboot TWICE because Unraid just casually and randomly forgets a network card is supposed to actually fulfill a function.

Basically, not other info than "ethx link is down, please check cables".

Oh I've got cables. direct attach copper or fiber, which color, length or thickness do you fancy?, and even using the ones that never dropped a packet in back to back connection on my proxmox and workstation boxes, Unraid still aint no clue a cable is plugged in at all, because while windows and proxmox have visibly no issue seeing and using these NICs, apparently Unraid hasn't got its goggles on and a bad case of wrist stumps.

Point is.

I can't use the onboard NIC and have a NetXtreme II BCM57810 dual 10gbps for ethernet cabling, because unraid's being a dumbass with the onboard NIC, it only works when I passthrough the thing to a VM, great, I don't need that!

I apparently now can't use my SFP+ Solarflare card either and will have to get an other sku/brand because Unraid can't deal with that either?

And in 10months basically no one cares. That's awesome.

shergar · April 10, 2021

hi mate thanks for taking the time to reply to my post, take a look at cpu temp app if you have it and see what temperature your nic is running at. I have stumbled across what appears to be a card overheating issue, my 10gbe nic was running at 104 deg C which is over the threshold where it automatically shuts off the network connection. it appears from another o/s forum that previous versions of the drivers may have not had visibility of this and would quite happily let the card run at silly temps, the new linux kernel may have updated the driver to a more recent one in which temperature control is a thing, it will not reset until restart. I have a 40mm fan on the way from amazon, I will keep you posted on the results.

shergar · April 12, 2021

as promised, i have update. the new fan is zip tied to my nic. temp has dropped from 104 deg C at idle to a very acceptable 64 deg C under load. no dropped connections as yet with about an hour and a half uptime. looks promising.

Eugene D · October 3, 2021

On 4/12/2021 at 4:19 PM, shergar said:

as promised, i have update. the new fan is zip tied to my nic. temp has dropped from 104 deg C at idle to a very acceptable 64 deg C under load. no dropped connections as yet with about an hour and a half uptime. looks promising.

Hopefully I can get a response on this thread; Shergar, What system do you have for your Unraid/10G Nic, I have a have a Chenbro NR12000 with a HP/Qlogic nc523sfp I ordered a 40mm fan off amazon but I don't see a way to make it fit, I'll have to look and see if there's something smaller. Do you know of any other fixes for the overheat in a server chassis aside from installing the thing in a refrigerator for these cards?

Eugene D · October 10, 2021

On 10/3/2021 at 4:06 PM, Eugene D said:

Quote

shergar said:

as promised, i have update. the new fan is zip tied to my nic. temp has dropped from 104 deg C at idle to a very acceptable 64 deg C under load. no dropped connections as yet with about an hour and a half uptime. looks promising.

Hopefully I can get a response on this thread; Shergar, What system do you have for your Unraid/10G Nic, I have a have a Chenbro NR12000 with a HP/Qlogic nc523sfp I ordered a 40mm fan off amazon but I don't see a way to make it fit, I'll have to look and see if there's something smaller. Do you know of any other fixes for the overheat in a server chassis aside from installing the thing in a refrigerator for these cards?

Well, Here's my update. I got a Noctua 40x40x10 fan and zip tied it to the heat sink of the card (It barely fits, in fact you can not screw it down to anything because you need wiggle room for it to get past the components on the motherboard) and so far (about a day) it seems to be JUST enough to keep the card active. I have not been able to test it under a load as I have yet to figure out how to use it versus the built in 1gb nics on the server/board but at least at idle it isn't quitting on me. Temps as reported by "System Temp" are (when I've checked which isn't constant) in the 90's (Ive seen low and high, 99 at the time of writing this, about 6.7 hours of uptime) at idle, so I'm not expecting good results when I manage to figure out how to utilize it instead of the other nics.

For system specs reference I am using a Chenbro NR12000 chassis which is a 1U 12 bay chassis which has a TYAN S5512 for a motherboard. In this case there is no room for vertical pci cards and when using a riser card only the 1 16x slot is usable so there is little to no room for much of anything.

Edited October 10, 2021 by Eugene D
typo

[Edit: Not solved, happened again] Loss of connectivity, eth0 drops randomely - R720

Recommended Posts

Keexrean

Link to comment

Keexrean

Link to comment

Keexrean

Link to comment

Keexrean

Link to comment

Keexrean

Link to comment

Keexrean

Link to comment

Keexrean

Link to comment

Keexrean

Link to comment

shergar

Link to comment

shergar

Link to comment

Eugene D

Link to comment

Eugene D

Link to comment

Join the conversation