Freeze / crash issue resulting in network lock-up and BTRFS corruption?


Recommended Posts

Hello fellow Unraid users,

 

I have been running Unraid for a year now, but recently I have run into a weird issue:

Every once in a while (sometimes a few days, sometimes a few weeks, I would notice my network slowly failing.

 

My setup is kind of like this:

Unraid-PC and Windows PC are plugged into a 10GBit switch (Netgear GS110MX) using Asus XG-C100C network cards. They are also both connected using their onboard gigabit as a backup. On Unraid I configured this as an "active-backup" bond. The NAS is using a ASRock B450 Pro4 AM4.

The wifi and internet gateway are also plugged into this switch, as well as any other media streaming device. All the other devices are plugged into a separate Gigabit Switch, which is plugged into this switch.

 

Whenever the network issues start, it would usually start with the NAS disconnecting (Plex stream interrupts, Web GUI is down, SSH is down). Then slowly all the other devices on the 10g switch would start to fail, until finally the whole network is down. I cannot connect from any device on the switch to any other device. However, all the devices on the second gigabit switch can communicate fine.

The way to resolve this is to power-cycle the 10g switch, then communication is immediately reset for all devices. Except for Unraid, as apparently my Unraid server is completely frozen after this, even when I plug in a monitor and keyboard there is no response until I press the reset button. And almost every time I do that, the BTRFS on my ssd corrupts, and I have to reformat it.

 

I first sent the switch in to be replaced, but the new one has exactly the same issue, and since it starts with the Unraid server, and Unraid is also frozen after the fact, I now suspect the issue lies with Unraid instead of my switch. 

While I have seen many customer reviews saying that the switch will sometimes fail under high loads, all of my crashes happened under zero load, like while I was asleep.

 

Has anyone experienced anything similar?

 

How could I go about diagnosing this further?

Is the switch issue causing the NAS to freeze, or is the NAS freezing up maybe somehow locking up the switch?

(next time it happens I will try to unplug only the nas from the switch to see if the issue is resolved)

 

Thanks and best regards

 

nethub-diagnostics-20200402-1947.zip

Link to comment

Based on your recommendation I updated the BIOS and set the Power Supply Idle Control to typical, I checked the RAM Speeds and they were on Auto (which for this ram defaults to 2400 if I read the BIOS right) (2 sticks with 4 slots, ryzen 2 gen) but set them fixed to 2400 anyway.

 

Will keep you updated if the crash happens again.

Edited by permissionBRICK
Link to comment
  • 2 weeks later...

No, sadly it happened again tonight. The switch locked up and got power-cycled by the watchdog script I made, but the NAS was still stuck and I had to hard-reset it. Any more ideas? (Luckily it appears there was no corruption this time)

 

Well as I just went AMD on my main workstation and freed up  an MSI Board and I7 7700k CPU, I will try to swap to that for now

Edited by permissionBRICK
Link to comment
  • 3 weeks later...

Well, sadly, it happened again. This time I didn't rush to restart the NAS and spent some time trying to analyze the issue live.

 

Since last time I switched the Mainboard for an MSI z270 gaming pro carbon, the CPU for an intel i7 7700k, and the dual sata ssd cache for a single NVME ssd.

The only components that stayed the same was the 10g card, the ram, the PSU, the hard drives, and the unraid config on my usb.

 

First I checked whether I could see if there was garbled data being sent over the Network using Wireshark, however both on the 10g port as well as the backup 1g port on my pc I could only see packets from my local pc, or on the 1g link there was some packets from devices on the second switch in between the main switch and this port.

 

Like last time, plugging a monitor into the NAS did nothing. First I tried to unplug the 10g link of the NAS from the router. This resulted in devices on the 10g switch being able to do limited communication. Then I unplugged the backup link to the 1g switch, which then restored full communication on all other devices. Then I replugged the 10g link, and nothing changed, however when I replugged the 1g link, devices started to fail again. This might be due to the fact that the 10g card does not reinitialize its connection. Then I even tried to root out if the 10g card was hard-locking the system by hot-unplugging it, however that changed nothing.


Then I unpowered the NAS, reinstalled the 10g card and restarted it. Luckily this time there was no corruption. However, When I stopped the docker service to check it, and then restarted it, the same thing happened again right away.

 

This is where it gets weird.

 

All the devices slowly started to block / drop communication, until nothing was working anymore. then I used the opportunity to plug the 1g cable from the nas into the pc directly to look for packets using wireshark. However, again no packets were being received. Then i replugged the 1g cable back into the switch, which immediately caused all devices to fail again.

 

Finally, I hard reset the NAS again, and since then I have not been able to reproduce this today.

 

So apparently this issue happens with both 1g and 10g switches and with both intel and amd mainboards. It seems to lock up the NAS, as well as any network switch connected to it, even if you unplug and replug the lan ports, but apparently no packets are being sent. I'm slowly approaching the end of my ideas here, can anyone help me out?

 

Is it sending garbled data that isn't in packet form? If so, why would it transfer to the second switch?

 

Edit: It just occured to me to set up a remote syslog server, in case it happens again.

Edited by permissionBRICK
Link to comment

Another wild idea: is it possible that the active-backup bond plugged into two different switches caused the switches to enter a switching loop if they used the same mac for some reason, where each switch would think the other one was connected to this mac?

 

However if that was the cause: I'm not sure what happened the first time since I think at that time both bonds were plugged into the 10g switch. Although back then power cycling the 10g switch completely fixed the network issues, where today that only made it a little bit better, but it seemed both switches would lock up as long as one bond was attached, and both switches were connected to each other. Perhaps the same thing happened in the 10g switch if the 10g ports are switched separately from the 1g ports

 

Also I tried repeatedly switching the active interface on the bond, but that didn't produce the error.

Link to comment

No, that didn't fix it either. 

I set Bonding to Disabled, and the second interface to just not get any ip (I just need it for Wake on LAN really).

However today the exact same thing happened. I tried to unplug the cable from the 1g switch and plug it into the 10g switch while it was happening, this caused the lockdown to continue.

 

The remote syslog files didn't contain any entries from around the time when the issue happened.

 

Anyone got any more ideas?

 

I will now try to disable the link entirely to see if that changes anything

Edited by permissionBRICK
Link to comment
15 hours ago, permissionBRICK said:

I set Bonding to Disabled, and the second interface to just not get any ip (I just need it for Wake on LAN really).

I have exactly setup, onboard LAN for WOL purpose only. Problem seems relate to loooping, but can't identify where are the source.

 

The link for 10G switch to 1G switch is a simple single ethernet ?

 

When problem occur, does all port LED flash quickly ? ( You say network down where slowly, so this different as traditional looping)

 

Does any VM / docker config a wrong MAC address which conflict with the real network device ? ( Some post also report Unraid make whole network down, but usually haven't story ending )

Edited by Benson
Link to comment
12 hours ago, Benson said:

I have exactly setup, onboard LAN for WOL purpose only. Problem seems relate to loooping, but can't identify where are the source.

 

The link for 10G switch to 1G switch is a simple single ethernet ?

 

When problem occur, does all port LED flash quickly ? ( You say network down where slowly, so this different as traditional looping)

 

Does any VM / docker config a wrong MAC address which conflict with the real network device ? ( Some post also report Unraid make whole network down, but usually haven't story ending )

Yes, there is only one link cable between the two switches.

When the nas is plugged into the 1g switch, the link cable between the 1g and 10g switch flashes rapidly during this issue The other ports don't flash as much, so I guess the buffers are overflowing instead of a broadcast storm overloading the network.

 

I don't know enough about data link layer logic to guess why the switches are blinking "getting data" but when I plug the cable directly into the pc, wireshark  doesn't show any packets.

 

None of the VMs were started at any point when the issues occured. With Docker containers I guess the only unusual thing would be the lancache docker that I run on br0 with a separate ip, but running ifconfig in that docker also shows a different mac.

 

Edited by permissionBRICK
Link to comment

To be honest, I haven't idea why network will slowly down and storage file system corrupt, but I agree problem source come from Unraid .

My setup have some different, both 10G and 1G link were plug in same switch that means not cross switch, so if I want to reproduce the problem then I need setup like this and not try on production Unraid, I think I will try this on a test PC.

 

 

Edited by Benson
Link to comment

Okay, it happened again, this time while the 1G Link was down. This resulted in network lock-up as well, but once i reset the switch, the network was fine again, but the server was still frozen. So I guess when the server freezes up the 1G does reestablish the link if its lost while the 10g card doesn't, but both lock up the network when it happens.

 

I also found these threads, which seem to be exactly my issue as well: 

The first one seems to indicate that a memory issue might have been the cause, however i would find it strange if a generic memory defect would cause specific issues this similar with several people...

Nevertheless I will try to swap out the ram again, and have the current modules run memtest for a week to be sure...

 

Edited by permissionBRICK
Link to comment
4 hours ago, permissionBRICK said:

Okay, it happened again, this time while the 1G Link was down.

This prove not network device relate.

 

4 hours ago, permissionBRICK said:

have the current modules run memtest for a week to be sure...

Pair Ballistix RAM module just run at 2400Mhz. I suppose you use same modules in AMD / Intel and got same problem, so may be one of them went bad.

 

I suggest you should test both in some aggressive way, i.e. set the RAM speed to 2666 or run Windows platform HCI-design memory to test it.

All my DIY build will do this to quickly check memory, I usually run in 1 or 2 press and just several hrs.

 

 https://www.hcidesign.com/memtest/

 

Edited by Benson
Link to comment
5 hours ago, Benson said:

This prove not network device relate.

 

No, sadly it doesn't. The 10G Link was still connected, and it locked up the network again as well.

5 hours ago, Benson said:

Pair Ballistix RAM module just run at 2400Mhz. I suppose you use same modules in AMD / Intel and got same problem, so may be one of them went bad.

 

I suggest you should test both in some aggressive way, i.e. set the RAM speed to 2666 or run Windows platform HCI-design memory to test it.

All my DIY build will do this to quickly check memory, I usually run in 1 or 2 press and just several hrs.

 

 https://www.hcidesign.com/memtest/

 

Yeah, I now put back the old 4GB modules into the System, so I have plenty of time to test the RAM for any issues. Thanks

 

Link to comment
  • 2 weeks later...

I have been running Memtest on the 32GB RAM Modules on the old board (where the same issue used to occur) for 2 weeks now. One week of memtest86, and one week of Windows Memory Diagnosis on max settings. Both resulted in no errors. I even ran the memtest tool @Benson suggested but again no errors.

Here is a picture:

image.thumb.png.79e97cf3c27e25bea971175b1f61bec3.png

Edited by permissionBRICK
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.