February 8, 20233 yr To begin, I am an IT Technician with 10yrs of experience under my belt. Asking for help (in general) is very hard for me so admittedly I've been holding off on posting to these forums. After about a year of troubleshooting I'm truly out of ideas and can't think of where to go next. I have a Dell R710 running the last supported BIOS, Lifecycle controller, and iDRAC package. It is equipped with a PERC H200 flashed to IT/JBOD mode. Periodically the server crashes and I lose everything. There is no rhyme or reason as to why and the only indicator I get (when I'm able to catch it) is my VMs and Dockers begin to crash/slow down/act weird. For instance I am running Home Assistant in a VM converted from their official website and my ZWave plugin begins to lose connectivity with my ZWave network. I have read several posts regarding similar issues and found some people have luck changing their docker network interface from macvlan to ipvlan (which I have done), and running Memtest86 to verify ram functionality. I ran Memtest for about 13hrs and was unable to get it to fail. I have BTRFS, TRIM, and mover schedules set so those are all being taken care of. No SMART errors either. At this point I am at a loss and cannot for the life of me figure out how to mitigate these crashes. Any input or guidance would be appreciated. All I want is for this thing to work consistently and with the struggles of life and taking on a mortgage I'm running out of time and energy to fix this. Attached is my diagnostics, if someone could please go through them and give me some pointers I would truly appreciate it. I feel like I'm just an idiot and missed a setting somewhere. Thanks again in advance, I can provide more information if required! zserve-diagnostics-20230207-2144.zip
February 8, 20233 yr Author 11 hours ago, trurl said: setup syslog server syslog has already been setup! Sorry I wrote this while tired last night. Before this last crash I setup syslog and mirrored it all to flash, so I have logs of it while crashing. However after skimming through them I couldn’t see anything out of the ordinary. Would you be interested in taking a look?
February 8, 20233 yr Author 23 minutes ago, JorgeB said: Looks like NIC problems, try with a different NIC if available. Just out of curiosity where in the logs where you able to determine that? Right now I'm actually using all 4 NICs in a LAG to my switch. (the switch has been configured for it as well). I also had these exact issues when I used just one NIC, but I hadn't tried removing the LAG and trying another one of the 4. Is the driver acting up or does it appear to be hardware related?
February 8, 20233 yr Look at the syslog, anything that mentions bnx2 is about the NICs, e.g.: Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: <--- start TBDC dump ---> Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: TBDC free cnt: 32 Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: LINE CID BIDX CMD VALIDS Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 00 001300 c560 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 01 001300 c560 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 02 001080 a2f8 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 03 001080 a300 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 04 000800 1f38 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 05 001300 1110 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 06 001300 1120 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 07 001000 6308 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 08 001000 62e8 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 09 059f80 ff50 fc [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0a 1ffd80 f660 f4 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0b 1eaa80 f6f8 af [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0c 0fff80 3cc8 3d [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0d 129f80 df90 a5 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0e 1fd780 7fd0 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0f 0fff80 57f8 6e [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 10 1bef00 7fd8 d1 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 11 1dfe80 7ee8 ef [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 12 17ec80 dbf0 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 13 01ac80 ddf0 7b [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 14 00ff00 dff8 46 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 15 1b1d80 76e8 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 16 1fff80 3328 7f [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 17 1d3b80 fdb0 c3 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 18 0d7f00 af48 5d [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 19 1bdf80 9de0 f4 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1a 0ff300 ebd0 eb [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1b 1f7d80 b9f8 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1c 0fe980 fdf0 fd [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1d 1ff980 6770 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1e 197f00 f7f8 fc [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1f 137980 fff8 ed [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: <--- end TBDC dump --->
February 8, 20233 yr Author 1 hour ago, JorgeB said: Look at the syslog, anything that mentions bnx2 is about the NICs, e.g.: Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: <--- start TBDC dump ---> Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: TBDC free cnt: 32 Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: LINE CID BIDX CMD VALIDS Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 00 001300 c560 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 01 001300 c560 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 02 001080 a2f8 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 03 001080 a300 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 04 000800 1f38 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 05 001300 1110 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 06 001300 1120 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 07 001000 6308 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 08 001000 62e8 00 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 09 059f80 ff50 fc [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0a 1ffd80 f660 f4 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0b 1eaa80 f6f8 af [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0c 0fff80 3cc8 3d [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0d 129f80 df90 a5 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0e 1fd780 7fd0 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 0f 0fff80 57f8 6e [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 10 1bef00 7fd8 d1 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 11 1dfe80 7ee8 ef [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 12 17ec80 dbf0 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 13 01ac80 ddf0 7b [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 14 00ff00 dff8 46 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 15 1b1d80 76e8 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 16 1fff80 3328 7f [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 17 1d3b80 fdb0 c3 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 18 0d7f00 af48 5d [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 19 1bdf80 9de0 f4 [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1a 0ff300 ebd0 eb [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1b 1f7d80 b9f8 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1c 0fe980 fdf0 fd [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1d 1ff980 6770 ff [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1e 197f00 f7f8 fc [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: 1f 137980 fff8 ed [0] Jan 19 05:52:07 zServe kernel: bnx2 0000:01:00.0 eth0: <--- end TBDC dump ---> Gotcha, are you seeing anything suspicious after that date? I changed some of my logs to just show the last 2 days and haven't seen anything related to that since. Not sure if that was due to me moving the connection or what. Regardless I'll give disabling the LAG a shot and moving the primary port over to eth1. Will report back on what I find!
February 8, 20233 yr 17 minutes ago, echooffzack said: are you seeing anything suspicious after that date? Not really, and didn't even notice the date, I do also see this: Feb 2 07:45:54 zServe kernel: macvlan_process_broadcast+0xea/0x128 [macvlan] Macvlan call traces are usually the result of having dockers with a custom IP address and will end up crashing the server, upgrading to v6.10 or later and switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)).
February 8, 20233 yr Author 25 minutes ago, JorgeB said: Not really, and didn't even notice the date, I do also see this: Feb 2 07:45:54 zServe kernel: macvlan_process_broadcast+0xea/0x128 [macvlan] Macvlan call traces are usually the result of having dockers with a custom IP address and will end up crashing the server, upgrading to v6.10 or later and switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)). I should note the last time the system crashed was after I switched from macvlan. I initially did some research on 2+ day Unraid crashing and found a couple of posts where you had mentioned making that switch, and I am using VLANs for my dockers/vms/cameras so I made the change. I believe the last crash has been within the last 2 days, hopefully that helps with combing through the logs. I believe it was between 12-1AM on Feb 7th, I remember waking up to my house freezing because Home assistant hadn't recognized I was home and didn't make the adjustment to my thermostat, tried changing it in HA but the UI wouldn't load. I woke up that morning at 7am and told Unraid to start the array again. Again I appreciate your help with this so far. My only other guess is that my Home Assistant VM could be contributing, but I couldn't imagine a VM would have that much power over the entire kernel. HA has been pretty straightforward and isn't very problematic. I had to make some modifications to the actual image before I got it working, I downloaded the KVM file from them directly and imported it into the VM manager. Is this a common point of failure on a lot of Unraid installs? After going through some previous posts it seems a lot of people run into issues with the networking side of Unraid. I'm just worried I've overcomplicated a lot of this with a modded HBA and all of these VLANs.
February 8, 20233 yr 8 minutes ago, echooffzack said: Is this a common point of failure on a lot of Unraid installs? Macvlan yes, for a considerable number of users, network in general no.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.