February 7, 201610 yr Hi Everyone, First issue with unRAID after using it for a few months. In the 2 months of using unRAID I've had 2 instances where the box drops off the network. VM's are uncontactable, unRAID web GUI will not load, cannot ping box, CCTV camera's stop recording etc. The way to get the box back online is to disconnect the network cable, reconnect and it then everyone wakes back up. The log doesn't show it going down, only me disconnecting and reconnecting: Feb 6 19:00:01 unRAID logger: mover finished Feb 6 19:03:37 unRAID kernel: e1000e: eth0 NIC Link is Down Feb 6 19:03:37 unRAID kernel: br0: port 1(eth0) entered disabled state Feb 6 19:03:46 unRAID kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Feb 6 19:03:46 unRAID kernel: br0: port 2(eth1) entered listening state Feb 6 19:03:46 unRAID kernel: br0: port 2(eth1) entered listening state Feb 6 19:04:01 unRAID kernel: br0: port 2(eth1) entered learning state Feb 6 19:04:16 unRAID kernel: br0: topology change detected, propagating Feb 6 19:04:16 unRAID kernel: br0: port 2(eth1) entered forwarding state Feb 6 20:00:01 unRAID logger: mover started The box is set to use a static IP and the NIC is a HP Dual Port PCI-e Gigabit Server Network Adapter NC360T. Would appreciate any advise on how to resolve this issue. Cheers, Kris
February 8, 201610 yr Author I was thinking of re-enabling my on-board NIC and teaming them. Would I receive notification if the NC360T (or on-board) NIC becomes unresponsive? So far I haven't with just the card being run singly.
May 18, 201610 yr Author So this is still an issue. I swapped from the NC360T NIC to the onboard Realtek NIC. Same issue.
May 19, 201610 yr I had this issue also, it seemed to be connected when I did a network copy from a VM (Windows 10) to a USB external drive share via unassigned devices. I was also able to replicate the network drop by setting up a WIFI printer share on the VM (Windows 10) and have other machine connect the printer share. Had to disconnect the cable and re connect to get access back to the machine. I was also able to replicate the issue using another network card. I have changed what I do (Dont share the printer and no longer share USB drives via Unassigned devices) so I have not had the issue come back. I have also upgraded my server and now use 6.2 beta. But I did have this very issue. (Glad it was not just me)
May 19, 201610 yr Author I had this issue also, it seemed to be connected when I did a network copy from a VM (Windows 10) to a USB external drive share via unassigned devices. I was also able to replicate the network drop by setting up a WIFI printer share on the VM (Windows 10) and have other machine connect the printer share. Had to disconnect the cable and re connect to get access back to the machine. I was also able to replicate the issue using another network card. I have changed what I do (Dont share the printer and no longer share USB drives via Unassigned devices) so I have not had the issue come back. I have also upgraded my server and now use 6.2 beta. But I did have this very issue. (Glad it was not just me) Interesting! I have 2 VM's, each connect to a unassigned (HDD) device. Do you think this could be why? I'm going to see if I can use your method to replicate the error. Really appreciate your reply. This is the first lead I've had since getting the issue.
May 19, 201610 yr Author I'm changing the way the VM's see the disks. I'm adding them to the shares. Setting Global Share Settings to exclude the drive. Creating a share and on the individual share, set it to include the (what was globally excluded) drive. This way i'm not having to do the XML Hack to passthrough the drive, but at the same time only the VM/s will be writing to the particular drive.
May 19, 201610 yr Being new to unRAID and Linux I have no idea how it could be linked to unassigned. But it was linked to activity in the VM....
May 20, 201610 yr Interesting! I'm just testing out the trial version before I part with my wonga and I had the same problem yesterday evening. All network activity died and the UI stopped responding. The weird thing was that I could still telnet in. I've sinced changed the NIC and uninstalled the Unassigned plugin thinking they were at fault. However, after reading this, I realised that I also had a VM open and was directly passing an unassigned drive through (it's old data in software raid). I'm going to have to start the VM again to try and transfer the data off of my drive and will keep an eye on things. It will be a real shame if this does cause issues, as I prefer to pass through devices rather than use shares.
May 20, 201610 yr Author I ended up not adding the drives to the array as I don't want the drives to be added to the parity pool. I used the Unassigned Drives Plugin to mount the drives and create shares. I flipped back to the NC360T NIC too. As we all know, this can take weeks to show it's head again. But keep posted and update with your findings.
May 20, 201610 yr Interesting! I'm just testing out the trial version before I part with my wonga and I had the same problem yesterday evening. All network activity died and the UI stopped responding. The weird thing was that I could still telnet in. I've sinced changed the NIC and uninstalled the Unassigned plugin thinking they were at fault. However, after reading this, I realised that I also had a VM open and was directly passing an unassigned drive through (it's old data in software raid). I'm going to have to start the VM again to try and transfer the data off of my drive and will keep an eye on things. It will be a real shame if this does cause issues, as I prefer to pass through devices rather than use shares. You're using beta? There is a known issue with hanging array which I think is your issue, give the clue that you still can telnet in i.e. as long as it's not querying the array, it's fine.
May 22, 201610 yr You're using beta? There is a known issue with hanging array which I think is your issue, give the clue that you still can telnet in i.e. as long as it's not querying the array, it's fine. Yes I am. I'm moving from a Xen virtualised Dom0 and there seemed to be some notable improvements in the beta regarding virtualisation. To be honest, I did read the known issues before deciding whether to install and I don't remember reading about the hanging array. If that's the case, at least it's known. Once it's fixed I'll post my findings on stability again for direct passthrough. Thanks for the headsup!
May 25, 201610 yr Yes I am. I'm moving from a Xen virtualised Dom0 and there seemed to be some notable improvements in the beta regarding virtualisation. To be honest, I did read the known issues before deciding whether to install and I don't remember reading about the hanging array. If that's the case, at least it's known. Once it's fixed I'll post my findings on stability again for direct passthrough. Thanks for the headsup! The hanging array only seems to happen to some people (as the devs have said they just can't reproduce it in the lab). They have proposed the workaround to increase the md stripe attribute and it seems to fix it.
May 27, 201610 yr 7 days strong so far. *fingers crossed* I've not had the issue reoccur yet either. I've disabled the bond as well as I did have an instance where the br0 disappeared and eth0 was present causing my VM's to have difficulty connecting to my LAN. I'm going to do some large scale copies between my VM (passthrough drive) and the Unraid share, so will be interesting to see if it all works! I'll report back with my findings. The hanging array only seems to happen to some people (as the devs have said they just can't reproduce it in the lab). They have proposed the workaround to increase the md stripe attribute and it seems to fix it. Interesting! I'll not change any stripe settings yet (don't want to make too many changes before testing) and see how things go!
June 2, 201610 yr Author SIGH It happened again. No GUI, No ping, No Telnet, No SSH. What should I be looking for? Internet went down @ 1:20am. Unraid lost network connection at 5:55am. Any connection?
June 2, 201610 yr Install fix common problems plugin and run troubleshooting mode Sent from my LG-D852 using Tapatalk
June 2, 201610 yr Author Install fix common problems plugin and run troubleshooting mode Sent from my LG-D852 using Tapatalk Good idea Squid, that's now done. Cheers
June 14, 201610 yr Author OK It happen again. I had Fix Common Problems trouble shooting mode enabled. The CCTV camera stopped being visible at 17:29. I unplugged the network cable on the unraid box at 17:32, reconnected it and everything came alive. I've attached the log files from this time. All i've been able to find so far is Kernel detection of the link being unplugged then replugged: Jun 14 17:22:38 unRAID logger: Fix Common Problems: rootfs (/) currently 3 % full Jun 14 17:31:20 unRAID kernel: e1000e: eth2 NIC Link is Down Jun 14 17:31:20 unRAID kernel: br0: port 3(eth2) entered disabled state Jun 14 17:31:23 unRAID kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 14 17:31:23 unRAID kernel: br0: port 3(eth2) entered listening state Jun 14 17:31:23 unRAID kernel: br0: port 3(eth2) entered listening state Jun 14 17:31:38 unRAID kernel: br0: port 3(eth2) entered learning state Jun 14 17:31:53 unRAID kernel: br0: topology change detected, propagating Jun 14 17:31:53 unRAID kernel: br0: port 3(eth2) entered forwarding state Nothing around the time of 17:29. What/where should I be looking? Log zip too big to attach (250kb), find it here: http://s000.tinyupload.com/index.php?file_id=00205517858879269474 syslog: http://s000.tinyupload.com/?file_id=95662302419157254579
June 15, 201610 yr I don't see a connection with the networking, so that may be a 'red herring', irrelevant. What I do see is the CCTV VM running all out (100% CPU or very close to it!) since at least June 3 8:10am when you did the Troubleshooting mode tail. You booted the system on May 19, but there are no syslogs before June 3, apparently all deleted. All syslog activity since is Troubleshooting logging, plus a little Mover logging. You need to reboot now, both to stop the Troubleshooting mode, and to restart the CCTV VM. The problem is something wrong in that VM, running wild. If you have trouble again, check a ps report and see what the CPU is doing in that VM. If maxed out, restart the VM. I can't tell what OS is running in there, but you'll need to debug it. For help with that, you'll need to go to whoever helped set it up. You probably should have rebooted or restarted CCTV on June 3 or sooner, whenever you first detected CCTV trouble!
June 15, 201610 yr Author Hi Robj, I see what you're saying, the PS report shows 91% cpu usage: root 14894 91.1 36.1 13128848 8946816 ? SLl May19 36235:15 /usr/bin/qemu-system-x86_64 -name CCTV -S -machine pc-i440fx-2.3,accel=kvm,usb=off,mem-merge=off -cpu host,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff -m 8192 -realti I think this is false though. Dashboard: From inside the CCTV VM Host (Windows 2012 R2) In saying that, the machine has been rebooted many times. Especially when swapping the NIC's to test. Problem has persisted. Not sure why I should of rebooted the CCTV VM when all VMs and the unraid box it self was contactable?
June 16, 201610 yr I don't think it's false, just not the whole story. Your graph is showing aggregate totals, and with extra CPU's, it looks fine. But are those extras available in that VM? If only core 2 and 6 were assigned to it (speculating, don't know but they're the busiest), then CCTV would be experiencing huge lag, perhaps sufficient at moments to cause timeouts, resulting in occasional small breakages. What's a little odd is that on June 3 it was at 100%, then later it had dropped to 99, then sometime later to 98, and very slowly slid to the 91 you saw. With each drop, average lag will slowly decrease, but it's still high enough that in moments of high demand, you could still see brief timeouts. I still think you need to find out what is using so much CPU in that VM. High CPU means high internal processing not I/O, which means the CPU is a bottleneck! That's ridiculous! I/O should almost always be the bottleneck.
June 16, 201610 yr I don't think it's false, just not the whole story. Your graph is showing aggregate totals, and with extra CPU's, it looks fine. But are those extras available in that VM? If only core 2 and 6 were assigned to it (speculating, don't know but they're the busiest), then CCTV would be experiencing huge lag, perhaps sufficient at moments to cause timeouts, resulting in occasional small breakages. What's a little odd is that on June 3 it was at 100%, then later it had dropped to 99, then sometime later to 98, and very slowly slid to the 91 you saw. With each drop, average lag will slowly decrease, but it's still high enough that in moments of high demand, you could still see brief timeouts. I still think you need to find out what is using so much CPU in that VM. High CPU means high internal processing not I/O, which means the CPU is a bottleneck! That's ridiculous! I/O should almost always be the bottleneck. Just a note that the ps dump from fcp is unaltered (in only not logging anything that is at 0%) tbh between the dynamix display vs the ps output I'd be more inclined to trust the ps output. That being said iirc the sysload that was also logged never went over 1 which means that (assuming you're running a multicore cpu) that the cpu wasn't starved either Sent from my LG-D852 using Tapatalk
June 20, 201610 yr Author Did some reading on CPU core allocating and I think I've found that my setup was poor / wrong. Here is old v new setup: I read that unRAID OS favours core 0, so I left this unallocated. Separated the 2 VM's to ensure they didn't overlap on the same cores. Is this better / correct?
Archived
This topic is now archived and is closed to further replies.