dgomel Posted January 16, 2020 Share Posted January 16, 2020 (edited) Dear Gurus, I need your help with the problem that I have. During last couple of months my Unraid server got unstable. It was working perfectly for years. Recently i added a hard drive and updated a version of Unraid. I attempted some troubleshooting but couldn't find the root cause. The problem is that once in a while everything stack. No networking. The vents is on, no beeps from BIOS. I think it could be a memory problem somewhere in upper addresses or some kind of another hardware problem. Please help to find a root cause. tower-diagnostics-20200116-0928.zip Edited March 24, 2020 by dgomel Solved Quote Link to comment
dgomel Posted January 16, 2020 Author Share Posted January 16, 2020 (edited) The Memtest86+ shows no problem on a couple of passes. I see fast grow in zombie processes visible from top. around 100 in 5 mins. Could this be a problem? I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant. Edited January 16, 2020 by dgomel Quote Link to comment
dgomel Posted January 20, 2020 Author Share Posted January 20, 2020 (edited) I found the problem. The problem started when i enabled VT-D having SYBA SI-PEX40108 with Marvell 88SE9215. After replacing it with LSI 9211-8i everything back to work. Unfortunately, the problem persists. It looks like it's depends on usage of VMs. Latest Diagnostics and Syslog attached. tower-diagnostics-20200121-0930.zip syslog (6).zip Edited January 21, 2020 by dgomel Problem persists. Quote Link to comment
jonp Posted January 21, 2020 Share Posted January 21, 2020 Hi there, Looks like you need to update your forum signature as your hardware has changed from AMD to Intel ;-). That said, I'm not seeing any events in the logs themselves that show a problem. And I'm not sure what this is supposed to mean: On 1/16/2020 at 9:11 AM, dgomel said: The problem is that once in a while everything stack. Also hoping that you removed this container while troubleshooting this: On 1/16/2020 at 9:51 AM, dgomel said: I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant. At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f). This will begin printing the log out to the screen. Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary). This should give us at least some information to point towards a cause. I'd also check for a BIOS update on your motherboard. 1 Quote Link to comment
dgomel Posted January 21, 2020 Author Share Posted January 21, 2020 I forwarded syslog to flash and provided it along the lines. I can't see a anything in the syslog. The outages could be easily found by gaps in printout and new start sequence. Console is connected. Bios is latest for the MB. Quote Link to comment
dgomel Posted January 21, 2020 Author Share Posted January 21, 2020 27 minutes ago, jonp said: you need to update your forum signature Done. No improvement Quote Link to comment
jonp Posted January 21, 2020 Share Posted January 21, 2020 The syslog to flash method is not valid for capturing log events related to major crashes like you're experiencing. The problem is that the hang can occur before the write to the flash can occur. This is why I am suggesting you connect a monitor / keyboard to the system. Quote Link to comment
dgomel Posted January 21, 2020 Author Share Posted January 21, 2020 Got it. Will work this out. Thanks for your time. Quote Link to comment
dgomel Posted January 21, 2020 Author Share Posted January 21, 2020 I think i could safely exclude overheating of hardware. I created High CPU load along with average IO and keep this running for a couple of hours. No crashes. Quote Link to comment
dgomel Posted January 22, 2020 Author Share Posted January 22, 2020 One of theories was a potential of HW monitoring from the BIOS. Yesterday I went to check this and found nothing related to threshold on temp. On the way I changed a CPU governor setting to Performance mode. In addition, I found yesterday that dynamix.system.temp.plg wasn't updated for a while. When I tried to update, it failed. So, I uninstalled and installed again. After these two changes the system is working for a day with no crashes. I'll keep monitoring. 1 Quote Link to comment
dgomel Posted January 29, 2020 Author Share Posted January 29, 2020 (edited) I think it safe to move the case to solved. I didn't try change the CPU governor, but something telling me the plugin is most probably a root cause. Thanks for your help. The problem persists. The problem depends on enabled virtualization. The system works with VM manager and Docker turned off. Up to the moment I tested all components, besides MB. The test is upcoming. I was able to catch once a failure on a console. Screenshot attached. Attaching cumulative syslog as well. Would appreciate your thoughts. tower-diagnostics-20200226-0756.zip syslog.zip Edited February 26, 2020 by dgomel Problem persists. Quote Link to comment
dgomel Posted March 10, 2020 Author Share Posted March 10, 2020 I just completely upgraded the hardware... and got my first restart today's morning. I'll give another week or two before I'll drop the product completely. To say I'm frustrated, is a bit of an understatement. Quote Link to comment
dgomel Posted March 10, 2020 Author Share Posted March 10, 2020 I'm attaching fresh diagnostics, on a new hardware... tower-diagnostics-20200310-0943.zip Quote Link to comment
jonp Posted March 10, 2020 Share Posted March 10, 2020 On 1/21/2020 at 12:50 PM, jonp said: At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f). This will begin printing the log out to the screen. Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary). This should give us at least some information to point towards a cause. This was my last major suggestion for you so we could determine what is happening that is causing the instability. We don't see any events in the log because the system is crashing before they can be written. By doing what I quoted above, you will see the last events prior to the crash and can post those here. The picture you've posted above only has a few lines of log and then I see a printout from Top. This cuts out a lot of other log entries. Please just boot up the system, attach a monitor and keyboard, login via the console, and type "tail /var/log/syslog -f" and just leave that up until it crashes. Then capture a picture of what's on the screen and post it here. 1 Quote Link to comment
dgomel Posted March 10, 2020 Author Share Posted March 10, 2020 (edited) On the new hardware (New MB/CPU/Memory/PSU) I see system restarts. It's not hanged as it was before. So. tactics with waiting for a console doesn't work :(. Edited March 11, 2020 by dgomel Quote Link to comment
dgomel Posted March 16, 2020 Author Share Posted March 16, 2020 BTW, old hardware was reused as a desktop and works fine under Win10. Quote Link to comment
Dissones4U Posted March 16, 2020 Share Posted March 16, 2020 @dgomel From lssci.txt Quote [0:0:0:0] disk UFD 2.0 Silicon-Power8G 1100 /dev/sda /dev/sg0 state=running queue_depth=1 scsi_level=5 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/host0/target0:0:0/0:0:0:0] The reviews I've found say that's not a great Flash Drive, have you ruled out the Flash as the problem? From syslog.text Quote Mar 10 08:31:59 Tower kernel: usb 1-6: new high-speed USB device number 3 using xhci_hcd (from lssub.txt Device 003: ID 090c:1000 Silicon Motion, Inc. - Taiwan (formerly Feiya Technology Corp.) Flash Drive) Mar 10 08:31:59 Tower kernel: usb-storage 1-6:1.0: USB Mass Storage device detected Mar 10 08:31:59 Tower kernel: scsi host0: usb-storage 1-6:1.0 Mar 10 08:31:59 Tower kernel: scsi 0:0:0:0: Direct-Access UFD 2.0 Silicon-Power8G 1100 PQ: 0 ANSI: 4 From lspci.text Quote 00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af] Subsystem: Micro-Star International Co., Ltd. [MSI] 200 Series PCH USB 3.0 xHCI Controller [1462:7a70] Kernel driver in use: xhci_hcd This is the only USB controller listed in the lscpi.txt, make sure your flash is not using USB3 Hopefully this leads you in the right direction, if not always include diagnostics. At this point I'd say you should also include the syslog from the syslog server folder. Quote Link to comment
dgomel Posted March 16, 2020 Author Share Posted March 16, 2020 (edited) Thanks, I appreciate your inputs. The USB drive is the only part, besides the array, that was not replaced. The USB is indeed in USB3 slot. I'm going to switch it immediately. I'm attaching a cumulative log, starting from the first boot on new hardware. syslog (11).zip Edited March 16, 2020 by dgomel Quote Link to comment
Dissones4U Posted March 16, 2020 Share Posted March 16, 2020 37 minutes ago, dgomel said: I'm attaching a cumulative log, starting from the first boot on new hardware. Hopefully moving from USB3 to USB2 fixes it (also remember that the quality of the flash is important, see the recommendations here). If you continue to have issues, I'm still learning how to read these logs but could the snippet below indicate trouble starting smb shares and would that restart the server? Quote Mar 6 13:07:38 Tower avahi-daemon[9564]: Service group file /services/smb.service vanished, removing services. Mar 6 13:07:38 Tower emhttpd: shcmd (123): /etc/rc.d/rc.nfsd stop Mar 6 13:07:38 Tower rpc.mountd[9540]: Caught signal 15, un-registering and exiting. Looking at the next snippet it is clear that the server just snaps into reboot vs the hang originally described so I suspect that tailing the log would not help in this case. Quote Mar 6 13:07:45 Tower rpc.mountd[14158]: Caught signal 15, un-registering and exiting. Mar 6 13:07:46 Tower sshd[9433]: Received signal 15; terminating. Mar 6 13:07:46 Tower haveged: haveged: Stopping due to signal 15 Mar 6 13:07:46 Tower ntpd[1749]: ntpd exiting on signal 1 (Hangup) Mar 6 13:07:46 Tower ntpd[1749]: 127.127.1.0 local addr 127.0.0.1 -> <null> Mar 6 13:07:46 Tower ntpd[1749]: 45.62.214.53 local addr 192.168.22.47 -> <null> Mar 6 13:07:46 Tower ntpd[1749]: 216.232.132.31 local addr 192.168.22.47 -> <null> Mar 6 13:07:46 Tower ntpd[1749]: 209.115.181.107 local addr 192.168.22.47 -> <null> Mar 6 13:07:46 Tower kernel: nfsd: last server has exited, flushing export cache Mar 6 13:07:46 Tower rc.inet1: ip -4 route flush default dev br0 Mar 6 13:07:46 Tower rc.inet1: ip -4 addr flush dev br0 Mar 6 13:07:46 Tower rc.inet1: ip link set br0 down Mar 6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state Mar 6 13:07:46 Tower rc.inet1: ip link set eth0 promisc off nomaster Mar 6 13:07:46 Tower kernel: device eth0 left promiscuous mode Mar 6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state Mar 6 13:07:46 Tower rc.inet1: ip link set br0 down Mar 6 13:07:46 Tower rc.inet1: ip link del br0 Mar 6 13:07:46 Tower rc.inet1: ip link set lo down REBOOTS HERE, (I'm just not sure what the above info implies) Mar 6 13:09:19 Tower kernel: microcode: microcode updated early to revision 0xca, date = 2019-10-03 Mar 6 13:09:19 Tower kernel: Linux version 4.19.94-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Jan 9 08:20:36 PST 2020 Mar 6 13:09:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot Quote Link to comment
dgomel Posted March 17, 2020 Author Share Posted March 17, 2020 I switched over to USB2. I'll keep monitoring. Thanks for the reply ans analysis. Quote Link to comment
dgomel Posted March 24, 2020 Author Share Posted March 24, 2020 The system is stable for a week after the switch from USB2 to USB3. @Dissones4U thanks for your findings. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.