(SOLVED) - My Unraid Got unstable

dgomel · January 16, 2020

Dear Gurus,

I need your help with the problem that I have. During last couple of months my Unraid server got unstable. It was working perfectly for years.

Recently i added a hard drive and updated a version of Unraid.

I attempted some troubleshooting but couldn't find the root cause.

The problem is that once in a while everything stack. No networking. The vents is on, no beeps from BIOS.

I think it could be a memory problem somewhere in upper addresses or some kind of another hardware problem.

Please help to find a root cause.

tower-diagnostics-20200116-0928.zip

Edited March 24, 2020 by dgomel
Solved

dgomel · January 16, 2020

The Memtest86+ shows no problem on a couple of passes.

I see fast grow in zombie processes visible from top. around 100 in 5 mins. Could this be a problem?

I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant.

Edited January 16, 2020 by dgomel

dgomel · January 20, 2020

~~I found the problem.~~ The problem started when i enabled VT-D having SYBA SI-PEX40108 with Marvell 88SE9215. After replacing it with LSI 9211-8i everything back to work.

Unfortunately, the problem persists. It looks like it's depends on usage of VMs.

Latest Diagnostics and Syslog attached.

tower-diagnostics-20200121-0930.zip syslog (6).zip

Edited January 21, 2020 by dgomel
Problem persists.

jonp · January 21, 2020

Hi there,

Looks like you need to update your forum signature as your hardware has changed from AMD to Intel ;-). That said, I'm not seeing any events in the logs themselves that show a problem. And I'm not sure what this is supposed to mean:

On 1/16/2020 at 9:11 AM, dgomel said:

The problem is that once in a while everything stack.

Also hoping that you removed this container while troubleshooting this:

On 1/16/2020 at 9:51 AM, dgomel said:

I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant.

At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f). This will begin printing the log out to the screen. Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary). This should give us at least some information to point towards a cause.

I'd also check for a BIOS update on your motherboard.

dgomel · January 21, 2020

I forwarded syslog to flash and provided it along the lines. I can't see a anything in the syslog.

The outages could be easily found by gaps in printout and new start sequence.

Console is connected. Bios is latest for the MB.

dgomel · January 21, 2020

27 minutes ago, jonp said:

you need to update your forum signature

Done. No improvement

jonp · January 21, 2020

The syslog to flash method is not valid for capturing log events related to major crashes like you're experiencing. The problem is that the hang can occur before the write to the flash can occur. This is why I am suggesting you connect a monitor / keyboard to the system.

dgomel · January 21, 2020

Got it. Will work this out. Thanks for your time.

dgomel · January 21, 2020

I think i could safely exclude overheating of hardware. I created High CPU load along with average IO and keep this running for a couple of hours. No crashes.

dgomel · January 22, 2020

One of theories was a potential of HW monitoring from the BIOS. Yesterday I went to check this and found nothing related to threshold on temp. On the way I changed a CPU governor setting to Performance mode.

In addition, I found yesterday that dynamix.system.temp.plg wasn't updated for a while. When I tried to update, it failed. So, I uninstalled and installed again.

After these two changes the system is working for a day with no crashes. I'll keep monitoring.

dgomel · January 29, 2020

~~I think it safe to move the case to solved. I didn't try change the CPU governor, but something telling me the plugin is most probably a root cause.~~

~~Thanks for your help.~~

The problem persists. The problem depends on enabled virtualization. The system works with VM manager and Docker turned off.

Up to the moment I tested all components, besides MB. The test is upcoming. I was able to catch once a failure on a console. Screenshot attached. Attaching cumulative syslog as well. Would appreciate your thoughts.

tower-diagnostics-20200226-0756.zip syslog.zip

Edited February 26, 2020 by dgomel
Problem persists.

dgomel · March 10, 2020

I just completely upgraded the hardware... and got my first restart today's morning.

I'll give another week or two before I'll drop the product completely.

To say I'm frustrated, is a bit of an understatement.

dgomel · March 10, 2020

I'm attaching fresh diagnostics, on a new hardware...

tower-diagnostics-20200310-0943.zip

jonp · March 10, 2020

On 1/21/2020 at 12:50 PM, jonp said:

At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f). This will begin printing the log out to the screen. Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary). This should give us at least some information to point towards a cause.

This was my last major suggestion for you so we could determine what is happening that is causing the instability. We don't see any events in the log because the system is crashing before they can be written. By doing what I quoted above, you will see the last events prior to the crash and can post those here. The picture you've posted above only has a few lines of log and then I see a printout from Top. This cuts out a lot of other log entries. Please just boot up the system, attach a monitor and keyboard, login via the console, and type "tail /var/log/syslog -f" and just leave that up until it crashes. Then capture a picture of what's on the screen and post it here.

dgomel · March 10, 2020

On the new hardware (New MB/CPU/Memory/PSU) I see system restarts. It's not hanged as it was before.

So. tactics with waiting for a console doesn't work :(.

Edited March 11, 2020 by dgomel

dgomel · March 16, 2020

BTW, old hardware was reused as a desktop and works fine under Win10.

Dissones4U · March 16, 2020

@dgomel

From lssci.txt

Quote

[0:0:0:0] disk UFD 2.0 Silicon-Power8G 1100 /dev/sda /dev/sg0
state=running queue_depth=1 scsi_level=5 type=0 device_blocked=0 timeout=30
dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/host0/target0:0:0/0:0:0:0]

The reviews I've found say that's not a great Flash Drive, have you ruled out the Flash as the problem?

From syslog.text

Quote

Mar 10 08:31:59 Tower kernel: usb 1-6: new high-speed USB device number 3 using xhci_hcd

(from lssub.txt Device 003: ID 090c:1000 Silicon Motion, Inc. - Taiwan (formerly Feiya Technology Corp.) Flash Drive)

Mar 10 08:31:59 Tower kernel: usb-storage 1-6:1.0: USB Mass Storage device detected
Mar 10 08:31:59 Tower kernel: scsi host0: usb-storage 1-6:1.0
Mar 10 08:31:59 Tower kernel: scsi 0:0:0:0: Direct-Access UFD 2.0 Silicon-Power8G 1100 PQ: 0 ANSI: 4

From lspci.text

Quote

00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af]
Subsystem: Micro-Star International Co., Ltd. [MSI] 200 Series PCH USB 3.0 xHCI Controller [1462:7a70]
Kernel driver in use: xhci_hcd

This is the only USB controller listed in the lscpi.txt, make sure your flash is not using USB3

Hopefully this leads you in the right direction, if not always include diagnostics. At this point I'd say you should also include the syslog from the syslog server folder.

dgomel · March 16, 2020

Thanks, I appreciate your inputs.

The USB drive is the only part, besides the array, that was not replaced.

The USB is indeed in USB3 slot. I'm going to switch it immediately.

I'm attaching a cumulative log, starting from the first boot on new hardware.

syslog (11).zip

Edited March 16, 2020 by dgomel

Dissones4U · March 16, 2020

37 minutes ago, dgomel said:

I'm attaching a cumulative log, starting from the first boot on new hardware.

Hopefully moving from USB3 to USB2 fixes it (also remember that the quality of the flash is important, see the recommendations here). If you continue to have issues, I'm still learning how to read these logs but could the snippet below indicate trouble starting smb shares and would that restart the server?

Quote

Mar 6 13:07:38 Tower avahi-daemon[9564]: Service group file /services/smb.service vanished, removing services.
Mar 6 13:07:38 Tower emhttpd: shcmd (123): /etc/rc.d/rc.nfsd stop
Mar 6 13:07:38 Tower rpc.mountd[9540]: Caught signal 15, un-registering and exiting.

Looking at the next snippet it is clear that the server just snaps into reboot vs the hang originally described so I suspect that tailing the log would not help in this case.

Quote

Mar 6 13:07:45 Tower rpc.mountd[14158]: Caught signal 15, un-registering and exiting.
Mar 6 13:07:46 Tower sshd[9433]: Received signal 15; terminating.
Mar 6 13:07:46 Tower haveged: haveged: Stopping due to signal 15
Mar 6 13:07:46 Tower ntpd[1749]: ntpd exiting on signal 1 (Hangup)
Mar 6 13:07:46 Tower ntpd[1749]: 127.127.1.0 local addr 127.0.0.1 -> <null>
Mar 6 13:07:46 Tower ntpd[1749]: 45.62.214.53 local addr 192.168.22.47 -> <null>
Mar 6 13:07:46 Tower ntpd[1749]: 216.232.132.31 local addr 192.168.22.47 -> <null>
Mar 6 13:07:46 Tower ntpd[1749]: 209.115.181.107 local addr 192.168.22.47 -> <null>
Mar 6 13:07:46 Tower kernel: nfsd: last server has exited, flushing export cache
Mar 6 13:07:46 Tower rc.inet1: ip -4 route flush default dev br0
Mar 6 13:07:46 Tower rc.inet1: ip -4 addr flush dev br0
Mar 6 13:07:46 Tower rc.inet1: ip link set br0 down
Mar 6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state
Mar 6 13:07:46 Tower rc.inet1: ip link set eth0 promisc off nomaster
Mar 6 13:07:46 Tower kernel: device eth0 left promiscuous mode
Mar 6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state
Mar 6 13:07:46 Tower rc.inet1: ip link set br0 down
Mar 6 13:07:46 Tower rc.inet1: ip link del br0
Mar 6 13:07:46 Tower rc.inet1: ip link set lo down

REBOOTS HERE, (I'm just not sure what the above info implies)

Mar 6 13:09:19 Tower kernel: microcode: microcode updated early to revision 0xca, date = 2019-10-03
Mar 6 13:09:19 Tower kernel: Linux version 4.19.94-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Jan 9 08:20:36 PST 2020
Mar 6 13:09:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

dgomel · March 17, 2020

I switched over to USB2. I'll keep monitoring. Thanks for the reply ans analysis.

dgomel · March 24, 2020

The system is stable for a week after the switch from USB2 to USB3. @Dissones4U thanks for your findings.

(SOLVED) - My Unraid Got unstable

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation