(SOLVED) - My Unraid Got unstable


dgomel

Recommended Posts

Dear Gurus,

I need your help with the problem that I have. During last couple of months my Unraid server got unstable. It was working perfectly for years.

Recently i added a hard drive and updated a version of Unraid.

I attempted some troubleshooting but couldn't find the root cause.

The problem is that once in a while everything stack. No networking. The vents is on, no beeps from BIOS.

I think it could be a memory problem somewhere in upper addresses or some kind of another hardware problem.

Please help to find a root cause.

tower-diagnostics-20200116-0928.zip

Edited by dgomel
Solved
Link to comment

The Memtest86+ shows no problem on a couple of passes.

I see fast grow in zombie processes visible from top. around 100 in 5 mins. Could this be a problem?

I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant.

Edited by dgomel
Link to comment

I found the problem. The problem started when i enabled VT-D having SYBA SI-PEX40108 with Marvell 88SE9215. After replacing it with LSI 9211-8i everything back to work.

Unfortunately, the problem persists. It looks like it's depends on usage of VMs.

Latest Diagnostics and Syslog attached. 

 

tower-diagnostics-20200121-0930.zip syslog (6).zip

Edited by dgomel
Problem persists.
Link to comment

Hi there,

 

Looks like you need to update your forum signature as your hardware has changed from AMD to Intel ;-).  That said, I'm not seeing any events in the logs themselves that show a problem.  And I'm not sure what this is supposed to mean:

On 1/16/2020 at 9:11 AM, dgomel said:

The problem is that once in a while everything stack.

Also hoping that you removed this container while troubleshooting this:

On 1/16/2020 at 9:51 AM, dgomel said:

I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant.

 

At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f).  This will begin printing the log out to the screen.  Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary).  This should give us at least some information to point towards a cause.

 

I'd also check for a BIOS update on your motherboard.

  • Like 1
Link to comment

One of theories was a potential of HW monitoring from the BIOS.  Yesterday I went to check this and found nothing related to threshold on temp. On the way I changed a CPU governor setting to Performance mode.

In addition, I found yesterday that dynamix.system.temp.plg wasn't updated for a while. When I tried to update, it failed. So, I uninstalled and installed again. 

After these two changes the system is working for a day with no crashes. I'll keep monitoring. 

  • Like 1
Link to comment

I think it safe to move the case to solved. I didn't try change the CPU governor, but something telling me the plugin is most probably a root cause.

Thanks for your help. 

 

The problem persists. The problem depends on enabled virtualization. The system works with VM manager and Docker turned off.

Up to the moment I tested all components, besides MB. The test is upcoming. I was able to catch once a failure on a console. Screenshot attached. Attaching cumulative syslog as well. Would appreciate your thoughts. 

IMG-1173.jpg

tower-diagnostics-20200226-0756.zip syslog.zip

Edited by dgomel
Problem persists.
Link to comment
  • 2 weeks later...
On 1/21/2020 at 12:50 PM, jonp said:

At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f).  This will begin printing the log out to the screen.  Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary).  This should give us at least some information to point towards a cause.

 

This was my last major suggestion for you so we could determine what is happening that is causing the instability.  We don't see any events in the log because the system is crashing before they can be written.  By doing what I quoted above, you will see the last events prior to the crash and can post those here.  The picture you've posted above only has a few lines of log and then I see a printout from Top.  This cuts out a lot of other log entries.  Please just boot up the system, attach a monitor and keyboard, login via the console, and type "tail /var/log/syslog -f" and just leave that up until it crashes.  Then capture a picture of what's on the screen and post it here.

  • Thanks 1
Link to comment

@dgomel

  • From lssci.txt
Quote

[0:0:0:0]    disk    UFD 2.0  Silicon-Power8G  1100  /dev/sda   /dev/sg0 
  state=running queue_depth=1 scsi_level=5 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/host0/target0:0:0/0:0:0:0]

The reviews I've found say that's not a great Flash Drive, have you ruled out the Flash as the problem?

  • From syslog.text
Quote

Mar 10 08:31:59 Tower kernel: usb 1-6: new high-speed USB device number 3 using xhci_hcd

  • (from lssub.txt Device 003: ID 090c:1000 Silicon Motion, Inc. - Taiwan (formerly Feiya Technology Corp.) Flash Drive)

Mar 10 08:31:59 Tower kernel: usb-storage 1-6:1.0: USB Mass Storage device detected
Mar 10 08:31:59 Tower kernel: scsi host0: usb-storage 1-6:1.0
Mar 10 08:31:59 Tower kernel: scsi 0:0:0:0: Direct-Access     UFD 2.0  Silicon-Power8G  1100 PQ: 0 ANSI: 4

  • From lspci.text
Quote

00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af]
    Subsystem: Micro-Star International Co., Ltd. [MSI] 200 Series PCH USB 3.0 xHCI Controller [1462:7a70]
    Kernel driver in use: xhci_hcd

This is the only USB controller listed in the lscpi.txt, make sure your flash is not using USB3

Hopefully this leads you in the right direction, if not always include diagnostics. At this point I'd say you should also include the syslog from the syslog server folder.

 

Link to comment
37 minutes ago, dgomel said:

I'm attaching a cumulative log, starting from the first boot on new hardware.

Hopefully moving from USB3 to USB2 fixes it (also remember that the quality of the flash is important, see the recommendations here). If you continue to have issues, I'm still learning how to read these logs but could the snippet below indicate trouble starting smb shares and would that restart the server?

Quote

Mar  6 13:07:38 Tower avahi-daemon[9564]: Service group file /services/smb.service vanished, removing services.
Mar  6 13:07:38 Tower emhttpd: shcmd (123): /etc/rc.d/rc.nfsd stop
Mar  6 13:07:38 Tower rpc.mountd[9540]: Caught signal 15, un-registering and exiting.

Looking at the next snippet it is clear that the server just snaps into reboot vs the hang originally described so I suspect that tailing the log would not help in this case.

Quote

Mar  6 13:07:45 Tower rpc.mountd[14158]: Caught signal 15, un-registering and exiting.
Mar  6 13:07:46 Tower sshd[9433]: Received signal 15; terminating.
Mar  6 13:07:46 Tower haveged: haveged: Stopping due to signal 15
Mar  6 13:07:46 Tower ntpd[1749]: ntpd exiting on signal 1 (Hangup)
Mar  6 13:07:46 Tower ntpd[1749]: 127.127.1.0 local addr 127.0.0.1 -> <null>
Mar  6 13:07:46 Tower ntpd[1749]: 45.62.214.53 local addr 192.168.22.47 -> <null>
Mar  6 13:07:46 Tower ntpd[1749]: 216.232.132.31 local addr 192.168.22.47 -> <null>
Mar  6 13:07:46 Tower ntpd[1749]: 209.115.181.107 local addr 192.168.22.47 -> <null>
Mar  6 13:07:46 Tower kernel: nfsd: last server has exited, flushing export cache
Mar  6 13:07:46 Tower rc.inet1: ip -4 route flush default dev br0
Mar  6 13:07:46 Tower rc.inet1: ip -4 addr flush dev br0
Mar  6 13:07:46 Tower rc.inet1: ip link set br0 down
Mar  6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state
Mar  6 13:07:46 Tower rc.inet1: ip link set eth0 promisc off nomaster
Mar  6 13:07:46 Tower kernel: device eth0 left promiscuous mode
Mar  6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state
Mar  6 13:07:46 Tower rc.inet1: ip link set br0 down
Mar  6 13:07:46 Tower rc.inet1: ip link del br0
Mar  6 13:07:46 Tower rc.inet1: ip link set lo down

 

REBOOTS HERE, (I'm just not sure what the above info implies)


Mar  6 13:09:19 Tower kernel: microcode: microcode updated early to revision 0xca, date = 2019-10-03
Mar  6 13:09:19 Tower kernel: Linux version 4.19.94-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Jan 9 08:20:36 PST 2020
Mar  6 13:09:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.