Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

(SOLVED) - My Unraid Got unstable

Featured Replies

Dear Gurus,

I need your help with the problem that I have. During last couple of months my Unraid server got unstable. It was working perfectly for years.

Recently i added a hard drive and updated a version of Unraid.

I attempted some troubleshooting but couldn't find the root cause.

The problem is that once in a while everything stack. No networking. The vents is on, no beeps from BIOS.

I think it could be a memory problem somewhere in upper addresses or some kind of another hardware problem.

Please help to find a root cause.

tower-diagnostics-20200116-0928.zip

Edited by dgomel
Solved

  • Author

The Memtest86+ shows no problem on a couple of passes.

I see fast grow in zombie processes visible from top. around 100 in 5 mins. Could this be a problem?

I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant.

Edited by dgomel

  • Author

I found the problem. The problem started when i enabled VT-D having SYBA SI-PEX40108 with Marvell 88SE9215. After replacing it with LSI 9211-8i everything back to work.

Unfortunately, the problem persists. It looks like it's depends on usage of VMs.

Latest Diagnostics and Syslog attached. 

 

tower-diagnostics-20200121-0930.zip syslog (6).zip

Edited by dgomel
Problem persists.

Hi there,

 

Looks like you need to update your forum signature as your hardware has changed from AMD to Intel ;-).  That said, I'm not seeing any events in the logs themselves that show a problem.  And I'm not sure what this is supposed to mean:

On 1/16/2020 at 9:11 AM, dgomel said:

The problem is that once in a while everything stack.

Also hoping that you removed this container while troubleshooting this:

On 1/16/2020 at 9:51 AM, dgomel said:

I localized a source of zombies to specific container. The problem started before I implemented the container. So, seems irrelevant.

 

At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f).  This will begin printing the log out to the screen.  Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary).  This should give us at least some information to point towards a cause.

 

I'd also check for a BIOS update on your motherboard.

  • Author

I forwarded syslog to flash and provided it along the lines. I can't see a anything in the syslog.

The outages could be easily found by gaps in printout and new start sequence.

Console is connected. Bios is latest for the MB.

  • Author
27 minutes ago, jonp said:

you need to update your forum signature

Done. No improvement ;)

The syslog to flash method is not valid for capturing log events related to major crashes like you're experiencing.  The problem is that the hang can occur before the write to the flash can occur.  This is why I am suggesting you connect a monitor / keyboard to the system.

  • Author

Got it. Will work this out. Thanks for your time.

  • Author

I think i could safely exclude overheating of hardware. I created High CPU load along with average IO and keep this running for a couple of hours. No crashes.

7BEB0E27-9CD3-4843-805F-CABB408C42EC.jpeg

  • Author

One of theories was a potential of HW monitoring from the BIOS.  Yesterday I went to check this and found nothing related to threshold on temp. On the way I changed a CPU governor setting to Performance mode.

In addition, I found yesterday that dynamix.system.temp.plg wasn't updated for a while. When I tried to update, it failed. So, I uninstalled and installed again. 

After these two changes the system is working for a day with no crashes. I'll keep monitoring. 

  • Author

I think it safe to move the case to solved. I didn't try change the CPU governor, but something telling me the plugin is most probably a root cause.

Thanks for your help. 

 

The problem persists. The problem depends on enabled virtualization. The system works with VM manager and Docker turned off.

Up to the moment I tested all components, besides MB. The test is upcoming. I was able to catch once a failure on a console. Screenshot attached. Attaching cumulative syslog as well. Would appreciate your thoughts. 

IMG-1173.jpg

tower-diagnostics-20200226-0756.zip syslog.zip

Edited by dgomel
Problem persists.

  • 2 weeks later...
  • Author

I just completely upgraded the hardware... and got my first restart today's morning.

I'll give another week or two before I'll drop the product completely.

To say I'm frustrated, is a bit of an understatement.

On 1/21/2020 at 12:50 PM, jonp said:

At this point, I would suggest hooking up a monitor and keyboard to the system and tailing the log (command is tail /var/log/syslog -f).  This will begin printing the log out to the screen.  Then try to get the system to crash again and capture whatever was printed to the screen (use your phone to take a picture if necessary).  This should give us at least some information to point towards a cause.

 

This was my last major suggestion for you so we could determine what is happening that is causing the instability.  We don't see any events in the log because the system is crashing before they can be written.  By doing what I quoted above, you will see the last events prior to the crash and can post those here.  The picture you've posted above only has a few lines of log and then I see a printout from Top.  This cuts out a lot of other log entries.  Please just boot up the system, attach a monitor and keyboard, login via the console, and type "tail /var/log/syslog -f" and just leave that up until it crashes.  Then capture a picture of what's on the screen and post it here.

  • Author

On the new hardware (New MB/CPU/Memory/PSU) I see system restarts. It's not hanged as it was before.

So. tactics with waiting for a console doesn't work :(.

Edited by dgomel

  • Author

BTW, old hardware was reused as a desktop and works fine under Win10.

@dgomel

  • From lssci.txt
Quote

[0:0:0:0]    disk    UFD 2.0  Silicon-Power8G  1100  /dev/sda   /dev/sg0 
  state=running queue_depth=1 scsi_level=5 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/host0/target0:0:0/0:0:0:0]

The reviews I've found say that's not a great Flash Drive, have you ruled out the Flash as the problem?

  • From syslog.text
Quote

Mar 10 08:31:59 Tower kernel: usb 1-6: new high-speed USB device number 3 using xhci_hcd

  • (from lssub.txt Device 003: ID 090c:1000 Silicon Motion, Inc. - Taiwan (formerly Feiya Technology Corp.) Flash Drive)

Mar 10 08:31:59 Tower kernel: usb-storage 1-6:1.0: USB Mass Storage device detected
Mar 10 08:31:59 Tower kernel: scsi host0: usb-storage 1-6:1.0
Mar 10 08:31:59 Tower kernel: scsi 0:0:0:0: Direct-Access     UFD 2.0  Silicon-Power8G  1100 PQ: 0 ANSI: 4

  • From lspci.text
Quote

00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af]
    Subsystem: Micro-Star International Co., Ltd. [MSI] 200 Series PCH USB 3.0 xHCI Controller [1462:7a70]
    Kernel driver in use: xhci_hcd

This is the only USB controller listed in the lscpi.txt, make sure your flash is not using USB3

Hopefully this leads you in the right direction, if not always include diagnostics. At this point I'd say you should also include the syslog from the syslog server folder.

 

  • Author

Thanks, I appreciate your inputs.

The USB drive is the only part, besides the array, that was not replaced.

The USB is indeed in USB3 slot. I'm going to switch it immediately.

I'm attaching a cumulative log, starting from the first boot on new hardware.

syslog (11).zip

Edited by dgomel

37 minutes ago, dgomel said:

I'm attaching a cumulative log, starting from the first boot on new hardware.

Hopefully moving from USB3 to USB2 fixes it (also remember that the quality of the flash is important, see the recommendations here). If you continue to have issues, I'm still learning how to read these logs but could the snippet below indicate trouble starting smb shares and would that restart the server?

Quote

Mar  6 13:07:38 Tower avahi-daemon[9564]: Service group file /services/smb.service vanished, removing services.
Mar  6 13:07:38 Tower emhttpd: shcmd (123): /etc/rc.d/rc.nfsd stop
Mar  6 13:07:38 Tower rpc.mountd[9540]: Caught signal 15, un-registering and exiting.

Looking at the next snippet it is clear that the server just snaps into reboot vs the hang originally described so I suspect that tailing the log would not help in this case.

Quote

Mar  6 13:07:45 Tower rpc.mountd[14158]: Caught signal 15, un-registering and exiting.
Mar  6 13:07:46 Tower sshd[9433]: Received signal 15; terminating.
Mar  6 13:07:46 Tower haveged: haveged: Stopping due to signal 15
Mar  6 13:07:46 Tower ntpd[1749]: ntpd exiting on signal 1 (Hangup)
Mar  6 13:07:46 Tower ntpd[1749]: 127.127.1.0 local addr 127.0.0.1 -> <null>
Mar  6 13:07:46 Tower ntpd[1749]: 45.62.214.53 local addr 192.168.22.47 -> <null>
Mar  6 13:07:46 Tower ntpd[1749]: 216.232.132.31 local addr 192.168.22.47 -> <null>
Mar  6 13:07:46 Tower ntpd[1749]: 209.115.181.107 local addr 192.168.22.47 -> <null>
Mar  6 13:07:46 Tower kernel: nfsd: last server has exited, flushing export cache
Mar  6 13:07:46 Tower rc.inet1: ip -4 route flush default dev br0
Mar  6 13:07:46 Tower rc.inet1: ip -4 addr flush dev br0
Mar  6 13:07:46 Tower rc.inet1: ip link set br0 down
Mar  6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state
Mar  6 13:07:46 Tower rc.inet1: ip link set eth0 promisc off nomaster
Mar  6 13:07:46 Tower kernel: device eth0 left promiscuous mode
Mar  6 13:07:46 Tower kernel: br0: port 1(eth0) entered disabled state
Mar  6 13:07:46 Tower rc.inet1: ip link set br0 down
Mar  6 13:07:46 Tower rc.inet1: ip link del br0
Mar  6 13:07:46 Tower rc.inet1: ip link set lo down

 

REBOOTS HERE, (I'm just not sure what the above info implies)


Mar  6 13:09:19 Tower kernel: microcode: microcode updated early to revision 0xca, date = 2019-10-03
Mar  6 13:09:19 Tower kernel: Linux version 4.19.94-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Jan 9 08:20:36 PST 2020
Mar  6 13:09:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

 

  • Author

I switched over to USB2. I'll keep monitoring. Thanks for the reply ans analysis.

  • Author

The system is stable for a week after the switch from USB2 to USB3. @Dissones4U thanks for your findings.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.