Disappointed in stability

Nexus · August 2, 2022

I have been running an Unraid server for about 3 years, and recently completed a build that was built around lower energy consumption, lower temps and more energy efficient parts. Prior to this build, I was running Unraid on an old gaming pc that was loud, sucked down energy and huge. So my new server footprint is much better (smaller) and all around happy with the hardware side of things.

BUT: Something is tremendously wrong with my current set up. Specifically, my docker set up seems tremendously unstable and frankly quite fragile.

I am running the following:

AdGuard-Home
binhex-krusader
Cloudflare-DDNS
HomeBridge
NginxProxyManager
Plex-Media-Server
Portainer
scrypted

And I *THINK* it's either scrypted or Plex that is causing problems. For the past few weeks, each morning I wake up and 25% of the time - the UNRAID webui is not responding and I can't SSH in to the box. I need to do a hard power reset.
Of course after the system comes back it wants to do a parity check.

During one of these resets it corrupted my Plex DB (that I migrated from my v1 server without issue) - while the media files are fine, I lost all the additional metadata (lists, closed captions, cover art) So I don't know how to recover that part of the equation.

I thought the way Docker was designed is that it's not supposed to bring down the host. Why does this keep happening?

Any idea where to start trying to get things back to a stable place? I am stumped.

Thanks.

Edited August 2, 2022 by Nexus
Update

JorgeB · August 2, 2022

Enable the syslog server and post that together with the complete diagnostics after the next crash.

Nexus · August 2, 2022

Thanks Jorge. I have done that, and I will share that when the inevitable crash happens

Nexus · August 3, 2022

Following up: Today the Scrypted container siezed up.

I had to stop the Docker services and then they would not restart. So I took the array offline and did a clean reboot.

Attached are logs from the USB boot drive and two system diagnostics: One before reboot and one after

Archive 2.zip

Edited August 3, 2022 by Nexus

Nexus · August 3, 2022

Damn it. It did it again. I was streaming a movie from my plex container, checked the UI to see if it was using hardware encoding and went to the UI and it was unstable.

My system crashed - AGAIN.
I can't SSH into the box and now I have to do ANOTHER power cycle.

This is supremely frustrating. Is there another way I can direct & timely support from Limetech? While I appreciate the peer to peer model, I'd prefer a more direct line to the company

Attached are logs after this crash

More logs.zip

Kilrah · August 3, 2022

Your system reports hardware errors (that you've apparently ignored), might want to check mcelog / run a memtest.

Your 2 sticks of RAM are mismatched and support different speeds, could be worth trying them separately and seeing if you're stable that way.

Edited August 3, 2022 by Kilrah

ChatNoir · August 3, 2022

1 hour ago, Nexus said:

Is there another way I can direct & timely support from Limetech? While I appreciate the peer to peer model, I'd prefer a more direct line to the company

There is a paid support from Limetech.

But it seems that the link for it that I saved is being rebuild and is currently lacking relevant information.

Nexus · August 3, 2022

56 minutes ago, Kilrah said:

Your system reports hardware errors (that you've apparently ignored), might want to check mcelog / run a memtest.

Your 2 sticks of RAM are mismatched and support different speeds, could be worth trying them separately and seeing if you're stable that way.

Curious. Thanks. Where can I see that error in the logs I submitted? They are showing the same speed in the BIOS but clearly they are not?

Kilrah · August 3, 2022

10 minutes ago, Nexus said:

Where can I see that error in the logs I submitted?

e.g.

Aug  2 17:25:32 Altair8800 kernel: mce: [Hardware Error]: Machine check events logged

Aug  2 17:42:04 Altair8800 root: Fix Common Problems: Error: Machine Check Events detected on your server ** Ignored

For the 2nd line you likely have gotten a notification and then manually set it to be ignored.

10 minutes ago, Nexus said:

They are showing the same speed in the BIOS but clearly they are not?

They are both shown to be running at 2133MHz but looking up part number the 32GB one is rated 3000MHz and the 16GB one is 2400. Sometimes running fast RAM too slow can cause issues (rarely), but mostly having different rated speeds means all timings are likely to be different and in this case mixing can sometimes be iffy.

Edited August 3, 2022 by Kilrah

Nexus · August 3, 2022

3 minutes ago, Kilrah said:
e.g.
Aug  2 17:25:32 Altair8800 kernel: mce: [Hardware Error]: Machine check events logged
Aug  2 17:42:04 Altair8800 root: Fix Common Problems: Error: Machine Check Events detected on your server ** Ignored
They are both shown to be running at 2133MHz but looking up part number the 32GB one is rated 3000MHz and the 16GB one is 2400. Sometimes running fast RAM too slow can cause issues (rarely), but mostly having different rated speeds means all timings are likely to be different and in this case mixing can sometimes be iffy.

I’ll pull the 16 G out.

Re the machine error - where can I find the logged? I enabled mce - but don’t know where to get details

Kilrah · August 3, 2022

mcelog --client

should list what happened, some threads for reference:

Edited August 3, 2022 by Kilrah

JorgeB · August 3, 2022

Aug  2 11:26:53 Altair8800 kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Aug  2 11:26:53 Altair8800 kernel: macvlan_process_broadcast+0xc7/0x110 [macvlan]

Beside the mentioned possible hardware issues this will also make Unraid crash, these are usually the result of having dockers with a custom IP address, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)).

Nexus · August 3, 2022

8 hours ago, Kilrah said:
mcelog --client
should list what happened, some threads for reference:

I looked in the log, and all I see is the machine event error. I turned on mcelog and I don't see anything specific being logged.

Edited August 3, 2022 by Nexus

Nexus · August 3, 2022

8 hours ago, JorgeB said:
Aug  2 11:26:53 Altair8800 kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Aug  2 11:26:53 Altair8800 kernel: macvlan_process_broadcast+0xc7/0x110 [macvlan]
Beside the mentioned possible hardware issues this will also make Unraid crash, these are usually the result of having dockers with a custom IP address, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)).

Thanks. I made that change. Thank you

Disappointed in stability

Recommended Posts

Nexus

Link to comment

JorgeB

Link to comment

Nexus

Link to comment

Nexus

Link to comment

Nexus

Link to comment

Kilrah

Link to comment

ChatNoir

Link to comment

Nexus

Link to comment

Kilrah

Link to comment

Nexus

Link to comment

Kilrah

Link to comment

JorgeB

Link to comment

Nexus

Link to comment

Nexus

Link to comment

Join the conversation