Jump to content

Inaccessible after a few hours or days - Unable to further investigate what's going on


manrw
Go to solution Solved by JorgeB,

Recommended Posts

Hello,
for the last 3 months I have been having issues with random crashes on my Unraid server.

It has been running well before. First, I suspected a hardware failure - so I checked the disk devices and the logs for errors.
I noticed that there were PCIe errors for one of my SATA controllers which stated that an error was corrected.
Other than that, there was nothing unusual.
Every time the server "crashes", USB power on the chasis goes out and I'm unable to establish any connection via IPv4 (ICMP , HTTP and SSH).
Pressing the power button once did not shut down the server gracefully (even after 15 minutes) and I always had to press for 5 seconds to restart.
After that, the server booted back up and worked normally.
Since I didn't have much time due to Christmas and moving to a new apartment, I was unable to further investigate it - it still crashed every 2 hours - 3 days.
There is no pattern between the crashes and they seem random to me.

 

Now I finished moving and replaced the SATA controller with a better one.
I no longer see the PCIe errors but the server is still crashing.
Since I was using an old Z170A board (almost no settings, unable to change C states, ...) with an I7 7600k before, I switched to a Gigabyte B760 Gaming with I3 12100 for much lower idle consumption.
This configuration has been working well but the crashes are still happening randomly.
Due to the replacements and upgrades, I suspect that this isn't a hardware failure.
I am at the end of my troubleshooting steps and would appreciate if you could help me with the next steps.

image.png.ab441816775c6239b98c32896e9fc492.png

Further context:

  • My server never had a power loss
  • I'm running Ubiquiti DM & multiple switches for my network
  • Power is controlled over multiple NETIO 4C PDUs - but they never turned off according to the history
  • HDMI does not give any output on the screen anymore when I try to look after the crash
  • UniFi controller shows the device as offline after crashing

tower-diagnostics-20231217-1909.zip

Link to comment

Hi, thanks for the reply. I just set up a rsyslog on my Raspberry Pi and will wait for the next crash.
This may take a few hours to days, since I don't have reproduction steps for it.
I actually turned on the cloning of the syslog to the flash drive a few days ago - would you want that?

Syslog file on Raspberry Pi looks like this:
image.thumb.png.8d9cbe187e89779ed1021fdf622e29ea.png
I assume that's ok?

EDIT: Can confirm that Remote Syslog is working: It just updated with some messages from a plugin. I'll let you know once I see another crash.

Edited by manrw
Link to comment

Thanks, I'll let it run in safe mode. But without Docker, it's useless to me.
I see a Call trace mentioning Postgres and "php-fpm" in the log. I assume php-fpm is from Unraid itself but Postgres is not.
Could a Docker container be crashing Unraid? I do run two instances of Postgres.
"Corrupted page table at address 55b437033b50" sounds to me that there could be corrupted data in one of the databases.
If one container accesses said data only periodically or upon requests, this would explain the seemingly random behavior of the crash.

Link to comment
  • 2 weeks later...
6 hours ago, manrw said:

After debugging multiple weeks, I can confirm that one of the Docker containers caused the crashes.
However, I migrated to a different OS due to the frustration and other problems I had.
Thanks for your help 👍

Which docker was it? I'm dealing with something similar and this could be useful info.

Link to comment

Although I wasn't able to 100% confirm, because I had to wait days to get it to crash, I suspect that it was any of the following containers:

- GitLab

- Postgres14

- Postgres15

- pgAdmin4

 

I'd assume that deleting the database would be enough to get rid of the errror. Although it should be noted, that the crash could be caused by a different error that we don't see in the log. The Postgres error is just the last entry and makes it quite likely that it's the reason for the crash.

 

Furthermore, I can confirm that no plugin or Unraid itself was causing the crash.

Edited by manrw
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...