Server unresponsive and unreachable


shEiD

Recommended Posts

My secondary/backup/testing server just "hanged" for a 3rd time in 2 days. This never happened before and it was working perfectly.

 

I am in the process of copying all the files of that backup server (about 8-9TB) onto my main server. My plans was to copy in stages, one large folder (about 0.5TB - 1TB) at a time.

 

All those 3 times the backup server "hanged" was during a large rsync transfer.

At first everything was perfectly fine - I did 2 or 3 rsync transfers, about 1.2TB in total.

 

I fired up another rsync and went to sleep. The next morning I noticed that the backup server has "hanged". My SSH session to the backup server was disconnected. WebUI was not working, I tried to ping it:

Pinging 192.168.1.22 with 32 bytes of data:
Reply from 192.168.1.11: Destination host unreachable.
Reply from 192.168.1.11: Destination host unreachable.

The monitor on the backup server showed black screen and keyboard did nothing. All I could think of at this time, was to force shutdown with a power button. I turned it off and on again. Server booted normally and did a parity check (~9 hours) - no errors.

Although I had no idea what happened, I was happy that parity check returned with 0 errors. The only linux experience I had is unraid, so after a couple of hours of googling, all I could think of, was to setup remote syslog. So that's what I did. I've setup both of my unraid machines to act as a remote syslog server for each other, hoping this could help me to find out what was the problem.

 

I fired up another rsync and went on with my day. A couple of hours into the transfer the backup server hanged again. Exactly the same - unresponsive and unreachable. I looked at the remote syslog - and did not see anything that explained what happened, at least to me (as linux illiterate as I am). Well, something weird - there was a ton of those share cache full messages, and it was the last message in the syslog before the backup server hanged:

Sep  2 19:19:39 unGiga shfs: share cache full
Sep  2 19:19:39 unGiga shfs: share cache full
Sep  2 19:19:39 unGiga shfs: share cache full
Sep  2 19:19:39 unGiga shfs: share cache full
Sep  2 19:19:39 unGiga shfs: share cache full
Sep  2 19:20:47 unGiga ool www[5686]: /usr/local/emhttp/plugins/recycle.bin/scripts/rc.recycle.bin 'empty'
Sep  2 19:20:48 unGiga Recycle Bin: User: Recycle Bin has been emptied
Sep  2 19:20:53 unGiga ool www[5810]: /usr/local/emhttp/plugins/recycle.bin/scripts/rc.recycle.bin 'clear'
Sep  2 19:27:14 unGiga shfs: share cache full

That did not help at all... My cache pool on backup server was actually pretty empty.

 

Again, I did some googling. And again, I forced-turned-off my backup server with the power button, and turned it back on. But this time, I logged in on the server itself and launched syslog tail, hoping that the monitor will stay working and I could see the errors, if same crap happened again.

tail -f /var/log/syslog

Parity check - 0 errors. 👍

 

I was always using rsync on the backup server itself, to copy files into the locally mounted main server's share. Because those multiple log messages said something about share, I decided to switch it up. I fired up another rsync transfer, but this time I was copying over SSH, and not into the mounted share. And went to sleep.

When I woke up, I found that the backup server has hanged again, for the 3rd time.
And... everything was the same. Remote syslog showed nothing informative (at least to me) again, ping failed again, and the monitor was black again. The keyboard did not work, I tried CTRL+C, any other keys - nothing.

Then I came here and started writing this post, asking for help. 🤪
The server is still "powered on". I decided to not force power down this time, in case there's anything can/need to be done in the process of trying to find out what the hell is going on.

 

I have ran a ~24 hour memory test on the backup server about 6 months ago - perfectly fine, no errors.

I've attached the whole remote syslog.


Thanks in advance for any help.

 

syslog-192.168.1.22.zip

Edited by shEiD
Link to comment
4 minutes ago, trurl said:

Sorry, please start the array and post new diagnostics.

That would start a parity check due to unclean shutdown.

 

It looks like your shares have a very large Minimum Free setting. If you don't specify units, such as 100MB or 20GB, then the default unit is KB.

 

So 500000000 would be 500GB

Link to comment
1 hour ago, trurl said:

Sorry, please start the array and post new diagnostics.

Here's the diagnostics with a running array.

ungiga-diagnostics-20210904-0125.zip

Although, I have already rebooted normally one more time, without starting the array - just to avoid the parity check. I mean I decided enough is enough - 2 parity checks in 2 days (0 errors), I feel 3rd time would be the same. There should have never actually been any writes to the array, when the server hanged...
Sorry, if that messed up the diagnostics... did it? I actually thought, if you don't capture stuff before restart - all the info is useless anyways, as unraid is always loaded to memory and completely resets on reboot?

 

1 hour ago, trurl said:

It looks like your shares have a very large Minimum Free setting. If you don't specify units, such as 100MB or 20GB, then the default unit is KB.

 

So 500000000 would be 500GB

Yep, that's how have set them, by using the units: from the smallest 3GB to the largest of 500GB.
I usually tend to not fill any drives past 90% on my main server. And yes, on this backup server 500GB is normally way to much for 3TB drives, but meh - this was just temporary. This is as much a backup server, as it is a testing server. That's why I'm copying everything out to the main one. I want to create and properly test multiple btrfs cache pools using these drives.

I would love sooooo much to use btrfs pools, if it wasn't so "anecdotally scary unreliable" and did not have so much warnings not to trust in it, all over the internets 😟

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.