Jump to content

Server died a day after 6.12.10 upgrade


Recommended Posts

Last year I upgraded my server to 6.12.x and immediately suffered from stability issues due to btrfs.  I downgraded back to 6.11.x and the server has been online without interruption for another 8 months.  On Monday I upgraded again, this time recreating the cache pool as zfs.  The server was stable for about 24 hours but it died in the early hours of this morning, although clearly not with a btrfs issue this time.  I was unable to contact the server over the network and had to force a reboot.  I've grabbed and attached diagnostics, although the syslog data is form post-reboot.  I have a copy of syslog data that I had sent to an external server, so I've pulled about 10K lines from that and also attached it here.  

 

Hopefully there's something in here to give pointers.  Again, I'd stress that the server has been running without the slightest hiccup on UPS for well over a year, interrupted only by the issues that I had during the last aborted attempt to move to 6.12.  It would be really nice if I didn't have to downgrade again!

 

thanks

tower-diagnostics-20240410-1022.zip syslogresults_20240410_102435.zip

Link to comment

There are Macvlan related call traces, and those will usually end up crashing the server, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)), then reboot.

Link to comment

Thanks, I'll make that switch.  I knew about that from the last upgrade, but the advice back then was also steering towards a second NIC (which I added at the time) and so this time I wasn't changing anything until I was more sure it was necessary.  

Link to comment

The server was stable until today after making the ipvlan switch.  Today I spotted that my containers were effectively unreachable (web servers responsive but not returning content post logon/timing out).  I found that if I tried to ls /mnt that the terminal would hang, but the same on /mnt/disk1 was fine.  Nothing in syslog at the time of the issues being seen, but issue much earlier in the day (when things had seemed okay still), e.g.

PANIC: zfs: removing nonexistent segment from range tree 

 

I couldn't reboot the server because issuing a reboot would also hang, so eventually had to do a hard reset.  After the reset the array got stuck starting with the cache pool the apparent culprit.  I rebooted (which was now possible) with a plan to mount the cache read-only, but after the reboot the array has started fine.  This all feels like it's related to the cache pool (single SSD), but again I'd had zero problems before the 6.12.x upgrade when on btrfs and only moved to zfs to get around the apparent issues with btrfs on 6.12.x.  So my question at this point, assuming that zfs doesn't like something about my hardware that btrfs on 6.11.x was fine with, is whether I should consider reformatting the cache pool to xfs instead.

 

thanks

Link to comment

Just to add, server has been back online for under an hour, symptoms with /mnt/user access hanging (and admin UI not being available etc.) have returned.  I can't run diagnostics because it also hangs, but there's nothing new in syslog since the server/disk became unresponsive.  In other scenarios I'd think that this was likely to be a disk issue, but again it seems like a massive coincidence that I had no issues for the 6 months before the upgrade and the zfs change.

Link to comment
37 minutes ago, turnma said:

PANIC: zfs: removing nonexistent segment from range tree 

This suggests a problem with a zfs filesystem, since there's no fsck for zfs, you would need to backup and recreate the pool, do you have more than one zfs filesystem?

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...