Server Bad State


Recommended Posts

A few times since upgrading my parity drive to a larger one, I am having issues with my server going into a bad state. I feel like this is similar to what I initially saw happening when I switched from Intel to AMD but the issues went away after I change some recommended settings but I think I recall I even changed the BIOS back to defaults. It has been fine for a few years now. It seems like if I have to hard reset the system, it will boot up, run file while parity check is happening, but then later in the day, or next, the UI becomes unresponsive. I am able to SSH in and reset the UI. And say right now, I see that my docker containers are all but 1 stopped and the stopped ones are throwing an error. VMs will throw an error if I try to start them. Attached is my diags.

diagnostics-20240108-1437.zip

Link to comment
Jan  7 08:45:37 YorksHomeServer kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan  7 08:45:37 YorksHomeServer kernel: ? _raw_spin_unlock+0x14/0x29
Jan  7 08:45:37 YorksHomeServer kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

Macvlan call traces will usually end up crashing the server, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)), then reboot.

Link to comment

Appreciate the suggestion. Yesterday the UI died on me a few times, once while stopping docker. I made the change and so far it has been holding since yesterday. It is possible that maybe docker wasn't failing on me except that one time which I posted this 2 days ago but I was assuming it also had been.

Link to comment
  • 3 weeks later...

I think I am still having related issues, or maybe it is something else. I again had hard reset the box, and after that I had an issue which I created another post about with my cache drive which it is working again. Twice though docker containers were not running, stopped I guess at some point, and wouldn't start until I rebooted. Currently this was happening, I am trying to stop the array but it is not stopping. Attached is the current diag and I have tried physically rebooting it yet. Maybe this is a separate issue from before but i'm not certain.

diagnostics-20240127-2005.zip

Edited by LimeB
Link to comment

There are still macvlan call traces, and not really a surprise since according to the diags you are still using macvlan, also filesystem corruption on cache, possibly due to bad RAM, run memtest, but first recommend correctly installing your sticks, both are on the same channel, that's worse for both performance and stability.

Link to comment

Maybe the macvlan inadvertently got changed when I was then messing with some other settings for working on something else. I felt like it was staying stable after I initially changed to ipvlan but thinking back, it must have got changed back and I became unstable again. Now it is changed again and I'll see how it holds up, along with running a memtest.

  • Like 1
Link to comment

After my last post, system booted up and went into a bad state probably within minutes but I didn't notice for a few hours. I went ahead and ran a memtest and no errors with 9 passes. I have moved the memory into dual channel, and I am still having issues. Now what I am seeing is within minutes of the array being started, my docker containers will crash, and can't start. I am seeing my cache drive being unmountable as per my other post. I can run the command you suggested, it gets me running again. Maybe I am running into multiple issues here because I'm not seeing the initial UI problem anymore but now more just docker keeps failing because of my cache drive. Attached is a diag after the latest boot and crash. This last time I did disable docker, started the array, then enabled docker.

diagnostics-20240204-0119.zip

Link to comment
Feb  4 01:15:26 YorksHomeServer kernel: BTRFS: error (device dm-9: state A) in __btrfs_free_extent:3072: errno=-2 No such entry
Feb  4 01:15:26 YorksHomeServer kernel: BTRFS info (device dm-9: state EA): forced readonly
Feb  4 01:15:26 YorksHomeServer kernel: BTRFS error (device dm-9: state EA): failed to run delayed ref for logical 1785080889344 num_bytes 16384 type 176 action 2 ref_mod 1: -2

 

There is corruption of the pool filesystem, suggest backing up and re-formatting, if it happens again soon it would suggest an underlying hardware issue.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.