4 year stability has gone horrible in last month.. crashes every other day, btrfs "errors". Please help.

tdotr6 · December 22, 2023

Hello unRAID Team!

I have had nothing but stability with my system for the last several years, rock solid.

Recently encountering errors every day, or every other day.. I can't figure out a pattern or trigger. I have another friend who is having almost identical issues and the only thing we have done recently is upgrade to 6.12.6. We run almost the same containers as well, and our hardware is different, he is on a later gen intel and ddr5 , I am still using ddr4.

I have attached 2 dumps of 2 deferent times of when I have experienced the following.

Check dashboard in AM and find a 1 or 2 containers stopped - Try to start them and unable to. They are never the same containers that are experiencing this. If I stop another container now, it wont' start and even thou it looks like the others are running they don't work correctly.. for instance, grafana is running but I can't get to the dashboard - I get {"traceID":""} , but I am able to get to Frigate and access all of my cameras and footage.. It's very strange!

A reboot and it all seems to be fine again.. no ongoing errors, until the issue repeats.

checking syslogs today I do see repeated

image.png.39b443dbc69c13f0499e9f56ed362069.png

svr-diagnostics-20231216-1719.zip svr-diagnostics-20231222-0956.zip

itimpi · December 22, 2023

Those errors are saying that your ‘cache’ pool is corrupt and needs fixing. Since that is where everything docker related seems to be not surprising you have issues.

btrfs corruption seems to be most commonly related to ram issues, so have you run a memtest recently?

JorgeB · December 22, 2023

First thing change docker network to ipvlan and reboot.

Dec 22 02:56:31 SVR kernel: BTRFS error (device sdf1): block=272887431168 write time tree block corruption detected

This error usually means a hardware issue, most often bad RAM, but with the current kernel there have been some possible false positives, so I would recommend running memtest and if nothing is found try to recreate the pool or use zfs instead.

tdotr6 · December 22, 2023

mem test has been done and I don't have bad RAM , this is fairly new RAM as well.

I've also rebuild the cache pool a few weeks ago. when I was first experiencing errors

I rebuild and I replaced 2 SSD's with brand new ones in the cache pool because I thought maybe the SSD's were dying as they were 6 years old. Now all SSD's are very new in the pool of 4 .

As well my friend is getting the very very similar ( corrupt 8 instead of 12 ) errors on his server and he is on totally different hardware. I can have him as well post his dumps to this post. Same thing with the containers errors in the AM etc. Server can run fine for a few days with no issues and then all of a sudden issues. Agian, system was VERY stable before upgrading from .4

is there a way we can just roll back to before .4? 2 upgrade versions before.. I only upgraded again as I thought it would fix the issues, I should have rolled back then.

RE: First thing change docker network to ipvlan and reboot.

Will do.

Edited December 22, 2023 by tdotr6

itimpi · December 22, 2023

2 minutes ago, tdotr6 said:

is there a way we can just roll back to before .4? 2 upgrade versions before.. I only upgraded again as I thought it would fix the issues, I should have rolled back then

You can manually upgrade or downgrade to any Unraid release for which you have the zip file version of the release as described here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page. The 6.12.4 release is still available for download from the Unraid site.

tdotr6 · December 22, 2023

I've changed to ipvlan and will monitor.. thank you @itimpi and @JorgeB

svr-diagnostics-20231222-1126.zip

JorgeB · December 22, 2023

38 minutes ago, tdotr6 said:

this is fairly new RAM as well.

This doesn't mean it's not bad.

39 minutes ago, tdotr6 said:

I rebuild and I replaced 2 SSD's with brand new ones in the cache pool because I thought maybe the SSD's were dying as they were 6 years old. Now all SSD's are very new in the pool of 4 .

Unlike the other error I mentioned before, where there can be false positives, those checksum errors by themselves are always caused by bad RAM (or board/CPU)

tdotr6 · December 22, 2023

Yes, I understand that I am just pointing out it's new as well as tested.

Quote

Unlike the other error I mentioned before, where there can be false positives, those checksum errors by themselves are always caused by bad RAM (or board/CPU)

Are you ref to these checkum errors?

Any other tests you recommended to test board/cpu?

JorgeB · December 22, 2023

14 minutes ago, tdotr6 said:

Are you ref to these checkum errors?

Yep.

14 minutes ago, tdotr6 said:

ny other tests you recommended to test board/cpu?

You can run with just one stick of RAM, if the same try a different one, that will basically rule out the RAM and any board stability issues with more memory load.

tdotr6 · December 22, 2023

I think I will take 2 sticks of RAM from a machine i've had since 2017 and swap it out.

That said, I have looked at some of my dashboards , it looks like the crash happens on the first svr-diagnostics-20231222-0956.zip @ 2:56 AM .

Dec 22 02:56:31 SVR kernel: BTRFS critical (device sdf1): corrupt leaf: root=5 block=272887431168 slot=110 ino=5203363 file_offset=229376, invalid ram_bytes for file extent, have 65535, should be aligned to 4096
Dec 22 02:56:31 SVR kernel: BTRFS info (device sdf1): leaf 272887431168 gen 51281 total ptrs 194 free space 16 owner 5
Dec 22 02:56:31 SVR kernel:    item 0 key (5203347 1 0) itemoff 16123 itemsize 160
Dec 22 02:56:31 SVR kernel:        inode generation 40834 size 42 mode 40755

I must as as well, do you see a similarity in the crash in  svr-diagnostics-20231222-0956.zip & svr-diagnostics-20231222-1126.zip

both were taken before a reboot while experiencing the issue. such, container errors started and I was unable to start a stopped container again.

I would suspect if its RAM I would see similar but I am not exp. looking at these logs like you are.. I don't see the similarity in the crashes to point to a RAM (or other hardware) issue.

@JorgeB - https://pastebin.com/BKKBwb3a You ref the checksum errors as being the thing for me to focus on but I must disagree. I have syslogs I am exporting so I don't have any loss of details from the last 30 days... it just doesn't seem to be the smoking gun you've pointed it out to be?

Edited December 22, 2023 by tdotr6

JorgeB · December 23, 2023

13 hours ago, tdotr6 said:

You ref the checksum errors as being the thing for me to focus on but I must disagree.

That's your right.

tdotr6 · December 23, 2023

4 hours ago, JorgeB said:

That's your right.

What's with the attituded? You've initially neglected to ignore I already did a mem test that PASSED before I came to this post. and I have as well posted diag dump from a few weeks prior when the system had the same issue and I remembered to get the dump before I rebooted.

Why ignore it? Why ignore the extra logs I am giving you to make your product better?

It is very interesting with the amount of logs I've provided you , you've yet to still review the diagnostics provided from the earlier crash. And I've provided logs back 30 days of all BTRFS errors /checksum and kernel ..

Yet, you're going to reply with it's my right? I guess it's your right to not actually want to help people who have been a dedicated user, supporter and patron of this product for nearly a decade.. and the first time I come looking for help this is how i'm treated.. Maybe it's just time to look at unRAID alternatives.

image.png.71dcf09fe990c3db81f05428783c541c.png

itimpi · December 23, 2023

3 minutes ago, tdotr6 said:

What's with the attituded? You've initially neglected to ignore I already did a mem test that PASSED before I came to this post. and I have as well posted diag dump from a few weeks prior when the system had the same issue and I remembered to get the dump before I rebooted.

Why ignore it? Why ignore the extra logs I am giving you to make your product better?

It is very interesting with the amount of logs I've provided you , you've yet to still review the diagnostics provided from the earlier crash. And I've provided logs back 30 days of all BTRFS errors /checksum and kernel ..

Yet, you're going to reply with it's my right? I guess it's your right to not actually want to help people who have been a dedicated user, supporter and patron of this product for nearly a decade.. and the first time I come looking for help this is how i'm treated.. Maybe it's just time to look at unRAID alternatives.

you realize you are responding to another user - not a Limetech employee?

JonathanM · December 23, 2023

Passing a memory test doesn't mean the RAM is good. It's very possible that the conditions to trigger the failure aren't being reproduced by the loading in the test. The fact that there are checksum errors means that SOMEWHERE in the path that data takes there is an issue. Easiest to test is RAM modules, because it's easy to remove a portion, test for the errors to reoccur, switch to the unused modules, repeat. If the errors still occur with either half, the next likely candidate would probably be motherboard, or possibly even the memory on the drive. CPU is also involved, but that usually effects more things.

Until the checksum errors don't come back, you have a memory issue. Doesn't mean the RAM is bad, but that's the first logical place to troubleshoot.

Also, if you are unhappy with the peer to peer support, Unraid does have priority paid support available. https://unraid.net/services

JorgeB · December 23, 2023

I'm sorry if you took my reply the wrong way, it was not my intention, I only meant that I can only offer some advice, you are free to follow it or not.

Based on the btrfs write time corruption error and the checksum errors, I still think bad RAM is a strong possibility, but cannot be certain of course, still that's where I would start, you can also try this as suggested:

On 12/22/2023 at 4:03 PM, JorgeB said:

try to recreate the pool or use zfs instead.

Backup and reformat the pool with zfs, if a zfs pool also has issues it would basically confirm there's some hardware issue going on.

tdotr6 · December 24, 2023

On 12/23/2023 at 8:59 AM, itimpi said:

you realize you are responding to another user - not a Limetech employee?

A moderator is indeed much a representative of Limetech my dude.

Quote

Until the checksum errors don't come back, you have a memory issue. Doesn't mean the RAM is bad, but that's the first logical place to troubleshoot.

What? Still haven't checked out the secondary Diag posted. Great support from the peers lol..

Thank mod's I am going to take this time and move myself to TrueNAS Scale.

This was the push I needed.

Thank you.

Edited December 24, 2023 by tdotr6

trurl · December 24, 2023

6 minutes ago, tdotr6 said:

A moderator is indeed much a representative of Limetech my dude.

No, just another unpaid volunteer with more work to do.

8 minutes ago, tdotr6 said:

Still haven't checked out the secondary Diag posted

For some reason, that one isn't working. It appears to be an image of an attachment, not a real attached diagnostics that we can download.

dirkinthedark · December 25, 2023

@tdotr6many of us have experienced the exact same issue after upgrading. The only fix was to move stuff off the cache into your array, format your cache / rebuild it and then move stuff back.

Seems many of us are getting corrupt file systems after the update.

tdotr6 · December 27, 2023

Quote

For some reason, that one isn't working. It appears to be an image of an attachment, not a real attached diagnostics that we can download.

This is posted above, yes that was indeed a screenshot showing he didn't even look at it yet continued to provide bad advice.

@dirkinthedark already did that , and agreed issues with that as well.

SEEMS for now, although I have ran macvlan for many years, 2 updates ago they did a big change around this, they clearly broke it for the ones that were working and cause some kernel panics and other issuues. since I have been on ipvlan for a few days, no issues. I very much do want to look at going to TrueNAS once coral support is avail, i'm jumping ship.

tdotr6 · January 11

It's clear this is caused by the recent patch where unraid identifies they've resolved the macvlan issue..

I've been on macvlan for many years with this hardware and it's stable.

Now that I've had issues after patch, and as you @JorgeBsuggested, I've changed to ipvlan and we're stable with 0 errors for 17 days.

Unraid should be responding to this as I've not been the only one experiencing this since the recent patch.

JorgeB · January 11

AFAIK the macvlan issue is fixed since v6.12.4, there are two options, users that don't really need macvlan can just change to ipvlan, that's the easier option, some users do require macvaln, those can still use it if they disable bridging for eth0, if you try to use macvaln with bridging enabled there can be macvaln call traces that end up crashing the server, but this is not new, macvlan issues have been a problem since at least v6.5, though looks like with each newer release more users are affected.

4 year stability has gone horrible in last month.. crashes every other day, btrfs "errors". Please help.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation