Jump to content

4 year stability has gone horrible in last month.. crashes every other day, btrfs "errors". Please help.


tdotr6
Go to solution Solved by tdotr6,

Recommended Posts

Hello unRAID Team! 

 

I have had nothing but stability with my system for the last several years, rock solid. 

 

Recently encountering errors every day, or every other day.. I can't figure out a pattern or trigger.  I have another friend who is having almost identical issues and the only thing we have done recently is upgrade to 6.12.6.   We run almost the same containers as well, and our hardware is different, he is on a later gen intel and ddr5 , I am still using ddr4. 

 

I have attached 2 dumps of 2 deferent times of when I have experienced the following.

 

 Check dashboard in AM  and find a 1 or 2 containers stopped - Try to start them and unable to. They are never the same containers that are experiencing this.  If  I stop another container now, it wont' start and even thou it looks like the others are running they don't work correctly.. for instance, grafana is running but I can't get to the dashboard - I get {"traceID":""}  , but I am able to get to Frigate and access all of my cameras and footage.. It's very strange! 

 

A reboot and it all seems to be fine again.. no ongoing errors, until the issue repeats. 

 

checking syslogs today I do see repeated 

 

image.thumb.png.2831b54c3179394e29721b3a3144ed91.png

 

image.png.39b443dbc69c13f0499e9f56ed362069.png

 

image.thumb.png.2aad55678ed9b4d886be97400c728305.png

 

svr-diagnostics-20231216-1719.zip svr-diagnostics-20231222-0956.zip

Link to comment

First thing change docker network to ipvlan and reboot.

 

Dec 22 02:56:31 SVR kernel: BTRFS error (device sdf1): block=272887431168 write time tree block corruption detected

 

This error usually means a hardware issue, most often bad RAM, but with the current kernel there have been some possible false positives, so I would recommend running memtest and if nothing is found try to recreate the pool or use zfs instead.

Link to comment

mem test has been done and I don't have bad RAM , this is fairly new RAM as well. 

 

I've also rebuild the cache pool a few weeks ago. when I was first experiencing errors   
image.thumb.png.d9c12c74190690e2bf058cf0829e5a14.png

 

I rebuild and I replaced 2 SSD's with brand new ones in the cache pool because I thought maybe the SSD's were dying as they were 6 years old.  Now all SSD's are very new in the pool of 4 .

 

As well my friend is getting the very very similar ( corrupt 8 instead of 12 ) errors on his server and he is on totally different hardware. I can have him as well post his dumps to this post. Same thing with the containers errors in the AM etc.  Server can run fine for a few days with no issues and then all of a sudden issues. Agian, system was VERY stable before upgrading from .4 

 

is there a way we can just roll back to before .4? 2 upgrade versions before.. I only upgraded again as I thought it would fix the issues, I should have rolled back then. :( 

 

RE: First thing change docker network to ipvlan and reboot.

Will do. 

 

Edited by tdotr6
Link to comment
2 minutes ago, tdotr6 said:

is there a way we can just roll back to before .4? 2 upgrade versions before.. I only upgraded again as I thought it would fix the issues, I should have rolled back then

You can manually upgrade or downgrade to any Unraid release for which you have the zip file version of the release as described here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page. The 6.12.4 release is still available for download from the Unraid site.

Link to comment
38 minutes ago, tdotr6 said:

this is fairly new RAM as well. 

This doesn't mean it's not bad.

 

39 minutes ago, tdotr6 said:

I rebuild and I replaced 2 SSD's with brand new ones in the cache pool because I thought maybe the SSD's were dying as they were 6 years old.  Now all SSD's are very new in the pool of 4 .

Unlike the other error I mentioned before, where there can be false positives, those checksum errors by themselves are always caused by bad RAM (or board/CPU)

Link to comment

Yes, I understand that I am just pointing out it's new as well as tested.

 

Quote

Unlike the other error I mentioned before, where there can be false positives, those checksum errors by themselves are always caused by bad RAM (or board/CPU)

image.thumb.png.7aaaf6badf8f03de3adc9abc0dc112ec.png

 

Are you ref to these checkum errors?  

 

Any other tests you recommended to test board/cpu? 

Link to comment

I think I will take 2 sticks of RAM from a machine i've had since 2017 and swap it out. 

 

That said, I have looked at some of my dashboards , it looks like the crash happens on the first svr-diagnostics-20231222-0956.zip  @ 2:56 AM . 

 

Dec 22 02:56:31 SVR kernel: BTRFS critical (device sdf1): corrupt leaf: root=5 block=272887431168 slot=110 ino=5203363 file_offset=229376, invalid ram_bytes for file extent, have 65535, should be aligned to 4096
Dec 22 02:56:31 SVR kernel: BTRFS info (device sdf1): leaf 272887431168 gen 51281 total ptrs 194 free space 16 owner 5
Dec 22 02:56:31 SVR kernel:     item 0 key (5203347 1 0) itemoff 16123 itemsize 160
Dec 22 02:56:31 SVR kernel:         inode generation 40834 size 42 mode 40755


I must as as well, do you see a similarity in the crash in  svr-diagnostics-20231222-0956.zip & svr-diagnostics-20231222-1126.zip

 

both were taken before a reboot while experiencing the issue. such, container errors started and I was unable to start a stopped container again.  

 

I would suspect if its RAM I would see similar but I am not exp. looking at these logs like you are.. I don't see the similarity in the crashes to point to a RAM (or other hardware) issue.  
 

@JorgeB - https://pastebin.com/BKKBwb3a  You ref the checksum errors as being the thing for me to focus on but I must disagree.  I have syslogs I am exporting so I don't have any loss of details from the last 30 days... it just doesn't seem to be the smoking gun you've pointed it out to be?

Edited by tdotr6
Link to comment
4 hours ago, JorgeB said:

That's your right.

What's with the attituded? You've initially neglected to ignore I already did a mem test that PASSED before I came to this post.   and I have as well posted diag dump from a few weeks prior when the system had the same issue and I remembered to get the dump before I rebooted. 

 

Why ignore it? Why ignore the extra logs I am giving you to make your product better? 

 

It is very interesting with the amount of logs I've provided you , you've yet to still review the diagnostics provided from the earlier crash.  And I've provided logs back 30 days of all BTRFS errors /checksum and kernel ..  

 

Yet, you're going to reply with it's my right?   I guess it's your right to not actually want to help people who have been a dedicated user, supporter and patron of this product for nearly a decade.. and the first time I come looking for help this is how i'm treated.. Maybe it's just time to look at unRAID alternatives.  

 

image.png.71dcf09fe990c3db81f05428783c541c.png

 

 

Link to comment
3 minutes ago, tdotr6 said:

What's with the attituded? You've initially neglected to ignore I already did a mem test that PASSED before I came to this post.   and I have as well posted diag dump from a few weeks prior when the system had the same issue and I remembered to get the dump before I rebooted. 

 

Why ignore it? Why ignore the extra logs I am giving you to make your product better? 

 

It is very interesting with the amount of logs I've provided you , you've yet to still review the diagnostics provided from the earlier crash.  And I've provided logs back 30 days of all BTRFS errors /checksum and kernel ..  

 

Yet, you're going to reply with it's my right?   I guess it's your right to not actually want to help people who have been a dedicated user, supporter and patron of this product for nearly a decade.. and the first time I come looking for help this is how i'm treated.. Maybe it's just time to look at unRAID alternatives.  

 

image.png.71dcf09fe990c3db81f05428783c541c.png

 

 


you realize you are responding to another user - not a Limetech employee?

Link to comment

Passing a memory test doesn't mean the RAM is good. It's very possible that the conditions to trigger the failure aren't being reproduced by the loading in the test. The fact that there are checksum errors means that SOMEWHERE in the path that data takes there is an issue. Easiest to test is RAM modules, because it's easy to remove a portion, test for the errors to reoccur, switch to the unused modules, repeat. If the errors still occur with either half, the next likely candidate would probably be motherboard, or possibly even the memory on the drive. CPU is also involved, but that usually effects more things.

 

Until the checksum errors don't come back, you have a memory issue. Doesn't mean the RAM is bad, but that's the first logical place to troubleshoot.

 

Also, if you are unhappy with the peer to peer support, Unraid does have priority paid support available. https://unraid.net/services

  • Upvote 1
Link to comment

I'm sorry if you took my reply the wrong way, it was not my intention, I only meant that I can only offer some advice, you are free to follow it or not.

 

Based on the btrfs write time corruption error and the checksum errors, I still think bad RAM is a strong possibility, but cannot be certain of course, still that's where I would start, you can also try this as suggested:

 

On 12/22/2023 at 4:03 PM, JorgeB said:

try to recreate the pool or use zfs instead.

Backup and reformat the pool with zfs, if a zfs pool also has issues it would basically confirm there's some hardware issue going on.

 

  • Upvote 1
Link to comment
  • Solution
On 12/23/2023 at 8:59 AM, itimpi said:


you realize you are responding to another user - not a Limetech employee?

A moderator is indeed much a representative of Limetech my dude. 

Quote

Until the checksum errors don't come back, you have a memory issue. Doesn't mean the RAM is bad, but that's the first logical place to troubleshoot.

What? Still haven't checked out the secondary Diag posted. Great support from the peers lol.. 

 

Thank mod's I am going to take this time and move myself to TrueNAS Scale.  

 

This was the push I needed.

 

Thank you. 

Edited by tdotr6
Link to comment
6 minutes ago, tdotr6 said:

A moderator is indeed much a representative of Limetech my dude. 

No, just another unpaid volunteer with more work to do.

 

8 minutes ago, tdotr6 said:

Still haven't checked out the secondary Diag posted

For some reason, that one isn't working. It appears to be an image of an attachment, not a real attached diagnostics that we can download.

Link to comment
Quote

For some reason, that one isn't working. It appears to be an image of an attachment, not a real attached diagnostics that we can download.

This is posted above, yes that was indeed a screenshot showing he didn't even look at it yet continued to provide bad advice.

 

@dirkinthedark already did that , and agreed issues with that as well.

 

SEEMS for now, although I have ran macvlan for many years, 2 updates ago they did a big change around this, they clearly broke it for the ones that were working and cause some kernel panics and other issuues. since I have been on ipvlan for a few days, no issues. I very much do want to look at going to TrueNAS once coral support is avail, i'm jumping ship. 

Link to comment
  • 2 weeks later...

It's clear this is caused by the recent patch where unraid identifies they've resolved the macvlan issue..

 

I've been on macvlan for many years with this hardware and it's stable.

 

Now that I've had issues after patch, and as you @JorgeBsuggested, I've changed to ipvlan and we're stable with 0 errors for 17 days.

 

Unraid should be responding to this as I've not been the only one experiencing this since the recent patch. 

 

Link to comment

AFAIK the macvlan issue is fixed since v6.12.4, there are two options, users that don't really need macvlan can just change to ipvlan, that's the easier option, some users do require macvaln, those can still use it if they disable bridging for eth0, if you try to use macvaln with bridging enabled there can be macvaln call traces that end up crashing the server, but this is not new, macvlan issues have been a problem since at least v6.5, though looks like with each newer release more users are affected.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...