Jump to content

(6.12.6) Call traces crashing server; even after switch to IPVLAN as requested


Go to solution Solved by SinoBreizh,

Recommended Posts

Hello,

 

I have recently installed the binhex qbtirorrent-vpn docker container, and have been experiencing regular crashes ever since (around once a day). The system was perfectly stable previously, with un uptime of around 4 months.

 

I believe I have narrowed down the issue to one mentioned in 6.12.4 and 6.12.6 change logs, related to MACVLAN calls. The problem itself is far beyond my knowledge level, but I have applied the fix suggested by Unraid in the change logs: I have switched "Docker custom network type" from MACVLAN to IPVLAN. Yet the crashes still occur, albeit with different errors in the logs.

 

Therefore, here are my questions to those more knowledgeable than me:

  1. I have a Realtek RTL8125B 2.5G NIC on my motherboard, but it is not in use (not plugged in), as I use a generic chinese X520-DA1 10G NIC instead for connectivity, which has worked without issues for a year. The fix common problems plugin warns the RL8125B NIC is known to cause instability, and a plugin is available for an alternative driver. Can the Realtek chip cause instability even when it is not plugged in and not in use? Should I install the driver then?
  2. I have, to the best of my knowledge, applied the recommendations found here in the 6.12.4 changelog. Yet the crashes still occur. Please find attached the syslog of the latest crash, as well as the diagnostics obtained just after reboot. The syslog is full of errors related to "call trace" and "khugepaged Tainted", which is why I believe it is related to the MACVLAN call traces issue. What should I do next to attempt to fix the issue?

 

Thanks in advance for your help, and I wish you all a (soon to be) happy new year!

tower-diagnostics-20231231-1113.zip tower-syslog-previous-20231231-1012.zip

Link to comment

I've checked your post, and while some things are similar ("Call Trace"; <TASK>; the general syntax of the errors), it's not exactly the same.

 

For instance, in your post you say the errors should always start and end with 

BUG: kernel NULL pointer dereference, address: blablabla

 

Yet control+F in my syslog yields nothing of that kind.

 

Same with

__filemap_get_folio+0x98/0x1ff

 

All I get is

filemap_migrate_folio+0x1b/0x62

 

I've also read that I should remove certain plugins you mentioned. I'm however reluctant to do so as some, most notably appdata backup, are essential to backing up saves from the game servers I host. In your thread, is the top post updated with conclusions from further down the thread, or should I browse it to see if some plugins end up confirmed as culprits?

Edited by SinoBreizh
Link to comment

First of all, happy new year! Didn't expect anyone to be online so soon in 2024.

 

Got it, I'll dig in and report back as soon as my parity drive finishes rebuilding.

 

I think it's completely unrelated, but I woke up this morning to a disabled parity drive and a bunch of errors in syslog. I wasn't too fresh this morning and forgot to download the syslog and the diagnostics, but the drive does pass the SMART test with no issues, and I checked all cable connections to make sure there were no loose fittings.

 

So I don't think the problem is from the drive's health or my SATA cables. All I can think of is a brownout/very short power cut over the new year. And I don't think it's related to the main problem I posted here for.

Link to comment

I've read your thread @JorgeB, and I've decided to try another option proposed there, before I commit to disabling plugins I rely on.

 

I've edited my binhex/qbitorrent-vpn container to pull from @binhex's libtorrentv1 library, and will run it as a daily driver to see if things improve. If no crashes/similar errors occur, then we can confirm your thread was the issue, without having to resort to wiping my plugins. If it still crashes, then I'll wipe the plugins as you asked.

 

From my understanding of the thread you linked, this issue (if it is indeed what you suspect) should be solved with the release of 6.13 since it will run on the 6.5+ kernel. Correct?

 

Anyways, thanks as usual for your time and help.

Link to comment

I was about to say things were looking good since I switched qbitorrent to use libtorrentv1 instead of libtorrentv2, but I encountered a crash while doing something else.

 

The problem is reproducible 100% of the time. I had attempted to copy over SMB a large (125GB) file from my NVMe-equipped desktop to my NVMe Unraid cache pool; so as to test the SSD's SLC cache limit. However, at some point during the copy process, throughput falls to zero and Unraid crashes in the following seconds.

 

I haven't included the syslog or diagnostics, because there's no errors. Nothing. Only the console hooked up to the server has info:

 

Crash 1 starts with:

BUG: Bad page state in process shfs pfn:952a66

 

Crash 2 starts with:

BUG: Bad page state in process smbd pfn: 2a090b

 

Crash 3 starts with:

general protection fault, probably for non-canonical address

 

Most crashes end with a kernel panic. Others are an endless loop of trace calls.

 

What makes me think it is related to the issue I posted here for, is the fact that disabling docker seems solves the issue entirely. I have not been able to reproduce the crash with docker disabled. I can copy the 125GB file over SMB back and forth with no issue, and it fully saturates my 10G network (1,2GB/s). 

 

Yet unlike in my previous comments, the syslog is completely clean. So it can't be the same issue? And what makes the issue reproducible every single time; why is a large file over SMB what triggers the qbitorrent kernel panic? SMART data shows the SSD is only at 20TB written out of a marketed life of 200TB. The few attributes I get seem to indicate all is normal. The Crucial P1 is known as a hot running drive, but unreliable it is not. I will be purchasing another NVMe drive to rule out the hardware side - needed a second drive for parity on the cache pool anyways.

 

For now I'm just confused. I thought I had narrowed down the issue, but a whole other can of worms is now open. Any help is appreciated.

20240102_232810.jpg

20240102_232818.jpg

20240102_235322.jpg

20240103_000213.jpg

Link to comment
Posted (edited)

Never mind, it now crashes even with Docker disabled when sending the test file. I guess it has to be a different issue then.

 

Gives the same error though:

general protection fault, probably for non-canonical address

 

I give up for tonight. I'm going to bed and letting Memtest run overnight, just in case.

 

-------------------------

 

Edit: Welp, Memtest immediately failed with thousands of errors. That also explains why Unraid would crash when ingesting a big file.

 

These sticks were working fine before and are months old. I guess my "new year power brownout" theory is gaining traction. I'll individually test the modules tomorrow. But for fuck's sake; I was thinking of buying a UPS too since I'm starting to use Unraid more.

 

I'm not closing this thread as solved until I can confirm everything works with good RAM, hope you don't mind.

Edited by SinoBreizh
Link to comment
  • Solution

I'm updating this thread just because it can prove useful to someone, even though it has spiralled out of control into troubleshooting hell.

 

I've narrowed the issue down to my RAM. I have two kits of identical RAM, bought at separate times. Both kits have the same model numbers, specs, and timings. Both pass MemTest without issue on their own. Put them together however, and they fail MemTest. After further investigation, I have noticed that, while they have the same model numbers, frequencies, and timings, the older kit of RAM is actually single rank, while the newer kit is dual rank. I never knew the kits were different in this manner (as it is not advertised anywhere on the online shop or the manufacturer's website) and never expected this to be an issue - I had always read that the memory controller would "pick the slower of the two" kits and work from there. I also expected memory incompatibilities to be clear and evident, where the PC wouldn't boot at all; not this unexplainable crash situation. This is the only explanation I can find for otherwise identical kits of RAM passing MemTest individually, but failing when put together, no matter what slots are used.

 

I am currently using only the dual rank dual channel kit of RAM, and I've had consistent uptime with no crashes for around a day now. I have also failed to cause crashes as I did before, by sending a huge 125GB file over SMB. I have even reverted qbitorrent back to LibtorrentV2, with no issues so far. I'll update the post if I spoke too soon, but this seems to be it. For now I'm flairing this as the solution, in case it can help someone.

 

TL;DR: when encountering seemingly unexplainable crashes, running MemTest can be a good first step to narrow the issue down.

Link to comment
On 1/5/2024 at 5:39 PM, SinoBreizh said:

This is the only explanation I can find for otherwise identical kits of RAM passing MemTest individually, but failing when put together, no matter what slots are used.

There is another possibility, the RAM controller circuits may not be able to run all the sticks at once due to load. Unless you mean that memtest fails with 1 stick from one set and 1 set from another, so only 2 sticks total still fails.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...