Jump to content

General protection fault SMP PTI - unraid 6.5.3


tencent

Recommended Posts

I've never had any issues with my unraid box until today. The web ui becomes unresponsive a minute or so after unraid finishes booting. Using the machine directly in safe mode I was able to disable docker, virtual machines, and disable auto start of the array. This did not fix the issue and after about a minute or so the machine continues to end up throwing a fault code.
 

Aug 18 16:54:03 unraid sshd[4913]: Accepted password for root from 10.0.0.2 port 35396 ssh2
Aug 18 16:54:36 unraid kernel: general protection fault: 0000 [#1] SMP PTI
Aug 18 16:54:36 unraid kernel: Modules linked in: md_mod e1000e ptp pps_core atlantic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd i2c_i801 i2c_core ahci intel_cstate intel_uncore libahci intel_rapl_perf button [last unloaded: pps_core]
Aug 18 16:54:36 unraid kernel: CPU: 0 PID: 4538 Comm: nginx Not tainted 4.14.49-unRAID #1
Aug 18 16:54:36 unraid kernel: Hardware name: PDS Inc. Vector SW/DQ67SW, BIOS SWQ6710H.86A.0068.2017.0601.1423 06/01/2017
Aug 18 16:54:36 unraid kernel: task: ffff88042b7c0d80 task.stack: ffffc90001cd8000
Aug 18 16:54:36 unraid kernel: RIP: 0010:kfree_skb_list+0x7/0x19
Aug 18 16:54:36 unraid kernel: RSP: 0018:ffffc90001cdbd20 EFLAGS: 00010206
Aug 18 16:54:36 unraid kernel: RAX: 0000000000000000 RBX: ffff8803fc07dc00 RCX: ffffffff81c8cd40
Aug 18 16:54:36 unraid kernel: RDX: 0000000000002000 RSI: 0000000000000000 RDI: 447b1011dd35fbed
Aug 18 16:54:36 unraid kernel: RBP: ffff880429f58ff9 R08: 0000000000000fb7 R09: 00000000068310ee
Aug 18 16:54:36 unraid kernel: R10: ffffc90001cdbd20 R11: 0000000000000000 R12: 0000000000000000
Aug 18 16:54:36 unraid kernel: R13: ffff8803fc07dc00 R14: ffff8804298fe51c R15: ffff8804298fe000
Aug 18 16:54:36 unraid kernel: FS:  00001466f9b7e740(0000) GS:ffff88043e200000(0000) knlGS:0000000000000000
Aug 18 16:54:36 unraid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 18 16:54:36 unraid kernel: CR2: 00001481f8f886c0 CR3: 00000004087d4005 CR4: 00000000000606f0
Aug 18 16:54:36 unraid kernel: Call Trace:
Aug 18 16:54:36 unraid kernel: skb_release_data+0x85/0xe4
Aug 18 16:54:36 unraid kernel: __kfree_skb+0x9/0x12
Aug 18 16:54:36 unraid kernel: tcp_recvmsg+0x7a0/0x7b8
Aug 18 16:54:36 unraid kernel: ? do_iter_readv_writev+0xe4/0x105
Aug 18 16:54:36 unraid kernel: inet_recvmsg+0x7c/0x8a
Aug 18 16:54:36 unraid kernel: SyS_recvfrom+0xb1/0xfa
Aug 18 16:54:36 unraid kernel: ? do_writev+0x53/0xb1
Aug 18 16:54:36 unraid kernel: ? do_writev+0x53/0xb1
Aug 18 16:54:36 unraid kernel: do_syscall_64+0x6d/0xfe
Aug 18 16:54:36 unraid kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Aug 18 16:54:36 unraid kernel: RIP: 0033:0x1466f955a38a
Aug 18 16:54:36 unraid kernel: RSP: 002b:00007ffe181569d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
Aug 18 16:54:36 unraid kernel: RAX: ffffffffffffffda RBX: 0000000000abf450 RCX: 00001466f955a38a
Aug 18 16:54:36 unraid kernel: RDX: 0000000000001d8e RSI: 0000000000b30312 RDI: 0000000000000016
Aug 18 16:54:36 unraid kernel: RBP: 0000000000001d8e R08: 0000000000000000 R09: 0000000000000000
Aug 18 16:54:36 unraid kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000b30312
Aug 18 16:54:36 unraid kernel: R13: 0000000000afaf50 R14: 0000000000a2d248 R15: 0000000000a5f6c8
Aug 18 16:54:36 unraid kernel: Code: ff 48 85 ff 74 1d 8b 87 dc 00 00 00 ff c8 75 02 eb 0f f0 ff 8f dc 00 00 00 0f 88 39 83 12 00 75 02 eb cc c3 48 85 ff 75 01 c3 53 <48> 8b 1f e8 ce ff ff ff 48 85 db 48 89 df 75 f0 5b c3 65 8b 05 
Aug 18 16:54:36 unraid kernel: RIP: kfree_skb_list+0x7/0x19 RSP: ffffc90001cdbd20
Aug 18 16:54:36 unraid kernel: ---[ end trace a0f8bbb6eb993e0e ]---


As indicated in the log, about 30 seconds after opening the SSH shell the fault is thrown and that is when the web UI becomes unavailable. I have attached the anonymized diagnostics from the web tool obivously prior to one inevitable lockout. I tried looking for solutions but I am seeing a lot of noise from other distros having this issue over the last few months and I'm not seeing any resolutions. Any suggestions?

unraid-diagnostics-20180818-1653.zip

Link to comment

First pass of memtest is complete with no issues but I'll let it run a few more hours today. I would not expect memory to be the issue here as launching in safemode does not exhibit the issue. It's only after the system is fully online that the error appears to occur leading me to believe plugins are likely the culprit. 

It appears I'm a version behind on bios updates for my board so I'm going to give that try but I still suspect the issue is entirely runtime and specifically one of the plugins that are causing the issue to occur.

Link to comment

I do not remember specifically but generally just a few tools such as iperf and others that I have needed for testing. I do not believe I'm using anything from nerd pack that actively runs though I suppose a plugin could be indirectly using it. Is there a way to check from safe mode?

Link to comment

nerd pack

git-2.14.1-x86_64-1.txz
iperf-3.1.6-x86_64-1cf.txz
perl-5.26.1-x86_64-4.txz
unrar-5.4.5-x86_64-1.txz


Flashed the bios and let the system come up and idle for a bit running tail on the syslog. Not seeing the error pop up yet. If it goes a while longer without throwing the fault I'll try bringing my array back online.

Link to comment

Brought my array up in maintenance mode which seemed to work fine. Attempting to stop the array caused a fault which looked to trigger diagnostics which then threw another fault. I could see this happen watching the monitor but my syslog tail stalled as SSH appeared to die. I pulled the usb drive but the syslog in the diagnostic zips end at pulling the diagnostic and do not show the faults I saw printing to the monitor.

Any suggestions of what I should try next?

Link to comment

I decided to try running a live usb of Lubuntu to test the rig. Using Phoronix I was able to run through most of the compiler benchmark (everything that could be easily installed) and it completed without issue. So I know my USB controller/cpu/ram all appear to be in working order. Using gparted, I was able to look at all my disks and they all showed up nominally so no obvious issue there. Maybe my unraid usb has gone bad? Any suggestions on what steps I should take next such as doing a clean unraid install to my usb drive?

Link to comment

I gave it a try copying over the array config only and ended up with the same issue trying to use the array. I found others using btrfs with my issue on other operating systems so I'm in the process of moving my data off the array so I can try a different filesystem for the array.

Link to comment
1 hour ago, tencent said:

I'm in the process of moving my data off the array so I can try a different filesystem for the array.

 

Give XFS a try. That's what most unRAID users use and it's the default. Don't waste your time with ReiserFS as it's only there for legacy support.

 

Do you have a syslog that actually includes the general protection fault? The one in the diagnostics you uploaded doesn't as it stops at 16:53:17, 46 seconds before the snippet you pasted into your OP. I'd like to take a look at that missing 46 seconds to see if there's any clue there.

 

Link to comment

@John_M Machine ended up running into it again. I killed apcupsd and a few minutes later it threw the fault and I was able to cat the log to the boot directory before the system locked up. Interestingly it took significantly longer than usual for the system to lockup so detaching all the drives improved the uptime to a few minutes instead of the previous ~1 minute.

syslog

Link to comment

@tencent

Well done for catching the log before it was lost! You're correct about the USB cable to your UPS becoming unplugged. It looks like the daemon respawned and got upset by the pre-existing lock file. Nothing to do with your problem but curious that it didn't handle the situation more gracefully.

 

The only thing I can see that troubles me is this:

Aug 21 15:09:55 unraid kernel: mm/pgtable-generic.c:40: bad pmd ffff88042a39a000(0001000000000000)
Aug 21 15:09:56 unraid kernel: BUG: Bad rss-counter state mm:ffff88042913fb40 idx:1 val:3
Aug 21 15:09:56 unraid kernel: BUG: non-zero nr_ptes on freeing mm: 1
Aug 21 15:11:20 unraid kernel: general protection fault: 0000 [#1] SMP PTI

If it is a kernel bug that caused your crash it's curious that it took over a minute to do it. Still it's in the memory management department and that's consistent with a general protection fault.

 

How long have you been running version 6.5.3 of unRAID? I'm wondering if it's very recently or if your server has run that version successfully for at least a while.

 

There isn't a newer version to try - not even a beta or rc - so would you be interested in downgrading just to see if it makes a difference? Assuming you updated using the normal method you should be able to revert to your previous version (and a different kernel) simply by copying all the bz* files in the previous folder on your boot flash and pasting them into the root of your boot flash so that they replace the newer files of the same name. There's a GUI method (Tools -> Update OS -> Restore) but I wouldn't want your server to crash part way through the process, so better to plug the flash into a different PC.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...