[SOLVED-WORK-AROUND] ReiserFS - kernel panic and stall on CPU

razorslinky · October 14, 2014

Other threads reporting the same type of issue. This will be updating randomly as I search through the forums:

http://lime-technology.com/forum/index.php?topic=35357.0

Syslog is too large to attach

Lately it appears that Unraid has been freezing multiple times a day since I've updated to 6.0-beta10a. I never had any freezing issues while I was on 6.0-beta6. It usually appears to freeze after the mover script has run but that's not always the case. I've ran a reiserfsck --check /dev/md(1-13) and only one drive needed to be repaired so I don't think it's an issue with the drives.

At first I thought it was BTRFS on the cache drive since it was reporting that the drive was full even though there was 60GB of 120GB free so I copied everything off of it and formatted the drive to XFS as I had to keep scrubbing the drive.

I believe it has to do with Docker because all of my XBMC machines are unable to access the MariaDB database or play video but I am able to browse /mnt/user via CLI but unable to access the GUI. I will be moving SQL to it's own VM just in case it's MariaDB causing the issue as I've recieved kernel panics saying unraid ran out of memory.

Here's my only manual entry in the Go file. I've also deleted these lines just to verify if these settings were causing any issues but unraid would still randomly freeze.

# OOM daemon trick : do not kill unRaid WebUI and SMB shares too easily.

pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done

pgrep -f "/usr/local/sbin/shfs" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done

pgrep -f "/usr/sbin/smbd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done

Here's the first 6 lines of the kernel panics after Mover has been started:

Oct 14 03:04:25 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 3} (t=6001 jiffies g=4020303 c=4020302 q=18721)

Oct 14 03:04:25 Tower kernel: sending NMI to all CPUs:

Oct 14 03:04:25 Tower kernel: NMI backtrace for cpu 3

Oct 14 03:04:25 Tower kernel: CPU: 3 PID: 32325 Comm: shfs Not tainted 3.16.3-unRAID #3

Oct 14 03:04:25 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

Oct 14 03:04:25 Tower kernel: task: ffff880139b94ec0 ti: ffff880011e50000 task.ti: ffff880011e50000

The nzbget segfaults have been fixed in 1144 per this thread. I just now updated to the newest version so nzbget should no longer segfault.

http://nzbget.net/forum/viewtopic.php?f=3&t=1536

Hardware:

Motherboard: GA-990FXA-UD3 (BIOS has been updated to the newest version F10d. I was on F6 months ago)

http://www.gigabyte.com/products/product-page.aspx?pid=3996#bios

CPU: AMD Phenom II X4 955

RAM: 16GB DDR3

ESXi 5.5 Update 2 (2068190)

Both SAS cards are in pass-through mode:

SAS1: LSI1068E

SAS2: 88SE9485

CPU: 2 virtual sockets, 2 cores per socket (Shares: Normal, Unlimited Limit, No reservation)

Memory: 4GB (Shares: High, Unlimited Limit 4096 reservation)

Docker images:

Couchpotato

MariaDB

NZBGet

I have two other VMs (XBMC/Plex/Mediabrowser server and Server2012R2) running that have never froze or had issues. I can provide any ESXi screenshots/information if needed.

jonp · October 14, 2014

So are you running unRAID as a virtual machine under VM Ware?

razorslinky · October 14, 2014

So are you running unRAID as a virtual machine under VM Ware?

Yes I am.

razorslinky · October 14, 2014

Just ran mover via the GUI and it did not freeze until I ran "git update"

Here's the following kernel panic:

https://paste.ee/p/ew93C

Oct 14 15:06:14 Tower logger: mover finished

Oct 14 15:10:42 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 2} (t=6000 jiffies g=751905 c=751904 q=20375)

Oct 14 15:10:42 Tower kernel: sending NMI to all CPUs:

Oct 14 15:10:42 Tower kernel: NMI backtrace for cpu 2

Oct 14 15:10:42 Tower kernel: CPU: 2 PID: 4096 Comm: shfs Not tainted 3.16.3-unRAID #3

Oct 14 15:10:42 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

Oct 14 15:10:42 Tower kernel: task: ffff880139166660 ti: ffff8800a0300000 task.ti: ffff8800a0300000

Both GUI and SMB are not accessible. Running "ls /mnt/user" froze while I am in via putty froze the terminal. The vmware console is still accessible.

I've attached two screenshots of htop and top running.

I'm not sure if this is a unraid issue or esxi issue as it appears to be pointing to the CPU...

sureguy · October 14, 2014

I also have freezing issues with unraid running as an esxi 5.5 guest under beta9 and beta 10a. I've reverted to 6beta6 as it doesn't exhibit the same behaviour.

Here are some logs from 6beta9 - I can reproduce on 6beta10a if necessary:

http://lime-technology.com/forum/index.php?topic=35357.msg330154#msg330154

razorslinky · October 16, 2014

I also have freezing issues with unraid running as an esxi 5.5 guest under beta9 and beta 10a. I've reverted to 6beta6 as it doesn't exhibit the same behaviour.

Here are some logs from 6beta9 - I can reproduce on 6beta10a if necessary:

http://lime-technology.com/forum/index.php?topic=35357.msg330154#msg330154

Interesting... I'm going to test some theories over the weekend such as moving from ESXi to Xen 4.4.1 and turning off Docker completely.

For now, I've posted under the beta10a Announcements board so I can grab Limetech's attention to this:

http://lime-technology.com/forum/index.php?topic=35493.msg333743#msg333743

sureguy,

When running beta10a and it unraid froze, was your CPU usage extremely high and it did randomly freeze or did it freeze when/after the mover script ran? Vsphere was reporting that it was using 100% of the CPU.

sureguy · October 16, 2014

sureguy,

When running beta10a and it unraid froze, was your CPU usage extremely high and it did randomly freeze or did it freeze when/after the mover script ran? Vsphere was reporting that it was using 100% of the CPU.

CPU usage was very high. On unraid the load average was over 100 at one point. Vsphere was reporting 100% cpu usage as well.

It wasn't freezing when I was running mover, I was actually running bitrot.sh which adds file hashes to the extended attributes of files.

WeeboTech · October 16, 2014

sureguy,

When running beta10a and it unraid froze, was your CPU usage extremely high and it did randomly freeze or did it freeze when/after the mover script ran? Vsphere was reporting that it was using 100% of the CPU.

CPU usage was very high. On unraid the load average was over 100 at one point. Vsphere was reporting 100% cpu usage as well.

It wasn't freezing when I was running mover, I was actually running bitrot.sh which adds file hashes to the extended attributes of files.

Load average of 100 means there were allot of processes waiting to run.

In that case, something may have went haywire and fired off allot of parallel processes.

bitrot will spike the CPU, but it should not spike it that high.

mover will also make the cpu slightly busy, but not with a load average of 100.

I would suggest you do a ps -ef when the load average is that high to see what else is going on.

A bad file name could accidentally spawn processes if the shells that utilize them are not quoting them.

I'm not saying this is the problem, but some detective work could point to the sort of such a high load average.

StevenD · October 16, 2014

FYI...

I've been running unRAID as a guest on ESXi for well over a year now without so much as a hiccup. Currently running ESXi 5.5U2 and unRAID 6-beta10a.

RobJ · October 16, 2014

Some ideas (from someone who is admittedly not very knowledgeable here, as to VM's etc) -

* the kernel stalls/issues occur in exactly 3 minute intervals

* one occurred near the end of a large file copy as part of a mover transfer, the other occurred several minutes after the mover was finished

* they generally always appear to begin with a CPU stall, on an Arch system interrupt, that occurs during Reiser file system management

* you don't mention anything about Arch, but it looks like Arch is involved (to this not very Arch or VM experienced observer); would it be possible to avoid anything Arch related and test again, just to see how the behavior changes?

* you mention having 16GB, but this syslog only shows 4GB available; some of the behavior makes me wonder if there was memory pressure involved, some need for VMware to reorganize the memory in the background; would it be possible to increase the memory to 6GB or 8GB and see if it makes any difference?

Scratch them, see http://lime-technology.com/forum/index.php?topic=35788.msg333810#msg333810.

* we like syslogs zipped and attached, they are highly compressible, yours zipped to 83KB

WeeboTech · October 16, 2014

After looking closer at the syslog (thanks for pointing that out RobJ). Could there be some kind of conflict with ESX unRAID as a guest and dockers?

I noticed that 4 cpus are allocated to the unRAID system.

I remember reading years ago, that vmware will pause a VM if it requires multiple cores and the other cores are busy elsewhere.

Can't remember where I read that though.

Perhaps you could allocated only 2 cores to this virtual machine and see how things work out.

jonp · October 16, 2014

Issues with running unRAID 6 as a guest can be tricky to diagnose. You're really nesting virtualization this way and could create issues.

sureguy · October 16, 2014

Issues with running unRAID 6 as a guest can be tricky to diagnose. You're really nesting virtualization this way and could create issues.

If I don't have any dockers, am I still nesting virtualization?

jonp · October 16, 2014

Issues with running unRAID 6 as a guest can be tricky to diagnose. You're really nesting virtualization this way and could create issues.

If I don't have any dockers, am I still nesting virtualization?

So is docker even running as a service?

What boot mode for unRAID are you loading?

sureguy · October 16, 2014

Issues with running unRAID 6 as a guest can be tricky to diagnose. You're really nesting virtualization this way and could create issues.

If I don't have any dockers, am I still nesting virtualization?

So is docker even running as a service?

What boot mode for unRAID are you loading?

Default boot mode. I was able to trigger the issue in safe mode with the default go script. I don't know if docker is run when safe mode is selected.

jonp · October 16, 2014

Hmm. If you boot unRAID as the host, do you have any issues?

sureguy · October 16, 2014

Hmm. If you boot unRAID as the host, do you have any issues?

I was able to trigger a crash running as a host. It seemed to take longer running for the crash to happen.

jonp · October 16, 2014

Hmm. If you boot unRAID as the host, do you have any issues?

I was able to trigger a crash running as a host. It seemed to take longer running for the crash to happen.

Hmm, can you share a syslog running unRAID as host for the crash? That would help track the issue.

sureguy · October 16, 2014

Hmm. If you boot unRAID as the host, do you have any issues?

I was able to trigger a crash running as a host. It seemed to take longer running for the crash to happen.

Hmm, can you share a syslog running unRAID as host for the crash? That would help track the issue.

I'm away from home right now, but will do that when I'm home.

jonp · October 16, 2014

Awesome! Thanks man! Syslogs are the key...

RobJ · October 16, 2014

I was going to reply over in sureguy's thread, where he had posted syslogs, but seemed better to move it here.

Attached is the last backup of the syslog taken about 20 minutes before totally giving up (before the manual shutdown procedure).

Because razorslinky had linked to you as someone with the same issue, I took a look at your syslogs of 9-22 and 9-27. Your issues did not sound to me to be the same, so I wasn't expecting anything. I then noticed that the 9-22 syslog was using VMware with 4GB, so thought it *could* be similar, but discovered it quit with a 'kernel Null pointer dereference' bug, which seems very different. I didn't expect anything from the 9-27 syslog, because it had no VMware, used 32GB, and had unraidsafemode set, but surprise surprise there it was, exactly like razorslinky's series of CPU stalls and kernel issues, each 3 minutes apart. Some differences, there is no mover involved, so that rules out the mover as a cause, and no VMware and 32GB of memory, so that rules VMware and low memory out. I'll report this over there.

What's constant:

* begins with CPU stall

* an NMI is sent to all CPU's, resulting in all CPU's reporting back with an 'NMI backtrace'

* first CPU to report back is 3, usually the only active one, same for both razorslinky and sureguy

* kernel module involved with CPU 3 is shfs, others later are mostly the swapper module

* code trace begins with timer interrupt calls, usually identical or very similar to:

Sep 27 04:36:33 phatstore kernel: <IRQ>
Sep 27 04:36:33 phatstore kernel: [<ffffffff8102feaf>] ? arch_trigger_all_cpu_backtrace+0xc4/0xcd

Sep 27 04:36:33 phatstore kernel: [<ffffffff8107d136>] rcu_check_callbacks+0x1e3/0x503

Sep 27 04:36:33 phatstore kernel: [<ffffffff8108511c>] ? tick_sched_handle+0x34/0x34

Sep 27 04:36:33 phatstore kernel: [<ffffffff810493ee>] update_process_times+0x38/0x60

Sep 27 04:36:33 phatstore kernel: [<ffffffff8108511a>] tick_sched_handle+0x32/0x34

Sep 27 04:36:33 phatstore kernel: [<ffffffff81085151>] tick_sched_timer+0x35/0x53

Sep 27 04:36:33 phatstore kernel: [<ffffffff8105adfc>] __run_hrtimer.isra.28+0x57/0xb0

Sep 27 04:36:33 phatstore kernel: [<ffffffff8105b2d5>] hrtimer_interrupt+0xd9/0x1c0

Sep 27 04:36:33 phatstore kernel: [<ffffffff8102e7bc>] local_apic_timer_interrupt+0x4f/0x52

Sep 27 04:36:33 phatstore kernel: [<ffffffff8102eb8e>] smp_apic_timer_interrupt+0x3a/0x4b

Sep 27 04:36:33 phatstore kernel: [<ffffffff815dfc1d>] apic_timer_interrupt+0x6d/0x80

Sep 27 04:36:33 phatstore kernel: <EOI>

* the first line above involved arch_trigger_all_cpu_backtrace, probably the 3 minute scheduled call to log all CPU back traces (arch is probably not Arch, perhaps architectural?)

* it then continues with Reiser file management function calls, only small variations but very similar to:

Sep 27 04:36:33 phatstore kernel: [<ffffffff811432d5>] ? __discard_prealloc+0x9b/0xb1
Sep 27 04:36:33 phatstore kernel: [<ffffffff8114334e>] reiserfs_discard_all_prealloc+0x43/0x4c

Sep 27 04:36:33 phatstore kernel: [<ffffffff8115f682>] do_journal_end+0x4e1/0xc57

Sep 27 04:36:33 phatstore kernel: [<ffffffff81160352>] journal_end+0xad/0xb4

Sep 27 04:36:33 phatstore kernel: [<ffffffff81162801>] reiserfs_xattr_set+0xd2/0x114

Sep 27 04:36:33 phatstore kernel: [<ffffffff8116311b>] user_set+0x3f/0x4d

Sep 27 04:36:33 phatstore kernel: [<ffffffff81162bb2>] reiserfs_setxattr+0x9b/0xa9

Sep 27 04:36:33 phatstore kernel: [<ffffffff8110b981>] __vfs_setxattr_noperm+0x69/0xd5

Sep 27 04:36:33 phatstore kernel: [<ffffffff8110ba69>] vfs_setxattr+0x7c/0x99

Sep 27 04:36:33 phatstore kernel: [<ffffffff8110bb9e>] setxattr+0x118/0x162

Sep 27 04:36:33 phatstore kernel: [<ffffffff810f83b4>] ? final_putname+0x2f/0x32

Sep 27 04:36:33 phatstore kernel: [<ffffffff810fc0f5>] ? user_path_at_empty+0x60/0x87

Sep 27 04:36:33 phatstore kernel: [<ffffffff810f1b41>] ? __sb_start_write+0x99/0xcd

Sep 27 04:36:33 phatstore kernel: [<ffffffff81107da1>] ? __mnt_want_write+0x43/0x4a

Sep 27 04:36:33 phatstore kernel: [<ffffffff8110bd92>] SyS_setxattr+0x69/0xab

Sep 27 04:36:33 phatstore kernel: [<ffffffff815dee29>] system_call_fastpath+0x16/0x1b

* this repeats every 3 minutes for CPU 3, so it appears to be stuck!

* then all of the rest of the CPU's report in, not in order, usually just idling so just variations on idling code calls

* once for razorslinky, it was nzbget working (CPU 2)

* it then pauses exactly 3 minutes before repeating the same sequence

I then happened to notice that the Call Trace (see below) for the 'kernel Null pointer dereference' bug of the 9-22 syslog was EXACTLY identical to one of the Reiser file system code traces of the 9-27 syslog! I mean EXACTLY, down to the code addresses, function names, and code offsets! So it's just a variation of the issue above.

Sep 22 11:58:55 phatstore kernel: [<ffffffff8114334e>] reiserfs_discard_all_prealloc+0x43/0x4c
Sep 22 11:58:55 phatstore kernel: [<ffffffff8115f682>] do_journal_end+0x4e1/0xc57

Sep 22 11:58:55 phatstore kernel: [<ffffffff81160352>] journal_end+0xad/0xb4

Sep 22 11:58:55 phatstore kernel: [<ffffffff81162801>] reiserfs_xattr_set+0xd2/0x114

Sep 22 11:58:55 phatstore kernel: [<ffffffff8116311b>] user_set+0x3f/0x4d

Sep 22 11:58:55 phatstore kernel: [<ffffffff81162bb2>] reiserfs_setxattr+0x9b/0xa9

Sep 22 11:58:55 phatstore kernel: [<ffffffff8110b981>] __vfs_setxattr_noperm+0x69/0xd5

Sep 22 11:58:55 phatstore kernel: [<ffffffff8110ba69>] vfs_setxattr+0x7c/0x99

Sep 22 11:58:55 phatstore kernel: [<ffffffff8110bb9e>] setxattr+0x118/0x162

Sep 22 11:58:55 phatstore kernel: [<ffffffff810f83b4>] ? final_putname+0x2f/0x32

Sep 22 11:58:55 phatstore kernel: [<ffffffff810fc0f5>] ? user_path_at_empty+0x60/0x87

Sep 22 11:58:55 phatstore kernel: [<ffffffff810f1b41>] ? __sb_start_write+0x99/0xcd

Sep 22 11:58:55 phatstore kernel: [<ffffffff81107da1>] ? __mnt_want_write+0x43/0x4a

Sep 22 11:58:55 phatstore kernel: [<ffffffff8110bd92>] SyS_setxattr+0x69/0xab

Sep 22 11:58:55 phatstore kernel: [<ffffffff815dee29>] system_call_fastpath+0x16/0x1b

I have to wonder if we aren't dealing with another Reiser bug here, due to the recent Reiser code 'refactoring'. My gut feeling here though is this could be a kernel bug, which will be fixed by a later patch, Linux kernel 3.16.4 or later?

It looks like the CPU stall triggers a debug state where the kernel logs exactly what each CPU is doing every 3 minutes. Perhaps this is configured somewhere? It does allow us to see that most of the CPU's are almost always idling, but CPU 3 is stuck in Reiser code.

WeeboTech · October 17, 2014

Sep 27 04:36:33 phatstore kernel: [<ffffffff81162801>] reiserfs_xattr_set+0xd2/0x114
Sep 27 04:36:33 phatstore kernel: [<ffffffff8116311b>] user_set+0x3f/0x4d
Sep 27 04:36:33 phatstore kernel: [<ffffffff81162bb2>] reiserfs_setxattr+0x9b/0xa9
Sep 27 04:36:33 phatstore kernel: [<ffffffff8110b981>] __vfs_setxattr_noperm+0x69/0xd5
Sep 27 04:36:33 phatstore kernel: [<ffffffff8110ba69>] vfs_setxattr+0x7c/0x99
Sep 27 04:36:33 phatstore kernel: [<ffffffff8110bb9e>] setxattr+0x118/0x162

Maybe there's another reiserfs bug in the extended attributes code.

Something that bitrot exercises allot. I.E. It puts data into the extended attributes.

I know the rsync does a copy with extended attributes also, but there shouldn't be all that much in the cache extended attributes.

Unless purposely put there. i.e. bitrot or AFP(I believe)

razorslinky · October 17, 2014

Thanks everyone for taking a look into this issue. I really appreciate the help and support.

RobJ: I'll zip and attach the syslog next time unraid freezes, which should be a little after the mover script is ran. I'll keep uploading the syslogs as unraid freezes just to help narrow down the issue.

WeeboTech: I'll run the "ps -ef" command when unraid freezes, zip it up and attach it here too.

StevenD: how many CPU/Cores & RAM do you have assigned to unraid? I want to match my settings to yours just to see if it makes any difference.

Let me know if there's anything that you want me to test, even if it's a beta kernel. I don't mind living on the bleeding

edge.

AFP and NFS have been turned off and I've reassigned all CPUs on ESXi to be one less than the max available.

StevenD · October 17, 2014

razorslinky · October 17, 2014

Unraid froze again during the mover script. Here are the syslogs and "ps -ef" logs.

Looks like CPU & Reiserfs causing the kernel panics:

Oct 17 03:20:43 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 1} (t=6000 jiffies g=1676792 c=1676791 q=19435)

Oct 17 03:20:43 Tower kernel: sending NMI to all CPUs:

Oct 17 03:20:43 Tower kernel: NMI backtrace for cpu 1

Oct 17 03:20:43 Tower kernel: CPU: 1 PID: 15800 Comm: shfs Not tainted 3.16.3-unRAID #3

Oct 17 03:20:43 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

Oct 17 03:20:43 Tower kernel: task: ffff8800a7e156a0 ti: ffff88009a048000 task.ti: ffff88009a048000

Oct 17 03:20:43 Tower kernel: RIP: 0010:[<ffffffff81032618>] [<ffffffff81032618>] flat_send_IPI_mask+0x77/0x86

Oct 17 03:20:43 Tower kernel: [<ffffffff81143592>] ? reiserfs_discard_all_prealloc+0x37/0x4c

Oct 17 03:20:43 Tower kernel: [<ffffffff8114359e>] ? reiserfs_discard_all_prealloc+0x43/0x4c

Oct 17 03:20:43 Tower kernel: [<ffffffff8115f8d2>] do_journal_end+0x4e1/0xc57

Oct 17 03:20:43 Tower kernel: [<ffffffff811605a2>] journal_end+0xad/0xb4

Oct 17 03:20:43 Tower kernel: [<ffffffff81150eca>] reiserfs_dirty_inode+0x6c/0x7c

Oct 17 03:20:43 Tower kernel: [<ffffffff81095450>] ? from_kuid+0x9/0xb

Oct 17 03:20:43 Tower kernel: [<ffffffff8110dfb2>] __mark_inode_dirty+0x2f/0x1de

Oct 17 03:20:43 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294

Oct 17 03:20:43 Tower kernel: [<ffffffff810f1b45>] ? __sb_start_write+0x99/0xcd

Oct 17 03:20:43 Tower kernel: [<ffffffff810fc109>] ? user_path_at_empty+0x60/0x87

Oct 17 03:20:43 Tower kernel: [<ffffffff81105492>] notify_change+0x1dd/0x2d1

Oct 17 03:20:43 Tower kernel: [<ffffffff81112014>] utimes_common+0x114/0x174

Oct 17 03:20:43 Tower kernel: [<ffffffff8111215f>] do_utimes+0xeb/0x123

Oct 17 03:20:43 Tower kernel: [<ffffffff811122ff>] SyS_futimesat+0x8a/0xa5

Oct 17 03:20:43 Tower kernel: [<ffffffff8111232e>] SyS_utimes+0x14/0x16

Oct 17 03:20:43 Tower kernel: [<ffffffff815df469>] system_call_fastpath+0x16/0x1b

syslog-10-17-14.zip

ps-10-17-14.zip

[SOLVED-WORK-AROUND] ReiserFS - kernel panic and stall on CPU

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation