[SOLVED-WORK-AROUND] ReiserFS - kernel panic and stall on CPU


Recommended Posts

Can you temporarily turn off the dockers and run the mover script?

 

It really does seem that something is interfering with the CPU and/or reiserfs.

Are you using pass through for the controller?

 

Is there anything in the esx logs pertaining to the disks?

 

Did you notice all those SMBD's running?

Link to comment
  • Replies 99
  • Created
  • Last Reply

Top Posters In This Topic

Unraid froze again during the mover script. Here are the syslogs and "ps -ef" logs.

 

Looks like CPU & Reiserfs causing the kernel panics:

 

Oct 17 03:20:43 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 1}  (t=6000 jiffies g=1676792 c=1676791 q=19435)

Oct 17 03:20:43 Tower kernel: sending NMI to all CPUs:

Oct 17 03:20:43 Tower kernel: NMI backtrace for cpu 1

Oct 17 03:20:43 Tower kernel: CPU: 1 PID: 15800 Comm: shfs Not tainted 3.16.3-unRAID #3

Oct 17 03:20:43 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

Oct 17 03:20:43 Tower kernel: task: ffff8800a7e156a0 ti: ffff88009a048000 task.ti: ffff88009a048000

Oct 17 03:20:43 Tower kernel: RIP: 0010:[<ffffffff81032618>]  [<ffffffff81032618>] flat_send_IPI_mask+0x77/0x86

 

Oct 17 03:20:43 Tower kernel: [<ffffffff81143592>] ? reiserfs_discard_all_prealloc+0x37/0x4c

Oct 17 03:20:43 Tower kernel: [<ffffffff8114359e>] ? reiserfs_discard_all_prealloc+0x43/0x4c

Oct 17 03:20:43 Tower kernel: [<ffffffff8115f8d2>] do_journal_end+0x4e1/0xc57

Oct 17 03:20:43 Tower kernel: [<ffffffff811605a2>] journal_end+0xad/0xb4

Oct 17 03:20:43 Tower kernel: [<ffffffff81150eca>] reiserfs_dirty_inode+0x6c/0x7c

Oct 17 03:20:43 Tower kernel: [<ffffffff81095450>] ? from_kuid+0x9/0xb

Oct 17 03:20:43 Tower kernel: [<ffffffff8110dfb2>] __mark_inode_dirty+0x2f/0x1de

Oct 17 03:20:43 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294

Oct 17 03:20:43 Tower kernel: [<ffffffff810f1b45>] ? __sb_start_write+0x99/0xcd

Oct 17 03:20:43 Tower kernel: [<ffffffff810fc109>] ? user_path_at_empty+0x60/0x87

Oct 17 03:20:43 Tower kernel: [<ffffffff81105492>] notify_change+0x1dd/0x2d1

Oct 17 03:20:43 Tower kernel: [<ffffffff81112014>] utimes_common+0x114/0x174

Oct 17 03:20:43 Tower kernel: [<ffffffff8111215f>] do_utimes+0xeb/0x123

Oct 17 03:20:43 Tower kernel: [<ffffffff811122ff>] SyS_futimesat+0x8a/0xa5

Oct 17 03:20:43 Tower kernel: [<ffffffff8111232e>] SyS_utimes+0x14/0x16

Oct 17 03:20:43 Tower kernel: [<ffffffff815df469>] system_call_fastpath+0x16/0x1b

 

That's quite a ps report!  I'll let more experienced people comment, but can't help wondering what that smbd with PID 6712 is doing, and shfs (PID 4233) numbers seem high.

 

You have the same series of stalls, but then they stop with:

Oct 17 03:34:00 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

Oct 17 03:34:00 Tower kernel: IP: [<ffffffff81143521>] __discard_prealloc+0x97/0xb1

Oct 17 03:34:00 Tower kernel: PGD 7c1ea067 PUD 7d13a067 PMD 0

Oct 17 03:34:00 Tower kernel: Oops: 0002 [#1] SMP

Oct 17 03:34:00 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod mvsas libsas mptsas mptscsih mptbase scsi_transport_sas i2c_piix4 ata_piix

Oct 17 03:34:00 Tower kernel: CPU: 1 PID: 15800 Comm: shfs Not tainted 3.16.3-unRAID #3

Oct 17 03:34:00 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014

...

Oct 17 03:34:00 Tower kernel: Call Trace:

Oct 17 03:34:00 Tower kernel: [<ffffffff8114359e>] reiserfs_discard_all_prealloc+0x43/0x4c

Oct 17 03:34:00 Tower kernel: [<ffffffff8115f8d2>] do_journal_end+0x4e1/0xc57

Oct 17 03:34:00 Tower kernel: [<ffffffff811605a2>] journal_end+0xad/0xb4

Oct 17 03:34:00 Tower kernel: [<ffffffff81150eca>] reiserfs_dirty_inode+0x6c/0x7c

Oct 17 03:34:00 Tower kernel: [<ffffffff81095450>] ? from_kuid+0x9/0xb

Oct 17 03:34:00 Tower kernel: [<ffffffff8110dfb2>] __mark_inode_dirty+0x2f/0x1de

Oct 17 03:34:00 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294

Oct 17 03:34:00 Tower kernel: [<ffffffff810f1b45>] ? __sb_start_write+0x99/0xcd

Oct 17 03:34:00 Tower kernel: [<ffffffff810fc109>] ? user_path_at_empty+0x60/0x87

Oct 17 03:34:00 Tower kernel: [<ffffffff81105492>] notify_change+0x1dd/0x2d1

Oct 17 03:34:00 Tower kernel: [<ffffffff81112014>] utimes_common+0x114/0x174

Oct 17 03:34:00 Tower kernel: [<ffffffff8111215f>] do_utimes+0xeb/0x123

Oct 17 03:34:00 Tower kernel: [<ffffffff811122ff>] SyS_futimesat+0x8a/0xa5

Oct 17 03:34:00 Tower kernel: [<ffffffff8111232e>] SyS_utimes+0x14/0x16

Oct 17 03:34:00 Tower kernel: [<ffffffff815df469>] system_call_fastpath+0x16/0x1b

Link to comment

Can you temporarily turn off the dockers and run the mover script?

 

It really does seem that something is interfering with the CPU and/or reiserfs.

Are you using pass through for the controller?

 

Is there anything in the esx logs pertaining to the disks?

 

Did you notice all those SMBD's running?

 

I will be moving the Docker services to a dedicated VM tonight and see if it still freezes after the docker services are turned off and Mover is ran during the early morning.

 

Two controllers are currently in passthrough mode to the unraid vm, so there are no logs that reference the hard drives.

 

I noticed all of the SMBD's running but I'm not sure why so many are running. Anyway to find out why?

Link to comment
  • 3 weeks later...

Has this issue been resolved?  I'm having pretty much the same issue with my UnRaid VM setup since updating from 5.05 to 6.0b10.

 

Nope. The issue hasn't been resolved. I've moved everything over from Docker to a dedicated VM and disabled the Mover script and Unraid hasn't frozen at all. Once I manually run the Mover script, it freezes and then I have to hard-reboot Unraid from vSphere. I then have to review the files that were moved over because every once in a while there are corrupted files moved from the cache drive to another drive. I just got sick of hearing "XBMC froze again and I can't watch anything" or "My show stopped half way through. You need to redownload it."

 

I'm waiting for the newest version of Unraid and crossing my fingers that it has more ReiserFS fixes in the newest kernel. My only other option is to purchase a 4TB drive to manually copy over and convert one drive at a time to either XFS or EXT4 as I think this is a ReiserFS issue. The only thing I haven't tried is running Unraid on bare-metal and not under ESXi but I'm baffled that it worked flawlessly with every other version and not beta10(a).

 

But between having a newborn child and lack of sleep, I'm not looking forward to putting that type of time and energy into it... Maybe in a month I'll go down that route but right now it's seems like that would be easier than manually running the Mover script 4-5 times every time the cache drive gets filled up.

 

Limetech or staff,

 

Are there any ReiserFS patches that were rolled into the newest kernel? Has anyone tried to replicate this issue in the lab? There are a few of us here that are posting about this issue which makes me wonder if there are more people out there having the same issue but not saying anything.

 

I can give you guys full access to my server if needed for testing. I don't mind setting you guys up with TeamViewer and Vsphere access if anyone wants to poke around.

Link to comment

Yeah, I kept having people saying the same thing, "Why does your Plex show the movies but say i can't play them, etc".  So I ended up having to just revert back to 5.0.6.  I needed to have crashplan and the only way to get it working on 6.0b10 was to use Docker and since Docker would just cause my box to hard freeze and force me to reboot via VSphere I really had no other choice.

 

Hopefully we will see a fix to this in a future release, as I would really like to use the new features of 6.0.

Link to comment
  • 2 weeks later...

After manually running the Mover 6 times I wrote down which files were causing Unraid/ReiserFS to hard freeze. This is 100% non-conclusive as I'm still testing my theory but I believe the issue is due to my LSI1068E card. Again, I haven't verified if it's freezing when copying to any hard drives connected to the Marvell 88SE9485 card.

 

Mover would freeze when copying files to the following drives: sdn (3 times), sdo (once), sdj (once), sdp (once).

 

I ran the following command: "lsscsi -g" and here's the output:

 

[2:0:0:0]    disk    ATA      WDC WD30EZRX-00D 0A80  /dev/sdc  /dev/sg3

[2:0:1:0]    disk    ATA      WDC WD30EZRX-00M 0A80  /dev/sdd  /dev/sg4

[2:0:2:0]    disk    ATA      WDC WD30EZRX-00D 0A80  /dev/sde  /dev/sg5

[2:0:3:0]    disk    ATA      ST2000DL003-9VT1 CC32  /dev/sdf  /dev/sg6

[2:0:4:0]    disk    ATA      Samsung SSD 840  BB6Q  /dev/sdg  /dev/sg7

[2:0:5:0]    disk    ATA      WDC WD30EZRX-00D 0A80  /dev/sdh  /dev/sg8

[4:0:0:0]    disk    ATA      WDC WD20EARS-00S 0A80  /dev/sdi  /dev/sg9

[4:0:1:0]    disk    ATA      WDC WD20EARS-00M AB50  /dev/sdj  /dev/sg10

[4:0:2:0]    disk    ATA      WDC WD20EARX-00P AB51  /dev/sdk  /dev/sg11

[4:0:3:0]    disk    ATA      WDC WD20EADS-00W 0A01  /dev/sdl  /dev/sg12

[4:0:4:0]    disk    ATA      Hitachi HDS5C302 A580  /dev/sdm  /dev/sg13

[4:0:5:0]    disk    ATA      Hitachi HDS5C302 A580  /dev/sdn  /dev/sg14

[4:0:6:0]    disk    ATA      Hitachi HDS5C302 A580  /dev/sdo  /dev/sg15

[4:0:7:0]    disk    ATA      ST2000DL003-9VT1 CC32  /dev/sdp  /dev/sg16

 

The one thing all the drives have in common is [4:0:X:0]

 

Running "lsscsi  -t -H" gave me the following output:

 

[2]    mvsas        sas:0x5005043011ab0000

[4]    mptsas        sas:0x50030480006582f0

 

Running "lspci"

0b:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)

13:00.0 RAID bus controller: Marvell Technology Group Ltd. Device 9485 (rev 03)

 

All of this points to the mptsas driver which is what the LSI1068E card needs.

 

I currently have an IBM M1015 card that i need to flash over to LSI SAS2008 IT mode which will replace the SAS1068E card which is in pass-through mode in ESXi.

 

Limetech, Jonp, RobJ, WeeboTech:

 

Have there been any changes/updates to the MPTSAS driver or anything related to it in the kernel? I didn't see any mention of it in the changelog from Beta6 onwards.

 

I will post another reply when the LSI1068E card has been replaced and I have ran the mover more than once.

Link to comment

Razorslinksy,

 

If we are both seeing the same issue then changing the card over to the M1015 won't make a difference.  That is the exact card I am using and it looks to be causing the same issues.

 

Well shit...

 

If it still freezes after I replace the card then it HAS to be the ReiserFS/kernel code which means we have to wait for 6.0-beta11 or move away from ReiserFS to either XFS or EXT4 (no more reiserfs for me. I don't trust it anymore since it's no longer being maintained?)

 

I'll keep reviewing the syslog to see if it freezes when copying over to the hard-drive under the Marvell chipset. I'm just getting REALLY annoyed with the amount of data corruption that I'm seeing since it has to be powered down everytime.

Link to comment

Being a little cautious and optimist but I believe my issue could be fixed (fingers crossed.) Mover was successful in copying over 50GB (not much I know) without issues or kernel panics. Whenever Mover was manually ran, it would always freeze a good 15-20 minutes into the move and would take me about an hour to move 50+ GB of data.

 

Mover has been scheduled to run at 3AM every day to test my theory about the LSI1068E being the issue.

 

The LSI1068E was replaced with a re-flashed M1015 to the newest firmware 9211_8i_P20-IT

http://www.lsi.com/products/host-bus-adapters/pages/lsi-sas-9211-8i.aspx#tab/tab4.

 

Something I haven't seen in a while:

Nov 17 00:14:58 Tower logger: mover started

Nov 17 00:36:29 Tower logger: mover finished

 

Damn it! Ran the bitrot utility to stress test everything and just received a kernel panic.

 

Strange that Mover didn't cause the kernal panic because it copied most of the files over to the hard drives under the new SAS card.

 

Now it's time to play the waiting game for beta11 or trying running Unraid as baremetal or under KVM/Xen under Arch.

Link to comment

Given that CPU in the sig and the  issues involved, I would lower the CPU amount to the virtual machine to 2.

Yes you can have 8 threads, but hyperthreading is not the same as a core.

I'm taking a stab at a potential timing issue as cores are allocated.

 

Another issue could be just a fault in the reiserfs code.

 

I noticed this below.

Oct 17 03:20:43 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294 

and there was mention of the bitrot script.

 

That seems to be a good candidate for testing both CPU contention and I/O.

 

If I had a spare drive, I might  convert one to XFS on the suspect controller and beat the hell out of it with jobs, files scans and copies.

Link to comment

Given that CPU in the sig and the  issues involved, I would lower the CPU amount to the virtual machine to 2.

Yes you can have 8 threads, but hyperthreading is not the same as a core.

I'm taking a stab at a potential timing issue as cores are allocated.

 

Another issue could be just a fault in the reiserfs code.

 

I noticed this below.

Oct 17 03:20:43 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294 

and there was mention of the bitrot script.

 

That seems to be a good candidate for testing both CPU contention and I/O.

 

If I had a spare drive, I might  convert one to XFS on the suspect controller and beat the hell out of it with jobs, files scans and copies.

 

Alright... Updated the CPU (virtual & cores) to 2 virtual sockets and 1 core per socket.

 

Manually ran Mover and another kernel panic happened within 5 minutes.

 

Attached is my syslog. Looks like it's the same Reiserfs & CPU error.

razorslinky-syslog-11-17-14.zip

Link to comment

Hmm, that's interesting.

 

Nov 17 17:28:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=16963 c=16962 q=35413)

 

Are there any ESX events logged?

 

Are you booting in XEN or the other non XEN mode?

Nov 17 17:27:32 Tower logger: ./TV/Marry Me/Season 01/Marry Me - s01e04 - Annicurser-me - 720p-WEBDL - BTN.mkv
Nov 17 17:28:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=16963 c=16962 q=35413)
Nov 17 17:28:32 Tower kernel: sending NMI to all CPUs:
Nov 17 17:28:32 Tower kernel: NMI backtrace for cpu 0
Nov 17 17:28:32 Tower kernel: CPU: 0 PID: 6857 Comm: shfs Not tainted 3.16.3-unRAID #3
Nov 17 17:28:32 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
Nov 17 17:28:32 Tower kernel: task: ffff8800ba266660 ti: ffff88005e7e4000 task.ti: ffff88005e7e4000
Nov 17 17:28:32 Tower kernel: RIP: 0010:[<ffffffff81032618>]  [<ffffffff81032618>] flat_send_IPI_mask+0x77/0x86
Nov 17 17:28:32 Tower kernel: RSP: 0018:ffff88013fc03df8  EFLAGS: 00000046
Nov 17 17:28:32 Tower kernel: RAX: 0000000000000c00 RBX: 0000000000000002 RCX: 0000000000000007
Nov 17 17:28:32 Tower kernel: RDX: ffffffff817fb940 RSI: 0000000000000002 RDI: 0000000000000082
Nov 17 17:28:32 Tower kernel: RBP: ffff88013fc03e18 R08: 0000000000000086 R09: ffffffff819a00cc
Nov 17 17:28:32 Tower kernel: R10: ffff880100002dd8 R11: 00000000000005eb R12: 0000000000000c00
Nov 17 17:28:32 Tower kernel: R13: 0000000000000003 R14: ffffffff81813580 R15: 0000000000000000
Nov 17 17:28:32 Tower kernel: FS:  00002b029cb40700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
Nov 17 17:28:32 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 17:28:32 Tower kernel: CR2: 000000000068a744 CR3: 000000007dd92000 CR4: 00000000000007f0
Nov 17 17:28:32 Tower kernel: Stack:
Nov 17 17:28:32 Tower kernel: 0000000000000082 0000000000000001 ffff88013fc0e210 0000000000000000
Nov 17 17:28:32 Tower kernel: ffff88013fc03e30 ffffffff8102fe80 ffffffff81813580 ffff88013fc03e88
Nov 17 17:28:32 Tower kernel: ffffffff8107d142 ffff88013fc0df48 ffffffff8186b850 0000000000000000
Nov 17 17:28:32 Tower kernel: Call Trace:
Nov 17 17:28:32 Tower kernel: <IRQ> 
Nov 17 17:28:32 Tower kernel: [<ffffffff8102fe80>] arch_trigger_all_cpu_backtrace+0x95/0xcd
Nov 17 17:28:32 Tower kernel: [<ffffffff8107d142>] rcu_check_callbacks+0x1e3/0x503
Nov 17 17:28:32 Tower kernel: [<ffffffff81085128>] ? tick_sched_handle+0x34/0x34
Nov 17 17:28:32 Tower kernel: [<ffffffff810493fa>] update_process_times+0x38/0x60
Nov 17 17:28:32 Tower kernel: [<ffffffff81085126>] tick_sched_handle+0x32/0x34
Nov 17 17:28:32 Tower kernel: [<ffffffff8108515d>] tick_sched_timer+0x35/0x53
Nov 17 17:28:32 Tower kernel: [<ffffffff8105ae08>] __run_hrtimer.isra.28+0x57/0xb0
Nov 17 17:28:32 Tower kernel: [<ffffffff8105b2e1>] hrtimer_interrupt+0xd9/0x1c0
Nov 17 17:28:32 Tower kernel: [<ffffffff8102e7bc>] local_apic_timer_interrupt+0x4f/0x52
Nov 17 17:28:32 Tower kernel: [<ffffffff8102eb8e>] smp_apic_timer_interrupt+0x3a/0x4b
Nov 17 17:28:32 Tower kernel: [<ffffffff815e025d>] apic_timer_interrupt+0x6d/0x80
Nov 17 17:28:32 Tower kernel: <EOI> 
Nov 17 17:28:32 Tower kernel: [<ffffffff81143511>] ? __discard_prealloc+0x87/0xb1
Nov 17 17:28:32 Tower kernel: [<ffffffff8114359e>] reiserfs_discard_all_prealloc+0x43/0x4c
Nov 17 17:28:32 Tower kernel: [<ffffffff8115f8d2>] do_journal_end+0x4e1/0xc57
Nov 17 17:28:32 Tower kernel: [<ffffffff811605a2>] journal_end+0xad/0xb4
Nov 17 17:28:32 Tower kernel: [<ffffffff81150eca>] reiserfs_dirty_inode+0x6c/0x7c
Nov 17 17:28:32 Tower kernel: [<ffffffff81095450>] ? from_kuid+0x9/0xb
Nov 17 17:28:32 Tower kernel: [<ffffffff8110dfb2>] __mark_inode_dirty+0x2f/0x1de
Nov 17 17:28:32 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294
Nov 17 17:28:32 Tower kernel: [<ffffffff810f1b45>] ? __sb_start_write+0x99/0xcd
Nov 17 17:28:32 Tower kernel: [<ffffffff810fc109>] ? user_path_at_empty+0x60/0x87
Nov 17 17:28:32 Tower kernel: [<ffffffff81105492>] notify_change+0x1dd/0x2d1
Nov 17 17:28:32 Tower kernel: [<ffffffff81112014>] utimes_common+0x114/0x174
Nov 17 17:28:32 Tower kernel: [<ffffffff8111215f>] do_utimes+0xeb/0x123
Nov 17 17:28:32 Tower kernel: [<ffffffff811122ff>] SyS_futimesat+0x8a/0xa5
Nov 17 17:28:32 Tower kernel: [<ffffffff8111232e>] SyS_utimes+0x14/0x16
Nov 17 17:28:32 Tower kernel: [<ffffffff815df469>] system_call_fastpath+0x16/0x1b
Nov 17 17:28:32 Tower kernel: Code: f3 90 eb f0 44 89 e8 c1 e0 18 89 04 25 10 93 5f ff 89 d8 44 09 e0 41 81 cc 00 04 00 00 83 fb 02 41 0f 44 c4 89 04 25 00 93 5f ff <57> 9d 66 66 90 66 90 58 5b 41 5c 41 5d 5d c3 55 48 89 e5 41 55 
Nov 17 17:28:32 Tower kernel: NMI backtrace for cpu 1
Nov 17 17:28:32 Tower kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.3-unRAID #3
Nov 17 17:28:32 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
Nov 17 17:28:32 Tower kernel: task: ffff880139b1c6e0 ti: ffff880139b58000 task.ti: ffff880139b58000
Nov 17 17:28:32 Tower kernel: RIP: 0010:[<ffffffff81032448>]  [<ffffffff81032448>] native_apic_mem_write+0xc/0xe
Nov 17 17:28:32 Tower kernel: RSP: 0018:ffff88013fd03f90  EFLAGS: 00000046
Nov 17 17:28:32 Tower kernel: RAX: ffffffff817fb940 RBX: 0000000000000000 RCX: 0000000000000000
Nov 17 17:28:32 Tower kernel: RDX: 00000000ffffffed RSI: 0000000000000000 RDI: 00000000000000b0
Nov 17 17:28:32 Tower kernel: RBP: ffff88013fd03f90 R08: 0000000000000000 R09: 0000000000000000
Nov 17 17:28:32 Tower kernel: R10: ffffffff8106aa65 R11: 0000000000000400 R12: ffff880139b58000
Nov 17 17:28:32 Tower kernel: R13: ffff880139b58000 R14: ffff880139b58000 R15: 0000000000000000
Nov 17 17:28:32 Tower kernel: FS:  00002ad709007700(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
Nov 17 17:28:32 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 17 17:28:32 Tower kernel: CR2: 00007fdc0e45408c CR3: 000000008852c000 CR4: 00000000000007e0
Nov 17 17:28:32 Tower kernel: Stack:
Nov 17 17:28:32 Tower kernel: ffff88013fd03fa8 ffffffff8102eb7f 0000000000000000 ffff880139b5bec0
Nov 17 17:28:32 Tower kernel: ffffffff815e025d ffff880139b5be38 <EOI>  ffff880139b5bec0 ffff880139b5be68
Nov 17 17:28:32 Tower kernel: 0000000000000400 ffffffff8106aa65 0000000000000000 0000000000000000
Nov 17 17:28:32 Tower kernel: Call Trace:
Nov 17 17:28:32 Tower kernel: <IRQ> 
Nov 17 17:28:32 Tower kernel: [<ffffffff8102eb7f>] smp_apic_timer_interrupt+0x2b/0x4b
Nov 17 17:28:32 Tower kernel: [<ffffffff815e025d>] apic_timer_interrupt+0x6d/0x80
Nov 17 17:28:32 Tower kernel: <EOI> 
Nov 17 17:28:32 Tower kernel: [<ffffffff8106aa65>] ? pick_next_task_fair+0x37e/0x402
Nov 17 17:28:32 Tower kernel: [<ffffffff81034cc0>] ? native_safe_halt+0x6/0x8
Nov 17 17:28:32 Tower kernel: [<ffffffff810125db>] default_idle+0x9/0xd
Nov 17 17:28:32 Tower kernel: [<ffffffff81012c6d>] arch_cpu_idle+0xa/0xc
Nov 17 17:28:32 Tower kernel: [<ffffffff8106f316>] cpu_startup_entry+0x12c/0x232
Nov 17 17:28:32 Tower kernel: [<ffffffff8102d21f>] start_secondary+0x1bf/0x1c4
Nov 17 17:28:32 Tower kernel: Code: 41 5d 41 5e 5d c3 83 c8 ff c3 90 83 3d cd c8 95 00 01 76 0a 55 48 89 e5 e8 02 ae ff ff 5d c3 55 89 ff 48 89 e5 89 b7 00 90 5f ff <5d> c3 55 89 ff 48 89 e5 8b 87 00 90 5f ff 5d c3 55 48 8b 05 c8 
...<snip>...

Link to comment
  • 2 weeks later...

Hmm, that's interesting.

 

Nov 17 17:28:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=16963 c=16962 q=35413)

 

Are there any ESX events logged?

 

Are you booting in XEN or the other non XEN mode?

Nov 17 17:27:32 Tower logger: ./TV/Marry Me/Season 01/Marry Me - s01e04 - Annicurser-me - 720p-WEBDL - BTN.mkv
Nov 17 17:28:32 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=16963 c=16962 q=35413)
Nov 17 17:28:32 Tower kernel: sending NMI to all CPUs:
Nov 17 17:28:32 Tower kernel: NMI backtrace for cpu 0
Nov 17 17:28:32 Tower kernel: CPU: 0 PID: 6857 Comm: shfs Not tainted 3.16.3-unRAID #3
Nov 17 17:28:32 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
Nov 17 17:28:32 Tower kernel: task: ffff8800ba266660 ti: ffff88005e7e4000 task.ti: ffff88005e7e4000
Nov 17 17:28:32 Tower kernel: RIP: 0010:[<ffffffff81032618>]  [<ffffffff81032618>] flat_send_IPI_mask+0x77/0x86
Nov 17 17:28:32 Tower kernel: RSP: 0018:ffff88013fc03df8  EFLAGS: 00000046
Nov 17 17:28:32 Tower kernel: RAX: 0000000000000c00 RBX: 0000000000000002 RCX: 0000000000000007
Nov 17 17:28:32 Tower kernel: RDX: ffffffff817fb940 RSI: 0000000000000002 RDI: 0000000000000082
Nov 17 17:28:32 Tower kernel: RBP: ffff88013fc03e18 R08: 0000000000000086 R09: ffffffff819a00cc
Nov 17 17:28:32 Tower kernel: R10: ffff880100002dd8 R11: 00000000000005eb R12: 0000000000000c00
Nov 17 17:28:32 Tower kernel: R13: 0000000000000003 R14: ffffffff81813580 R15: 0000000000000000
Nov 17 17:28:32 Tower kernel: FS:  00002b029cb40700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
Nov 17 17:28:32 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 17:28:32 Tower kernel: CR2: 000000000068a744 CR3: 000000007dd92000 CR4: 00000000000007f0
Nov 17 17:28:32 Tower kernel: Stack:
Nov 17 17:28:32 Tower kernel: 0000000000000082 0000000000000001 ffff88013fc0e210 0000000000000000
Nov 17 17:28:32 Tower kernel: ffff88013fc03e30 ffffffff8102fe80 ffffffff81813580 ffff88013fc03e88
Nov 17 17:28:32 Tower kernel: ffffffff8107d142 ffff88013fc0df48 ffffffff8186b850 0000000000000000
Nov 17 17:28:32 Tower kernel: Call Trace:
Nov 17 17:28:32 Tower kernel: <IRQ> 
Nov 17 17:28:32 Tower kernel: [<ffffffff8102fe80>] arch_trigger_all_cpu_backtrace+0x95/0xcd
Nov 17 17:28:32 Tower kernel: [<ffffffff8107d142>] rcu_check_callbacks+0x1e3/0x503
Nov 17 17:28:32 Tower kernel: [<ffffffff81085128>] ? tick_sched_handle+0x34/0x34
Nov 17 17:28:32 Tower kernel: [<ffffffff810493fa>] update_process_times+0x38/0x60
Nov 17 17:28:32 Tower kernel: [<ffffffff81085126>] tick_sched_handle+0x32/0x34
Nov 17 17:28:32 Tower kernel: [<ffffffff8108515d>] tick_sched_timer+0x35/0x53
Nov 17 17:28:32 Tower kernel: [<ffffffff8105ae08>] __run_hrtimer.isra.28+0x57/0xb0
Nov 17 17:28:32 Tower kernel: [<ffffffff8105b2e1>] hrtimer_interrupt+0xd9/0x1c0
Nov 17 17:28:32 Tower kernel: [<ffffffff8102e7bc>] local_apic_timer_interrupt+0x4f/0x52
Nov 17 17:28:32 Tower kernel: [<ffffffff8102eb8e>] smp_apic_timer_interrupt+0x3a/0x4b
Nov 17 17:28:32 Tower kernel: [<ffffffff815e025d>] apic_timer_interrupt+0x6d/0x80
Nov 17 17:28:32 Tower kernel: <EOI> 
Nov 17 17:28:32 Tower kernel: [<ffffffff81143511>] ? __discard_prealloc+0x87/0xb1
Nov 17 17:28:32 Tower kernel: [<ffffffff8114359e>] reiserfs_discard_all_prealloc+0x43/0x4c
Nov 17 17:28:32 Tower kernel: [<ffffffff8115f8d2>] do_journal_end+0x4e1/0xc57
Nov 17 17:28:32 Tower kernel: [<ffffffff811605a2>] journal_end+0xad/0xb4
Nov 17 17:28:32 Tower kernel: [<ffffffff81150eca>] reiserfs_dirty_inode+0x6c/0x7c
Nov 17 17:28:32 Tower kernel: [<ffffffff81095450>] ? from_kuid+0x9/0xb
Nov 17 17:28:32 Tower kernel: [<ffffffff8110dfb2>] __mark_inode_dirty+0x2f/0x1de
Nov 17 17:28:32 Tower kernel: [<ffffffff8114cd1a>] reiserfs_setattr+0x262/0x294
Nov 17 17:28:32 Tower kernel: [<ffffffff810f1b45>] ? __sb_start_write+0x99/0xcd
Nov 17 17:28:32 Tower kernel: [<ffffffff810fc109>] ? user_path_at_empty+0x60/0x87
Nov 17 17:28:32 Tower kernel: [<ffffffff81105492>] notify_change+0x1dd/0x2d1
Nov 17 17:28:32 Tower kernel: [<ffffffff81112014>] utimes_common+0x114/0x174
Nov 17 17:28:32 Tower kernel: [<ffffffff8111215f>] do_utimes+0xeb/0x123
Nov 17 17:28:32 Tower kernel: [<ffffffff811122ff>] SyS_futimesat+0x8a/0xa5
Nov 17 17:28:32 Tower kernel: [<ffffffff8111232e>] SyS_utimes+0x14/0x16
Nov 17 17:28:32 Tower kernel: [<ffffffff815df469>] system_call_fastpath+0x16/0x1b
Nov 17 17:28:32 Tower kernel: Code: f3 90 eb f0 44 89 e8 c1 e0 18 89 04 25 10 93 5f ff 89 d8 44 09 e0 41 81 cc 00 04 00 00 83 fb 02 41 0f 44 c4 89 04 25 00 93 5f ff <57> 9d 66 66 90 66 90 58 5b 41 5c 41 5d 5d c3 55 48 89 e5 41 55 
Nov 17 17:28:32 Tower kernel: NMI backtrace for cpu 1
Nov 17 17:28:32 Tower kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.3-unRAID #3
Nov 17 17:28:32 Tower kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
Nov 17 17:28:32 Tower kernel: task: ffff880139b1c6e0 ti: ffff880139b58000 task.ti: ffff880139b58000
Nov 17 17:28:32 Tower kernel: RIP: 0010:[<ffffffff81032448>]  [<ffffffff81032448>] native_apic_mem_write+0xc/0xe
Nov 17 17:28:32 Tower kernel: RSP: 0018:ffff88013fd03f90  EFLAGS: 00000046
Nov 17 17:28:32 Tower kernel: RAX: ffffffff817fb940 RBX: 0000000000000000 RCX: 0000000000000000
Nov 17 17:28:32 Tower kernel: RDX: 00000000ffffffed RSI: 0000000000000000 RDI: 00000000000000b0
Nov 17 17:28:32 Tower kernel: RBP: ffff88013fd03f90 R08: 0000000000000000 R09: 0000000000000000
Nov 17 17:28:32 Tower kernel: R10: ffffffff8106aa65 R11: 0000000000000400 R12: ffff880139b58000
Nov 17 17:28:32 Tower kernel: R13: ffff880139b58000 R14: ffff880139b58000 R15: 0000000000000000
Nov 17 17:28:32 Tower kernel: FS:  00002ad709007700(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
Nov 17 17:28:32 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 17 17:28:32 Tower kernel: CR2: 00007fdc0e45408c CR3: 000000008852c000 CR4: 00000000000007e0
Nov 17 17:28:32 Tower kernel: Stack:
Nov 17 17:28:32 Tower kernel: ffff88013fd03fa8 ffffffff8102eb7f 0000000000000000 ffff880139b5bec0
Nov 17 17:28:32 Tower kernel: ffffffff815e025d ffff880139b5be38 <EOI>  ffff880139b5bec0 ffff880139b5be68
Nov 17 17:28:32 Tower kernel: 0000000000000400 ffffffff8106aa65 0000000000000000 0000000000000000
Nov 17 17:28:32 Tower kernel: Call Trace:
Nov 17 17:28:32 Tower kernel: <IRQ> 
Nov 17 17:28:32 Tower kernel: [<ffffffff8102eb7f>] smp_apic_timer_interrupt+0x2b/0x4b
Nov 17 17:28:32 Tower kernel: [<ffffffff815e025d>] apic_timer_interrupt+0x6d/0x80
Nov 17 17:28:32 Tower kernel: <EOI> 
Nov 17 17:28:32 Tower kernel: [<ffffffff8106aa65>] ? pick_next_task_fair+0x37e/0x402
Nov 17 17:28:32 Tower kernel: [<ffffffff81034cc0>] ? native_safe_halt+0x6/0x8
Nov 17 17:28:32 Tower kernel: [<ffffffff810125db>] default_idle+0x9/0xd
Nov 17 17:28:32 Tower kernel: [<ffffffff81012c6d>] arch_cpu_idle+0xa/0xc
Nov 17 17:28:32 Tower kernel: [<ffffffff8106f316>] cpu_startup_entry+0x12c/0x232
Nov 17 17:28:32 Tower kernel: [<ffffffff8102d21f>] start_secondary+0x1bf/0x1c4
Nov 17 17:28:32 Tower kernel: Code: 41 5d 41 5e 5d c3 83 c8 ff c3 90 83 3d cd c8 95 00 01 76 0a 55 48 89 e5 e8 02 ae ff ff 5d c3 55 89 ff 48 89 e5 89 b7 00 90 5f ff <5d> c3 55 89 ff 48 89 e5 8b 87 00 90 5f ff 5d c3 55 48 8b 05 c8 
...<snip>...

 

I didn't see any ESXi logs but I did the following as another test:

 

Installed a newly up-to-date version of Arch and XEN (4.4.1) on an extra SSD

Created an unRAID DomU

Unraid booted and started up just fine

Manually started Mover

 

Crashed HARD again:

Nov 30 00:45:46 Tower kernel: REISERFS error (device md1): vs-4010 is_reusable: block number is out of range 763574476 (488378638)

Nov 30 00:45:46 Tower kernel: general protection fault: 0000 [#1] SMP

Nov 30 00:45:46 Tower kernel: Modules linked in: veth xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat iptable_filter ip_tables md_mod mvsas libsas mpt2sas raid_class scsi_transport_sas

Nov 30 00:45:46 Tower kernel: CPU: 1 PID: 3381 Comm: shfs Not tainted 3.16.3-unRAID #3

Nov 30 00:45:46 Tower kernel: task: ffff88012889c6e0 ti: ffff8800c89bc000 task.ti: ffff8800c89bc000

Nov 30 00:45:46 Tower kernel: RIP: e030:[] [] search_by_key+0x2e/0xca7

Nov 30 00:45:46 Tower kernel: RSP: e02b:ffff8800c89bf9b8 EFLAGS: 00010246

Nov 30 00:45:46 Tower kernel: RAX: 0000000000000001 RBX: 19217dd6988a7fec RCX: 0000000000000001

Nov 30 00:45:46 Tower kernel: RDX: ffff8800c89bfb40 RSI: ffff8800c89bfb20 RDI: ffff8800c89bfb40

Nov 30 00:45:46 Tower kernel: RBP: ffff8800c89bfad8 R08: 0000000000000003 R09: 00000000134e88c4

Nov 30 00:45:46 Tower kernel: R10: 000000000d8e0fa6 R11: 0000000000000000 R12: 0000000000000000

Nov 30 00:45:46 Tower kernel: R13: ffff8800c89bfd18 R14: ffff8800c89bfb20 R15: ffff8800c89bfb40

Nov 30 00:45:46 Tower kernel: FS: 00002b4668782700(0000) GS:ffff88012af00000(0000) knlGS:0000000000000000

Nov 30 00:45:46 Tower kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b

Nov 30 00:45:46 Tower kernel: CR2: 00002ae682bb51c0 CR3: 0000000005496000 CR4: 0000000000000660

Nov 30 00:45:46 Tower kernel: Stack:

Nov 30 00:45:46 Tower kernel: 0000000000000006 ffff88012af13680 0000000000000000 ffff8800c89bf9e0

Nov 30 00:45:46 Tower kernel: ffffffff81006a25 ffff880000000001 ffffffff810119cc ffff8800c89bfa10

Nov 30 00:45:46 Tower kernel: ffffffff8106464b ffff88012af13680 0000000000013680 ffff8800c89bfa38

 

Logs have also been attached.

 

My next plan of attack:

 

Move ALL of the data off the cache drive

Backup the docker.img

Backup the Appdata

Reformat the cache drive to BTRFS

Create a VM that has all of the the apps that are needed

Keep unraid as stock as possible.

 

Pray to the unraid gods that beta6 fixes all of these crazy issues. I'm REALLY hoping beta11 comes out this week. If beta6 crashes I may just have to temporarily move all of my data to Arch with either aufs or mhddfs until the new beta is released.

unraid_xen_11-30-14.zip

Link to comment

Make sure to post the results!  I'm interested to see what happens, and if I should go for the upgrade.

 

Mover script is running right now and will be moving about 40GB worth of data. If Mover is successful, I will then run the Bitrot script which will really stress test the new kernel, etc...

 

Will update once the Mover script has been completed and start the Bitrot script.

 

I've been checking the announcement forum twice a day for the new release... so I'm keeping my fingers crossed.

Link to comment

Make sure to post the results!  I'm interested to see what happens, and if I should go for the upgrade.

 

Mover script is running right now and will be moving about 40GB worth of data. If Mover is successful, I will then run the Bitrot script which will really stress test the new kernel, etc...

 

Will update once the Mover script has been completed and start the Bitrot script.

 

I've been checking the announcement forum twice a day for the new release... so I'm keeping my fingers crossed.

 

Are you freaking kidding me. What bullshit.... Keeps saying it's ReiserFS issue... So I guess it's migrate each drive one by one to XFS or move to beta6 which I don't want to do.

 

Nov 30 17:15:34 Tower logger: ./TV/The Newsroom (2012)/Season 03/The Newsroom (2012) - s03e03 - Main Justice - 720p-HDTV - KILLERS.mkv
Nov 30 17:16:34 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 1}  (t=6000 jiffies g=34323 c=34322 q=56064)
Nov 30 17:16:34 Tower kernel: Task dump for CPU 1:
Nov 30 17:16:34 Tower kernel: shfs            R  running task        0  4026      1 0x00000008
Nov 30 17:16:34 Tower kernel: 0000000000000000 ffff88013fd03de8 ffffffff8105cc09 0000000000000001
Nov 30 17:16:34 Tower kernel: 0000000000000001 ffff88013fd03e00 ffffffff8105f2c4 ffffffff81822d00
Nov 30 17:16:34 Tower kernel: ffff88013fd03e30 ffffffff810766a5 ffffffff81822d00 ffff88013fd0e0c0
Nov 30 17:16:34 Tower kernel: Call Trace:
Nov 30 17:16:34 Tower kernel: <IRQ>  [<ffffffff8105cc09>] sched_show_task+0xbe/0xc3
Nov 30 17:16:34 Tower kernel: [<ffffffff8105f2c4>] dump_cpu_task+0x34/0x38
Nov 30 17:16:34 Tower kernel: [<ffffffff810766a5>] rcu_dump_cpu_stacks+0x6a/0x8c
Nov 30 17:16:34 Tower kernel: [<ffffffff81078ead>] rcu_check_callbacks+0x1e1/0x4ff
Nov 30 17:16:34 Tower kernel: [<ffffffff81086659>] ? tick_sched_handle+0x34/0x34
Nov 30 17:16:34 Tower kernel: [<ffffffff8107ac1a>] update_process_times+0x38/0x60
Nov 30 17:16:34 Tower kernel: [<ffffffff81086657>] tick_sched_handle+0x32/0x34
Nov 30 17:16:34 Tower kernel: [<ffffffff8108668e>] tick_sched_timer+0x35/0x53
Nov 30 17:16:34 Tower kernel: [<ffffffff8107b149>] __run_hrtimer.isra.29+0x57/0xb0
Nov 30 17:16:34 Tower kernel: [<ffffffff8107b634>] hrtimer_interrupt+0xd9/0x1c0
Nov 30 17:16:34 Tower kernel: [<ffffffff8102ea78>] local_apic_timer_interrupt+0x4f/0x52
Nov 30 17:16:34 Tower kernel: [<ffffffff8102ee4a>] smp_apic_timer_interrupt+0x3a/0x4b
Nov 30 17:16:34 Tower kernel: [<ffffffff815ead9d>] apic_timer_interrupt+0x6d/0x80
Nov 30 17:16:34 Tower kernel: <EOI>  [<ffffffff81147b3b>] ? __discard_prealloc+0xad/0xb1
Nov 30 17:16:34 Tower kernel: [<ffffffff81147ba2>] reiserfs_discard_all_prealloc+0x43/0x4c
Nov 30 17:16:34 Tower kernel: [<ffffffff81163ed6>] do_journal_end+0x4e1/0xc57
Nov 30 17:16:34 Tower kernel: [<ffffffff81164ba6>] journal_end+0xad/0xb4
Nov 30 17:16:34 Tower kernel: [<ffffffff811554ce>] reiserfs_dirty_inode+0x6c/0x7c
Nov 30 17:16:34 Tower kernel: [<ffffffff81096e84>] ? from_kuid+0x9/0xb
Nov 30 17:16:34 Tower kernel: [<ffffffff811119aa>] __mark_inode_dirty+0x2f/0x1de
Nov 30 17:16:34 Tower kernel: [<ffffffff8115131e>] reiserfs_setattr+0x262/0x294
Nov 30 17:16:34 Tower kernel: [<ffffffff810f5505>] ? __sb_start_write+0x9a/0xce
Nov 30 17:16:34 Tower kernel: [<ffffffff810ffb8d>] ? user_path_at_empty+0x60/0x87
Nov 30 17:16:34 Tower kernel: [<ffffffff81108ea6>] notify_change+0x1dd/0x2d1
Nov 30 17:16:34 Tower kernel: [<ffffffff81115a20>] utimes_common+0x114/0x174
Nov 30 17:16:34 Tower kernel: [<ffffffff81115b6b>] do_utimes+0xeb/0x123
Nov 30 17:16:34 Tower kernel: [<ffffffff81115d0b>] SyS_futimesat+0x8a/0xa5
Nov 30 17:16:34 Tower kernel: [<ffffffff81115d3a>] SyS_utimes+0x14/0x16
Nov 30 17:16:34 Tower kernel: [<ffffffff815e9fa9>] system_call_fastpath+0x16/0x1b

razorslinky-11-30-14-beta12.zip

Link to comment

And another kernel panic

 

Nov 30 17:58:09 Tower avahi-daemon[3928]: server.c: Packet too short or invalid while reading response record. (Maybe a UTF-8 problem?)
Nov 30 18:06:39 Tower kernel: INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=47573 c=47572 q=71300)
Nov 30 18:06:39 Tower kernel: Task dump for CPU 0:
Nov 30 18:06:39 Tower kernel: shfs            R  running task        0 10443      1 0x00000008
Nov 30 18:06:39 Tower kernel: 0000000000000000 ffff88013fc03de8 ffffffff8105cc09 0000000000000000
Nov 30 18:06:39 Tower kernel: 0000000000000000 ffff88013fc03e00 ffffffff8105f2c4 ffffffff81822d00
Nov 30 18:06:39 Tower kernel: ffff88013fc03e30 ffffffff810766a5 ffffffff81822d00 ffff88013fc0e0c0
Nov 30 18:06:39 Tower kernel: Call Trace:
Nov 30 18:06:39 Tower kernel: <IRQ>  [<ffffffff8105cc09>] sched_show_task+0xbe/0xc3
Nov 30 18:06:39 Tower kernel: [<ffffffff8105f2c4>] dump_cpu_task+0x34/0x38
Nov 30 18:06:39 Tower kernel: [<ffffffff810766a5>] rcu_dump_cpu_stacks+0x6a/0x8c
Nov 30 18:06:39 Tower kernel: [<ffffffff81078ead>] rcu_check_callbacks+0x1e1/0x4ff
Nov 30 18:06:39 Tower kernel: [<ffffffff81086659>] ? tick_sched_handle+0x34/0x34
Nov 30 18:06:39 Tower kernel: [<ffffffff8107ac1a>] update_process_times+0x38/0x60
Nov 30 18:06:39 Tower kernel: [<ffffffff81086657>] tick_sched_handle+0x32/0x34
Nov 30 18:06:39 Tower kernel: [<ffffffff8108668e>] tick_sched_timer+0x35/0x53
Nov 30 18:06:39 Tower kernel: [<ffffffff8107b149>] __run_hrtimer.isra.29+0x57/0xb0
Nov 30 18:06:39 Tower kernel: [<ffffffff8107b634>] hrtimer_interrupt+0xd9/0x1c0
Nov 30 18:06:39 Tower kernel: [<ffffffff8102ea78>] local_apic_timer_interrupt+0x4f/0x52
Nov 30 18:06:39 Tower kernel: [<ffffffff8102ee4a>] smp_apic_timer_interrupt+0x3a/0x4b
Nov 30 18:06:39 Tower kernel: [<ffffffff815ead9d>] apic_timer_interrupt+0x6d/0x80
Nov 30 18:06:39 Tower kernel: <EOI>  [<ffffffff81154fd0>] ? unfix_nodes+0x13f/0x14b
Nov 30 18:06:39 Tower kernel: [<ffffffff81147b89>] ? reiserfs_discard_all_prealloc+0x2a/0x4c
Nov 30 18:06:39 Tower kernel: [<ffffffff81147ba2>] ? reiserfs_discard_all_prealloc+0x43/0x4c
Nov 30 18:06:39 Tower kernel: [<ffffffff81163ed6>] do_journal_end+0x4e1/0xc57
Nov 30 18:06:39 Tower kernel: [<ffffffff81164ba6>] journal_end+0xad/0xb4
Nov 30 18:06:39 Tower kernel: [<ffffffff8114b8d9>] reiserfs_unlink+0x1bf/0x21f
Nov 30 18:06:39 Tower kernel: [<ffffffff810fc287>] ? link_path_walk+0x67/0x70c
Nov 30 18:06:39 Tower kernel: [<ffffffff810ff1ed>] vfs_unlink+0xa7/0x120
Nov 30 18:06:39 Tower kernel: [<ffffffff810ff351>] do_unlinkat+0xeb/0x1ee
Nov 30 18:06:39 Tower kernel: [<ffffffff810f7750>] ? SyS_newlstat+0x25/0x2e
Nov 30 18:06:39 Tower kernel: [<ffffffff810fffe8>] SyS_unlink+0x11/0x13
Nov 30 18:06:39 Tower kernel: [<ffffffff815e9fa9>] system_call_fastpath+0x16/0x1b

razorslinky-11-30-14-beta12-2.zip

Link to comment

An interesting observation...

 

Since I've been having nothing but issues with the Mover script, I've decided to re-balance all of my folders manually (I don't want two TV shows to be spread across 4 disks) and noticed that I haven't had any kernel panics if I'm moving folders via MC. I've moved over 100GB+ of data between disks and never had one kernel panic. So I'm thinking it has to do something with Mover or rsync.

 

My goal for this week is to get all of my folders on one disk and not spread across all 13 as I will be moving over to Arch and AUFS3 until I hear back from Limetech or someone who is knowledgeable in the kernel/reiserfs portion of unraid.

 

I REALLY REALLY want to stick with Unraid as I like the GUI, Docker and support forum but I can't keep having major kernel panic issues and having a massive amount of data corruption without any type of explanation (is it my Mobo? ESXi? ReiserFS? Kernel?) a

Link to comment

I hesitate to respond here because I am not a Linux expert, but I think you may have jumped the gun on this one.  Both syslogs show only what you posted, namely just a single CPU stall, and neither appears to be a kernel panic.  Unless you have something else following those lines, I think they may actually be harmless (but I'm not an authority here).  The CPU stall lines begin as an 'INFO', not a panic or other serious issue, and there are no stack dumps, no 'BUG's, no kernel NULL pointer dereferences, and no NMI's called.  It's possible there has been a configuration change as to how a CPU stall is handled, but otherwise it looks like a major improvement over the previous failures.  Are you actually seeing corruption or crashes?  I note that the second syslog indicates an unclean shutdown from the previous session.

 

I cannot tell if you are using different hardware, but the SAS support has changed from a Fusion MPT SAS driver 'mptbase' (version 3.04.20) to the 'mpt2sas' driver (version 16.100.00.00).  Perhaps Tom may have a comment on the difference.

Link to comment

I hesitate to respond here because I am not a Linux expert, but I think you may have jumped the gun on this one.  Both syslogs show only what you posted, namely just a single CPU stall, and neither appears to be a kernel panic.  Unless you have something else following those lines, I think they may actually be harmless (but I'm not an authority here).  The CPU stall lines begin as an 'INFO', not a panic or other serious issue, and there are no stack dumps, no 'BUG's, no kernel NULL pointer dereferences, and no NMI's called.  It's possible there has been a configuration change as to how a CPU stall is handled, but otherwise it looks like a major improvement over the previous failures.  Are you actually seeing corruption or crashes?  I note that the second syslog indicates an unclean shutdown from the previous session.

 

I cannot tell if you are using different hardware, but the SAS support has changed from a Fusion MPT SAS driver 'mptbase' (version 3.04.20) to the 'mpt2sas' driver (version 16.100.00.00).  Perhaps Tom may have a comment on the difference.

 

I will admit that I did jump the gun as I am not a Linux expert either and just assumed it was a kernel panic due to how it looked like the previous errors that I was getting. I just searched for reiserfs and assumed it was the same type of kernel panic/issue.

 

I did remove an older SAS expander card and installed a IBM M1015 re-flashed to LSI 2008 IT mode which explains the difference in the driver.

 

What's strange is that when Unraid "stalls," I am unable to browse via UNC, access the GUI, or shut it down properly but command line is fine... I'll attempt it again and see if I was just being impatient..

 

There is some data corruption (only with newly moved files) but that's because I'm doing an unclean shutdown due to Unraid being unresponsive.

 

I sent Tom an email (nice and polite as you don't get anywhere being an asshole) asking if he could take a look at let me know if I am missing something. I just don't want the wife waking me up at 3AM asking why the server has froze and she can't watch anything while feeding the little one, which means I check via the console and see the dreaded kernel panic screen and have to force restart.  I think the lack of sleep is messing with me and I'm just jumping to conclusions...

 

I'm going to do one more test... I'll re-run the Mover script, do a clean shutdown and then run the bitrot script on 1 disk and see what happens.

 

Also, I've moved about 250GB+ worth of data between disks via MC (getting between 50-80MB/s)... Which is great news... I'm just baffled at what the Mover script is doing to cause issues. (This gives me hope that it's something on my end and not Unraid)

Link to comment

I too am having the exact same issue.  I have been running Unraid ver 5 with no issues for the past year under Esxi. 

 

I upgraded to unraid ver 6 when it entered public beta, again with no issues.  However past beta ver 6, after an average of 2 days, the webgui would not be accessible,  nor the user shares.  However I did have access to SSH and direct access to the virtual machine in Esxi.  The unraid server would not respond to console shutdown commands or commands to restart Unraid. 

 

I thought it might have something to do with the mover script.  Once the mover script finished running, access to the webgui became noticeably slower if it was accessible at all.  Running the mover script twice pretty much guaranteed a problem.  I thought, well I can do without a cache drive, however turning off all access to the cache drive did not 'fix' the issue, only added a little bit of time before the gui stopped responding.

 

After downgrading back to unraid 5, once again I have not had any issues. 

 

 

Was so excited to see beta 12 be released, but I am disappointed to see this bug still exists.  I wish I could put beta 12 up and grab some of the log files from my server to help diag this issue, however having the media server go down randomly is not something I am willing to risk.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.