CPU stall on Beta 8

September 1, 201411 yr

I was stopping the array, and on the gui it stopped responding after "Stopping AFP". My ssh sessions was also hung that I had already established. I checked the console, and it was blank and seemed non-responsive. I had to end up power cycling the server to get it to boot. I did see this message on the ssh sessions I had opened when it hung:

mdcmd (54): nocheck

md: nocheck_array: check not active

INFO: rcu_sched self-detected stall on CPU { 0} (t=6001 jiffies g=860756 c=860755 q=436)

sending NMI to all CPUs:

NMI backtrace for cpu 0

CPU: 0 PID: 4334 Comm: tail Not tainted 3.16.0-unRAID #6

Hardware name: 113 1/113-M2-E113, BIOS 6.00 PG 09/30/2008

task: ffff88013990b720 ti: ffff88007eaf8000 task.ti: ffff88007eaf8000

RIP: 0010:[<ffffffff8102feea>] [<ffffffff8102feea>] arch_trigger_all_cpu_backtrace+0xbf/0xcd

RSP: 0018:ffff88013fc03e28 EFLAGS: 00000046

RAX: 0000000000000000 RBX: 0000000000002710 RCX: 0000000000000040

RDX: 0000000000000001 RSI: 0000000000000100 RDI: 0000000000418958

RBP: ffff88013fc03e30 R08: 0000000000000046 R09: 0000000000000000

R10: 0000000000000000 R11: ffffffff818f2ed0 R12: ffff88013fc0e210

R13: 0000000000000000 R14: ffffffff8176b580 R15: 0000000000000000

FS: 00002ab4aa258b80(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 00002b009aaff610 CR3: 000000007ebe3000 CR4: 00000000000007f0

Stack:

ffffffff8176b580 ffff88013fc03e88 ffffffff8107d15e ffff88013fc0df48

ffffffff817c2750 0000000000000000 00000000000001b4 ffff88013990b720

0000000000000000 0000000000000000 ffffffff81085144 ffff88013fc0df48

Call Trace:

<IRQ>

[<ffffffff8107d15e>] rcu_check_callbacks+0x1e3/0x503

[<ffffffff81085144>] ? tick_sched_handle+0x34/0x34

[<ffffffff8104941e>] update_process_times+0x38/0x60

[<ffffffff81085142>] tick_sched_handle+0x32/0x34

[<ffffffff81085179>] tick_sched_timer+0x35/0x53

[<ffffffff8105ae2c>] __run_hrtimer.isra.28+0x57/0xb0

[<ffffffff8105b305>] hrtimer_interrupt+0xd9/0x1c0

[<ffffffff8102e7fc>] local_apic_timer_interrupt+0x4f/0x52

[<ffffffff8102ebce>] smp_apic_timer_interrupt+0x3a/0x4b

[<ffffffff8155919d>] apic_timer_interrupt+0x6d/0x80

<EOI>

[<ffffffff813094ee>] ? vgacon_scroll+0xc2/0x2a6

[<ffffffff8136bc8f>] scrup+0xc5/0xe2

[<ffffffff8136bcd5>] lf+0x29/0x61

[<ffffffff8136e948>] do_con_trol+0x197/0x1404

[<ffffffff8137034e>] do_con_write.part.21+0x799/0x7d5

[<ffffffff81074196>] ? console_unlock+0x318/0x347

[<ffffffff813703d6>] con_write+0x20/0x33

[<ffffffff8135e254>] do_output_char+0x8b/0x1a6

[<ffffffff8135edbd>] n_tty_write+0x30a/0x401

[<ffffffff81061ff5>] ? wake_up_process+0x32/0x32

[<ffffffff8135bf33>] tty_write+0x19b/0x21d

[<ffffffff8135eab3>] ? process_echoes+0x69/0x69

[<ffffffff810ef9ad>] vfs_write+0xb5/0x169

[<ffffffff810efeef>] SyS_write+0x42/0x86

[<ffffffff815583a9>] system_call_fastpath+0x16/0x1b

Code: 00 bb 10 27 00 00 be 00 01 00 00 48 c7 c7 d0 22 7c 81 e8 56 b7 2a 00 85 c0 74 0b f0 80 25 be 5e 8b 00 fe 5b 5d c3 bf 58 89 41 00 <e8> 27 91 2a 00 ff cb 75 d2 eb e5 c3 66 90 48 63 c7 81 c7 07 02

NMI backtrace for cpu 1

CPU: 1 PID: 4253 Comm: shfs Not tainted 3.16.0-unRAID #6

Hardware name: 113 1/113-M2-E113, BIOS 6.00 PG 09/30/2008

task: ffff8800b1632760 ti: ffff88007e470000 task.ti: ffff88007e470000

RIP: 0033:[<00002ab6d112d1b0>] [<00002ab6d112d1b0>] 0x2ab6d112d1b0

RSP: 002b:00002ab6d235eeb8 EFLAGS: 00000202

RAX: 0000000000000000 RBX: 00002ab6d0f37c48 RCX: 000000000000da48

RDX: 0000000000610158 RSI: 0000000000000000 RDI: 0000000000610158

RBP: 00000000006100f0 R08: 0000000000000000 R09: 000000000000109d

R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000610138

R13: 0000000000610158 R14: 00002ab6d235f9c0 R15: 00000000006100f0

FS: 00002ab6d235f700(0000) GS:ffff88013fc80000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00002b35f7f00ac0 CR3: 000000007e255000 CR4: 00000000000007e0

Quote

September 2, 201411 yr

Author

This server crashes almost every time I try doing data copies to/from the array. Most of the time during a parity check. I have reverted back to beta7 to see if it goes back to being stable. My other server has beta8, and hasn't crashed yet...

Quote

September 2, 201411 yr

This server crashes almost every time I try doing data copies to/from the array. Most of the time during a parity check. I have reverted back to beta7 to see if it goes back to being stable. My other server has beta8, and hasn't crashed yet...

did you mean have reverted to beta 6 or beta 7? Your line above is unclear unless there's a typo above.

This is similar to the crashes we've been seeing with 6b7 during copy from one array volume to another:

http://lime-technology.com/forum/index.php?topic=34904.0

Haven't tried 6b8 yet.

Quote

September 2, 201411 yr

Author

It seemed very stable under B7. I migrated about 6 disks over to xfs and btrfs under b7 and it never had a single crash. Once I upgraded to b8, it seems to crash and lockup with any IO load, or just stopping the Array sometimes.

Quote

September 3, 201411 yr

Author

Rolled back to beta7 that was stable, and now it made it through the parity build ok, but started up 2 rsync's to move data from 2 reiserfs volumes to XFS volumes, and now locked up tight again...

So i guess it might be similar to the other reports of lockups. Not sure what really changed other than converting disks to xfs, but those conversions went fine...

Quote

September 3, 201411 yr

Please post full syslog. You might have to set up

tail -f /var/log/syslog

in a telnet window. If you don't want to post syslog here, please email to me:

[email protected]

Quote

September 3, 201411 yr

Author

I have restarted the server back with beta8 and kicked off the rsync's again....

I have a backup of the syslog, tail via ssh sessions, and a tail running in the background writing to the flash... Hopefully that will capture something.

I will post the info once it crashes...

Quote

September 4, 201411 yr

Author

Just had the hang, and the tail caught a bunch of errors on the parity drive. In the zip attached, I have the syslog backup from after boot, the syslog tail, and a smart report on the parity disk.

unraid.zip

Quote

September 4, 201411 yr

Just had the hang, and the tail caught a bunch of errors on the parity drive. In the zip attached, I have the syslog backup from after boot, the syslog tail, and a smart report on the parity disk.

Just a bit of info, the errors are on the ATA bus, not the drive. Drive is fine. SATA error reported is 0xffffffff, all error flags on, which is meaningless. Something else in the system just went wrong, hanging the I/O, but no further info available.

Quote

September 4, 201411 yr

Author

Not sure if it helps, I had a top running at the time of the freeze...

https://www.dropbox.com/s/zqsz2irr8iowijf/Screen%20Shot%202014-09-03%20at%2010.00.10%20PM.png?dl=0

Quote

September 5, 201411 yr

Author

Interesting, from the top I had running, smbd was the top process when it crashed last time. I started up the server and manually stopped samba. It actually ran all the way through a parity check for 10+ hours and didn't crash. I did see a group of slimlar bus errors, but the server didn't crash. I am going to try some rsync's with samba stopped and see how it runs..

Quote

September 5, 201411 yr

Author

So with Samba shutdown, the same 2 rsync's that would lock the server are running all night. It transferred almost 1.5TB between the 2 rsync. Could there be some conflict going on with the smbd daemon? I tried using an NFS client at the time of the rsync and didn't see any issues.

Quote

September 5, 201411 yr

Author

Started samba back up, after about 10-15 minutes, the rsync's stopped and it started spitting out massive amounts of XFS, disk, Rieserfs errors..

Here is the syslog, and a screen shot of some XFS errors on the console:

syslog: https://www.dropbox.com/s/czkqzwx0v7zsgi6/syslog?dl=0

Console: https://www.dropbox.com/s/0rk0spkbcko2npm/2014-09-05%2007.14.18.jpg?dl=0

and now disk2 is red balled, but a smart report showed no disk errors...

Quote

September 6, 201411 yr

Author

Guess I am totally out of ideas... Rebuilt parity with smbd down, once it was complete, I kicked off a parity check to make sure all was good, and sometime over night, it locked up again... Didn't have a tail of syslog going since I was confident it was working. Running the auto parity check now with a tail of syslog...

Quote

September 9, 201411 yr

Looking into this...

Quote

September 12, 201411 yr

Author

I tried using Beta9 on this server, and after about an hour it hung but I didn't have a tail running on the log. I restarted which kicked of a parity check and it is still running, but is kicking out strange CPU stall messages in syslog..

syslog.zip

Quote

September 12, 201411 yr

You have a whole lot of stuff going on in that syslog. My suggestion is to please simplify back to no extra plugins, maybe disable AFP if you can run without it. Then run like that and see if issues persist. If not, then add stuff back one at a time.

Quote

September 12, 201411 yr

Author

The only 2 plugins that I am running is for APC and cache_dirs. I went to edit the boot directory to disable those plugins to reboot, and for somereason, /boot is empty. I tried to just reboot and of course I am remote right now and it is hung in limbo... Ill try getting it started when I get home and try with no plugins or AFP and see what happens.

Quote

September 13, 201411 yr

Author

Booted the server with no plugins, and AFP disabled, and after a couple of hours the server was hung, and no errors on the tail from the syslog.

I restarted the server and with no plugins, AFP, Samba, and NFS all stopped it has now run all night and about 75% complete on the parity check and still running fine.

The only other thing that was strange, on the first hang I was checking the web UI pretty frequently clicking on the main tab to update the % complete and it was progressing. I waited a few hours, and then when I checked the web UI it was hung. When I checked the tail, the last message was pretty recent to me checking the UI and seeing it was hung.

This morning I checked via shell first, looking around the system before touching the web UI and all was good. Not sure if that was all just coincidence...

The only thing that is pretty consistant is if I cut off all client access during the parity build it usually completes successfully.

I have run rsync's on the system with all external access stopped and seemed to work fine. The main client is a SageTV server that is capturing 1080i video. I wonder if it is the high write access coming in via SMB from that server that could be causing the issue?

Quote

September 14, 201411 yr

Author

I brought up all client access, left SageTV down and have rsync'd about 1TB of data to other drives without issue. I am wondering if it is related to the heavy IO from sage to XFS disks. I am going to switch the share to include only a btrfs volume and restart it and see how it runs writing new content to btrfs instead of XFS.

Quote

September 14, 201411 yr

Author

Tested recordings and rsyncing both to a btrfs volume and haven't seen any issues. Might just convert all disks to btrfs to see if it keeps the server stable...

Quote

September 16, 201411 yr

Author

Was going pretty well, migrated about 4 disks to btrfs from xfs without issue, and just hung with:

Sep 16 02:25:07 Dumpster kernel: ata14.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen

Sep 16 02:25:07 Dumpster kernel: ata14: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }

Sep 16 02:25:07 Dumpster kernel: ata14.00: failed command: WRITE DMA EXT

Sep 16 02:25:07 Dumpster kernel: ata14.00: cmd 35/00:00:d0:6c:05/00:04:be:00:00/e0 tag 16 dma 524288 out

Sep 16 02:25:07 Dumpster kernel: res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x56 (ATA bus error)

Sep 16 02:25:07 Dumpster kernel: ata14.00: status: { DRDY }

Sep 16 02:25:07 Dumpster kernel: ata14: hard resetting link

It is a strange coincidence, most of the time I find it hung, I have just checked the web page via iPad browser and find it down and the errors seem to start right about that time..

Quote

September 17, 201411 yr

Author

After a parity check migrating multiple disks over from xfs to btrfs, last night during a copy a ton of errors were produced on one of the btrfs volumes but the server didn't' hang just has alot of errors on the webui. I have attached the syslog.

https://www.dropbox.com/s/fpnh2zgghwv0vie/dumpster.zip?dl=0

Quote

September 17, 201411 yr

After a parity check migrating multiple disks over from xfs to btrfs, last night during a copy a ton of errors were produced on one of the btrfs volumes but the server didn't' hang just has alot of errors on the webui. I have attached the syslog.

Looks like a hardware issue, with the 2 identical 2-port cards connecting Disk 1 through 4. At Sep 17 01:20:44, Disk 2 (sdb, ata10.00) became unresponsive, and after a minute of trying to reset it and reestablish the connection, gave up at 01:21:44 and disabled it (all subsequent disk 2 error messages can be ignored). Seconds later at 01:22:15, Disk 1 and Disk 3 and Disk 4 also dropped away, unresponsive, and it appears all 3 were subsequently disabled (and again, you can ignore all subsequent disk error messages). These 4 drives are on the first 2 identical 2 port cards. Since all 4 drives lost communications at almost the same time, you have to assume that a common factor caused both cards to crash. The good news is there is probably nothing at all wrong with any of the drives. A possible solution to test is move the 4 drives to the motherboard, even if it's possibly a little slower.

Was going pretty well, migrated about 4 disks to btrfs from xfs without issue, and just hung with:

Sep 16 02:25:07 Dumpster kernel: ata14.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen

Sep 16 02:25:07 Dumpster kernel: ata14: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }

Sep 16 02:25:07 Dumpster kernel: ata14.00: failed command: WRITE DMA EXT

Sep 16 02:25:07 Dumpster kernel: ata14.00: cmd 35/00:00:d0:6c:05/00:04:be:00:00/e0 tag 16 dma 524288 out

Sep 16 02:25:07 Dumpster kernel: res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x56 (ATA bus error)

Sep 16 02:25:07 Dumpster kernel: ata14.00: status: { DRDY }

Sep 16 02:25:07 Dumpster kernel: ata14: hard resetting link

It is a strange coincidence, most of the time I find it hung, I have just checked the web page via iPad browser and find it down and the errors seem to start right about that time..

This syslog piece shows an unusual crash, probably of a card or controller chip, responding to the exception handler with SATA error code 0xffffffff, which means all possible SATA error flags are being raised. Since that is essentially impossible, I have to assume the controller has crashed. This is a different session and syslog, so cannot assume it is the same controller as the other syslog, but it does seem suspicious.

I should add that these drive and controller issues may or may not have any connection to your CPU stall issues reported earlier.

Quote

September 18, 201411 yr

Author

Glad you noticed that...

The onboard ports make it crash pretty quickly. I had switched to the external cards about a year ago when I figured out the onboard ports would make it crash, and it had been pretty stable under v5 with the pci-e cards. I just ordered a new 8 port controller that is pretty stable in my other server under v6, so I will give that a try this weekend..

Thanks for you help..

Quote

CPU stall on Beta 8

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)