Big trouble, SYSLOG reports "rcu: INFO: rcu_bh self-detected stall on CPU"

joebot · September 27, 2020

A few days ago, after making a simple change to my router (probably unrelated, I was testing pi.hole and told my router to use public dns instead of pihole), I looked over at the dash board and noticed that it had frozen. At the time that it froze, all of my CPU indicators were maxed out. I waited for a while, and it didn't come back. None of the disks were being accessed, and seeing as how it was hard-crashed, I figured I better just cycle the power.

On reboot (boot to gui, normal mode), the unraid gui environment loaded, but the webserver would not start and none of my server's services were operational. I can reboot into safe mode and see the webui just fine. I have verified that the server is connecting to the network (it pings just fine, and ifconfig reports the correct network configuration).

I really need help figuring this out, as all my home automations are down and it's making the wife angry.

Facts recap:

server hung during normal operation (pegged CPUS, no disk access, webui crashed)
server will boot to normal mode, but the webui does not function
server will boot to safe mode with functional webui
server has no problem talking to the network
I don't know what I'm doing
Wife is mad

A few other interesting things:

When booted to normal mode (webui doesn't work), I can see that one of the cpu cores is pegged at 100%. I have no idea why.
Upon trying to access the webui from a different computer (while booted into normal mode), SYSLOG reports "rcu: INFO: rcu_bh self-detected stall on CPU"

I have attached a syslog (starts with enabling logging to flash booted to safe mode, followed by a reboot to normal mode, followed by me fruitlessly trying to safely shut the server down again) and my diagnostics file.

Please help!! Thanks!

syslog skippy-diagnostics-20200927-1553.zip

Edited September 28, 2020 by joebot

joebot · September 29, 2020

Anyone have an idea? I'm getting desperate here. I've tried disabling plug ins and then dockers (but not at the same time). Neither try helped.

akshunj · September 30, 2020

I'm relatively new to unraid, but I do have some linux knowledge. Are you on stable (just curious)?

From your syslog, it looks like your cache filesystem has become corrupt (maybe from your first dirty reboot?) and the kernel is squawking about not being able to mount it. I don't think the cache filesystem is mounted in safe-mode, but I could be wrong (again, I am an unraid noob). I think there's a way to repair the file system from safe mode using "maintenance mode" on the webgui, but someone else would have to give you more guidance. I'm also not sure if this is your only problem.

JorgeB · September 30, 2020

Yes, cache filesystem is crashing, you need to re-format the pool, some recovery options here if needed.

joebot · September 30, 2020

11 hours ago, akshunj said:

I'm relatively new to unraid, but I do have some linux knowledge. Are you on stable (just curious)?

From your syslog, it looks like your cache filesystem has become corrupt (maybe from your first dirty reboot?) and the kernel is squawking about not being able to mount it. I don't think the cache filesystem is mounted in safe-mode, but I could be wrong (again, I am an unraid noob). I think there's a way to repair the file system from safe mode using "maintenance mode" on the webgui, but someone else would have to give you more guidance. I'm also not sure if this is your only problem.

5 hours ago, JorgeB said:

Yes, cache filesystem is crashing, you need to re-format the pool, some recovery options here if needed.

Yeah I'm on the stable branch. Two questions:

What in the syslog data tipped you off about the cache filesystem? I totally didn't see that.
Considering that my server hard crashed, what were my options for returning to normal operation? Hard reboot seemed like the only option since it was unresponsive...

JorgeB · September 30, 2020

3 minutes ago, joebot said:

What in the syslog data tipped you off about the cache filesystem? I totally didn't see that.

btrfs starts crashing right after cache mount:

Sep 27 15:56:57 Skippy kernel: BTRFS info (device sdc1): enabling ssd optimizations
Sep 27 15:56:57 Skippy kernel: BTRFS info (device sdc1): start tree-log replay
Sep 27 15:56:57 Skippy kernel: ------------[ cut here ]------------
Sep 27 15:56:57 Skippy kernel: kernel BUG at fs/btrfs/extent-tree.c:6862!

6 minutes ago, joebot said:

Considering that my server hard crashed, what were my options for returning to normal operation? Hard reboot seemed like the only option since it was unresponsive...

Yep

joebot · September 30, 2020

21 minutes ago, JorgeB said:

btrfs starts crashing right after cache mount:

Thanks a ton - I would have never interpreted that as an unmountable cache file system. Do you have any tips to help me spot something like this in the future?

Edited September 30, 2020 by joebot

joebot · September 30, 2020

7 hours ago, JorgeB said:

Yes, cache filesystem is crashing, you need to re-format the pool, some recovery options here if needed.

Im trying to mount the cache pool as prescribed in that link. I have a pool of two ssds. i tried to mount the first disk with:

mount -o usebackuproot,ro /dev/sdd1 /x

and it returned "Segmentation fault" - what is that all about?!

...so I tried mounting the other drive with:

mount -o usebackuproot,ro /dev/sdc1 /x

and the cursor simply did a carriage return when I pushed enter and nothing has happened... what does that mean, if anything?

Edited September 30, 2020 by joebot

JorgeB · September 30, 2020

14 minutes ago, joebot said:

and it returned "Segmentation fault" - what is that all about?!

It means it's still crashing even in ro mode, try btrfs restore, also on that link.

joebot · September 30, 2020

2 minutes ago, JorgeB said:

It means it's still crashing even in ro mode, try btrfs restore, also on that link.

ok will do. also, i just noticed that one of my cpu's is still pegged at 100%. any idea why that might be? I'm still in safe mode, but I have a few plugins loaded.

JorgeB · September 30, 2020

Best to reboot after the crash/segfault.

joebot · September 30, 2020

57 minutes ago, JorgeB said:

Best to reboot after the crash/segfault.

ok that helped! now, another important question - how do I mount one of my array drives without messing up parity? I'm asking because I got as far as mounting one of the disks (manually, through CLI, to /mnt/rescue) and added a folder called 'cachedump' before realizing that this might be a terrible idea. did I totally donk up the array?

I haven't written to that folder I made on the array disk, but I did create the folder. Also, after starting the array in safe mode, I did not find a /disks folder in /mnt. should I have?

Edited September 30, 2020 by joebot

JorgeB · September 30, 2020

/mnt/rescue would be in RAM, just unassign the cache devices, start the array normally and use it to restore the data.

JorgeB · September 30, 2020

20 minutes ago, joebot said:

I got as far as mounting one of the disks

Oh, and this is enough to cause a few sync errors (unless done in read only mode, you'll need to run a parity check.

joebot · September 30, 2020

ok. i assume that I can just delete the folder I created on the disk (/mnt/rescue/cachedump) and then parity check will fix it?

JorgeB · September 30, 2020

No need, after you reboot the next time it will be gone.

joebot · September 30, 2020

is there something special I need to do to unassign the cache disks? when I change the drop down to "no device" the system complains and grays out the start array button. sorry for being so dense, and thank you so much for your help thus far!

JorgeB · September 30, 2020

There's a checkbox next to array start button to allow starting without cache assigned.

joebot · September 30, 2020

ok I got the array running without the cache but then I realized that I have a backup of my cache pool because I'm using the automatic backup plugin (CA Auto Backup). is there a way to leverage that to make this easier?

JorgeB · September 30, 2020

If there's nothing else you need on cache you can wipe the devices and re-format, then restore from backup.

joebot · September 30, 2020

if I backed up appdata that should pretty much cover me for my dockers, right? There's also a libvert backup, but I don't have any VMs on this machine. I've confirmed that nothing has changed with my dockers since the last back up. All the plugins reside on the flash drive, right?

All my shares are set to either cache "Yes" or "No" with the exception on System Data, which i set to prefer...

Since system data is set to prefer and there's nothing in the backup directory for that, does that mean that I will have problems if I wipe the disks and use plugin's restore function?? Is there anything else that I could conceivably have only on my cache disk? I haven't made any transfers to the array in quite some time before the crash.

JorgeB · September 30, 2020

16 minutes ago, joebot said:

if I backed up appdata that should pretty much cover me for my dockers, right?

Yep.

16 minutes ago, joebot said:

Since system data is set to prefer and there's nothing in the backup directory for that, does that mean that I will have problems if I wipe the disks and use plugin's restore function??

That will likely have the docker and libvirt images, docker can be recreated, libvirt doesn't matter if you don't have VMs.

17 minutes ago, joebot said:

Is there anything else that I could conceivably have only on my cache disk? I haven't made any transfers to the array in quite some time before the crash.

Then everything should be on the array, assuming mover has been running without issues.

Big trouble, SYSLOG reports "rcu: INFO: rcu_bh self-detected stall on CPU"

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation