Jump to content

Big trouble, SYSLOG reports "rcu: INFO: rcu_bh self-detected stall on CPU"


joebot

Recommended Posts

A few days ago, after making a simple change to my router (probably unrelated, I was testing pi.hole and told my router to use public dns instead of pihole), I looked over at the dash board and noticed that it had frozen.  At the time that it froze, all of my CPU indicators were maxed out.  I waited for a while, and it didn't come back.  None of the disks were being accessed, and seeing as how it was hard-crashed, I figured I better just cycle the power.

 

On reboot (boot to gui, normal mode), the unraid gui environment loaded, but the webserver would not start and none of my server's services were operational.  I can reboot into safe mode and see the webui just fine.  I have verified that the server is connecting to the network (it pings just fine, and ifconfig reports the correct network configuration).

 

I really need help figuring this out, as all my home automations are down and it's making the wife angry.

 

Facts recap:

  • server hung during normal operation (pegged CPUS, no disk access, webui crashed)
  • server will boot to normal mode, but the webui does not function
  • server will boot to safe mode with functional webui
  • server has no problem talking to the network
  • I don't know what I'm doing
  • Wife is mad :(

 

A few other interesting things:

  • When booted to normal mode (webui doesn't work), I can see that one of the cpu cores is pegged at 100%. I have no idea why.
  • Upon trying to access the webui from a different computer (while booted into normal mode), SYSLOG reports "rcu: INFO: rcu_bh self-detected stall on CPU"

 

I have attached a syslog (starts with enabling logging to flash booted to safe mode, followed by a reboot to normal mode, followed by me fruitlessly trying to safely shut the server down again) and my diagnostics file.

 

Please help!! Thanks!

syslog skippy-diagnostics-20200927-1553.zip

Edited by joebot
Link to comment

I'm relatively new to unraid, but I do have some linux knowledge.  Are you on stable (just curious)?  

 

From your syslog, it looks like your cache filesystem has become corrupt (maybe from your first dirty reboot?) and the kernel is squawking about not being able to mount it.  I don't think the cache filesystem is mounted in safe-mode, but I could be wrong (again, I am an unraid noob).  I think there's a way to repair the file system from safe mode using "maintenance mode" on the webgui, but someone else would have to give you more guidance.  I'm also not sure if this is your only problem.

Link to comment
11 hours ago, akshunj said:

I'm relatively new to unraid, but I do have some linux knowledge.  Are you on stable (just curious)?  

 

From your syslog, it looks like your cache filesystem has become corrupt (maybe from your first dirty reboot?) and the kernel is squawking about not being able to mount it.  I don't think the cache filesystem is mounted in safe-mode, but I could be wrong (again, I am an unraid noob).  I think there's a way to repair the file system from safe mode using "maintenance mode" on the webgui, but someone else would have to give you more guidance.  I'm also not sure if this is your only problem.

5 hours ago, JorgeB said:

Yes, cache filesystem is crashing, you need to re-format the pool, some recovery options here if needed.

Yeah I'm on the stable branch. Two questions:

 

  1. What in the syslog data tipped you off about the cache filesystem?  I totally didn't see that.
  2. Considering that my server hard crashed, what were my options for returning to normal operation? Hard reboot seemed like the only option since it was unresponsive...
Link to comment
3 minutes ago, joebot said:

What in the syslog data tipped you off about the cache filesystem?  I totally didn't see that.

btrfs starts crashing right after cache mount:

Sep 27 15:56:57 Skippy kernel: BTRFS info (device sdc1): enabling ssd optimizations
Sep 27 15:56:57 Skippy kernel: BTRFS info (device sdc1): start tree-log replay
Sep 27 15:56:57 Skippy kernel: ------------[ cut here ]------------
Sep 27 15:56:57 Skippy kernel: kernel BUG at fs/btrfs/extent-tree.c:6862!

 

6 minutes ago, joebot said:

Considering that my server hard crashed, what were my options for returning to normal operation? Hard reboot seemed like the only option since it was unresponsive...

Yep

Link to comment
21 minutes ago, JorgeB said:

btrfs starts crashing right after cache mount:

 

Thanks a ton - I would have never interpreted that as an unmountable cache file system.  Do you have any tips to help me spot something like this in the future?

Edited by joebot
Link to comment
7 hours ago, JorgeB said:

Yes, cache filesystem is crashing, you need to re-format the pool, some recovery options here if needed.

Im trying to mount the cache pool as prescribed in that link.  I have a pool of two ssds.  i tried to mount the first disk with:

 

mount -o usebackuproot,ro /dev/sdd1 /x

 

and it returned "Segmentation fault" - what is that all about?!

 

...so I tried mounting the other drive with:

 

mount -o usebackuproot,ro /dev/sdc1 /x

 

and the cursor simply did a carriage return when I pushed enter and nothing has happened... what does that mean, if anything?

Edited by joebot
Link to comment
57 minutes ago, JorgeB said:

Best to reboot after the crash/segfault.

ok that helped! now, another important question - how do I mount one of my array drives without messing up parity?  I'm asking because I got as far as mounting one of the disks (manually, through CLI, to /mnt/rescue) and added a folder called 'cachedump' before realizing that this might be a terrible idea.  did I totally donk up the array?

 

I haven't written to that folder I made on the array disk, but I did create the folder.  Also, after starting the array in safe mode, I did not find a /disks folder in /mnt.  should I have?

Edited by joebot
Link to comment

if I backed up appdata that should pretty much cover me for my dockers, right?  There's also a libvert backup, but I don't have any VMs on this machine.  I've confirmed that nothing has changed with my dockers since the last back up.  All the plugins reside on the flash drive, right?

 

All my shares are set to either cache "Yes" or "No" with the exception on System Data, which i set to prefer... 

 

Since system data is set to prefer and there's nothing in the backup directory for that, does that mean that I will have problems if I wipe the disks and use plugin's restore function??  Is there anything else that I could conceivably have only on my cache disk?  I haven't made any transfers to the array in quite some time before the crash.

Link to comment
16 minutes ago, joebot said:

if I backed up appdata that should pretty much cover me for my dockers, right?

Yep.

 

16 minutes ago, joebot said:

Since system data is set to prefer and there's nothing in the backup directory for that, does that mean that I will have problems if I wipe the disks and use plugin's restore function??

That will likely have the docker and libvirt images, docker can be recreated, libvirt doesn't matter if you don't have VMs.

 

17 minutes ago, joebot said:

Is there anything else that I could conceivably have only on my cache disk?  I haven't made any transfers to the array in quite some time before the crash.

Then everything should be on the array, assuming mover has been running without issues.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...