Troubleshooting crash maybe related to a bad docker container

JohanSF · November 8, 2018

This is a continuation of:

and

with diagnostics as per Squids' instructions, however, I did have to reboot in order to start the docker service again.
hal9000-diagnostics-20181108-1715.zip

In response to

I do have my appdata on the cache drive. I also think it did move last night, the 260 GB here makes sense as I downloaded large content just after the first crash. But it does indeed seem to have something to do with the cache drive and/or a container.

Edited November 8, 2018 by JohanSF

JorgeB · November 8, 2018

Your cache drive is fully allocated and needs a balance, see here:

https://lime-technology.com/forums/topic/62230-out-of-space-errors-on-cache-drive/?do=findComment&comment=610551

JohanSF · November 8, 2018

32 minutes ago, johnnie.black said:

Your cache drive is fully allocated and needs a balance, see here:

https://lime-technology.com/forums/topic/62230-out-of-space-errors-on-cache-drive/?do=findComment&comment=610551

Alright I don't really know what I am doing but you ask me to do

btrfs balance start -dusage=75 /mnt/cache

in the console right?

Edited November 8, 2018 by JohanSF

JorgeB · November 8, 2018

Yes, and like it mention on the thread linked:

Quote

If you get ENOSPC lower the 75 until you can complete a balance, e.g. try -dusage=50, 25 and so on, then do again with an higher number until you can do with at least 75.

JohanSF · November 8, 2018

I think it's doing its thing now.

chrome_2018-11-08_18-11-35.png.6c791a55524ed0d03a477776f9cd707a.png

JorgeB · November 8, 2018

Yes, if it's was going to error it's usually quick, but if you want post new diags when it finishes to confirm all is well.

JohanSF · November 8, 2018

hal9000-diagnostics-20181108-1833.zipI got this:

image.png.5ea98bbc4902cb23c3c5b6a4150f0205.png

I stopped the parity check and tried this:
image.png.a70771b045476d13d6696000217b905d.png

All good now? - if yes, that was an easy fix, can you explain more about what is going on and how you diagnosed it?

New diagnostics:

hal9000-diagnostics-20181108-1833.zip

Edited November 8, 2018 by JohanSF

JorgeB · November 8, 2018

Yes, it's fine, and it shouldn't happen again, this only happens with older kernels, or users coming from older kernels and never ran a balance which I assume is your case.

Before:

                  Data      Metadata System              
Id Path           single    single   single   Unallocated
-- -------------- --------- -------- -------- -----------
 1 /dev/nvme0n1p1 474.93GiB  2.01GiB  4.00MiB    56.00KiB
-- -------------- --------- -------- -------- -----------
   Total          474.93GiB  2.01GiB  4.00MiB    56.00KiB
   Used           240.03GiB  1.27GiB 80.00KiB

After:

                 Data      Metadata System              
Id Path           single    single   single   Unallocated
-- -------------- --------- -------- -------- -----------
 1 /dev/nvme0n1p1 253.01GiB  3.01GiB  4.00MiB   220.92GiB
-- -------------- --------- -------- -------- -----------
   Total          253.01GiB  3.01GiB  4.00MiB   220.92GiB
   Used           240.02GiB  1.24GiB 64.00KiB

Problem was the unallocated space, which you dind't have any.

JohanSF · November 8, 2018

I cannot thank you enough, it is good to have a stable system again.

JohanSF · November 9, 2018

I celebrated too early. The whole unRaid server crashed again now during the night. It must have been before 3:40 am as the mover has not run.

Here is the syslog and diagnostics:

syslog.txt (I know that Ihal9000-diagnostics-20181109-0622.zipserver to watch something on plex up until about 11 pm)

hal9000-diagnostics-20181109-0622.zip

It should also not be caused by my Ryzen 1700 processor as I have the zenstates script applied to disable C6 states:

image.png.c2b3eb91307b624563a902054cfc753b.png

Edited November 9, 2018 by JohanSF

JohanSF · November 9, 2018

I just updated to 6.6.5 and started the array. Next to the Array status on the main page it now says "BTRFS operation is running".

Now it is unresponsive.. should I hard-restart the machine?

This is becoming a little scary.

Edited November 9, 2018 by JohanSF

bonienl · November 9, 2018

Either a balance or scrub operation is being performed and array can not be stopped until this operation is completed.

JohanSF · November 9, 2018

4 minutes ago, bonienl said:

Either a balance or scrub operation is being performed and array can not be stopped until this operation is completed.

Ok. It is unresponsive in the way that on the main page, everything on the page under the disk status boxes is now missing. Using my phone with teamviewer to see it.

I can also see that the log has red erros. I can post that when I get home.

Edited November 9, 2018 by JohanSF

bonienl · November 9, 2018

1 minute ago, JohanSF said:

Ok. It is unresponsive in the way that on the main page, everything on the page under the disk status boxes is now missing. Using my phone with teamviewer to see it.

That doesn't sound right. Do you use BTRFS for the array or the cache or both?

JohanSF · November 9, 2018

Only for the cache.

bonienl · November 9, 2018

It might be a corrupted cache file system. Can you post diagnostics. If the GUI doesn't work then use terminal/telnet and type 'diagnostics', the zip file will be saved on your flash device in the /logs folder.

Probably need the help of the true expert @johnnie.black

JohanSF · November 9, 2018

1 hour ago, bonienl said:

It might be a corrupted cache file system. Can you post diagnostics. If the GUI doesn't work then use terminal/telnet and type 'diagnostics', the zip file will be saved on your flash device in the /logs folder.

Probably need the help of the true expert @johnnie.black

I can click Download diagnostics but it is collecting diagnosis information forever and the download never happens.

Trying with the terminal method I get "Starting diagnostics collection..." and nothing happens.

Update: I cannot restart it remotely it seems, have to do a hard reset when I get home. I really hope the cache drive is not corrupted

Edited November 9, 2018 by JohanSF

JorgeB · November 9, 2018

If you can't get diags before rebooting, grab and post them right after rebooting.

JohanSF · November 9, 2018

Got home to this log:

Restarted the machine with hardware button. Here are the diagnostics before starting the array:

hal9000-diagnostics-20181109-1539.zip

It started, parity check runs and dockers started too. I am looking at this now:

Should I start the Troubleshooting Mode in "Fix Common Problems"?

Edit: Not sure I can do that though, the "Scanning" when I enter the page seems to stay there forever. This is in the log:

image.png.1c37414509a0354f6ee7110e0ea19ccd.png

Edited November 9, 2018 by JohanSF

John_M · November 9, 2018

I'd like to know exactly what the nginx errors you're seeing are about, as I've seen them myself on occasion but I've never seen an explanation for them. The Web GUI pages are really quite complicated and for nginx to serve one up it has to retrieve the sources from multiple locations, most of which are dynamic and dependent on scripts completing and returning the necessary code. That looks as though it's failing here and causing the unresponsiveness.

JohanSF · November 9, 2018

Can I restart the plugin? I don't want to restart the whole server now that the parity check is running.

JorgeB · November 9, 2018

2 hours ago, JohanSF said:

diagnostics before starting the array:

Sorry, my fault, I meant diags right after starting the array.

JohanSF · November 9, 2018

Here:

hal9000-diagnostics-20181109-1756.zip

JorgeB · November 9, 2018

Except for the nginx errors, which I also don't know what they mean, though I see them frequently, all appears good, cache is fine.

JohanSF · November 9, 2018

That is good news, do you have any idea what do to about them? do I have to restart?

Troubleshooting crash maybe related to a bad docker container

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation