app data backup tanking server, possible other issues... no idea


1812

Recommended Posts

original topic was called "QEMU eating vdisks making them non-bootable? Or just bad luck? but I think I've found the culprit or a culprit... or i dont know... below in the edit.

 

I've had this happen on 2 different servers over the past 3-4 days.

 

A couple days ago, I was using a vm on one server, and the vm just stopped working (like a hard power off) but server still was on. Then neither vm's on that machine would boot, even after server reboot.

 

This morning, I woke up to find my other main server pegged at 100% cpu usage. I was able to get into the terminal and run pftop which showed qemu was the culprit.

 

--------

UPDATE

so after recreating the firewall vm, I told app data backup to run. it did, shutting down dockers and doing it thing, but returned errors, and then never restarted the dockers, then the array went to undefined and showed:

 

 

 

 

 

1389364048_ScreenShot2018-11-29at6_17_56AM.png.b50d175c0ece19e09c2530c525324718.png

 

So i had to do a hard reset of the entire server. When the vm of my firewall tried to boot it was stuck at shell. Even recreating the xml didn't help, I had to rebuild the entire vm.

 

I've got to rebuild the other 2 vm's this morning now on the first server... just curious is anyone else has had anything like this pop up on 6.6.5?

 

-----

 

 

after rebuilding the firewall vm, i then ran app data backup. it stopped all the dockers, did some things, then reported an error. the array then went to an undefined state:

266303223_ScreenShot2018-11-29at9_47_35AM.thumb.png.2da4219bc89ec05d22bf4ab2b9aaba95.png

 

 

 

 

1814002914_ScreenShot2018-11-29at9_47_41AM.thumb.png.ea02653d81d9ffa1383f417195751c04.png

 

 

all discs now show under unassigned... and the dashboard shows no icons

 

1463679960_ScreenShot2018-11-29at9_50_59AM.png.21fbf2b2097e4067fff5c9a08a8bb370.png

 

2126553506_ScreenShot2018-11-29at10_03_09AM.thumb.png.d71c22432708cfceee8552afd95997a6.png

1358880308_ScreenShot2018-11-29at10_03_23AM.png.63e6839bd76fe5315d3cf90856857d4c.png

 

 

But the firewall is still running.... i attempted to get the diagnostics.zip but clicking download only reloads the page. however I was able to copy the entire syslog from the webgui:  ----------> syslog.rtf

 

I also noticed that unassigned devices does not mount smb folders anymore, and it shows an error the log but manually clicking mount will connect. Same issue of not auto mounting disks in unassigned devices.

 

rebooting the server clears the issue, but I'm not sure for how long. I've disabled app data backup for the moment.

 

 

any thoughts anyone????

 

 

Edited by 1812
Link to comment
2 hours ago, Squid said:

diagnostics after this happens again would be a plus

I know it would be helpful, but it ends up locking up everything and won't download them. As a side update, I am starting to think that it isn't an app data backup issue but rather a docker one. It hung for quite a long time after restarting Krusader last night. After it finally reloaded, I attempted a docker scrub and it locked up.....I'll have more time to investigate this coming week. Maybe a corrupted image that is causing issues when appdata backup runs and cycles the dockers off/on?

 

 

There is a copy of the syslog in the original post but i'll repost here:

 

On 11/29/2018 at 9:04 AM, 1812 said:
9 hours ago, John_M said:

How is it if you boot into Safe Mode?

server boots fine but no issues observed because nothing is loaded in terms of plugins, dockers, vms, etc...

 

for the moment, as long as I don't touch it or mess with dockers that have already auto-started, it seems stable and the running dockers are are operational.

Link to comment

Then, when you're in the mood to play around, put fix common problems into troubleshooting mode and then do a backup

 

You have something else going on that the high I/O being generated from the reads on your cache drive.  

 

I can guarantee that the backup plugin will not stop your array (ever), does not stop VMs (ever). 

Link to comment
19 minutes ago, 1812 said:

server boots fine but no issues observed because nothing is loaded in terms of plugins, dockers, vms, etc...

Which is a good thing. You have a stable platform and by selectively switching the higher level functions off and on you should be able to isolate the culprit with relative ease.

Link to comment
23 hours ago, Squid said:

Then, when you're in the mood to play around, put fix common problems into troubleshooting mode and then do a backup 

 

You have something else going on that the high I/O being generated from the reads on your cache drive.  

 

I can guarantee that the backup plugin will not stop your array (ever), does not stop VMs (ever). 

 

will do and report back in a few days. I often forget about the troubleshooting mode in fcp... thanks!

Link to comment
On 12/1/2018 at 10:10 AM, Squid said:

I can guarantee that the backup plugin will not stop your array (ever), does not stop VMs (ever). 

 

OK. so I manually ran the appdata backup after putting fix common problems in trouble shooting mode. It ran the backup said it was done with errors. The dockers never restarted. Then errors started pouring in the syslog. I also attached a video showing how wonky it was afterward, (at the bottom of this post) and also showing it not downloading diagnostics. The array status kept changing from green to orange.

 

 

It was all very exciting ... lol...

 

It did not stop the firewall this time. a reboot returned it to normal operation (I wasn't previously able to reboot via webgui but could this time.)

 

there was also something that i assume was a popup that never rendered correctly on the right side that said undefined:undefined undefined undefined.

220092787_ScreenShot2018-12-03at12_35_04PM.thumb.png.8537c9d4ab1c167a60d2df080565ebe7.png

 

 

I was able to get to the syslog, and while it wouldn't download, i could copy/paste it into a document. it is here: errors.rtf

 

I looked in the appdata backup location for the tar to "  verify errors occurred" indicated by the logs but only found 3 older backups from mid November and no new backups. the libvirt backup has a timestamp that coincides with this backup so it was copied over. there were no other browser tabs open at the time either.

 

FCP syslog tail: FCPsyslog_tail.txt

 

What looks like an auto-generated diag file from shutdown: brahms1-diagnostics-20181203-1247.zip

 

and the movie:

 

 

 

 

Pretty neat, huh?

 

Edited by 1812
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.