Server hang after upgrading from 6.3.0


Recommended Posts

Ultimately my opinion here is that anyone who has "hanging" issues when utilizing a scheduled backup with CA backup has other issues with the server.  Whilst I'm not saying or trying to imply that the backup plugin is perfect, there is nothing in there that would cause a server to actually crash.

 

The backup is simply a 

 

rsync command

followed by a rm-rf if necessary

 

And in such crash issues, running the script manually succeeds no problems.

 

Net result is that unfortunately, I am not of much help here.  About the only suggestion I have is to limit either within share settings or in the backup settings to limit the disks utilized to a single one (ideally XFS)

Link to comment
Ultimately my opinion here is that anyone who has "hanging" issues when utilizing a scheduled backup with CA backup has other issues with the server.  Whilst I'm not saying or trying to imply that the backup plugin is perfect, there is nothing in there that would cause a server to actually crash.
 
The backup is simply a 
 
rsync command
followed by a rm-rf if necessary
 
And in such crash issues, running the script manually succeeds no problems.
 
Net result is that unfortunately, I am not of much help here.  About the only suggestion I have is to limit either within share settings or in the backup settings to limit the disks utilized to a single one (ideally XFS)


Thanks for your input Squid and I completely understand your answer. However could it be something to do with stopping all dockers and restarting them all at once before the actual rsync command is run? I have previously limited all share writes to one xfs disk and this didn't help. This is obviously a weird issue and not many people seem to be having it so I know it's hard to pin point answers. I'm a noob and just throwing my two cents around is all.

Sent from my SM-G930F using Tapatalk

Link to comment
37 minutes ago, mgladwin said:

However could it be something to do with stopping all dockers and restarting them all at once

The exact same process is done every time you stop and start the array....

38 minutes ago, mgladwin said:

This is obviously a weird issue and not many people seem to be having it so I know it's hard to pin point answers.

Yeah....  Only thing that is similar is that some users also have problems with unRaid's mover, and no adequate answer has ever been given (or found for that matter)...  And they both utilize rsync.  

 

One user found that if CA backup doesn't delete the old expired backup sets, then the crash didn't happen.  This tends to imply an issue with the filesystem, since the script issues a standard linux (rm -rf) command to delete a directory.

 

The only log that I've seen showing the crash shows that rsync just plain stopped.  Didn't exit or anything.  Just stopped doing what it was doing and went into (presumably) an infinite loop.  Implies to me again file system issue, or an rsync issue.  The script starts the rsync command and then just sits there twiddling its electronic thumbs until the command completes (succesfully or not)

 

To my knowledge, anyone who has seen the problem can run the backup manually and have it succeed all the time.  Implies there's nothing wrong with the file system or with rsync.  (And if anything, running the script manually is more likely to cause a GUI crash since the GUI is actually somewhat involved (starts it and monitors it).  The script itself has no concept of how it was started, and does not adjust itself at all to whether its started via cron or via the GUI)

 

Possible related issue: a few people have with unRaid: shfs running at 100% and locking up the system.  AFAIK tends to get blamed on RFS (for right or for wrong).

 

TLDR: filesystem issues (whether direct corruption or an issue with unRaid's user share system) are at the top of my list of suspects, but my arguments for it also have arguments against it.

 

As noted, the vast majority of users have zero issues.  Only a very small handful of users are seeing this behaviour.

 

For lack of anything else, my suggestions are:

 

Run memtest from the boot menu for at least a single pass.

Convert the entire array over to XFS.  Don't mix filesystem types.  This includes the cache drive.  Anecdotal evidence continually shows that BTRFS does have some problems. 

Within backup settings, (not share settings), confine the destination to a single disk.

 

If after a crash / lockup, ssh into the server and grab diagnostics if possible and run this command

 

cp  /var/lib/docker/unraid/ca.backup.datastore/appdata_backup.log   /boot

 

And then post the last bunch of lines from that file (now on the flash drive) along with the diagnostics (and a picture of what's on the locally attached monitor if there is one)

Link to comment

Every time I've had an issue I haven't been able to SSH in or use local console. With that said, I might use the user scripts plugin to run the appdata backup log copy command you suggested say every 30 seconds and overwrite or append file or something to help capture this. Does that sound reasonable to try in my case? Or maybe just a "tail" capture would suffice? I have recently ran jb's unbalance to help move data around array while I was converting all disks to xfs. This all happened without issue and I believe it uses same rsync command with different flags though. Have done everything else you suggested in your last post bar changing my cache from btrfs to xfs. I have activated appdata backup again to disk 3 (xfs) only. See how tonight goes.

Sent from my SM-G930F using Tapatalk

Link to comment
5 minutes ago, mgladwin said:

Every time I've had an issue I haven't been able to SSH in or use local console. With that said, I might use the user scripts plugin to run the appdata backup log copy command you suggested say every 30 seconds and overwrite or append file or something to help capture this. Does that sound reasonable to try in my case? Or maybe just a "tail" capture would suffice? I have recently ran jb's unbalance to help move data around array while I was converting all disks to xfs. This all happened without issue and I believe it uses same rsync command with different flags though. Have done everything else you suggested in your last post bar changing my cache from btrfs to xfs. I have activated appdata backup again to disk 3 (xfs) only. See how tonight goes.

Sent from my SM-G930F using Tapatalk
 

The backup log will survive a reboot, so a tail isn't really necessary.  If anything, install Fix Common Problems and start up troubleshooting mode and then upload afterwards the end of the backup log, FCPsyslog_tail.txt, and the last generated diagnostics.  

 

EDIT: an a pic of what's on the monitor before you reset

 

Edited by Squid
Link to comment

Ok so I saw there was a copy rsync log to flash setting in the CA Backup Settings which I wondered if this did same thing as you suggested. So i changed settings back to backing up to user share and not a particular disk and thought i would test to see what was actually outputted to the flash drive. I come back in 5 minutes and its locked up just the same!

SSH'ed in a logged in, ran " cp  /var/lib/docker/unraid/ca.backup.datastore/appdata_backup.log   /boot " which gave me this (assumed typical time out or something):-

 

20170401_132214.thumb.jpg.b98b657ea068ec7cba2b303bac168053.jpg

 

Then this added to it after about 1 minute-

 

20170401_132257.thumb.jpg.4d9810704292f3a5dc36dd63dca7fc58.jpg

 

And this is local monitor:-

 

20170401_132240.thumb.jpg.77f1247433c7306a30ac474827123afa.jpg

 

Right now I cant access GUI, dockers, anything. I will hard restart and put in trouble shooting mode and repeat same procedure hopefully capturing logs needed.

 

Will report back again when thats done.

 

Cheers

Link to comment
32 minutes ago, mgladwin said:

Ok so I saw there was a copy rsync log to flash setting in the CA Backup Settings which I wondered if this did same thing as you suggested.

It is, but when it does it is too late for when I wanted.  Takes a while because it is a huge file.

 

IRQ 16: disabled    Now that's what I was looking for.  :D (Well not exactly, but it is something to go on, and in retrospect makes some sense)

 

After you reboot, can you post the output of

 

cat   /proc/interrupts

 

 

 

 

Edited by Squid
Link to comment

Sorry was one step ahead already.

 

I rebooted and put in trouble shooting mode and re-ran manual appdata backup.

Seemed to finish OK (on the status page in CA backup settings) and I was able to use some dockers but not all (lIke some had restarted but not all of them).

No GUI still and forgot to check local monitor! I also forgot to capture the ca appdata backup log. Doh!

This time I could seemingly also still use console so I did the " cat   /proc/interrupt " as requested.

All info (i remembered to get) attached to this post. And now i'm in trouble with my 4 year old daughter as her movie on Plex keeps stopping!

Look forward to hearing from you Squid!

Cheers.

 

FCPsyslog_tail.txt

tower-diagnostics-20170401-1351.zip

20170401_141024.jpg

 

EDIT: last few lines of appdata_backup.log (seems all ok to me)

 

2017/04/01 14:03:03 [30434] >f+++++++++ unifi/logs/server.log.1
2017/04/01 14:03:03 [30434] >f+++++++++ unifi/logs/server.log.2
2017/04/01 14:03:03 [30434] >f+++++++++ unifi/logs/server.log.3
2017/04/01 14:03:03 [30434] cd+++++++++ unifi/run/
2017/04/01 14:03:03 [30434] >f+++++++++ unifi/run/firmware.json
2017/04/01 14:03:03 [30434] >f+++++++++ unifi/run/update.json
2017/04/01 14:03:03 [30434] sent 23,661,736,341 bytes  received 2,539,866 bytes  41,048,180.76 bytes/sec
2017/04/01 14:03:03 [30434] total size is 23,642,044,893  speedup is 1.00
Restarting Duckdns
Restarting letsencrypt
Restarting NZBGet
Restarting ombi
Restarting openvpn-as
Restarting plex
Restarting plexpy
Restarting quassel-core
Restarting radarr
Restarting sonarr
Restarting tvheadend
Restarting unifi
Backup/Restore Complete.  Rsync Status: Success
Deleting Dated Backup set: /mnt/user/appdata_backup/[email protected]
Deleting /mnt/user/appdata_backup/[email protected]
Deleting Dated Backup set: /mnt/user/appdata_backup/[email protected]
Deleting /mnt/user/appdata_backup/[email protected]

 

Edited by mgladwin
Link to comment

I think @RobJ might be able to help out here on why IRQ 16 is getting disabled (appears to happen under high I/O load), as I think he might be of more help.

 

Since that screenshot was presumably after the reboot (when IRQ 16 is functioning normally), can you try and replicate this again, and after the IRQ gets disabled do the same 

 

cat /proc/interrupts again as a comparison  (But do it after your daughter goes to sleep)

 

Link to comment
I think [mention=189]RobJ[/mention] might be able to help out here on why IRQ 16 is getting disabled (appears to happen under high I/O load), as I think he might be of more help.
 
Since that screenshot was presumably after the reboot (when IRQ 16 is functioning normally), can you try and replicate this again, and after the IRQ gets disabled do the same 
 
cat /proc/interrupts again as a comparison  (But do it after your daughter goes to sleep)
 

No worries. Will try and replicate. Generally I don't have access to console in any form after a lock up. How could I capture interrupts in this case? That last cat interrupts I posted was after a lock up but I seemed to still have SSH console available and I don't remember seeing the irq 16 message to be honest.

Sent from my SM-G930F using Tapatalk

Link to comment
12 minutes ago, mgladwin said:


No worries. Will try and replicate. Generally I don't have access to console in any form after a lock up. How could I capture interrupts in this case? That last cat interrupts I posted was after a lock up but I seemed to still have SSH console available and I don't remember seeing the irq 16 message to be honest.

Sent from my SM-G930F using Tapatalk
 

If you have no access to the console, then you won't be able to do anything.

 

But tomorrow sometime I'm going to get you to change the rsync parameters in the options to lower it's I/O  bandwidth to see if it makes a difference.   We're down to it being a hardware issue.   No clue as to what the actual problem is or why it's happening but we may be able to work around it nonetheless.   I just gotta do some testing to see what to get you to change in the parameters

Link to comment

Try this out:  (Note that for users that have the lockups when a backup is NOT set to run, I can guarantee that your issue has nothing to do with me, whether or not CA Appdata Backup is installed or not).  It also appears that this issue is purely a hardware / driver issue.  The following is a possible workaround that may help.

 

Update CA Appdata Backup to 2017.04.01

 

Create this file on the flash drive:

/config/plugins/ca.backup/nice

 

Edit the file, and have it contain the following:

 

nice  -n19  ionice  -c3

 

Then try the backup again.  What that line above is doing is running the rsync and the rm command at the lowest priority for both CPU and I/O.  No guarantees, but maybe just maybe....

Edited by Squid
Link to comment
7 hours ago, Squid said:

Try this out:  (Note that for users that have the lockups when a backup is NOT set to run, I can guarantee that your issue has nothing to do with me, whether or not CA Appdata Backup is installed or not).  It also appears that this issue is purely a hardware / driver issue.  The following is a possible workaround that may help.

 

Update CA Appdata Backup to 2017.04.01

 

Create this file on the flash drive:

/config/plugins/ca.backup/nice

 

Edit the file, and have it contain the following:

 

nice  -n19  ionice  -c3

 

Then try the backup again.  What that line above is doing is running the rsync and the rm command at the lowest priority for both CPU and I/O.  No guarantees, but maybe just maybe....

 

Have also done as suggested, Will report back soon. Either way, thanks very much for your help on this Squid.

 

EDIT: So I added the 'nice' file as per above and updated CA Appdata Backup. I also upgraded unRAID to 6.3.3.

Put unRAID in troubleshooting mode and ran a manual appdata backup.

Captured appdata_backup.log and all seemed to run/finish perfectly. All dockers came back up and the Web GUI and SSH window were both responsive after the backup. I'm calling that a clean, successful backup. It also didn't seem to take any longer than previous backups. I will continue to monitor over the next few days obviously report back. So far so good!

Again @Squid thank you very much for your effort to help us all out and I hope this solution can help fix others having similar problems.

I'm not calling the cat out of the bag just yet but it seems we might be on a winner.

Edited by mgladwin
Link to comment

Ok, So the time is now nearly 8.15 am and my backup is still going (started at 3am) i think it was doing this yesterday as well as its also slowing down my parity check (i think).

 

No dockers were started yesterday due to the backup running (i think), parity check was running at about 4MB /s and everything else was unresponsive after about 2 min of trying to do something. So if the backups are still going the server acts like it is hung.  Also yesterday afternoon i tried deleting old backups over the keep threshold  and it took over an hr and the server was not responsive for a short while (like it had hung).

 

are we starting to see that backup & restore is hogging the system during backups?

 

My current parity check is going at 1.2MB/s and going to take 3 days 21hrs to complete just under 4TB of the 8TB parity drive.

 

One thing is for sure im able to use the GUI add view the backup that currently happening

Link to comment
Ok, So the time is now nearly 8.15 am and my backup is still going (started at 3am) i think it was doing this yesterday as well as its also slowing down my parity check (i think).
 
No dockers were started yesterday due to the backup running (i think), parity check was running at about 4MB /s and everything else was unresponsive after about 2 min of trying to do something. So if the backups are still going the server acts like it is hung.  Also yesterday afternoon i tried deleting old backups over the keep threshold  and it took over an hr and the server was not responsive for a short while (like it had hung).
 
are we starting to see that backup & restore is hogging the system during backups?
 
My current parity check is going at 1.2MB/s and going to take 3 days 21hrs to complete just under 4TB of the 8TB parity drive.
 
One thing is for sure im able to use the GUI add view the backup that currently happening

There's something else going on. My mod is a hail Mary workaround that might mask the problem based upon the irq disabled posted above. Ultimately this issue is out of my hands and more info (especially screenshots of the local monitor) are what's required to help others assist you guys.

Sent from my LG-D852 using Tapatalk

Link to comment
Just now, mgladwin said:

Unfortunately still getting IRQ 16 disabled and lock ups. Will continue to fault find but to be honest it's way above me. Worth starting a new thread for this@Squid?

Sent from my SM-G930F using Tapatalk
 

I would say yes.  It's beyond me as we're going into hardware issues...

Link to comment

So i decided to upgrade to 6.3.3 2 days ago which is why i haven't posted anything back.

 

So far so good, the last 2 nights i haven't had any hangs/lockups.

 

I did note in 6.3.3 that reiserfsprogs has been downgraded, not sure if this is the cause of no lockup or not or even if it makes any difference.

 

@mgladwinhave you upgraded to 6.3.3 yet?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.