Jump to content

Another Freeze - screen photo this time


tiwing

Recommended Posts

As many have had happen in these parts, I've had 2 unraid boxes freeze, making them entirely useless in all respects. Power off and reboot solves the problem for a while. One box goes down almost weekly. The other lasts 3 weeks or so.

 

Both are 6.1.9. Both have ~250GB twin SSD Cache Pools wit mover enabled.

Both are Xeon W3550, although the 8GB machine was running on an older i5 until recently and experienced the same problem. All are HP based machines, off lease purchase because they're cheap to buy :)

 

Machine 1

8GB Ram, 2 Win10 VMs using 6GB total Ram and 3 cores.

No Plugins, No dockers. It's a very bare system designed as point of sale and backup file storage only.

2x 4TB WD Reds.

 

Machine 2

24 GB Ram

1 Win10 VM using 4GB Ram and 2 cores

FTPD plugin

Couchpotato and Deluge dockers running, MythTV not running.

total 7TB with various 2 and 3 TB drives.

Machine is lightly used.

 

I managed to get a photo of the 24GB machine after it froze... attached.

 

Thoughts?

 

Cheers. tiwing.

20160505_224405.jpg.946c2b53c219706fa79a384594b9e946.jpg

Link to comment

The common factor in recent freezes, in my opinion, seems to be VM's on recent unRAID versions, possibly Win10 VM's.  I don't know why.  KVM has not had the long history that VMware and others have had, so perhaps it's just 'youthful inexperience', lacking all the workarounds the others have added.

 

You also are running Cache pools on both, something that can be problematic too if they are corrupted, which is a possibility after any freeze.  And if they have problems, then they can fill up the logging space, which can cause more issues.

 

Try the Fix Common Problems plugin on both.  And keep an eye on the log space, on the DashBoard.  At the first sign of trouble, grab the diagnostics.

Link to comment

I have a similar issue with one of my servers (it lasts about 2 weeks).  In my case the GUI sometimes locks up, but at other times just displays error messages about not being able to write due to no space.  I did find that if I remove a plugin, it will temporarily solve the problem (removed a plugin which bought me a couple of hours the last time it happened).

 

In my case at least I know log space (as well as everything from a df output) showed space available, but the console was filled with the error messages like yours.

 

The only thing I see in common is the couchpotato docker (not sure if we have the same version).  In my case I have a few dockers, but no VM's

Link to comment
The common factor in recent freezes' date=' in my opinion, seems to be VM's on recent unRAID versions, possibly Win10 VM's.[/quote']

I did have this issue on one server before installing a VM though - and it looks like bondoo0 might have the same issue as me with no VMs running. Still, might be a contributing factor. In order to test that, I'll probably move my VMs onto a single unraid box and leave the other without VMs. If that proves to be the problem... it becomes decision time for what to do longer term. I chose unraid because of its ability to run VMs on a single piece of back room hardware fronted by "disposable" thin clients in a somewhat dirty environment ...

 

You also are running Cache pools on both, something that can be problematic too if they are corrupted, which is a possibility after any freeze.  And if they have problems, then they can fill up the logging space, which can cause more issues.

 

Try the Fix Common Problems plugin on both.  And keep an eye on the log space, on the DashBoard.  At the first sign of trouble, grab the diagnostics.

Thank you - I've run the common problems plugin on both machines. My 24GB machine had a number of minor things - mostly 32 bit packages on the flash drive - and a few plugins weren't set to auto update. My 8GB machine had nothing found. I haven't kept a super close eye on log space, but every time I have ever looked it's been 2%. I did have the mover writing to the log, which I've turned off. i've also extended the DHCP lease to 10 days to reduce those writes to the log. Maybe that will help not fill up space so quickly while I'm not looking (at 4am for the logger).

 

Is there any way to check for cache pool corruption?

 

The only thing I see in common is the couchpotato docker
True - but that's only on one of my servers, and I have the same issue with both servers... :)

 

In this case' date=' from looking at the installpkg script, it is unable to write to /var/log[/quote']

What is installpkg script? I have not been manually installing anything when having these freezes.. Unless I'm assuming wrong from the name of that script. Could it perhaps be a permissions thing where some process doesn't have permission to write to the log? Damn i wish I knew more about this stuff. Hopefully one day I'll be helping, not asking for help!

 

It looks like there might be some similarity in my screen shot to this thread https://lime-technology.com/forum/index.php?topic=41963.0 (Squid? )

 

 

Link to comment

It looks like there might be some similarity in my screen shot to this thread https://lime-technology.com/forum/index.php?topic=41963.0 (Squid? )

 

Definitely good to try a df -h if you can to see if a filesystem was full (/var/log seems to be the common one to fill up), and of course keep on eye on the %'s in dashboard.  I can say in my case that it wasn't full, when I had the issue, but hopefully you have more luck tracking down the issue than I have had so far :)

Link to comment

In this case' date=' from looking at the installpkg script, it is unable to write to /var/log[/quote']

What is installpkg script? I have not been manually installing anything when having these freezes.. Unless I'm assuming wrong from the name of that script. Could it perhaps be a permissions thing where some process doesn't have permission to write to the log? Damn i wish I knew more about this stuff. Hopefully one day I'll be helping, not asking for help!

 

It looks like there might be some similarity in my screen shot to this thread https://lime-technology.com/forum/index.php?topic=41963.0 (Squid? )

installpkg is a slackware script that installs packages.  All installed plugins use it.  Unless you caught when those messages appeared on the screen and we could correlate that to a time in the syslog, its meaningless.  I was just pointing out that those messages on the screen meant that it could write to /var/log (presumably it was trying to write an error message about it unable to write an error message)

 

Permission on /var/log.  guess its possible, but doubt it.

 

unregister netdevice (the link you posted) is a harmless but annoying error message that pops up on all open ssh / local terminals.  I usually only see it under heavy i/o on the docker image.

 

Link to comment

Out of curiosity, the machine that had the 32bit packages found.  Did you originally upgrade it from unraid v5?

 

Also, I should probably adjust the wording in the error message, but you're going to have to reboot after deleting the offending files.

Link to comment

The common factor in recent freezes' date=' in my opinion, seems to be VM's on recent unRAID versions, possibly Win10 VM's.[/quote']

I did have this issue on one server before installing a VM though - and it looks like bondoo0 might have the same issue as me with no VMs running. Still, might be a contributing factor. In order to test that, I'll probably move my VMs onto a single unraid box and leave the other without VMs. If that proves to be the problem... it becomes decision time for what to do longer term. I chose unraid because of its ability to run VMs on a single piece of back room hardware fronted by "disposable" thin clients in a somewhat dirty environment ...

I didn't want to imply unRAID can't run VM's successfully, many users are.  But there's a significant number with freezing issues lately, and I don't know what is common to their setup.

 

I haven't kept a super close eye on log space, but every time I have ever looked it's been 2%. I did have the mover writing to the log, which I've turned off.

 

Is there any way to check for cache pool corruption?

It would be great if there was a simple and easy tool for that, but there isn't, not one that we have found so far.

 

i've also extended the DHCP lease to 10 days to reduce those writes to the log. Maybe that will help not fill up space so quickly while I'm not looking (at 4am for the logger).

Better yet, switch to static IP's!  ;)

Link to comment

Out of curiosity, the machine that had the 32bit packages found.  Did you originally upgrade it from unraid v5?

 

Also, I should probably adjust the wording in the error message, but you're going to have to reboot after deleting the offending files.

Hi, it was a fresh v6 install to 6.1.7 I think. But I accidentally tried to install a 32 bit plugin from an old forum thread and only realized it when it didn't work! i have removed the files and rebooted... thanks. Your other information - all good to know. thank you.

 

Better yet' date=' switch to static IP's!  ;)[/quote']I'm lazy. :) I've been using static DHCP (is that a thing? Hopefully you know what I mean) for the key machines on my network, but yeah I should probably switch to static...

 

 

Link to comment

Better yet' date=' switch to static IP's!  ;)[/quote']I'm lazy. :) I've been using static DHCP (is that a thing? Hopefully you know what I mean) for the key machines on my network, but yeah I should probably switch to static...

Yeah I knew what you meant by 'static DHCP'!  Good name for it.

 

But the reason I've begun strongly favoring static IP's is that I have seen at least one case, maybe a second one too, where Dockers and the networking and possibly the virtual bridging used had issues immediately after the first IP lease renewal.  Something does not appear to be working right when it's renewed, and a bunch of network errors start clogging the syslog from that point on.  It seems prudent therefore to remove this potential source of network trouble, as well as another source of /var/log filling.  Especially when it's so easy to do!

Link to comment
  • 2 months later...

I want to provide an update to this thread. The problem i was having seems to have been solved! I have had both unraid boxes running fine for over 90 days now with no problems whatsoever.

 

I found a possible solution in another thread (can't find it again now to link to it), It appears the issue is caused by running a windows 10 VM connected to a wireless printer (on my servers I have unraid 6.1.9). Hard wiring the printer to the network seems to have solved the problem. No idea why it worked that way, but for anyone else who has the same kind of intermittent failure, try running a network cable to your printer instead of using it via wifi.

 

This also explains why one server went down weekly, and the other less frequently - it was due to the frequency of printing at different locations.

 

Hopefully this post is useful to someone else who might have the same kind of problem.

 

Cheers,

Tiwing.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...