Server crashing again

stublehustle · February 16, 2017

So my server had been crashing recently and then we tracked it to a corrupt cache drive which I cleared and started over with. Today my server has been crashing again. I'm not very good at reading logs but from what I can tell it might have something to do with btrfs which leads me to believe its the cache drive again but I'm not sure what would be corrupting it. Here's all my logs and screenshot of the monitor when I plugged it in. Any help would be greatly appreciated.

FCPsyslog_tail.zip

server-diagnostics-20170215-1939.zip

John_M · February 16, 2017

The Fix Common Problems syslog tail shows a segfault following a BTRFS checksum error on the filesystem contained in your docker.img file and then an Out Of Memory error:

Feb 14 20:33:55 Server kernel: BTRFS warning (device loop0): loop0 checksum verify failed on 191053824 wanted 850A780D found C12EEF04 level 0

Feb 14 20:33:58 Server kernel: notify[10708]: segfault at 0 ip 00000000005f42ad sp 00007ffda29924e0 error 4 in php[400000+724000]

Feb 14 20:34:02 Server kernel: Plex Media Serv invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0

Later there's BTRFS corruption found on your cache disk itself:

Feb 15 03:03:11 Server kernel: BTRFS critical (device sde1): corrupt leaf, bad key order: block=173359104, root=1, slot=151

repeated many times, plus a call trace. Later there's more BTRFS trouble:

Feb 15 19:07:57 Server kernel: kernel BUG at fs/btrfs/ctree.c:3168!

Feb 15 19:07:57 Server kernel: invalid opcode: 0000 [#1] PREEMPT SMP

then a kernel oops and finally:

Feb 15 19:07:57 Server kernel: Fixing recursive fault but reboot is needed!

Since is seems to have begun with a segfault you should run Memtest for 48 hours as a memory problem is a relatively easy thing to detect and fix. I don't know whether the BTRFS errors are a symptom or a cause but once the RAM is either eliminated or replaced you need to address both the file system errors on the cache and those in the docker.img file.

It's probably going to be easier to delete the docker.img file, then re-create it and download the containers again than to try to repair it - but fix the cache disk problem first. Regarding the cache disk itself, you say you have had problems with it before. I don't know anything about that and you haven't provided a link so I'll just say that you might want to consider replacing the cache disk.

Your magnetic disks look OK but they all show evidence of having had cable problems in the past. You may be aware of this already - if not, you'll want to make sure the SATA cables are all seated securely at both ends.

stublehustle · February 16, 2017

Since is seems to have begun with a segfault you should run Memtest for 48 hours as a memory problem is a relatively easy thing to detect and fix. I don't know whether the BTRFS errors are a symptom or a cause but once the RAM is either eliminated or replaced you need to address both the file system errors on the cache and those in the docker.img file.

It's probably going to be easier to delete the docker.img file, then re-create it and download the containers again than to try to repair it - but fix the cache disk problem first. Regarding the cache disk itself, you say you have had problems with it before. I don't know anything about that and you haven't provided a link so I'll just say that you might want to consider replacing the cache disk.

Your magnetic disks look OK but they all show evidence of having had cable problems in the past. You may be aware of this already - if not, you'll want to make sure the SATA cables are all seated securely at both ends.

I'll run memtest for the next 48 and see what that turns up. I've got another SSD laying around so I may throw that in and take out the current drive. I was using old cables and actually replaced all them over the weekend with new cables so I'll double check my seating. I'll let you know what the memtest turns up.

Thanks!

John_M · February 16, 2017

The cable-caused errors are historic - from the SMART reports, not current - from syslog. So as you recently replaced them you probably fixed that problem. No harm in checking again though.

stublehustle · February 21, 2017

So after a memtest I was getting some memory errors so I pulled the dimms I suspected were bad and I'm running a new memtest. So far, so good but I'm letting it run another 12 hours for the full 48 again.

I'll rebuild my cache drive after and hopefully the problem will be solved. Since I'm going to be using just one cache drive and don't have plans in the immediate future to add a second would I be better off running XFS on the cache drive?

I do have a question about the ram. I had 24GB with 2 8GB dimms and 2 4GB dimms. I took out the 2 4GB dimms which leaves me the 16GB. I was reading in this forum (I think)that its better to run 2 sticks vs 4 sticks. Will 16 be adequate for running plex, couch potato, sickrage, duck dns, deluge, and any future dockers, or would I be better served to add the 8GB back in once I get replacements from Kingston?

John_M · February 21, 2017

If you have a UPS you might as well stick with BTRFS, though some people would recommend switching to XFS, which seems to recover better from unclean shutdowns.

Four sticks of unbuffered RAM can present too big a load for some motherboards, so unless yours is on the manufacturer's qualified RAM list it's safer to stick with two. That said, many people get away with it, though to stand a fighting chance I'd try to use the same type. Of those dockers you mention, I only run Plex. The best way to find out is to try it and see how you get on.

Server crashing again

Recommended Posts

stublehustle

Link to comment

John_M

Link to comment

stublehustle

Link to comment

John_M

Link to comment

stublehustle

Link to comment

John_M

Link to comment

Join the conversation