[SOLVED] WebGUI crashing at times, general protection faults in syslog


Recommended Posts

My server has been SUPER reliable for the last few years.  Now I'm seeing all shares drop off randomly, and python (or other modules) throwing general protection faults at times.  Like this:

 

Nov 21 07:28:09 Tower kernel: traps: python[9560] general protection ip:2ac4d15689fc sp:7ffc761d11e8 error:0 in libpython2.7.so.1.0[2ac4d1486000+341000]

 

Attaching diagnostics.

 

EDIT:

Seeing tons of segfaults now.  Where should I begin?  RAM?

 

Nov 21 09:51:01 Tower kernel: monitor[6649]: segfault at 2b17ffd61320 ip 00002b17ff24daae sp 00007ffe1262d060 error 4 in libcrypto.so.1.0.0[2b17ff126000+227000]
Nov 21 09:51:01 Tower crond[1715]: exit status 139 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
Nov 21 09:51:01 Tower kernel: sendmail[6651]: segfault at 2b82438ac320 ip 00002b8242d98aae sp 00007ffc535588a0 error 4 in libcrypto.so.1.0.0[2b8242c71000+227000]
Nov 21 09:51:49 Tower kernel: php[6863]: segfault at 2b4eef209320 ip 00002b4eee6f5aae sp 00007ffd3f57e7c0 error 4 in libcrypto.so.1.0.0[2b4eee5ce000+227000]
Nov 21 09:51:49 Tower kernel: php[6864]: segfault at 2b99cc3c9320 ip 00002b99cb8b5aae sp 00007ffd19739ed0 error 4 in libcrypto.so.1.0.0[2b99cb78e000+227000]
Nov 21 09:51:49 Tower kernel: php[6865]: segfault at 2afcd970a320 ip 00002afcd8bf6aae sp 00007fffa0818c50 error 4 in libcrypto.so.1.0.0[2afcd8acf000+227000]
Nov 21 09:51:49 Tower kernel: php[6866]: segfault at 2add95a2b320 ip 00002add94f17aae sp 00007ffcfb922570 error 4 in libcrypto.so.1.0.0[2add94df0000+227000]
Nov 21 09:51:49 Tower kernel: php[6867]: segfault at 2b860b9ae320 ip 00002b860ae9aaae sp 00007ffe00302870 error 4 in libcrypto.so.1.0.0[2b860ad73000+227000]
Nov 21 09:51:49 Tower kernel: php[6868]: segfault at 2aaed9715320 ip 00002aaed8c01aae sp 00007ffe5ef804a0 error 4 in libcrypto.so.1.0.0[2aaed8ada000+227000]
Nov 21 09:52:01 Tower kernel: monitor[6921]: segfault at 2b460a4eb320 ip 00002b46099d7aae sp 00007fff6cc50b80 error 4 in libcrypto.so.1.0.0[2b46098b0000+227000]
Nov 21 09:52:01 Tower crond[1715]: exit status 139 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
Nov 21 09:52:01 Tower kernel: sendmail[6922]: segfault at 2b3ebfd57320 ip 00002b3ebf243aae sp 00007ffd81cb4140 error 4 in libcrypto.so.1.0.0[2b3ebf11c000+227000]
Nov 21 09:52:49 Tower kernel: php[7128]: segfault at 2b6112a09320 ip 00002b6111ef5aae sp 00007fffda511930 error 4 in libcrypto.so.1.0.0[2b6111dce000+227000]
Nov 21 09:52:49 Tower kernel: php[7133]: segfault at 2b28e06a0320 ip 00002b28dfb8caae sp 00007ffd866faa00 error 4 in libcrypto.so.1.0.0[2b28dfa65000+227000]
Nov 21 09:52:49 Tower kernel: php[7134]: segfault at 2aeddf86d320 ip 00002aedded59aae sp 00007ffeb5d9ba80 error 4 in libcrypto.so.1.0.0[2aeddec32000+227000]
Nov 21 09:52:49 Tower kernel: php[7135]: segfault at 2b62987c6320 ip 00002b6297cb2aae sp 00007fff08d0e7d0 error 4 in libcrypto.so.1.0.0[2b6297b8b000+227000]
Nov 21 09:52:49 Tower kernel: php[7136]: segfault at 2b393178f320 ip 00002b3930c7baae sp 00007ffee8fe1e30 error 4 in libcrypto.so.1.0.0[2b3930b54000+227000]
Nov 21 09:53:01 Tower crond[1715]: exit status 139 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null
Nov 21 09:53:01 Tower kernel: monitor[7186]: segfault at 2b8211835320 ip 00002b8210d21aae sp 00007fffb7fb0eb0 error 4 in libcrypto.so.1.0.0[2b8210bfa000+227000]
Nov 21 09:53:01 Tower kernel: sendmail[7187]: segfault at 2ad01ca84320 ip 00002ad01bf70aae sp 00007fff3563b9f0 error 4 in libcrypto.so.1.0.0[2ad01be49000+227000]
Nov 21 09:53:49 Tower kernel: php[7396]: segfault at 2ab4cb87e320 ip 00002ab4cad6aaae sp 00007ffc231fe230 error 4 in libcrypto.so.1.0.0[2ab4cac43000+227000]
Nov 21 09:53:49 Tower kernel: php[7397]: segfault at 2ab1d5d20320 ip 00002ab1d520caae sp 00007ffcfeeceef0 error 4 in libcrypto.so.1.0.0[2ab1d50e5000+227000]
Nov 21 09:53:49 Tower kernel: php[7398]: segfault at 2af207ec6320 ip 00002af2073b2aae sp 00007ffe803a92b0 error 4 in libcrypto.so.1.0.0[2af20728b000+227000]
Nov 21 09:53:49 Tower kernel: php[7399]: segfault at 2b935cf29320 ip 00002b935c415aae sp 00007fff93c40bc0 error 4 in libcrypto.so.1.0.0[2b935c2ee000+227000]
Nov 21 09:53:49 Tower kernel: php[7400]: segfault at 2b7803aad320 ip 00002b7802f99aae sp 00007ffc05db27a0 error 4 in libcrypto.so.1.0.0[2b7802e72000+227000]
Nov 21 09:54:01 Tower kernel: monitor[7473]: segfault at 2b2140905320 ip 00002b213fdf1aae sp 00007fff4167ae10 error 4 in libcrypto.so.1.0.0[2b213fcca000+227000]
Nov 21 09:54:01 Tower crond[1715]: exit status 139 from user root /usr/local/emhttp/plugins/dynamix/scripts/monitor &> /dev/null


 

I wanted to attach an updated diagnostics report, but I get a segmentation fault when attempting that:

root@Tower:~# diagnostics
Segmentation fault

 

Attaching a Syslog instead.

tower-diagnostics-20161121-0930.zip

syslog.txt

Link to comment

Went home over lunch and ran chkdsk on my flash drive.  Came up clean.

 

Hooked up a monitor and keyboard, and manually set the memory speed to 1600MHz instead of Auto.  All the timings and voltage look correct (9-9-9-24 1.5V)

 

Running memtest now because I'm suspecting bad RAM.  I put this RAM in (brand new) on 11/6 without running any tests on it.

Link to comment

I'm going to say this was a bad stick of RAM.  I got home after work and found 3 seperate errors on the memtest screen.  All from Test 7 of Pass #1.  No other errors, even though the system had gone all the way through Pass #5 by that time.

 

Here's what I saw:

 

XRfZUG7.jpg

 

Removed and reseated the sticks and tried again.  Immediately got another error when memtest got to Test 7.  Noticed this error was around the 2GB mark also, so I assumed that it was either the stick in DIMM slot 0 causing the issue, or DIMM slot 0 itself.

 

Pulled the stick out of DIMM slot 0, and replaced it with the stick from slot 1 (in an effort to rule out slot 0).

 

Started memtest again, this time telling it to just run test 7.  Let it do a dozen passes with no errors, and then called it a day and booted back into unRAID.  It's been running error free under heavy use, including a no correct parity check, SABnzbd doing it's thing for most of the night, and several Plex transcodes.

 

I guess I'll have to decide whether to RMA this or see if Fry's will exchange.

 

Fingers crossed!

Link to comment

One last update:

 

Exchanged the bad RAM kit at Fry's yesterday.

 

Put the new kit through 2 full passes of memtest, plus 10 passes of just test #7.  No errors.  Server has been running all night using the full 16GB with no issues.

 

Last week when this issue first came up, I had been getting BTRFS errors all through the syslog and assumed it was BTRFS corruption or failing cache drives.  I reformatted as XFS and the BTRFS issues went away (of course) but I continued to notice other weird issues.  Now I realize that it was all due to the bad RAM, so I've gone back to using a cache pool ;)

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.