[Solved] Please HELP! I think i broke my server

ttttubby · October 26, 2018

Ok, let me start at the beginning.

7-8 years ago I built an unraid server using an ASRock - H77 Pro4/MVP motherboard and an Intel Pentium G630T CPU with 8GB of ram. I currently have 8 drives in my array (9 with the Parity drive) and a 500GB cache drive that is at least 9 years old (maybe 10). I've never had a drive failure, though I did remove a 1TB drive about 6 months back because I was running out of SATA connectors between the ones on the motherboard and the two extras that I have from a SYBA Sil3132 PCI-E to SATA II controller card that I installed. Around mid-September, Unraid became unresponsive and I narrowed it down to the SYBA expansion card which had always been a bit wonky, requiring me to physically reseat the SATA connectors any time the server was powered down in order for its drives to be recognized. I replaced the expansion card with another of the same model and everything seemed to be working again + the SATA connectors no longer needed to be reseated. At this time I was running Unraid 6.5.1.

Fast forward to a couple of weeks ago (early October) and I started noticing that my plex docker wasn't working properly. One of my libraries seems to have become corrupt, and I couldn't delete it or refresh it from within the plex webgui. Not a big deal, because I didn't use plex that often, so I put it in the back of my mind and didn't let it bother me. Then, about a week later I noticed that the mover didn't seem to be moving my downloads from the cache drive to the array on schedule. It was supposed to do so nightly, but I started to notice downloads hanging out on the cache drive a week after they downloaded. As one of the uses for my server is to archive photos for my wife's photography business this worried me since stuff that stayed on the cache was unprotected. I could invoke the mover manually, and everything seemed to work, but this was a more immediate problem than the one facing plex.

Yesterday (10/25) I decided to see if I could figure out the problems. I started by rebooting the server (which may or may not have fixed the mover issue) and then proceeded to update to Unraid 6.6.3. It was at this time that I started checking the logs and started noticing some pretty significant errors (which I can't really make heads or tails of). I wanted to reboot again last night, but I downloaded the diagnostics file and the syslog before doing so. See attachments to this post.

After I rebooted, I noticed brtf errors which, through google, I was lead to believe that the problem was with my docker image. So I though maybe, if I deleted my docker image and repulled my docker apps, I might be able to fix my problem. Not so. I think I may have made my problem worse. I'll respond to this post to explain why and to upload today's diagnostics/syslog.

tower-diagnostics-20181025-2343.zip

tower-syslog-20181025-2358.zip

ttttubby · October 26, 2018

So I deleted my docker image and started to repull my docker apps (using my user templates). Couchpotato pulled fine. So far so good. SabNZB on the other hand was a disaster. I started getting the following error:

Error: failed to register layer: ApplyLayer exit status 1 stdout: stderr: unexpected EOF

I looked this up and the consensus was to try again. This time it seemed to get a little bit further, but then it spat out

Error: failed to register layer: ApplyLayer exit status 1 stdout: stderr: archive/tar: invalid tar header

It was then that I looked at my system log and I was getting a ton of CPU errors like the following:

Tower kernel: CPU: 1 PID: 13982 Comm: udevadm Tainted: G B 4.18.15-unRAID #1

Followed by a stream of:

Tower kernel: swap_info_get: Bad swap file entry 3ffffbdffffff

and

Tower kernel: swap_info_get: Bad swap file entry 3fffffdffffff

I've attached my diagnostics file and syslog file from today, which includes these new errors.

Right now my data on the array seems safe, but I'm worried that my cache drive might be done (despite the fact that my main page says no errors). I'm also worried that something might be bad with my CPU and/or motherboard. Maybe its my SATA cables?

Please help.

tower-diagnostics-20181026-1048.zip

tower-syslog-20181026-1049.zip

JorgeB · October 26, 2018

Lots of segfaulting and call traces, first thing you should it to run memtest for a few hours, up to 24.

ttttubby · October 26, 2018

1 minute ago, johnnie.black said:

Lots of segfaulting and call traces, first thing you should it to run memtest for a few hours, up to 24.

So you think it could be my memory? It would just... go bad?

JorgeB · October 26, 2018

1 minute ago, ttttubby said:

It would just... go bad?

It's known to happen...like anything else, just because it was working OK yesterday doesn't mean it is today.

ttttubby · October 26, 2018

When I tried to recreate my docker image I tinkered with the settings a bit, moving the image into a /cache/system/docker/ folder and bumping the image size from 30gb to 50gb. Just now I deleted that image again, set it back to 30gb and placed it back in /cache/ and I was able to pull my dockers for sabnzb, couchpotato, sickbeard, dropbox, and krusader again without any errors. So that stuff is working again thankfully. I'm not going to bother with Plex for now though, and I'd still very much like to get to the bottom of this.

ttttubby · October 26, 2018

Ok, I'll try that. I get that 24 hours would be ideal, but you think a few hours would be long enough? Also, what settings should I use for memtest?

JorgeB · October 26, 2018

A few hours could be enough, sometimes a few minutes are enough to start getting errors, but if there aren't any errors the longer you run it the better, no settings needed, just boot to memtest and it will start running.

JorgeB · October 26, 2018

P.S: there are also some ATA errors on the cache device, nothing very worrying for now but you should check/replace cables.

ttttubby · October 26, 2018

Does this mean I failed the memory test?

JorgeB · October 26, 2018

Yep, and in a big away, not really surprising considering all the segfaulting, if you have more than 1 DIMM you can test one at a time.

ttttubby · October 26, 2018

Ok,

It looks like I found the bad stick. How much of a performance impact would I get if I dropped down to 4gb from 8gb on my server? Does it matter that I'm dropping from dual channel to single? It just so happens that I have another couple of sticks of 4gb ddr3. Should I replace the one still in there? Add them for a total of 12GB?

JorgeB · October 26, 2018

Just for basic NAS 4GB is enough, but it might impact your docker performance, depending how many you have, you can try and see.

Dual or single channel not a big difference, it won't be noticeable, use as much RAM as you need for your config.

ttttubby · October 26, 2018

Secondary question,

What kind of performance differences would I see if I bumped my Pentium dual core (basically a celeron replacement) to a 3rd gen Core i5 quad core? Thinking about buying one used on ebay and I'm wondering if its worth it for improved docker performance (for things like Plex in particular).

JorgeB · October 27, 2018

It should help with Plex if you're doing hardware transcoding but I never used Plex so not the guy to ask.

JorgeB · October 27, 2018

Also since you mentioned that your wife uses the server to archive some photos, it might be a good idea to check some of the last uploaded ones for corruption, it doesn't take many flipped bits to completely corrupt a photo.

[Solved] Please HELP! I think i broke my server

Recommended Posts

ttttubby

Link to comment

ttttubby

Link to comment

JorgeB

Link to comment

ttttubby

Link to comment

JorgeB

Link to comment

ttttubby

Link to comment

ttttubby

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

ttttubby

Link to comment

JorgeB

Link to comment

ttttubby

Link to comment

JorgeB

Link to comment

ttttubby

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

Archived