willdouglas Posted March 7, 2018 Share Posted March 7, 2018 (edited) Hi All, OS Version: 6.4.1 Hardware: Supermicro X9-DRL-iF 2 x Xeon E5-2670 8 x 8GB DDR3 6 x 8TB HGST, 5 data, 1 parity. 2 x 240GB Kingston SSD, using as Cache pool. 1 x Kingston Data Traveler 8GB for boot. 1 x HAUPPAUGE WinTV-quadHD(haven't tried to use it yet) 1 x LSI 9211-8i (works great!) 1x Mellanox 10G Network Card(works great!) Installed Apps: Binhex Plex Binhex Deluge Binhex nzbget Binhex Sonarr Binhex Radarr Linuxserver plexpy CA Fix common problems - reports the unclean shutdowns afterward but other than that just wants me to install trim plugin and docker update check. System locking up solid roughly once per day over the last two days, both times at some point mid day when the server should be seeing little to no utilization. The system has only been up for eight days, and I didn't see any lockups until mid day seven, then again mid day eight. Pulled the tower from it's home in the crawlspace after first lockup and moved it to a table with a monitor so I could observe but haven't seen anything out of the ordinary. It's not overheating, and as far as I can remember the only change I made was specifying the parity drive and allowing it to build after the rsync copy from my old server completed. There were single smart warnings I don't recall specifically on two of the 8tb drives when I first installed them but I acknowledged them and haven't seen anything since that point. Rolled back to no parity to see if it will lock up tomorrow. Could attach some logs if anything interesting happens without a lockup, but no errors present besides the quoted output dumped to console and present when I get to the locked up box. Edited March 10, 2018 by willdouglas listed wrong SAS card, corrected. Quote Link to comment
willdouglas Posted March 7, 2018 Author Share Posted March 7, 2018 Locked up without parity drive. Re-adding it to the array. After parity finishes I'll shutdown clean and start removing things. TBC. Quote Link to comment
willdouglas Posted March 8, 2018 Author Share Posted March 8, 2018 (edited) Too impatient to let this thing freak out while I'm away repeatedly. Feels like flirting with disaster. Redid PSU calcs to make sure I'm not under powering and I have quite a bit of leeway. Removed tuner card since it wasn't being used anyway, flashed BIOS(probably same revision, forgot to check), currently running Memtest to verify RAM and looking for some good burn in software. Edited March 8, 2018 by willdouglas forgot some words Quote Link to comment
FreeMan Posted March 8, 2018 Share Posted March 8, 2018 When you bring unRAID back up, go to Fix Common Problems and put it in troubleshooting mode. That will write the generate diagnostics and tail the syslog to the flash drive. That way if/when it crashes again, you'll at least have the most recent events written to the syslog and can post them here for someone to review. I am not the guru who will interpret them for you, but I can tell you to get 'em so one of the resident gurus can interpret them! 1 Quote Link to comment
willdouglas Posted March 9, 2018 Author Share Posted March 9, 2018 I will do that now. 24 hours of memory tests haven't revealed any issues so it's time for the next thing. Quote Link to comment
willdouglas Posted March 10, 2018 Author Share Posted March 10, 2018 Daily lockup achieved, logs captured and attached. cadence-diagnostics-20180309-1217.zip FCPsyslog_tail.txt Quote Link to comment
willdouglas Posted March 10, 2018 Author Share Posted March 10, 2018 Logs from second lockup. cadence-diagnostics-20180310-0257.zip FCPsyslog_tail.txt Quote Link to comment
willdouglas Posted March 10, 2018 Author Share Posted March 10, 2018 I'm not seeing anything pop up using stress or memory testers. I also updated my post with the correct SAS controller, it's an LSI 9211-8i. I also verified it has the most recent IT firmware flashed - 20.00.07.00-IT, nothing newer on their site anyway. Going to start trying permutations of apps I guess. I feel like hardware is good, only new purchases for this were drives and controller, server previously did duty as a virtualization server for training so if it was going to flake I hope I'd have seen it previously impacting VMs. Quote Link to comment
damonkey Posted March 11, 2018 Share Posted March 11, 2018 So I wanted to let you know I have started a mprime Large FFT test to run overnight on my system. I doubt it will go that long since it seems to lock up within a few hours. Let me know if you find the culprit on your system. Quote Link to comment
willdouglas Posted March 11, 2018 Author Share Posted March 11, 2018 (edited) I'm breaking records over here with twenty hours of uptime without a lockup. Once I hit 24 hours I'll start the first app, probably plex, then 24 hours later start another, and so on. I thought I had them all installed and running before the lockups started but might not have. I did update library shares before the crashes start so maybe it'll be tied to plex usage as people started adding load. Edited March 11, 2018 by willdouglas added some words Quote Link to comment
damonkey Posted March 12, 2018 Share Posted March 12, 2018 Sound like you might have passed the 24hrs mark. Let me know what you find after starting your apps. ? I have now passed the 20hr mark. Quote Link to comment
willdouglas Posted March 12, 2018 Author Share Posted March 12, 2018 I've broken 48 hours of uptime. Feel safe ruling out plex at this point. Time for the next app! Quote Link to comment
willdouglas Posted March 13, 2018 Author Share Posted March 13, 2018 72 hours uptime. Enabled deluge, nzbget and sonarr all at the same time yesterday and triggered a couple searches to see if they were playing nicely together. No lockups in the last 24 hours. At this point I only have radarr and plexpy to re-enable and I'm kinda tempted to just not enable them, but I will forge ahead and enable radarr. Quote Link to comment
damonkey Posted March 14, 2018 Share Posted March 14, 2018 Sounds like things are looking up. I have most of my plugins and apps reinstalled minus radarr. But I also have not reinstalled some of my memory. I think I will let you take that radarr step first, it always seemed to be a little flakey. Quote Link to comment
willdouglas Posted March 14, 2018 Author Share Posted March 14, 2018 (edited) Fail! Made it somewhere around nineteen hours I think. Going to leave Radarr disabled and see how long of an up time counter I can accrue. EDIT, forgot to mention I upgraded to 6.5.0 before starting radarr so I was on a fresh reboot and upgrade, possibly hosing my troubleshooting by changing a variable. It seemed so innocent, so safe! Uptime was great! FCPsyslog_tail.txt cadence-diagnostics-20180314-0542.zip Edited March 14, 2018 by willdouglas Quote Link to comment
willdouglas Posted March 16, 2018 Author Share Posted March 16, 2018 I'm going to, possibly unfairly, blame the binhex-radarr plugin. They did warn me, it said it was under active development but I've been using radarr for a while, how bad could it be?! I'm at just over 48 hours of uptime and going to put the box back in the crawlspace where it lives and continue without radarr for now, maybe run it on another host for the time being. Quote Link to comment
willdouglas Posted April 1, 2018 Author Share Posted April 1, 2018 Alright, cutting down docker apps to sonarr, nzbget, and plex I've somehow extended average up-time to three or four days, and have a new symptom to add. If I catch the lockup after network access drops, but before hangcheck value past margin messages start appearing there is limited access to the box at console. I wasn't able to log in, but I was able to get it to respond to key presses at what felt like a fifteen to twenty second delay. I ended up rebooting rather than fight with it because there was a request for Plex access from family but I'll re-enable logging and see if I can catch it quickly enough. I do have an impi controller onboard, so I'll see if I can get into the box next time to poke around and maybe dump logs to another box for easy viewing. Quote Link to comment
willdouglas Posted April 20, 2018 Author Share Posted April 20, 2018 Figured I should follow up with this. I never figured out what the issue was or managed to pull usable info out of a crash. I replaced the USB drive with a newer drive, and with a fresh install only installed the official plex docker. All other apps are running in an unraid VM until I feel like investing time in this again. Quote Link to comment
willdouglas Posted April 26, 2018 Author Share Posted April 26, 2018 Ten days of uptime. I'm comfortable leaving well enough alone. End thread. Quote Link to comment
damonkey Posted April 26, 2018 Share Posted April 26, 2018 Hey good news!! I had to shut mine down for a few weeks now. Moving to Texas. I miss it:( Quote Link to comment
willdouglas Posted June 19, 2018 Author Share Posted June 19, 2018 Good luck on the move. 63 days uptime. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.