Jump to content

BLKMGK

Members
  • Content Count

    972
  • Joined

  • Last visited

Community Reputation

10 Good

About BLKMGK

  • Rank
    Advanced Member

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Well, sitting at over 6 days uptime now including a full scheduled parity check without errors. I'm not yet positive what's cleared this error and that sux. Temps have been cooler and I've had the house open but I've also had a second known good UPS inline with the server. The only major thing that changed between months of uptime and barely days was a new dedicated 30amp electrical circuit, a new 2U UPS, and warmer temps coincided. I guess I'll be patient but for now it's reliable - pretty frustrating. I guess I'll also mark the previously swapped PSU good and feel okay about keeping it with my spare chassis - it would've been nice had that been it as that's easiest to swap on a rackmount. I supose I could put it in the chassis attached to the known good UPS and see if the server toggles between them but I'm not sure how I'd get the info. Anyway, I'll post anything new here and sure hopew like heck it stays up and no one else goes through this grief.
  2. Not the power supply. Ran a memtest for over an hour, no issues. Have put a spare UPS in line between the recently ionstalled LARGE UPS and the server to try and see if the issue is coming from the new rackmount UPS - that I just had to run a new circuit to install. If it's the UPS I'll be upset but happy to have found this. If this doesn't work then I'll be booting it in "safe mode" to see if that helps <sigh> and probably a day's worth of memtest!
  3. Well, that was short lived - based on text messages I got from friends it died at 1:30 so it was up maybe 5 hours. I have swapped in a new PSU and touched nothing else - it's begun a parity check. If it goes down this time I may try moving it off my uber expensive UPS and onto a small portable one I used previously just to rule tat sucker out. The one it's on now has a dedicated circuit, brand new batteries, but was put in use not too far before these problems cropped up. I'll be pretty upset if that thing is the issue!
  4. I'm aware of the hazards of overclocking, I don't think I've owned a computer that wasn't overclocked in some way in the last 30 or so years if not longer. This thing would be water cooled if I could fit it without cutting. The settings now are close to stock and in Eco mode should be keeping it from boosting much and using less power. Underclocking is often done to save power but I've got enough containers running I'd prefer to not go that route unless pushed. SuperMicro chassis out of the box are quite loud. SuperMicro makes an SQ model PSU however that's dead silent and will only run it's fan when necessary, this cuts easily 50% of the noise. The rear fans are the next loudest but with a little tweaking fans that are normally used in the middle of the chassis can be used and that's what I've done. I did at one time try multiple third party fans that everyone claimed worked - they didn't lol. So yeah, this is cooled with good fans and not dead silent But WAY quieter than stock for sure. The CPU cooler is the AMD unit, 80mm vertical coolers work in this chassis but not 120mm. The OEM cooler seems to be handling the job fine - and has pretty colors when I pop the top 😛 Update: For the first time in a month, as far as I've seen, Parity Check completed successfully - whew! I feel a little more comfortable maybe swapping a drive, no additional SMART errors. Also no closer to solving the sudden reboot but I think I can rule out the parity process - that's GOOD news at least. If it goes down again that PSU is coming out for sure. Fingers crossed that's not soon and I can go back to my normal 6month long uptime. Wish I had better resolution but I'll keep watching it. Update 2: Well, that didn't last long - failed sometime after 5am last night. No time to swap PSU as I'm on the way out the door but I guess I'll be doing that tonight. 64 in the house last night so certainly not a cooling issue sheesh. Logs show nothing at all.
  5. This is a rack mount server chassis, it can actually accommodate 2x PSU in a failover setup but I've only got one installed currently. I have only briefly pondered the PSU as it's pretty good quality and made for this kind of (ab)use. The chassis holds 24 spinners and can accommodate multiple SSD too. I do actually have a spare, more than one if I'm willing to suffer turbine whine come to think of it lol. If it goes down again (it's at 40% thru the check now) I may do this first thing as swapping it is one of the easiest things I can do actually! I'll be shocked if that's the problem but as easily as it's replaced and with a spare that seems a good first step. Oh and yes I cleaned the heatsink fins. This system has only been together just short of a year but since it seemed like a heat issue that was one of the first changes. Slowing the CPU and lowering voltages got temps down quite a bit (over 20C) so I no longer think this is heat I also just realized I need to setup a syslog server somewhere to capture logs off box, I hate that we lose them when a system goes down. I might try building a script to dump them onto disk storage. I just came home and had this machine sleeping which broke the SSH connection ugh. This would be WAY easier if the syslog gave me clues. I look forward to the next release with its kernel having better support for my hardware and it's sensors!
  6. About three weeks ago my server dropped offline and I found it sitting at the boot main screen awaiting my crypto key input. I have a pretty solid UPS and my hardware is pretty new so this was puzzling, I was out of the town at the time. I thought that it might have gotten hot but logs showed no error (tailed in an SSH session). I brought the system up and it began a parity check. Last position I saw about a day later was some 90% complete - it dropped again. It did this one more time and then I was home. Each time it seems to get close to complete with parity and appears to cold boot, my logs show nothing untoward - no errors. This is an AMD 3700 on an ASROCK TaiChi board. 32gig RAM, 4U SuperMicro case with good cooling fans. When I returned home I flashed the firmware to the latest version, lowered clock speed, put the CPU in "eco mode", and increased fan speeds. Temps don't show on the main screen for me in unRAID but in the diag screen of the BIOS I noted a significant drop in CPU temp. I also slowed my memory to a default 2400mhz vs it's rated 3K+ and lowered the RAM voltage somewhat. I ran Memtest through a couple of iterations but not for terribly long as I wanted my server back up. About 19 hours later the system dropped again <sigh> This time when it came up I halted the parity check (no errors noted in previous runs until it booted) and the system has run fine for an entire week - until today. Since this issue began this is the longest it's run and I had surmised it was possibly an issue with parity building, now it's dropped again and I'm not so sure. The only thing I have seen in the way of drive errors is a single drive that's showing UDMA CRC errors in SMART reporting slowly increasing. This is a Seagate 4TB drive, if I'm seeing 90+ percent complete that drive shouldn't be in play. I'd like to replace it but if I cannot complete a parity check to rebuild to a larger drive that's not going to be possible I fear and realizing it's not likely being accessed when this drops makes me wonder what the real issue is. My Parity drive is 12TB, I have a total of 17 data drives of various sizes - 4, 5, 8, and 10. I have some SSD attached via SATA and a PCIe 4 NVME as cache. I do have a backup but that's a last gasp as it's a huge undertaking to restore. My fear is something will get corrupted during one of these boots and I'll lose encrypted files. My backups are daily and every file gets accessed - no errors occurring. I'm stumped and would like some suggestions please! The lack of log entries is pretty frustrating but makes me wonder if this is a straight hardware issue. I'm hesitant to swap parts without a clearer indication of what's dorked up. I'm letting it run parity again now, temp in the house has dropped with the outside temp so temp should NOT be an issue - drive temps 28-34C with most on the lower end of that range. If it drops this time I'm thinking a day's worth of Memtest maybe? My drive formats are XFS, I suppose that could be checked for corruption just in case. I might do that tonight starting with the larger drives.
  7. Successful upgrade from an old RC - my funky full encrypted disk setup moved over just fine, thanks guys! Looking forward to the next revision to support my X570 TaiChi, will test when NVIDIA supports it! P.S. Docker Swarm?
  8. I don't have active cooling on it, next I pull it out I'll see what I can do about adding some cooling. I have airflow as it's a SuperMicro chassis but it's not ducted over there. Watching things further I think the main issue could be the Mover process destroying performance when it runs. My cache drive is an M2 PCIE4 drive but when Mover fires the system becomes nearly unresponsive. Just frustrated I suppose as I bump the space limits on the 1TB cache drive moving videos around of late and performance tanks hard. I'm on 6.8 RC7 which has been stable but it looks like I'm two revs behind so perhaps there's help to be had there. I don't think I want the release 6.8 though as I think that was a kernel step backwards! Parity checks seem low as well with a speed of 113MB/s - 100+TB of space with a 10TB parity drive. Takes a day to check but other than being "slow" it doesn't impact things too badly.
  9. Purchased one of these awhile ago as I was no longer able to run 3x dual port cards. New Ryzen boards don't have the slots and I had to run a video card now (which I use for transcoding so no biggie). I've noticed that I seem to bottleneck during parity checks and when Mover strikes I see it bogging down too. Could I have made a better choice? I need to support a max of around 24 drives, I've got an expander kicking around but have never used it. Would I be better off using that somehow with an existing 8i 2 port card? I see fairly significant IOWait times in NetData from time to time when really pushing data around but some of this could be my drives. Can I do better for a reasonable cost? Edit: reading some other posts it looks like I ought to have enough speed for spinning rust. Perhaps it's simply my drives after all but seeing IOwait numbers climb from time to time is pretty frustrating!
  10. Just wanted to post and say THANK YOU!!!!!! Thanks to the efforts of the guys recompiling for the NVIDIA driver I was able to load up RC7 tonight and gve it a spin. The issues described above appears to be FIXED! My Go script no longer has to create a link to a cleartext password in order for me to boot my server - woohoo! I manually entered my password and the server started just fine - big Snoopy Dance! My thanks to the @limetech guys and to @dlandon for solvin this - much appreciated!!
  11. I've always wanted an easy way to PXE boot, this sounds promising! Thank you!
  12. You have to admit updates have been coming pretty quickly, thus far I've seen no huge showstoppers and am pretty excited about the new code! Lots of improvements and some of the speed issues of the past seem solved.
  13. I was mentioning it as an aside as it was something odd I had noticed and I'm not sure what's causing it, RC related or otherwise. You need not be so defensive.
  14. I'm now on RC5. One thing I've noticed going down is that a second drive I've got mounted with Unassigned Devices seems to hang and be forced unmounted. Each time the array comes up it forces a parity check now. This started a few RC back but I figured it had to do with other things going on. It's formatted XFS and I see errors about the XFS drive not unmounting go by as it goes down and that drive is the only one I've got formatted XFS. Not clear to me what does it, I can say I don't stop VMs or containers before rebooting but this drive isn't used for that anyway <shrug>. Just mentioning it...