Possible hardware error? Troubleshooting assitance plz 6.8.3


BLKMGK

Recommended Posts

About three weeks ago my server dropped offline and I found it sitting at the boot main screen awaiting my crypto key input. I have a pretty solid UPS and my hardware is pretty new so this was puzzling, I was out of the town at the time. I thought that it might have gotten hot but logs showed no error (tailed in an SSH session). I brought the system up and it began a parity check. Last position I saw about a day later was some 90% complete - it dropped again. It did this one more time and then I was home. Each time it seems to get close to complete with parity and appears to cold boot, my logs show nothing untoward - no errors.

 

This is an AMD 3700 on an ASROCK TaiChi board. 32gig RAM, 4U SuperMicro case with good cooling fans. When I returned home I flashed the firmware to the latest version, lowered clock speed, put the CPU in "eco mode", and increased fan speeds. Temps don't show on the main screen for me in unRAID but in the diag screen of the BIOS I noted a significant drop in CPU temp. I also slowed my memory to a default 2400mhz vs it's rated 3K+ and lowered the RAM voltage somewhat. I ran Memtest through a couple of iterations but not for terribly long as I wanted my server back up. About 19 hours later the system dropped again <sigh> This time when it came up I halted the parity check (no errors noted in previous runs until it booted) and the system has run fine for an entire week - until today. Since this issue began this is the longest it's run and I had surmised it was possibly an issue with parity building, now it's dropped again and I'm not so sure.

 

The only thing I have seen in the way of drive errors is a single drive that's showing UDMA CRC errors in SMART reporting slowly increasing. This is a Seagate 4TB drive, if I'm seeing 90+ percent complete that drive shouldn't be in play. I'd like to replace it but if I cannot complete a parity check to rebuild to a larger drive that's not going to be possible I fear and realizing it's not likely being accessed when this drops makes me wonder what the real issue is. My Parity drive is 12TB, I have a total of 17 data drives of various sizes - 4, 5, 8, and 10. I have some SSD attached via SATA and a PCIe 4 NVME as cache. I do have a backup but that's a last gasp as it's a huge undertaking to restore. My fear is something will get corrupted during one of these boots and I'll lose encrypted files. My backups are daily and every file gets accessed - no errors occurring.

 

I'm stumped and would like some suggestions please! The lack of log entries is pretty frustrating but makes me wonder if this is a straight hardware issue. I'm hesitant to swap parts without a clearer indication of what's dorked up. I'm letting it run parity again now, temp in the house has dropped with the outside temp so temp should NOT be an issue - drive temps 28-34C with most on the lower end of that range. If it drops this time I'm thinking a day's worth of Memtest maybe? My drive formats are XFS, I suppose that could be checked for corruption just in case. I might do that tonight starting with the larger drives.

Link to comment

Can we assume that you checked to make sure the CPU cooling fins were clean?  

 

One thing to consider is the PS.  There have been several instances over the past several months where replacement of the PS has clear up situations such as you have described.   With 17 data drives, it would probably be difficult to borrow one from a friend.   You might want to consider purchasing one from a vendor with a generous return policy.  (If you do order one make sure it has a single +12V rail and adequate current rating to accommodate 36-to-48A current surge when all of the drives spin up.)  

Link to comment
7 hours ago, Frank1940 said:

Can we assume that you checked to make sure the CPU cooling fins were clean?  

 

One thing to consider is the PS.  There have been several instances over the past several months where replacement of the PS has clear up situations such as you have described.   With 17 data drives, it would probably be difficult to borrow one from a friend.   You might want to consider purchasing one from a vendor with a generous return policy.  (If you do order one make sure it has a single +12V rail and adequate current rating to accommodate 36-to-48A current surge when all of the drives spin up.)  

 

This is a rack mount server chassis, it can actually accommodate 2x PSU in a failover setup but I've only got one installed currently. I have only briefly pondered the PSU as it's pretty good quality and made for this kind of (ab)use. The chassis holds 24 spinners and can accommodate multiple SSD too. I do actually have a spare, more than one if I'm willing to suffer turbine whine come to think of it lol. If it goes down again (it's at 40% thru the check now) I may do this first thing as swapping it is one of the easiest things I can do actually! I'll be shocked if that's the problem but as easily as it's replaced and with a spare that seems a good first step.

 

Oh and yes I cleaned the heatsink fins. This system has only been together just short of a year but since it seemed like a heat issue that was one of the first changes. Slowing the CPU and lowering voltages got temps down quite a bit (over 20C) so I no longer think this is heat :( 

 

I also just realized I need to setup a syslog server somewhere to capture logs off box, I hate that we lose them when a system goes down. I might try building a script to dump them onto disk storage. I just came home and had this machine sleeping which broke the SSH connection ugh. This would be WAY easier if the syslog gave me clues. I look forward to the next release with its kernel having better support for my hardware and it's sensors!

Link to comment
6 hours ago, BLKMGK said:

I also just realized I need to setup a syslog server somewhere to capture logs off box, I hate that we lose them when a system goes down. I might try building a script to dump them onto disk storage.

See Here:

 

    https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-781601

 

6 hours ago, BLKMGK said:

This system has only been together just short of a year but since it seemed like a heat issue that was one of the first changes. Slowing the CPU and lowering voltages got temps down quite a bit (over 20C) so I no longer think this is heat

Also, overclocking of Unraid servers is not recommended.  I am not sure what the recommendation would be for underclocking.   Try running with stock settings as that is typically the most stable setup.

 

Rack type boxes usually are designed to run in harsh environments but most of them are quite noisy.  😉  Addressing the noise issue is tricky as 'moving air' (think wind) is noisy by itself.  (Designing fans that can move large volumes of air quietly against static pressure is difficult.   There are some companies that have quiet fans that do so but one almost needs to be an engineer to select which of their products is best suited for a particular application.)

Link to comment

I'm aware of the hazards of overclocking, I don't think I've owned a computer that wasn't overclocked in some way in the last 30 or so years if not longer. This thing would be water cooled if I could fit it without cutting. The settings now are close to stock and in Eco mode should be keeping it from boosting much and using less power. Underclocking is often done to save power but I've got enough containers running I'd prefer to not go that route unless pushed. 

 

SuperMicro chassis out of the box are quite loud. SuperMicro makes an SQ model PSU however that's dead silent and will only run it's fan when necessary, this cuts easily 50% of the noise. The rear fans are the next loudest but with a little tweaking fans that are normally used in the middle of the chassis can be used and that's what I've done. I did at one time try multiple third party fans that everyone claimed worked - they didn't lol. So yeah, this is cooled with good fans and not dead silent :) But WAY quieter than stock for sure. The CPU cooler is the AMD unit, 80mm vertical coolers work in this chassis but not 120mm. The OEM cooler seems to be handling the job fine - and has pretty colors when I pop the top 😛 

 

Update: For the first time in a month, as far as I've seen, Parity Check completed successfully - whew! I feel a little more comfortable maybe swapping a drive, no additional SMART errors. Also no closer to solving the sudden reboot but I think I can rule out the parity process - that's GOOD news at least. If it goes down again that PSU is coming out for sure. Fingers crossed that's not soon and I can go back to my normal 6month long uptime. Wish I had better resolution but I'll keep watching it.

 

Update 2: Well, that didn't last long - failed sometime after 5am last night. No time to swap PSU as I'm on the way out the door but I guess I'll be doing that tonight. 64 in the house last night so certainly not a cooling issue sheesh. Logs show nothing at all.

Edited by BLKMGK
Link to comment

Well, that was short lived - based on text messages I got from friends it died at 1:30 so it was up maybe 5 hours. I have swapped in a new PSU and touched nothing else - it's begun a parity check. If it goes down this time I may try moving it off my uber expensive UPS and onto a small portable one I used previously just to rule tat sucker out. The one it's on now has a dedicated circuit, brand new batteries, but was put in use not too far before these problems cropped up. I'll be pretty upset if that thing is the issue!

Link to comment

Not the power supply. Ran a memtest for over an hour, no issues. Have put a spare UPS in line between the recently ionstalled LARGE UPS and the server to try and see if the issue is coming from the new rackmount UPS - that I just had to run a new circuit to install. If it's the UPS I'll be upset but happy to have found this. If this doesn't work then I'll be booting it in "safe mode" to see if that helps <sigh> and probably a day's worth of memtest!

Link to comment

Well, sitting at over 6 days uptime now including a full scheduled parity check without errors. I'm not yet positive what's cleared this error and that sux. Temps have been cooler and I've had the house open but I've also had a second known good UPS inline with the server. The only major thing that changed between months of uptime and barely days was a new dedicated 30amp electrical circuit, a new 2U UPS, and warmer temps coincided. I guess I'll be patient but for now it's reliable - pretty frustrating. I guess I'll also mark the previously swapped PSU good and feel okay about keeping it with my spare chassis - it would've been nice had that been it as that's easiest to swap on a rackmount. I supose I could put it in the chassis attached to the known good UPS and see if the server toggles between them but I'm not sure how I'd get the info. Anyway, I'll post anything new here and sure hopew like heck it stays up and no one else goes through this grief.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.