FunkadelicRelic Posted October 24, 2018 Share Posted October 24, 2018 Hi all. I was wandering if someone might be able to help me troubleshoot this issue I have been seeing since upgrading from 6.5.3. Every now and then (usually after about a day or two) of uptime, my server shuts itself down. I've been having this problem since moving from 6.5.3. It was running just fine for weeks, and when 6.6.x released I upgraded using the GUI. Since then, the shut downs are occuring all the time. I've been upgrading as new releases have come about but still the issues persist. I've most recently deployed a new USB key straight on version 6.6.3 but still the same thing. What's odd is that it seems to be doing a shutdown or a hard crash as the server will be in a powered off state, not sitting at a console or panic screen. I've done everything I can think of to rule out hardware - temp monitoring through IPMI and BIOS, Memtest, reset BIOS settings, reseated power/reset headers on motherboard, checked event log on IPMI etc. Nothing appears to be wrong. I should note that this hardware was running NAS4Free for about 4 years with no issues prior to migrating to unRAID a month or two ago. Are there any diagnostics I can perform on the OS? The diagnostics tool in the web GUI seemed to reset after power loss so they don't show anything meaningful. Here are the stats for my server. I'm runnning very minimal docker instances only (Plex, Sonarr etc) and CPU/mem utilisation are very low. Model: Custom M/B: Supermicro - X9SCL/X9SCM CPU: Intel® Xeon® CPU E3-1230 V2 @ 3.30GHz HVM: Enabled IOMMU: Enabled Cache: 256 kB, 1024 kB, 8192 kB Memory: 32 GB Single-bit ECC (max. installable capacity 32 GB) Network: eth0: 1000 Mb/s, full duplex, mtu 1500 eth3: 1000 Mb/s, full duplex, mtu 1500 Kernel: Linux 4.18.15-unRAID x86_64 OpenSSL: 1.1.0i Uptime: 0 days, 00:06:45 Link to comment
JonathanM Posted October 24, 2018 Share Posted October 24, 2018 This sort of issue is usually caused by a failing power supply. What model is it, and how old? How many hard drives, and models on those? Graphics cards? VM passthrough for graphics? Link to comment
Frank1940 Posted October 24, 2018 Share Posted October 24, 2018 Is it possible that something could be hitting the power switch on the case? (I can recall there being a couple of instances of this in the past. One time, it was a cat. Children are another possibility. ) Is it possible that someone is having some fun with you and shutting it remotely via the GUI? (Try changing the server password.) Are you getting a parity check on restart (hard shutdown) or a normal shutdown (no parity check)? @jonathanm has a have good possibility with the PS suggestion. Link to comment
itimpi Posted October 24, 2018 Share Posted October 24, 2018 Are you saying the server is being powered off (rather than crashing but staying powered on)? This distinction is important as powering down always needs a specific trigger to activate it. Typically this would be hardware related. If the server is staying powered on but unresponsive (I.e. crashed) then a software problem also becomes a likely candidate. in the short term the first step is to install the “fix Common Problems” plugin and put it into troubleshooting mode (note that this mode gets unset if you reboot) which might help in capturing some useful log information. Link to comment
FunkadelicRelic Posted October 24, 2018 Author Share Posted October 24, 2018 2 minutes ago, jonathanm said: This sort of issue is usually caused by a failing power supply. What model is it, and how old? How many hard drives, and models on those? Graphics cards? VM passthrough for graphics? Quite possibly, however, during the Memtest, which I ran for over a full day, there were no outages. Since typing the message above, it went again! That was up for about 30 minutes this time. The PSU is an EVGA BQ modular 500 watt if I recall. Graphics - none, all VM functionality disabled. I'm very much using this as a traditional NAS server. Nothing fancy. Link to comment
FunkadelicRelic Posted October 24, 2018 Author Share Posted October 24, 2018 1 minute ago, Frank1940 said: Is it possible that something could be hitting the power switch on the case? (I can recall there being a couple of instances of this in the past. One time, it was a cat. Children are another possibility. ) Is it possible that someone is having some fun with you and shutting it remotely via the GUI? (Try changing the server password.) Are you getting a parity check on restart (hard shutdown) or a normal shutdown (no parity check)? @jonathanm has a have good possibility with the PS suggestion. It's a fair suggestion, but unless a cat could get into my garage, open my locked rack and climb up about 30u and power off then I'm going say no. Although I don't ever trust a cat In seriousness though, it's deffo not someone powering it down. It is locked away and secure. The GUI shutdown is interesting but I don't think it could be that, I'm behind a firewall which I monitor and the server itself is in a constrained VLAN with limited connectivity so I think I'm OK from that standpoint - strong password too. Parity check seems to kick in on the reboot. Most times it completes with zero issues, but as I mentioned in my last reply, this latest stop happened after about 30 mins so it will never have finished. Oh - forgot to mention - 4 HDD (2TB), two of which parity, and one 500GB SSD cache. Link to comment
FunkadelicRelic Posted October 24, 2018 Author Share Posted October 24, 2018 5 minutes ago, itimpi said: Are you saying the server is being powered off (rather than crashing but staying powered on)? This distinction is important as powering down always needs a specific trigger to activate it. Typically this would be hardware related. If the server is staying powered on but unresponsive (I.e. crashed) then a software problem also becomes a likely candidate. in the short term the first step is to install the “fix Common Problems” plugin and put it into troubleshooting mode (note that this mode gets unset if you reboot) which might help in capturing some useful log information. Full power off. What is interesting is that when I power it back up the first time, the server doesn't boot from the USB key - I get the 'no OS' warning. Power-off and then on and it boots fine. This is new since 6.6.3. I thought it could be a bad stick which is why I rebuilt a new one. I've got the FCP app - I'll try and get it in to troubleshooting mode - haven't checked that option yet! Link to comment
Frank1940 Posted October 24, 2018 Share Posted October 24, 2018 I would bet on the PS. Cheap (and quick) to test (especially with only four HD's). If none in house, try borrowing from a friend (or purchase one from a vender with a 30 day return policy). EDIT: I had a PS go bad but it would reboot the server. In my case, it was an easy find. I had replaced a dual rail supply with a single rail a week earlier. So the PS was the last thing changed in the system... Link to comment
itimpi Posted October 24, 2018 Share Posted October 24, 2018 When it comes back up does it start an automatic parity check? If so that suggests an unexpected hardware power off. I would be looking then at the power supply, and items like the cooling system (in case the CPU is triggering a thermal triggered power Down). If on powering up a parity check is not initiated then that indicates a tidy shutdown. The only candidate I can think of off-hand that might trigger that would be a UPS related event. Are you using a UPS? Link to comment
FunkadelicRelic Posted October 24, 2018 Author Share Posted October 24, 2018 18 minutes ago, Frank1940 said: I would bet on the PS. Cheap (and quick) to test (especially with only four HD's). If none in house, try borrowing from a friend (or purchase one from a vender with a 30 day return policy). EDIT: I had a PS go bad but it would reboot the server. In my case, it was an easy find. I had replaced a dual rail supply with a single rail a week earlier. So the PS was the last thing changed in the system... I can certainly give it a go - have a spare kicking around here so no harm in trying. I'll post back with my findings. Link to comment
FunkadelicRelic Posted October 24, 2018 Author Share Posted October 24, 2018 19 minutes ago, itimpi said: When it comes back up does it start an automatic parity check? If so that suggests an unexpected hardware power off. I would be looking then at the power supply, and items like the cooling system (in case the CPU is triggering a thermal triggered power Down). If on powering up a parity check is not initiated then that indicates a tidy shutdown. The only candidate I can think of off-hand that might trigger that would be a UPS related event. Are you using a UPS? That's interesting to know - it certainly seems like a dirty shutdown then. Thermals are no probs - IPMI monitoring of temps and event logs indicate no issues. It's also cold in the UK at the moment - I reckon I could run this thing passively with no problems Joking aside that was my first hunch but I am certain that it's thermally sound. I'm going to try the PSU as you and others have mentioned. I'll keep you all posted! Link to comment
FunkadelicRelic Posted October 24, 2018 Author Share Posted October 24, 2018 Thanks everyone for the help so far. Going to try the PSU - fairly unanimous responses so far so we'll see how that goes. Link to comment
FunkadelicRelic Posted October 25, 2018 Author Share Posted October 25, 2018 23 hours ago, FunkadelicRelic said: Thanks everyone for the help so far. Going to try the PSU - fairly unanimous responses so far so we'll see how that goes. OK - new PSU is in. Replaced with a spare 450w EVGA CX modular PSU. First time power up is fine, parity check in-progress. Find out soon enough if that was the culprit. Keep you posted. Link to comment
FunkadelicRelic Posted October 25, 2018 Author Share Posted October 25, 2018 1 hour ago, FunkadelicRelic said: OK - new PSU is in. Replaced with a spare 450w EVGA CX modular PSU. First time power up is fine, parity check in-progress. Find out soon enough if that was the culprit. Keep you posted. Well - the server just lost power again with the new PSU. Must have stayed up for an hour or so this time. Anyone else have any other suggestions? Link to comment
Frank1940 Posted October 25, 2018 Share Posted October 25, 2018 Do you have a UPS and using the UPS shutdown app? You can easily check a UPS by shutting your server down and connect a lamp load of a couple of hundred watts to it. Pull the power plug and see how long it the battery lasts. Link to comment
FunkadelicRelic Posted October 26, 2018 Author Share Posted October 26, 2018 16 hours ago, Frank1940 said: Do you have a UPS and using the UPS shutdown app? You can easily check a UPS by shutting your server down and connect a lamp load of a couple of hundred watts to it. Pull the power plug and see how long it the battery lasts. Unfortunately not. I'm working now on the assumption that something is shorting out. The rackmount chassis has been in place for years so I wouldn't imagine it's the motherboard - but I do have questions about the backplane and disk trays. I'm going to swap into a spare case with no BP or caddies and see what happens. Guess that's my weekend project sorted. Link to comment
FunkadelicRelic Posted November 12, 2018 Author Share Posted November 12, 2018 So I seem to have resolved this although not entirely sure what was causing the issue as I changed quite a few things in the process. To summarise, it must have been environmental. I switched PSU and changed out CPU fan and TIM. Still crashed. In the end, I had a spare 4u case of identical make, so I decided to lift and shift the components into the spare case, backplane included. When moving the motherboard, I did notice the board had quite a lot of thick black gunk (thought it was scorch marks to start with!) around some of the capacitors. I gave it a good clean with isopropyl alcohol 99 and it cleaned right up. Re-assembled and everything seems to be OK. It's now been up for over 24 hours so fingers crossed. If I were to guess, I'd imagine it was a fan cable or one of the case headers causing a short. Anyway - appreciate everyones help trying to diagnose this. Regards. Tom. Link to comment
John_M Posted November 12, 2018 Share Posted November 12, 2018 If you have leaking capacitors you need to replace them. In practice that probably means replacing the motherboard. Link to comment
FunkadelicRelic Posted November 12, 2018 Author Share Posted November 12, 2018 Just now, John_M said: If you have leaking capacitors you need to replace them. In practice that probably means replacing the motherboard. Agreed, though in this case they aren't leaking, just years of crud build up. As mentioned, don't think they were the cause of the issue but they certainly do look nice and clean now Link to comment
Frayedknot Posted November 12, 2018 Share Posted November 12, 2018 I have the same issue. Going back to 6.5.3 seems to be stable. I've tried 6.6.0, 6.6.1, an 6.6.2. I have since jumped to 6.6.4 and 6.6.5 and it has been good. I thought it was possibly related to a realtek lan card, but because my system hard-stops that seems different that those symtoms. My console or syslog both didn't show anything. Are you now running 6.6.3 or did you upgrade with 6.6.4/6.6.5. .. maybe it's even a plugin that got an update!? Just throwing wild ideas out now. Link to comment
Soldius Posted November 13, 2018 Share Posted November 13, 2018 Someone mentioned his exact symptom was fixed from uninstalling the Unassigned Device plugin and deleting the Unassigned Device residual folder from the flash drive, then reinstalling it. Try that, and see if it makes a difference. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.