zleef Posted February 14, 2017 Share Posted February 14, 2017 Hi all, I'm losing my mind trying to solve this. Any suggestions for debugging this would be greatly appreciated! Here's the summary: my server crashes and restarts somewhere between every few minutes and every few hours without any obvious (to me) errors. I'm embarrassed to say, it's always had some form of this issue - although it previously would happen much less frequently (every few days/months), so it was low priority thing i never got around to fixing, other than spending a few hours on it here and there. But as of lately (last week or so, right around when i upgraded to unraid 6.3.0), something has changed and it's happening at an alarming rate. I should say, I'm not a hardware guy. I know enough to get myself in trouble (ah, hem. see this post), but am far from having any hardware expertise. Here's what i have tried: - tailing syslog from IMPI console before/after server crashes. no errors, it just restarts. - I've replaced the memory, sata cables, mother board, and PSU overtime. Each time, the issue still returns. The only things left are hard drives and cpu. - a few days ago i bought a new usb jump drive (Lexar JumpDrive) and tried re-installing unraid on that. Using a fresh install with default config (no license, so disks weren't mounted) didn't immediately cause the crash as before. However, when i copied over my existing config the server did crash pretty quickly. I didn't let the server run for too long with the default config (maybe 2 hours with no crash). I'm open to leaving that running for longer, but because it was running in such a minimal capacity it didn't feel like a good test. My current theory is that one of the disks is having problems (one of the disks in the array is ~5 years old), but none of the SMART reports seem to indicate failure (although this very well could be my ignorance of reading those reports). I'm open to suggestions. Let me know what other information I can provide. I've attached a diagnostics dump from earlier today. tower-diagnostics-20170214-0844.zip Quote Link to comment
testdasi Posted February 14, 2017 Share Posted February 14, 2017 I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic. Quote Link to comment
zleef Posted February 14, 2017 Author Share Posted February 14, 2017 I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic. Thanks testdasi, i installed Fix Common Problems and started "Troubleshooting Mode". Unfortunately I'm not seeing anything jump out in the persisted syslog but i've attached it just in case. I tried to run the extended scan, but the server restarted minutes after starting that FCPsyslog_tail.txt Quote Link to comment
Frank1940 Posted February 14, 2017 Share Posted February 14, 2017 I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic. I looked through your diagnostics file and didn't see anything. Your disks appear to be fine. Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt? Have you installed Dynamix System Temperature? Have you run memtst for 24 hours? (I know you replaced memory...) What plugins, Docks or VM have you installed? Most of what I saw were common plugins that most folks have so that should not be an issue here. Saw your second post about I locking up while running the troubleshooting scan. I would be looking at the CPU overheating as a cause... Quote Link to comment
zleef Posted February 14, 2017 Author Share Posted February 14, 2017 I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic. I looked through your diagnostics file and didn't see anything. Your disks appear to be fine. Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt? Have you installed Dynamix System Temperature? Have you run memtst for 24 hours? (I know you replaced memory...) What plugins, Docks or VM have you installed? Most of what I saw were common plugins that most folks have so that should not be an issue here. Saw your second post about I locking up while running the troubleshooting scan. I would be looking at the CPU overheating as a cause... Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt? Yes, just cleaned (after your post). CPU cooling fan is not clogged. It lasted a little longer initially after cleaning (maybe 1.5 hours), but now back to restarting every ~10 minutes or so. Have you run memtst for 24 hours? It's been awhile since i last did it. I don't believe i've done it with the "new" memory, so i'll go ahead and do that just to rule it out (probably will wait until tonight to start). What plugins, Docks or VM have you installed? Most of what I saw were common plugins that most folks have so that should not be an issue here. Plugins (listing all, including default): - CA Auto Update Application - CA Backup - CA Cleanup Appdata - Communicty Applications - Dynamix System Temperature (installed today) - Dynamix webGui - Fix Common Problems (installed today) - Preclear Disks - Speedtest Command Line Tool - unRAID server OS Docker: - I have several containers, but to help narrow down the problem i've disabled docker completely (and still have the problem). Saw your second post about I locking up while running the troubleshooting scan. I would be looking at the CPU overheating as a cause... Pardon my ignorance here, but is it possible for the temperatures reported to be incorrect? Running the `sensors` command from the command line, i get the following: coretemp-isa-0000 Adapter: ISA adapter CPU Temp: +32.0°C (high = +86.0°C, crit = +96.0°C) Core 0: +32.0°C (high = +86.0°C, crit = +96.0°C) Core 1: +30.0°C (high = +86.0°C, crit = +96.0°C) Core 2: +30.0°C (high = +86.0°C, crit = +96.0°C) Core 3: +31.0°C (high = +86.0°C, crit = +96.0°C) Even if we assume the temp doubles in the minutes (after i lookup the temperature and before the server crashes), it's still below the high or critical level. Just to show the full output: $ sensors nct6776-isa-0a30 Adapter: ISA adapter Vcore: +1.46 V (min = +1.02 V, max = +1.69 V) in1: +1.84 V (min = +1.55 V, max = +2.02 V) AVCC: +3.34 V (min = +2.90 V, max = +3.66 V) +3.3V: +3.34 V (min = +2.83 V, max = +3.66 V) in4: +1.50 V (min = +0.97 V, max = +1.65 V) in5: +1.27 V (min = +1.07 V, max = +1.39 V) in6: +1.47 V (min = +0.89 V, max = +1.23 V) ALARM 3VSB: +3.34 V (min = +2.83 V, max = +3.66 V) Vbat: +3.15 V (min = +2.50 V, max = +3.60 V) fan1: 0 RPM (min = 712 RPM) ALARM fan2: 2142 RPM (min = 712 RPM) fan3: 948 RPM (min = 712 RPM) fan4: 0 RPM (min = 712 RPM) ALARM fan5: 2177 RPM (min = 712 RPM) SYSTIN: +32.0°C (high = +85.0°C, hyst = +80.0°C) sensor = thermistor CPUTIN: +27.5°C (high = +85.0°C, hyst = +80.0°C) sensor = thermistor AUXTIN: +1.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor PECI Agent 0: +0.0°C (high = +80.0°C, hyst = +75.0°C) ALARM (crit = +100.0°C) PCH_CHIP_TEMP: +0.0°C PCH_CPU_TEMP: +0.0°C PCH_MCH_TEMP: +0.0°C intrusion0: ALARM intrusion1: OK beep_enable: enabled acpitz-virtual-0 Adapter: Virtual device temp1: +27.8°C (crit = +101.0°C) MB Temp: +29.8°C (crit = +101.0°C) coretemp-isa-0000 Adapter: ISA adapter CPU Temp: +32.0°C (high = +86.0°C, crit = +96.0°C) Core 0: +32.0°C (high = +86.0°C, crit = +96.0°C) Core 1: +30.0°C (high = +86.0°C, crit = +96.0°C) Core 2: +30.0°C (high = +86.0°C, crit = +96.0°C) Core 3: +31.0°C (high = +86.0°C, crit = +96.0°C) By the way, i just noticed the *ALARM* warning on several of the sensors. Any clues there? I should mention, since i've gotten this board the fans have run at (what i believe) is full speed. I've never spent enough time to figure out the problem, but looking at the sensor readings, i wonder if it's because it's expecting fan1 and fan4 to have min rpm of 712, but i don't have any fans hooked up to those? Quote Link to comment
Frank1940 Posted February 15, 2017 Share Posted February 15, 2017 From what you have posted, I don't think you have a heat related issue either. But I would be concerned with that in6 voltage in the sensors report being high. Has this MB been used previously in another computer setup where someone might have been attempting to overclock? I hope someone else can jump in and give you some insight here. Quote Link to comment
zleef Posted February 15, 2017 Author Share Posted February 15, 2017 From what you have posted, I don't think you have a heat related issue either. But I would be concerned with that in6 voltage in the sensors report being high. Has this MB been used previously in another computer setup where someone might have been attempting to overclock? I hope someone else can jump in and give you some insight here. The MB was purchased new, so to my knowledge it shouldn't have ever been overclocked. Quote Link to comment
John_M Posted February 15, 2017 Share Posted February 15, 2017 From your syslog: Feb 14 08:35:40 Tower kernel: mce: [Hardware Error]: Machine check events logged Run that Memtest that you were planning to run and run it for a good long time. If the memory proves to be good then finding out what is actually wrong is not going to be easy. Quote Link to comment
zleef Posted February 15, 2017 Author Share Posted February 15, 2017 From your syslog: Feb 14 08:35:40 Tower kernel: mce: [Hardware Error]: Machine check events logged Run that Memtest that you were planning to run and run it for a good long time. If the memory proves to be good then finding out what is actually wrong is not going to be easy. What would you consider a sufficient amount of time? 24 hours? 48? In the likely case (based on my luck) that the memory is good, does anyone have suggestions for next steps? Quote Link to comment
John_M Posted February 15, 2017 Share Posted February 15, 2017 I'd do it for 48 hours. If the RAM is good, I'd use the Nerd Tools to install mcelog and see if that reveals anything. But do the Memtest first. Better than the version you select from the boot menu is the newer version you can download and run from its own USB device. Quote Link to comment
zleef Posted February 15, 2017 Author Share Posted February 15, 2017 Thanks for the suggestions John_M. I've been running memtest from the boot menu, but will download and run the newer version just to be thorough. I'll report back after 48 hours of Memtest. Thanks! Quote Link to comment
John_M Posted February 15, 2017 Share Posted February 15, 2017 The stand-alone version needs its own USB stick. It dual boots: BIOS boot runs version 4 and UEFI boot runs version 7. If you can, run the later version. Quote Link to comment
gubbgnutten Posted February 15, 2017 Share Posted February 15, 2017 ...and please do interpret "If you can, run the later version" as "Run version 7 unless it is completely impossible". Quote Link to comment
zleef Posted February 15, 2017 Author Share Posted February 15, 2017 ...and please do interpret "If you can, run the later version" as "Run version 7 unless it is completely impossible". Haha, thanks for the laugh. Per your gentle suggestion, MemTest86 V7.2 is now running! 48 hours to go.. Quote Link to comment
gubbgnutten Posted February 15, 2017 Share Posted February 15, 2017 You're welcome! I'm sure your memory modules will enjoy the hammer tests Never really know what memory test result to hope for when there is an obscure problem present... Quote Link to comment
zleef Posted February 16, 2017 Author Share Posted February 16, 2017 So i just went to check in on MemTest, and it said something like "all tests passed, press any key to continue". I'm attaching the screenshot of the next page which showed the summary. I was assuming that this would just continue to run until I stopped it.. do i need to configure it somewhere to keep running, or should I just continue to restart the test every ~6hours? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.