generalkenobi Posted September 13, 2023 Share Posted September 13, 2023 TL;DR: Unraid server keeps restarting every 90 mins. Tried disabling VM and Docker, rolling back versions, and memory tests—all no fix. Swapped CPU and changed from PCI SATA to HBA adapter a month ago; was stable but now unstable again. Running on i9-13900K, Asus ProArt Z690, 64GB RAM. No clues in rsyslog or kernel panics. Open to any suggestions or insights from diagnostics. I have had issues with Unraid restarting roughly every 90 minutes or so. I have the diagnostics attached. Things I've tried: Disable VM Manager and Docker and leave disabled - no fix. Change Docker to ipvlan - no fix. Rollback from 6.12.4 to 6.12.2 - no fix. Rollback from 6.12.2 to 6.11.5 - no fix. I pulled out memory sticks and memtest for 24 hours straight with all passes. I had a cache SSD disk ("cache_ssd") that had some bad sectors. I removed it from the system just in case, though it is still connected. I just haven't the time to open the case. A month ago, I swapped to a new but the same model CPU, thinking I had kernel panics caused by the CPU. I previously had issues per my previous post. However, the system was stable for a month straight with no issues when I changed from a PCI SATA card to an HBA adapter. I have an i9-13900K on an Asus ProArt Z690-CREATOR WIFI with 64 GB. No GPU installed. I'm using the onboard 10G port for primary connection. I have a 10GBe PCI card to get me a direct connection to my backup Unraid server to rsync data back and forth. But that's been offline for a while. I have (8) 18TB disks with (1) parity. And I have a NVMe drive and SSD marked as cache, but I'm really just using them to serve docker containers to not stress the array. I set up a rsyslog server at attached that as well. It leaves no trace before a reboot. I don't see any kernel panics the way I did a month ago. I am at a loss for what to do next. Any wild ideas from anyone? Or someone that sees something in the diagnostics that I don't? tower-diagnostics-20230912-2120.zip messages Quote Link to comment
JorgeB Posted September 13, 2023 Share Posted September 13, 2023 If it's really restarting on its own, vs just crashing or hanging, it's likely a hardware problem, do you have a different PSU you could test with? Quote Link to comment
itimpi Posted September 13, 2023 Share Posted September 13, 2023 You could also try enabling the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a restart. If using the mirror option the syslog file is stored in the 'logs' folder on the flash drive. Quote Link to comment
generalkenobi Posted September 13, 2023 Author Share Posted September 13, 2023 Thanks. Those are good points. I turned on syslog to flash and got these logs. The last restart was around 12:40 today. It booted back up at "Sep 13 12:40:20." This restart is after I had stopped the array and let it sit idle. The lines preceding that are: Sep 13 12:37:03 Tower emhttpd: spinning down /dev/sdf Sep 13 12:37:12 Tower emhttpd: spinning down /dev/sdd Sep 13 12:37:20 Tower emhttpd: spinning down /dev/sdk Sep 13 12:37:20 Tower emhttpd: spinning down /dev/sdi Sep 13 12:37:37 Tower emhttpd: spinning down /dev/sdg Sep 13 12:37:46 Tower emhttpd: spinning down /dev/sde Sep 13 12:37:46 Tower emhttpd: spinning down /dev/sdl Sep 13 12:38:09 Tower emhttpd: read SMART /dev/sdf Sep 13 12:38:09 Tower emhttpd: spinning down /dev/sdj Sep 13 12:38:09 Tower emhttpd: spinning down /dev/sdc Sep 13 12:38:18 Tower emhttpd: read SMART /dev/sdd Sep 13 12:38:18 Tower emhttpd: spinning down /dev/sdh Which doesn't look bad to me. I'll try a power supply swap next once I work out if I have enough SATA power cables. syslog Quote Link to comment
JorgeB Posted September 14, 2023 Share Posted September 14, 2023 Not surprisingly there's nothing relevant logged, also pointing to a hardware issue. Quote Link to comment
generalkenobi Posted September 15, 2023 Author Share Posted September 15, 2023 I swapped with a new power supply. So far, it’s been up for 25 hours with no issues. Parity check is running and has corrected over 11,000 errors. Yikes. This may be solved. Holding out hope it remains stable over the next few days. Thanks for the suggestions. 1 Quote Link to comment
generalkenobi Posted September 15, 2023 Author Share Posted September 15, 2023 Hmm. Another restart after the power supply swap. Any other ideas? Since my last post, the parity check had finished. I moved it to another outlet on 1500W UPS. So, I don't think it's the outlet. My girlfriend overloaded the breaker with her hair dryer, but NUT did its job. I doubt it's the move to a different outlet. But maybe the UPS is faulty? Sep 15 10:22:34 Tower upsmon[5986]: UPS [email protected] on battery .... NUT is configured for shutdown after 5 min on battery Sep 15 10:27:35 Tower upsmon[5986]: Signal 10: User requested FSD Syslog is similar with nothing I can see. Restart occurs at Sep 15 12:47:31 in syslog. I ran sensors to get CPU temps now (with docker and parity check running and ~5% cpu load). Don't know what it was before restart. But I wondered about overheating the CPU. Those seem fine too. root@Tower:~# sensors coretemp-isa-0000 Adapter: ISA adapter Package id 0: +44.0°C (high = +80.0°C, crit = +100.0°C) Core 0: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 4: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 8: +41.0°C (high = +80.0°C, crit = +100.0°C) Core 12: +43.0°C (high = +80.0°C, crit = +100.0°C) Core 16: +44.0°C (high = +80.0°C, crit = +100.0°C) Core 20: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 24: +40.0°C (high = +80.0°C, crit = +100.0°C) Core 28: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 32: +40.0°C (high = +80.0°C, crit = +100.0°C) Core 33: +40.0°C (high = +80.0°C, crit = +100.0°C) Core 34: +40.0°C (high = +80.0°C, crit = +100.0°C) Core 35: +40.0°C (high = +80.0°C, crit = +100.0°C) Core 36: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 37: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 38: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 39: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 40: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 41: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 42: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 43: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 44: +38.0°C (high = +80.0°C, crit = +100.0°C) Core 45: +38.0°C (high = +80.0°C, crit = +100.0°C) Core 46: +38.0°C (high = +80.0°C, crit = +100.0°C) Core 47: +38.0°C (high = +80.0°C, crit = +100.0°C) nvme-pci-0300 Adapter: PCI adapter Composite: +49.9°C (low = -273.1°C, high = +81.8°C) (crit = +84.8°C) Sensor 1: +49.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +52.9°C (low = -273.1°C, high = +65261.8°C) acpitz-acpi-0 Adapter: ACPI interface temp1: +27.8°C (crit = +105.0°C) eth0-pci-0600 Adapter: PCI adapter PHY Temperature: +62.0°C MAC Temperature: +62.0°C eth2-pci-0200 Adapter: PCI adapter PHY Temperature: +53.0°C MAC Temperature: +53.0°C syslog Quote Link to comment
JorgeB Posted September 16, 2023 Share Posted September 16, 2023 Unfortunately there's still nothing relevant logged. Quote Link to comment
Solution generalkenobi Posted September 27, 2023 Author Solution Share Posted September 27, 2023 Following back up on this. It seems that the old UPS was somehow cutting power to the outlet when it was overloaded. Moved this to another UPS and haven't had any problems since. Thanks all! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.