Periodic reboot every ~90 minutes with no trace in syslog

generalkenobi · September 13, 2023

TL;DR: Unraid server keeps restarting every 90 mins. Tried disabling VM and Docker, rolling back versions, and memory tests—all no fix. Swapped CPU and changed from PCI SATA to HBA adapter a month ago; was stable but now unstable again. Running on i9-13900K, Asus ProArt Z690, 64GB RAM. No clues in rsyslog or kernel panics. Open to any suggestions or insights from diagnostics.

I have had issues with Unraid restarting roughly every 90 minutes or so. I have the diagnostics attached.

Things I've tried:

Disable VM Manager and Docker and leave disabled - no fix.
Change Docker to ipvlan - no fix.
Rollback from 6.12.4 to 6.12.2 - no fix.
Rollback from 6.12.2 to 6.11.5 - no fix.
I pulled out memory sticks and memtest for 24 hours straight with all passes.
I had a cache SSD disk ("cache_ssd") that had some bad sectors. I removed it from the system just in case, though it is still connected. I just haven't the time to open the case.
A month ago, I swapped to a new but the same model CPU, thinking I had kernel panics caused by the CPU.

I previously had issues per my previous post. However, the system was stable for a month straight with no issues when I changed from a PCI SATA card to an HBA adapter.

I have an i9-13900K on an Asus ProArt Z690-CREATOR WIFI with 64 GB. No GPU installed. I'm using the onboard 10G port for primary connection. I have a 10GBe PCI card to get me a direct connection to my backup Unraid server to rsync data back and forth. But that's been offline for a while. I have (8) 18TB disks with (1) parity. And I have a NVMe drive and SSD marked as cache, but I'm really just using them to serve docker containers to not stress the array.

I set up a rsyslog server at attached that as well. It leaves no trace before a reboot. I don't see any kernel panics the way I did a month ago.

I am at a loss for what to do next. Any wild ideas from anyone? Or someone that sees something in the diagnostics that I don't?

tower-diagnostics-20230912-2120.zip messages

JorgeB · September 13, 2023

If it's really restarting on its own, vs just crashing or hanging, it's likely a hardware problem, do you have a different PSU you could test with?

itimpi · September 13, 2023

You could also try enabling the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a restart.

If using the mirror option the syslog file is stored in the 'logs' folder on the flash drive.

generalkenobi · September 13, 2023

Thanks. Those are good points.

I turned on syslog to flash and got these logs. The last restart was around 12:40 today. It booted back up at "Sep 13 12:40:20." This restart is after I had stopped the array and let it sit idle.

The lines preceding that are:

Sep 13 12:37:03 Tower  emhttpd: spinning down /dev/sdf
Sep 13 12:37:12 Tower  emhttpd: spinning down /dev/sdd
Sep 13 12:37:20 Tower  emhttpd: spinning down /dev/sdk
Sep 13 12:37:20 Tower  emhttpd: spinning down /dev/sdi
Sep 13 12:37:37 Tower  emhttpd: spinning down /dev/sdg
Sep 13 12:37:46 Tower  emhttpd: spinning down /dev/sde
Sep 13 12:37:46 Tower  emhttpd: spinning down /dev/sdl
Sep 13 12:38:09 Tower  emhttpd: read SMART /dev/sdf
Sep 13 12:38:09 Tower  emhttpd: spinning down /dev/sdj
Sep 13 12:38:09 Tower  emhttpd: spinning down /dev/sdc
Sep 13 12:38:18 Tower  emhttpd: read SMART /dev/sdd
Sep 13 12:38:18 Tower  emhttpd: spinning down /dev/sdh

Which doesn't look bad to me. I'll try a power supply swap next once I work out if I have enough SATA power cables.

syslog

JorgeB · September 14, 2023

Not surprisingly there's nothing relevant logged, also pointing to a hardware issue.

generalkenobi · September 15, 2023

I swapped with a new power supply. So far, it’s been up for 25 hours with no issues. Parity check is running and has corrected over 11,000 errors. Yikes. This may be solved. Holding out hope it remains stable over the next few days. Thanks for the suggestions.

generalkenobi · September 15, 2023

Hmm. Another restart after the power supply swap. Any other ideas?

Since my last post, the parity check had finished. I moved it to another outlet on 1500W UPS. So, I don't think it's the outlet. My girlfriend overloaded the breaker with her hair dryer, but NUT did its job. I doubt it's the move to a different outlet. But maybe the UPS is faulty?

Sep 15 10:22:34 Tower upsmon[5986]: UPS [email protected] on battery
.... NUT is configured for shutdown after 5 min on battery
Sep 15 10:27:35 Tower upsmon[5986]: Signal 10: User requested FSD

Syslog is similar with nothing I can see. Restart occurs at Sep 15 12:47:31 in syslog.

I ran sensors to get CPU temps now (with docker and parity check running and ~5% cpu load). Don't know what it was before restart. But I wondered about overheating the CPU. Those seem fine too.

root@Tower:~# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +44.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 4:        +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 8:        +41.0°C  (high = +80.0°C, crit = +100.0°C)
Core 12:       +43.0°C  (high = +80.0°C, crit = +100.0°C)
Core 16:       +44.0°C  (high = +80.0°C, crit = +100.0°C)
Core 20:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 24:       +40.0°C  (high = +80.0°C, crit = +100.0°C)
Core 28:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 32:       +40.0°C  (high = +80.0°C, crit = +100.0°C)
Core 33:       +40.0°C  (high = +80.0°C, crit = +100.0°C)
Core 34:       +40.0°C  (high = +80.0°C, crit = +100.0°C)
Core 35:       +40.0°C  (high = +80.0°C, crit = +100.0°C)
Core 36:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 37:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 38:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 39:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 40:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 41:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 42:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 43:       +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 44:       +38.0°C  (high = +80.0°C, crit = +100.0°C)
Core 45:       +38.0°C  (high = +80.0°C, crit = +100.0°C)
Core 46:       +38.0°C  (high = +80.0°C, crit = +100.0°C)
Core 47:       +38.0°C  (high = +80.0°C, crit = +100.0°C)

nvme-pci-0300
Adapter: PCI adapter
Composite:    +49.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +49.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +52.9°C  (low  = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +105.0°C)

eth0-pci-0600
Adapter: PCI adapter
PHY Temperature:  +62.0°C  
MAC Temperature:  +62.0°C  

eth2-pci-0200
Adapter: PCI adapter
PHY Temperature:  +53.0°C  
MAC Temperature:  +53.0°C

syslog

JorgeB · September 16, 2023

Unfortunately there's still nothing relevant logged.

generalkenobi · September 27, 2023

Following back up on this. It seems that the old UPS was somehow cutting power to the outlet when it was overloaded. Moved this to another UPS and haven't had any problems since. Thanks all!

Periodic reboot every ~90 minutes with no trace in syslog

Recommended Posts

generalkenobi

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

generalkenobi

Link to comment

JorgeB

Link to comment

generalkenobi

Link to comment

generalkenobi

Link to comment

JorgeB

Link to comment

generalkenobi

Link to comment

Join the conversation