Diagnosing regular crash

January 23, 20215 yr

Hi all, I wonder if someone can help me identify the cause of a regular crash of my server?

It seems to coincide with sustained high disk I/O from what I can tell. It seems more likely to happen if I run my Rclone to Gdrive backup script, or backup other machines onto the array, or copy a few hundred gig to the server over a few days. If I keep usage light and don't do any of those things, it keeps running for over a month but otherwise it dies at least once a fortnight.

When the crash happens, VMs remain responsive but glitchy (particularly with internet/network access) but upon rebooting a VM it won't come back up. The WebUI stops responding, but the physical console still partially responds (accessing via IPMI on a Supermicro board). I cannot ping out or into Unraid, and can't get any network traffic at all, but the interface shows as up, has an IP address, and link lights on. The console is partially responsive, I can run basic commands like cd and ls for example, but anything more involved such as unmounting a drive or stopping a service it just hangs up, the shutdown -r command also hangs it fully in this case requiring a manual hardware reset.

After the unclean reboot, understandably it does a parity check, it's never yet flagged any parity errors. When I go to start my VM, it will take almost two hours to boot into Windows 10, it only takes this long after a crash and never during a normal VM reboot, the 10 VM is on a dedicated SSD.

I've managed to copy the /var/logs folder to my array prior to rebooting it on the most recent occasion so I'll trawl through those later today and post anything relevant.

Apart from that, my specs:

Unraid 6.8.3

Supermicro X8DTN+-F

104GB RAM

LSI 9207-8i

Supermicro BPN-SAS2-836EL1 expander backplane

Diagnostics attached, these were taken after the reboot (I couldn't get to the web interface prior to rebooting)

Edit: Since I wrote the above, the server has crashed again in less than 24 hours this time even with minimal workload apart from the parity check, in this case the crash happened shortly after the parity check finished (17 hours). It's currently still in the crashed state but when trying to retrieve the diagnostics via the console the diagnostics process errored and quit (I didn't note the error at the time), since then I tried to view the syslog manually with tail and it froze up and won't let me exit that or even login to another terminal, just sits with a blinking cursor. After checking the USB, the diagnostics file did not complete so safe to assume we've lost this syslog.

unraid-diagnostics-20210121-1330.zip syslog-after-reboot.zip

Quote

January 23, 20215 yr

Author

Following up from the copy of /var/log that I thought I had backed up, although the command completed successfully, it seems the file does not exist on the array and I guess did not actually save to the array due to unclean shutdown that was my only hope of obtaining a syslog so far. I'll have to wait for it to crash again to try another method of extracting it.

Edit: I have now setup syslog to save to a mounted disk rather than the array path using the method here. Seems to be saving fine so far, now we just wait for the next crash to occur. I'll hammer the drives a bit and see how soon I can trigger it

Edited January 23, 20215 yr by fitzy89
additional information added

Quote

January 23, 20215 yr

Quote

January 23, 20215 yr

Author

8 hours ago, trurl said:

Thanks for providing this, I have set it to mirror syslog to flash for the time being. In addition, I've edited my above comment that I've also started logging to one of the local disks too

Edit: been running for a few hours now flat out, every docker and VM is running, parity sync in progress, local PCs backing up to a share, just going to start my rclone backup. So far so good it seems to be stable at the moment and coping surprisingly well with the load. Overnight, there did appear to be a lot of SSH failed login attempts from a local IP address, but that IP doesn't exist on my network and there's no sign that it ever did, no logs on my router, no DHCP leases that would indicate what that IP belongs to. This server isn't publicly accessible either (only my Plex docker has a port forwarded, nothing else). I've attached the syslog from overnight if it helps though but at the moment it's still running.

syslog-1611366845

Edited January 23, 20215 yr by fitzy89
Added detail

Quote

February 12, 20215 yr

Author

Hi again all, we have another crash! 19 days uptime on this occasion. It seems to be working with VMs which triggers it I think I've found out, there are some odd quirks leading up to it, but rebooting a VM will always finish it off. I attach the latest syslog from prior to the crash. Afer the network dropped out, I logged into the console and did the "poweroff" command to try and encourage a graceful shutdown but it just said the usual "system is going down NOW" message and stuck at the normal blinking cursor.

Since the previous time, I have really been hammering the storage and copying terabytes of data here and there, had every VM and every docker started and running but not actively used for gaming like they previously were. I think I'm now of the opinion that it isn't linked to storage like I had previously thought, and instead it's solely something to do with the VM subsystem/Qemu.

Just before this issue, I was stopping all VMs and dockers as I wanted to swap a drive on my array, and it was upon stopping the last VM that the freeze happened.

Where do I go from here?

syslog.zip

Quote

Diagnosing regular crash

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)