Hi all, I'd appreciate your thoughts on this!
Background
I set up my first UNRAID server earlier this year, I'm enjoying it but ever since I've set it up, I've experienced periodic crashes, typically 3 per week. Sometimes it reboots by itself and sometimes I need to restart it manually using the physical switch.
The syslog was never showing anything useful, as it cut off when the errors occurred. But since I switched from logging to the cache to a Raspberry Pi, I'm getting a lot more useful information, which I have attached. A lot of different errors start occurring at timestamp 2024-08-25T21:23:36+01:00, from which the server hangs (i.e. I can't access the web UI) and I need to go and physically reboot it.
Some potentially revealing excerpts from the attached syslog:
2024-08-25T21:23:36+01:00 Tower kernel: __vm_enough_memory: pid: 15270, comm: ffmpeg, no enough memory for the allocation
2024-08-25T21:23:36+01:00 Tower kernel: ffmpeg[15270]: segfault at 57bf3d211548 ip 000014b85694426c sp 00007ffdb9dc03f0 error 6 in ld-linux-x86-64.so.2[14b856937000+25000] likely on CPU 6 (core 3, socket 0)
2024-08-25T21:23:45+01:00 Tower kernel: SQUASHFS error: xz decompression failed, data probably corrupt
2024-08-25T21:23:45+01:00 Tower kernel: SQUASHFS error: Failed to read block 0x735c18: -5
2024-08-25T21:24:07+01:00 Tower kernel: BTRFS critical (device loop2): corrupt leaf: root=2 block=205144064 slot=119, bad key order, prev (9223372054590193664 168 4096) current (17735421952 168 4096)
2024-08-25T21:24:07+01:00 Tower kernel: BTRFS info (device loop2): leaf 205144064 gen 62292 total ptrs 198 free space 839 owner 2
Hardware
Mobo: Gigabyte Z690M DS3H DDR4, BIOS F28
CPU: i5-12500T
RAM: 16 GB DDR4
Array: 2x 10 TB HDD
Cache: 1x 512 GB SSD
Flash: 8 GB Kingston DataTraveler DTSE9
PSU: Corsair 400 W
Unraid version: 6.12.13
Potential causes
- PSU - Unlikely, I've tested it with another and the problem still occurs.
- Memory - I think this is also unlikely? Despite the messages in the syslog, it passes 24+ hours of memtest without any errors. I've tried with XMP enabled and disabled. I used the memory in my old desktop PC for years without issue.
- Flash drive - I've read that the SQUASHFS errors point towards a failing flash drive, so this is a possibility. My server is taking backups and uploading them to the cloud, at least. But chkdsk doesn't report any errors.
- Cache drive - BTRFS errors indicate something is wrong, but I'm not sure that would cause the issues I'm seeing. I read somewhere that ZFS is preferred. I'll migrate things over at some point if that's the case.
- Motherboard - I think this is the most likely culprit, as I bought it second-hand, and very cheaply! In my mind, the surge of apparently unrelated errors could potentially be explained by a bad motherboard?
- CPU - A possibility, it's another second-hand component, but I think it is less likely to be the problem than the motherboard is.
I think my next step is to buy another motherboard, but I'd appreciate your ideas before I go and purchase anything! Thanks!