-
(SOLVED) Random hangs & reboots
Thank you for taking the time to offer your opinions, I solved the issue eventually. II found that unraid was completely stable if I ran it with only one stick of memory. It didn't matter which stick it was, or which slot it was in. But whenever I tried to run two sticks simultaneously, I would encounter crashes pretty quickly. So I'm guessing it's an issue with running in dual channel mode and/or with the memory controller, something like that. Despite that, I never encountered any memtest errors even when using two sticks of RAM. I wanted to upgrade to a higher capacity eventually anyway so I bought a 1x32GB memory kit and it's been stable since then. It's a bit of a bodge since there could still be an issue with my CPU and/or motherboard, but it works at least.
-
(SOLVED) Random hangs & reboots
Hi all, I'd appreciate your thoughts on this! Background I set up my first UNRAID server earlier this year, I'm enjoying it but ever since I've set it up, I've experienced periodic crashes, typically 3 per week. Sometimes it reboots by itself and sometimes I need to restart it manually using the physical switch. The syslog was never showing anything useful, as it cut off when the errors occurred. But since I switched from logging to the cache to a Raspberry Pi, I'm getting a lot more useful information, which I have attached. A lot of different errors start occurring at timestamp 2024-08-25T21:23:36+01:00, from which the server hangs (i.e. I can't access the web UI) and I need to go and physically reboot it. Some potentially revealing excerpts from the attached syslog: 2024-08-25T21:23:36+01:00 Tower kernel: __vm_enough_memory: pid: 15270, comm: ffmpeg, no enough memory for the allocation 2024-08-25T21:23:36+01:00 Tower kernel: ffmpeg[15270]: segfault at 57bf3d211548 ip 000014b85694426c sp 00007ffdb9dc03f0 error 6 in ld-linux-x86-64.so.2[14b856937000+25000] likely on CPU 6 (core 3, socket 0) 2024-08-25T21:23:45+01:00 Tower kernel: SQUASHFS error: xz decompression failed, data probably corrupt 2024-08-25T21:23:45+01:00 Tower kernel: SQUASHFS error: Failed to read block 0x735c18: -5 2024-08-25T21:24:07+01:00 Tower kernel: BTRFS critical (device loop2): corrupt leaf: root=2 block=205144064 slot=119, bad key order, prev (9223372054590193664 168 4096) current (17735421952 168 4096) 2024-08-25T21:24:07+01:00 Tower kernel: BTRFS info (device loop2): leaf 205144064 gen 62292 total ptrs 198 free space 839 owner 2 Hardware Mobo: Gigabyte Z690M DS3H DDR4, BIOS F28 CPU: i5-12500T RAM: 16 GB DDR4 Array: 2x 10 TB HDD Cache: 1x 512 GB SSD Flash: 8 GB Kingston DataTraveler DTSE9 PSU: Corsair 400 W Unraid version: 6.12.13 Potential causes - PSU - Unlikely, I've tested it with another and the problem still occurs. - Memory - I think this is also unlikely? Despite the messages in the syslog, it passes 24+ hours of memtest without any errors. I've tried with XMP enabled and disabled. I used the memory in my old desktop PC for years without issue. - Flash drive - I've read that the SQUASHFS errors point towards a failing flash drive, so this is a possibility. My server is taking backups and uploading them to the cloud, at least. But chkdsk doesn't report any errors. - Cache drive - BTRFS errors indicate something is wrong, but I'm not sure that would cause the issues I'm seeing. I read somewhere that ZFS is preferred. I'll migrate things over at some point if that's the case. - Motherboard - I think this is the most likely culprit, as I bought it second-hand, and very cheaply! In my mind, the surge of apparently unrelated errors could potentially be explained by a bad motherboard? - CPU - A possibility, it's another second-hand component, but I think it is less likely to be the problem than the motherboard is. I think my next step is to buy another motherboard, but I'd appreciate your ideas before I go and purchase anything! Thanks!
capable_badger
Members
-
Joined
-
Last visited