Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

SciKo Logic

Members
  • Joined

  • Last visited

  1. Just to give a brief update for anyone following this thread, I went on to: Replace the CPU with a i7 12700K. Replace the HBA card with an identical model (LSI 9207-8i). Remove my Nvidia GPU (4060Ti). Fit a Sparkle Arc A380 6GB Elf GPU for QSV purposes. Uninstall the NUT plugin (due to always losing connection to the UPS) and use the built in UPS integration in Unraid. Install Perforce version control on a docker container (skipping the need for a VM). Since then I have not seen a single issue with unresponsiveness or Unraid misbehaving to the point of any service failing or becoming unavailable. I have yet to confimr everything holds up in the long run but as we speak it was 9 days since my last reboot (when I decided to give up on NUT and use the Unraid UPS integration instead). Syslog is clean with no scary entries or errors. I believe, like JorgeB said, that the issues were from multiple sources and not necessarily connected. Having an Nvidia GPU seemed to taint the kernel which in itself can probably be quite harmless but I'd rather not have that for ease of mind. The driver crashes and initramfs data corruption, if I had to guess, was probably the main culprit in seeing Unraid enter into an unresponsive state only recoverable with a force restart. And the underlying issue was thus an unstable CPU (13900K). I cannot say for sure it wasn't also some bug, corruption or user-side setting in my VM that caused some havoc or added to the chaos, or that having data on the SSD cache with no protection against corruption also introduced behavior that could lead to Unraid failing somewhere but ultimately faulty hardware trumps most other issues and I suspect this was my main culprit all along. Unless something resurfaces, god forbid, I'll leave this thread solved and hope it helps someone else if ever in need. Huge thanks to @JorgeB whom without I would never have concluded to order a new CPU.
  2. Yeah I was afraid of that. I've ordered a new LSI 9207-8i HBA and a dedicated 120 mm fan I will use to bombard it with air. Once it arrives I'll continue with testing the RAM and if it doesn't show any signs of trouble I'm preparing to invest in a new CPU then. Just to be able to rule it out though, you don't think the motherboard can conjure these kinds of errors right? Red thread so far has essentially been me picking incompatible or poorly compatible (w Unraid) components and as a result replaced 70% of all hardware so far with better or at least components confirmed working on an Unraid server. Had never encountered CPU instability like the 13900K generation and the one I use for my PC threw ICEs at me when compiling code but has been behaving alright since I updated BIOS and set it to use Intel defaults. I imagined this would be similar for the Unraid build but if nothing else - lesson learned. Not worth the headaches if the compromise is instability. I greatly appreciate all the help so far @JorgeB 🙏 I'll come back to update this once I receive the new HBA and can continue testing for faulty hardware.
  3. Yikes, the HBA seems like it straight out died. I can't make it give any sign of life and starting up unraid now is just missing all disks. I guess the HBA card needs replacing too 🫠 Would it be considered dreaming to try and pin the above issues on the HBA card?
  4. I'll shutdown and re-seat the HBA + begin the RAM stick test right away then 🙏 Are there any CPU recommendations specific for Unraid (with 4K transcoding capabilities) or should I start looking for a stable CPU in general?
  5. I was about to get started on the single RAM stick tests today when I realized that all hell broke loose tonight. A few ( <10 ) errors on all array disks and they were all spun down as a result. This has never happened before and while I would expect it to be possible for some of the disks, maybe even most, I wouldn't expect the parity disk, which is much newer, to have reached its end of life. I'm a bit overwhelmed by everything crashing down atm, not sure if these error are connected to my previous issues or if it could be a result of starting a parity check (which I did yesterday evening) after an unclean shutdown or if it somehow a blessing where the root cause is revealing itself. Below is from the syslog starting from when I initiated the parity check: Sep 17 02:05:49 SkyNet kernel: mdcmd (36): check correct Sep 17 02:05:49 SkyNet kernel: md: recovery thread: check P ... Sep 17 02:08:23 SkyNet usbhid-ups[4339]: nut_libusb_get_report: Input/Output Error Sep 17 02:08:25 SkyNet usbhid-ups[4339]: Reconnecting. If you saw "nut_libusb_get_interrupt: Input/Output Error" or similar message in the log above, try setting "pollonly" flag in "ups.conf" options section for this driver! Sep 17 02:17:45 SkyNet Parity Check Tuning: Manual Correcting Parity-Check detected Sep 17 02:17:47 SkyNet Parity Check Tuning: Manual Correcting Parity-Check: Manually resumed Sep 17 02:24:09 SkyNet kernel: ------------[ cut here ]------------ Sep 17 02:24:09 SkyNet kernel: kernel BUG at drivers/md/unraid.c:1617! Sep 17 02:24:09 SkyNet kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI Sep 17 02:24:09 SkyNet kernel: CPU: 10 UID: 0 PID: 8327 Comm: unraidd0 Tainted: P O 6.12.24-Unraid #1 Sep 17 02:24:09 SkyNet kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE Sep 17 02:24:09 SkyNet kernel: Hardware name: ASUS System Product Name/PRIME Z790M-PLUS, BIOS 1810 12/02/2024 Sep 17 02:24:09 SkyNet kernel: RIP: 0010:unraidd+0x1189/0x1280 [md_mod] Sep 17 02:24:09 SkyNet kernel: Code: 00 83 3d cd 32 00 00 03 7e 16 41 8b 56 98 89 e9 48 c7 c7 c1 a3 83 a1 48 8b 73 20 e8 d1 de 2e e0 41 f6 86 69 ff ff ff 02 75 02 <0f> 0b 48 8b 43 20 49 03 47 18 41 c7 46 b0 00 10 00 00 49 8b 56 10 Sep 17 02:24:09 SkyNet kernel: RSP: 0018:ffffc90003c43da8 EFLAGS: 00010246 Sep 17 02:24:09 SkyNet kernel: RAX: 0000000000000000 RBX: ffff8881792986c8 RCX: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: RDX: 0000000000000000 RSI: ffffffff82e46f60 RDI: ffff88810ae7be38 Sep 17 02:24:09 SkyNet kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88815504b908 Sep 17 02:24:09 SkyNet kernel: R13: ffff888179298810 R14: ffff888179298888 R15: ffff888179305220 Sep 17 02:24:09 SkyNet kernel: FS: 0000000000000000(0000) GS:ffff88903f280000(0000) knlGS:0000000000000000 Sep 17 02:24:09 SkyNet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 17 02:24:09 SkyNet kernel: CR2: 000000000053dd30 CR3: 0000000147f74004 CR4: 0000000000772ef0 Sep 17 02:24:09 SkyNet kernel: PKRU: 55555554 Sep 17 02:24:09 SkyNet kernel: Call Trace: Sep 17 02:24:09 SkyNet kernel: <TASK> Sep 17 02:24:09 SkyNet kernel: ? preempt_latency_start+0x2b/0x50 Sep 17 02:24:09 SkyNet kernel: ? _raw_spin_lock_irqsave+0x1f/0x30 Sep 17 02:24:09 SkyNet kernel: md_thread+0xf6/0x130 [md_mod] Sep 17 02:24:09 SkyNet kernel: ? __pfx_autoremove_wake_function+0x10/0x10 Sep 17 02:24:09 SkyNet kernel: ? __pfx_md_thread+0x10/0x10 [md_mod] Sep 17 02:24:09 SkyNet kernel: kthread+0xec/0x100 Sep 17 02:24:09 SkyNet kernel: ? __pfx_kthread+0x10/0x10 Sep 17 02:24:09 SkyNet kernel: ret_from_fork+0x21/0x40 Sep 17 02:24:09 SkyNet kernel: ? __pfx_kthread+0x10/0x10 Sep 17 02:24:09 SkyNet kernel: ret_from_fork_asm+0x1a/0x30 Sep 17 02:24:09 SkyNet kernel: </TASK> Sep 17 02:24:09 SkyNet kernel: Modules linked in: nf_conntrack_netlink veth udp_diag wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha xt_nat xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle iptable_mangle ipvlan vhost_net tun vhost vhost_iotlb tap xt_conntrack xt_MASQUERADE nfnetlink xfrm_user xfrm_algo ip6table_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype md_mod zfs(PO) spl(O) ntfs3 tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet cfg80211 rfkill 8021q garp mrp bridge stp llc bonding tls xe drm_gpuvm drm_exec gpu_sched drm_ttm_helper drm_suballoc_helper i915 intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iosf_mbi drm_buddy crct10dif_pclmul crc32_pclmul crc32c_intel ttm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 i2c_algo_bit sha1_ssse3 drm_display_helper aesni_intel crypto_simd cryptd drm_kms_helper rapl drm mei_hdcp Sep 17 02:24:09 SkyNet kernel: mei_pxp intel_cstate mpt3sas wmi_bmof intel_gtt mei_me i2c_i801 nvme agpgart e1000e intel_uncore i2c_smbus tpm_crb mei ahci nvme_core tpm_tis i2c_core raid_class tpm_tis_core scsi_transport_sas libahci tpm video thermal fan libaescfb ecdh_generic wmi ecc backlight acpi_tad acpi_pad button Sep 17 02:24:09 SkyNet kernel: ---[ end trace 0000000000000000 ]--- Sep 17 02:24:09 SkyNet kernel: pstore: backend (efi_pstore) writing error (-28) Sep 17 02:24:09 SkyNet kernel: RIP: 0010:unraidd+0x1189/0x1280 [md_mod] Sep 17 02:24:09 SkyNet kernel: Code: 00 83 3d cd 32 00 00 03 7e 16 41 8b 56 98 89 e9 48 c7 c7 c1 a3 83 a1 48 8b 73 20 e8 d1 de 2e e0 41 f6 86 69 ff ff ff 02 75 02 <0f> 0b 48 8b 43 20 49 03 47 18 41 c7 46 b0 00 10 00 00 49 8b 56 10 Sep 17 02:24:09 SkyNet kernel: RSP: 0018:ffffc90003c43da8 EFLAGS: 00010246 Sep 17 02:24:09 SkyNet kernel: RAX: 0000000000000000 RBX: ffff8881792986c8 RCX: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: RDX: 0000000000000000 RSI: ffffffff82e46f60 RDI: ffff88810ae7be38 Sep 17 02:24:09 SkyNet kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88815504b908 Sep 17 02:24:09 SkyNet kernel: R13: ffff888179298810 R14: ffff888179298888 R15: ffff888179305220 Sep 17 02:24:09 SkyNet kernel: FS: 0000000000000000(0000) GS:ffff88903f280000(0000) knlGS:0000000000000000 Sep 17 02:24:09 SkyNet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 17 02:24:09 SkyNet kernel: CR2: 000000000053dd30 CR3: 0000000147f74004 CR4: 0000000000772ef0 Sep 17 02:24:09 SkyNet kernel: PKRU: 55555554 Sep 17 02:24:09 SkyNet kernel: ------------[ cut here ]------------ Sep 17 02:24:09 SkyNet kernel: WARNING: CPU: 10 PID: 8327 at kernel/exit.c:886 do_exit+0x80/0x8c0 Sep 17 02:24:09 SkyNet kernel: Modules linked in: nf_conntrack_netlink veth udp_diag wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha xt_nat xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle iptable_mangle ipvlan vhost_net tun vhost vhost_iotlb tap xt_conntrack xt_MASQUERADE nfnetlink xfrm_user xfrm_algo ip6table_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype md_mod zfs(PO) spl(O) ntfs3 tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet cfg80211 rfkill 8021q garp mrp bridge stp llc bonding tls xe drm_gpuvm drm_exec gpu_sched drm_ttm_helper drm_suballoc_helper i915 intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iosf_mbi drm_buddy crct10dif_pclmul crc32_pclmul crc32c_intel ttm ghash_clmulni_intel sha512_ssse3 sha256_ssse3 i2c_algo_bit sha1_ssse3 drm_display_helper aesni_intel crypto_simd cryptd drm_kms_helper rapl drm mei_hdcp Sep 17 02:24:09 SkyNet kernel: mei_pxp intel_cstate mpt3sas wmi_bmof intel_gtt mei_me i2c_i801 nvme agpgart e1000e intel_uncore i2c_smbus tpm_crb mei ahci nvme_core tpm_tis i2c_core raid_class tpm_tis_core scsi_transport_sas libahci tpm video thermal fan libaescfb ecdh_generic wmi ecc backlight acpi_tad acpi_pad button Sep 17 02:24:09 SkyNet kernel: CPU: 10 UID: 0 PID: 8327 Comm: unraidd0 Tainted: P D O 6.12.24-Unraid #1 Sep 17 02:24:09 SkyNet kernel: Tainted: [P]=PROPRIETARY_MODULE, [D]=DIE, [O]=OOT_MODULE Sep 17 02:24:09 SkyNet kernel: Hardware name: ASUS System Product Name/PRIME Z790M-PLUS, BIOS 1810 12/02/2024 Sep 17 02:24:09 SkyNet kernel: RIP: 0010:do_exit+0x80/0x8c0 Sep 17 02:24:09 SkyNet kernel: Code: 24 74 04 75 13 b8 01 00 00 00 41 89 6c 24 60 48 c1 e0 22 49 89 44 24 70 4c 89 ef e8 7a d7 ab 00 48 83 bb 28 08 00 00 00 74 02 <0f> 0b 48 8b bb 40 07 00 00 e8 22 d6 ab 00 48 8b 83 38 07 00 00 83 Sep 17 02:24:09 SkyNet kernel: RSP: 0018:ffffc90003c43ee0 EFLAGS: 00010286 Sep 17 02:24:09 SkyNet kernel: RAX: 0000000000000001 RBX: ffff88810abdc300 RCX: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: RDX: 0000000000000001 RSI: 0000000000002710 RDI: 00000000ffffffff Sep 17 02:24:09 SkyNet kernel: RBP: 000000000000000b R08: 0000000000000000 R09: ffff88815039d400 Sep 17 02:24:09 SkyNet kernel: R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff8881511f9540 Sep 17 02:24:09 SkyNet kernel: R13: ffff888108530000 R14: 0000000000000002 R15: ffffffff82283b5a Sep 17 02:24:09 SkyNet kernel: FS: 0000000000000000(0000) GS:ffff88903f280000(0000) knlGS:0000000000000000 Sep 17 02:24:09 SkyNet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 17 02:24:09 SkyNet kernel: CR2: 000000000053dd30 CR3: 0000000147f74004 CR4: 0000000000772ef0 Sep 17 02:24:09 SkyNet kernel: PKRU: 55555554 Sep 17 02:24:09 SkyNet kernel: Call Trace: Sep 17 02:24:09 SkyNet kernel: <TASK> Sep 17 02:24:09 SkyNet kernel: ? __pfx_md_thread+0x10/0x10 [md_mod] Sep 17 02:24:09 SkyNet kernel: ? kthread+0xec/0x100 Sep 17 02:24:09 SkyNet kernel: make_task_dead+0x104/0x110 Sep 17 02:24:09 SkyNet kernel: rewind_stack_and_make_dead+0x16/0x20 Sep 17 02:24:09 SkyNet kernel: RIP: 0000:0x0 Sep 17 02:24:09 SkyNet kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6. Sep 17 02:24:09 SkyNet kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Sep 17 02:24:09 SkyNet kernel: </TASK> Sep 17 02:24:09 SkyNet kernel: ---[ end trace 0000000000000000 ]--- Sep 17 04:32:53 SkyNet kernel: pcieport 0000:00:1b.4: AER: Multiple Correctable error message received from 0000:00:1b.4 Sep 17 04:32:53 SkyNet kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) Sep 17 04:32:53 SkyNet kernel: pcieport 0000:00:1b.4: device [8086:7a44] error status/mask=00002001/00002000 Sep 17 04:32:53 SkyNet kernel: pcieport 0000:00:1b.4: [ 0] RxErr (First) Sep 17 04:32:53 SkyNet kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 17 04:32:54 SkyNet kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 17 04:32:55 SkyNet kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 17 04:32:56 SkyNet kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 17 04:32:57 SkyNet kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 17 04:32:58 SkyNet kernel: mpt2sas_cm0: SAS host is non-operational !!!! Sep 17 04:32:58 SkyNet kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!! Sep 17 04:32:58 SkyNet kernel: sd 9:0:1:0: [sdc] tag#8818 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=3s Sep 17 04:32:58 SkyNet kernel: sd 9:0:1:0: [sdc] tag#8818 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00 Sep 17 04:32:58 SkyNet kernel: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 Sep 17 04:32:58 SkyNet kernel: sd 9:0:6:0: [sdh] tag#8819 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=3s Sep 17 04:32:58 SkyNet kernel: sd 9:0:6:0: [sdh] tag#8819 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00 Sep 17 04:32:58 SkyNet kernel: I/O error, dev sdh, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0 Sep 17 04:32:58 SkyNet kernel: sd 9:0:6:0: [sdh] tag#8820 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=0s Sep 17 04:32:58 SkyNet kernel: sd 9:0:6:0: [sdh] tag#8820 CDB: opcode=0x88 88 00 00 00 00 02 00 2a da 60 00 00 00 08 00 00 Sep 17 04:32:58 SkyNet kernel: I/O error, dev sdh, sector 8592743008 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Sep 17 04:32:58 SkyNet kernel: md: disk0 read error, sector=8592742944 Sep 17 04:32:58 SkyNet kernel: sd 9:0:1:0: [sdc] tag#8821 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=0s Sep 17 04:32:58 SkyNet kernel: sd 9:0:1:0: [sdc] tag#8821 CDB: opcode=0x88 88 00 00 00 00 02 00 2a da 60 00 00 00 08 00 00 Sep 17 04:32:58 SkyNet kernel: I/O error, dev sdc, sector 8592743008 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Sep 17 04:32:58 SkyNet kernel: md: disk5 read error, sector=8592742944 Sep 17 04:32:58 SkyNet kernel: XFS (md5p1): log I/O error -5 Sep 17 04:32:58 SkyNet kernel: XFS (md5p1): Filesystem has been shut down due to log error (0x2). Sep 17 04:32:58 SkyNet kernel: sd 9:0:0:0: [sdb] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: XFS (md5p1): Please unmount the filesystem and rectify the problem(s). Sep 17 04:32:58 SkyNet kernel: sd 9:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Sep 17 04:32:58 SkyNet kernel: sd 9:0:1:0: [sdc] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: sd 9:0:1:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Sep 17 04:32:58 SkyNet kernel: sd 9:0:2:0: [sdd] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: sd 9:0:2:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Sep 17 04:32:58 SkyNet kernel: sd 9:0:3:0: [sde] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: sd 9:0:3:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Sep 17 04:32:58 SkyNet kernel: sd 9:0:4:0: [sdf] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: sd 9:0:4:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Sep 17 04:32:58 SkyNet kernel: sd 9:0:5:0: [sdg] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: sd 9:0:5:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK Sep 17 04:32:58 SkyNet kernel: sd 9:0:6:0: [sdh] Synchronizing SCSI cache Sep 17 04:32:58 SkyNet kernel: sd 9:0:6:0: [sdh] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OKI notice it mentions the SAS host is not operational which if it was the cause I could see how it would affect all disks and cause them to write errors, but I dare not draw any conclusions without an Unraid adult in the room 😄 skynet-diagnostics-20250917-1129.zip
  6. I forgot to mention I did a 7+ hour memtest aswell without errors! I can try one stick at a time, how long typically would you say is satisfactory before I move on to the next? A day or several? I am also suspicious of my CPU, I've seen it mentioned before in logs but when looking it up I never could conclude it actually was the CPU. In BIOS I've tried to set everything to Intel defaults and turned XMP off completely. Do you think the CPU may need to be replaced completely or are there ways to make it stable? I chose it as I wanted to be able to do 4k transcoding with QuickSync but perhaps there are better more stable options? Cheers for the quick response!
  7. Not sure as of yet but I recently started a dig at this and posted a new thread with more relevant information pertaining to my situation. I suspect my efforts so far hasn't fully solved the matter but it's possible some have improved the severity at least. I found some errors and warnings in the syslog which I mention in my post below aswell. Perhaps any of those can be found in your syslog too? My post is over here: https://forums.unraid.net/topic/193565-unraid-sporadically-becomes-unresponsive-until-forced-unclean-shutdownreboot/
  8. I've been having a reoccurring issue for much over a year now, though I believe it has with time become more and more frequent to the point where the server can hardly stay online for more than a week or so before silently locking and becoming unresponsive and unreachable (including docker containers). As far as Unraid and linux in general goes I am less than capable most of the time but I've somewhat acquainted myself with docker throughout the years which is mostly what I run on this server thankfully. This recent week or so I've decided to get digging for real to bring stability back to the server and most glaring after I've now realized the corruption risks of unclean shutdowns were that the single SSD cache (i e no RAID) that's been in use has seen 71 unclean shutdowns. I've since done the following to address all I've managed to understand: - Rebuilt the docker.img (was likely corrupted, want to remember having to increase it's size to 128 GB seeing 64 GB was too small but as we speak the docker now takes 18 GB). - Removed invoke-ai docker (ai image generation using nvidia gpu). - Removed 2 node-js containers. - Set cpu and memory limits to all containers. - Deleted the libvirt.img (VM) and removed the single VM I used for perforce version control (this image, like the docker img, was stored on the SSD cache and perhaps corrupted too). - Bought a second 1 TB SSD nvme, installed it, backed up the remaining cache data (appdata and another share with docker data), formatted both SSDs btrfs, recreated the cache pool and set it to RAID1. - Bought a UPS (APC 2200 VA, 1320 W, Model: BGM2200B-GR), installed NUT plugin and successfully tested the automatic shutdown functionality. - Inserted the boot flash drive in my Windows machine and scanned for corruption but found none (Right click drive > Tools > Error Checking > Check). After those changes some restart a day or so after I noticed the parity check getting stuck (at 9.0% by the time I went to bed and still 9.0% at lunch the following day) again as I've seen before, and this time the UPS NUT plugin was no longer showing the UPS status and wouldn't turn on anymore complaining about stale data and possibly turning on the pollonly flag as a suggested solution. I tried the suggestion but NUT would not come back on. Instead I decided to go for a reboot as I was suspecting the old issues and as per usual when parity checks gotten stuck before - trying to reboot gets the server hard locked requiring me to force shutoff after at least 30 min of waiting for graceful approaches (including pressing the physical power button once, which would make the output/dos log start outputting via HDMI again). the last line I'd see it get stuck on was something like "mdcmd: stop". I apologize for the wall of text, I just want to make sure I cover anything that may be of help! Since this unclean shutdown I've now uninstalled the Nvidia driver plugin and turned off my ollama docker container (since it uses the gpu for LLM purposes). I know I've seen some kernel error saying something like "ALREADY_EXISTS" and I read it was gpu related. In my case though I'm hoping this is something I can turn back on unless it proves to be the reason for my issues. I've seen CPU errors in the syslog saying unraid was tainted due to a proprietary module which also made me think it was the Nvidia GPU. As we speak the syslog didn't have any errors left besides a warning I found eerie saying; Initramfs unpacking failed: XZ-compressed data is corrupt But then I just went in again to look and I find the same UPS error as mentioned above, this time the NUT GUI stuff still looks to hold up but I'll post it here just in case it bears any importance: Sep 16 17:28:31 SkyNet usbhid-ups[4339]: nut_libusb_get_report: Input/Output Error Sep 16 17:28:33 SkyNet usbhid-ups[4339]: Reconnecting. If you saw "nut_libusb_get_interrupt: Input/Output Error" or similar message in the log above, try setting "pollonly" flag in "ups.conf" options section for this driver! To conclude and clarify my main goal is to get this server stable, running the dockers without silently entering a state of unresponsiveness etc. I'd greatly appreciate your help, while I can't confirm things are still as bad as before this last week I suspect it still is and evidently it is not yet fully addressed! Thank you kindly in advance for any help! skynet-diagnostics-20250916-1748.zip
  9. I've been experiencing these type of unresponsive outages for well over a year now, maybe two. But they've recently become increasingly more common - only days apart many times now instead of months like it were before. No doubt I've installed a few more docker containers over the years, two of which I've had reason to believe be problematic at some point but had no conclusive evidence (ollama and invokeai). My docker img file is rather large. Don't think much else sticks out except an Nvidia GPU (used solely for invokeai docker container). Use the Intel iGPU for everything else (or such is the intention at least). Did anyone resolve this issue? Hearing that reseating the ethernet cable seemed to bring the server back from limbo makes me wonder if it's network related? Perhaps hardware or BIOS related if the ethernet port was unresponsive until unplugged (and plugged back in)? I'm sure I'm not alone in wanting the server to be up and reliable for any extended period of time one might not be able to physically maintain it (be it vacation or whatnot). Sorry for necroing but this is an issue that I reckon would be helpful for a sizeable portion of the community to get to the bottom of. Regards
  10. I've been noticing this since updating to 6.12.13. Docker container update checks and plugin checks takes substantially longer than before, albeit I do not remember if it started immediately since it took a while before realizing update checks were consistently slower. Most everything else seems to be running as usual. I have a stable and fast internet connection and anything outside of Unraid is seemingly unaffected and resolves DNS etc just fine (I use Pi-Hole to Cloudflare IPv4).
  11. Thank you I'll take a look at that! In this particular instance (copy between shares via SATA) the NIC shouldn't matter though right?
  12. Yeah I temporarily turned off the cache for the two shares in question (shares: server-migration -> media) since my cache is only 1TB. My intention is then to have a cache -> array setup for the media share. As of just now it looks like the copy transaction is finally finished although it still confuses me greatly what caused it to prolong for roughly 36 hours. I will take your advice and turn off mover logging. As for this type of stalling I'm wondering if there can be some hardware level caching going on in the mechanical HDDs that could stall once the supposed cache is full? It will probably be quite a while until I make another copy transaction this large but I'd nevertheless like to know what went wrong or was overlooked this time.
  13. I've ran into something I quite don't understand and was hoping maybe someone here sits on some knowledge regarding the matter. I've been trying to complete my server migration by copying large chunks of backed up data from an isolated share (it's disk will be made into a parity disk when done). This has worked great seeing the last job copy ~3.5TB. This last job however I chose to copy the remaining ~7.15TB in one go and it has shown some strange behaviour which I can't make out if it's purely UI or something else going on. Specifically after quite some time in - the progress and transfer speed does not update in the webGUI even though the animated circle icon keeps spinning suggesting the UI hasn't frozen completely. Multiple times I haven't waited this stall out and eventually, hours later it starts updating again but last time I noticed it regressed from 100% to 92% (or likely much less but I noticed it at 92%). During these stalls I have noticed the disks are still reading and writing as shown in the disk array view so I've assumed it's just a disconnect between the UI and the actual copy job. Now I'm not so sure anymore and am starting to worry if I'm stuck in a loop and need to cancel/redo? I'm using the built in copy feature for clarity. Does anyone have an idea what this phenomenon is? Diagnostics from today during the screenshotted stall at 100% provided. skynet-diagnostics-20230727-1534.zip
  14. Allright, it's time to post a resolution to this as I've just gotten past the point of now rediscovering all my old drives via my new HBA card. I can't really say why my LSI 9300-8i didn't work and so my theories may be of little worth. A stab in the dark would be assuming SAS3 type devices are not as widely compatible with consumer grade motherboards and so no matter which board I used the card was not recognized at all on the hardware level. To the happy news! I've recently installed my newly arrived LSI SAS 9207-8i HBA card into a PCIe Gen4 slot on my ASUS PRIME Z790M-PLUS and I can happily confirm the drives are now discovered and the disk array is complete and ready to start up again. I have far from tested this setup to any extent but intend to in the coming days. Should anyone however be in the same predicament I was in I can now at least offer my step forward from that. Fingers crossed it's smooth sailing from here!
  15. I'd also like to chime in here and inform that Fast Boot was indeed my issue as well. Nothing less than disabling it fully ended in a successful outcome. Thank you kindly for sharing this!

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.