Craig Dennis

Members
  • Posts

    41
  • Joined

  • Last visited

Everything posted by Craig Dennis

  1. After removing two sticks of RAM I was still experiencing crashes. I replaced with the other two sticks and have been up for 24 hours. Memtest reported no errors. I also thought Netdata docker was causing issues due some errors in the log Sep 22 21:22:32 Sakaar kernel: netdata[23864]: segfault at 30 ip 0000149933ef42c0 sp 000014992f5af500 error 4 in ld-musl-x86_64.so.1[149933eb1000+4c000] likely on CPU 9 (core 1, socket 0) Sep 22 21:22:32 Sakaar kernel: Code: 29 45 31 c0 31 c9 31 d2 4c 89 e6 bf 0f 00 00 00 31 c0 e8 d1 80 fc ff 89 c3 85 c0 0f 84 83 00 00 00 e8 7a 4c fc ff 8b 18 eb 7a <8b> 4d 30 48 8d 5c 24 0e be 22 00 00 00 31 c0 48 8d 15 aa db 03 00 Re-enabled with the second sticks of RAM and no issues.
  2. This is all too familiar. 1st pass memtest with no issues. I'll remove 2 RAM sticks. Thanks for the help (again).
  3. I ran that a few weeks ago. I'll run it again. Could this also indicate CPU hardware fault?
  4. How did you identify the faulty disk? Mine are all showing as good.
  5. I am having issues after my server has been running for 30 mins to an hour. The UI crashes completely and I get output to the screen showing `tainted` and other worrying things. I have been having issues with overheating so I'm wondering if I have damaged the CPU somehow. Services still appear to be running (e.g. I can still access Plex) but not the Unraid web UI. Can someone help me decipher the errors please? Attached are the diagnostics. Below is also the output from the syslog mirrored to the flash drive (full log from today attached as well). Sep 22 10:24:16 Sakaar kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: #PF: supervisor read access in kernel mode Sep 22 10:24:16 Sakaar kernel: #PF: error_code(0x0000) - not-present page Sep 22 10:24:16 Sakaar kernel: PGD 0 P4D 0 Sep 22 10:24:16 Sakaar kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Sep 22 10:24:16 Sakaar kernel: CPU: 5 PID: 23791 Comm: python3 Tainted: P U O 6.1.49-Unraid #1 Sep 22 10:24:16 Sakaar kernel: Hardware name: To Be Filled By O.E.M. Z590M Pro4/Z590M Pro4, BIOS P2.20 06/06/2022 Sep 22 10:24:16 Sakaar kernel: RIP: 0010:get_mmap_base+0xe/0x47 Sep 22 10:24:16 Sakaar kernel: Code: ff ff 48 8d 73 38 49 89 e8 4c 89 e1 48 8d 7b 30 5b 48 89 c2 5d 41 5c e9 5e fe ff ff 0f 1f 44 00 00 65 48 8b 14 25 c0 cb 01 00 <f6> 42 10 02 48 8b 82 f8 03 00 00 75 0d 85 ff 74 1f 48 8b 40 28 c3 Sep 22 10:24:16 Sakaar kernel: RSP: 0018:ffffc9002c64bd60 EFLAGS: 00010246 Sep 22 10:24:16 Sakaar kernel: RAX: 0000000000000000 RBX: 0000000000009000 RCX: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: RBP: 0000000000000000 R08: 0000000000000022 R09: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000022 Sep 22 10:24:16 Sakaar kernel: R13: 0000000000000000 R14: ffff8883fad30cc0 R15: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: FS: 00001523c6161b48(0000) GS:ffff88904f740000(0000) knlGS:0000000000000000 Sep 22 10:24:16 Sakaar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 22 10:24:16 Sakaar kernel: CR2: 0000000000000000 CR3: 000000065009a005 CR4: 0000000000770ee0 Sep 22 10:24:16 Sakaar kernel: PKRU: 55555554 Sep 22 10:24:16 Sakaar kernel: Call Trace: Sep 22 10:24:16 Sakaar kernel: <TASK> Sep 22 10:24:16 Sakaar kernel: ? __die_body+0x1a/0x5c Sep 22 10:24:16 Sakaar kernel: ? page_fault_oops+0x329/0x376 Sep 22 10:24:16 Sakaar kernel: ? do_user_addr_fault+0x12e/0x48d Sep 22 10:24:16 Sakaar kernel: ? exc_page_fault+0xfb/0x11d Sep 22 10:24:16 Sakaar kernel: ? asm_exc_page_fault+0x22/0x30 Sep 22 10:24:16 Sakaar kernel: ? get_mmap_base+0xe/0x47 Sep 22 10:24:16 Sakaar kernel: arch_get_unmapped_area_topdown+0xdd/0x1b2 Sep 22 10:24:16 Sakaar kernel: ? preempt_latency_start+0x1e/0x46 Sep 22 10:24:16 Sakaar kernel: get_unmapped_area+0xc4/0x14f Sep 22 10:24:16 Sakaar kernel: do_mmap+0x110/0x428 Sep 22 10:24:16 Sakaar kernel: vm_mmap_pgoff+0xbb/0x112 Sep 22 10:24:16 Sakaar kernel: ksys_mmap_pgoff+0x138/0x166 Sep 22 10:24:16 Sakaar kernel: do_syscall_64+0x68/0x81 Sep 22 10:24:16 Sakaar kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce Sep 22 10:24:16 Sakaar kernel: RIP: 0033:0x1523c61001c2 Sep 22 10:24:16 Sakaar kernel: Code: f6 c1 10 74 0f 4c 89 4c 24 08 e8 d5 fc 01 00 4c 8b 4c 24 08 48 63 d5 4c 63 d3 4d 63 c6 b8 09 00 00 00 4c 89 e7 4c 89 ee 0f 05 <48> 89 c7 48 83 f8 ff 75 20 4d 85 e4 75 1b 83 e3 30 48 c7 c0 f4 ff Sep 22 10:24:16 Sakaar kernel: RSP: 002b:00007ffedcd4b860 EFLAGS: 00000246 ORIG_RAX: 0000000000000009 Sep 22 10:24:16 Sakaar kernel: RAX: ffffffffffffffda RBX: 0000000000000022 RCX: 00001523c61001c2 Sep 22 10:24:16 Sakaar kernel: RDX: 0000000000000003 RSI: 0000000000009000 RDI: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: RBP: 0000000000000003 R08: ffffffffffffffff R09: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: R13: 0000000000009000 R14: 00000000ffffffff R15: 00001523c615fb00 Sep 22 10:24:16 Sakaar kernel: </TASK> Sep 22 10:24:16 Sakaar kernel: Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap ipvlan veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs md_mod tcp_diag inet_diag nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet bridge 8021q garp mrp stp llc ixgbe xfrm_algo mdio e1000e zfs(PO) i915 zunicode(PO) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal zzstd(O) intel_powerclamp coretemp kvm_intel zlua(O) zavl(PO) kvm icp(PO) iosf_mbi drm_buddy i2c_algo_bit ttm drm_display_helper drm_kms_helper mei_pxp mei_hdcp crct10dif_pclmul crc32_pclmul crc32c_intel zcommon(PO) drm ghash_clmulni_intel sha512_ssse3 Sep 22 10:24:16 Sakaar kernel: aesni_intel znvpair(PO) mei_me intel_gtt crypto_simd spl(O) cryptd wmi_bmof intel_cstate mpt3sas agpgart intel_uncore i2c_i801 nvme i2c_smbus raid_class sr_mod i2c_core mei ahci nvme_core scsi_transport_sas cdrom input_leds joydev led_class libahci syscopyarea sysfillrect sysimgblt fb_sys_fops video tpm_crb tpm_tis tpm_tis_core wmi tpm backlight intel_pmc_core acpi_tad acpi_pad button unix [last unloaded: xfrm_algo] Sep 22 10:24:16 Sakaar kernel: CR2: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: ---[ end trace 0000000000000000 ]--- Sep 22 10:24:16 Sakaar kernel: RIP: 0010:get_mmap_base+0xe/0x47 Sep 22 10:24:16 Sakaar kernel: Code: ff ff 48 8d 73 38 49 89 e8 4c 89 e1 48 8d 7b 30 5b 48 89 c2 5d 41 5c e9 5e fe ff ff 0f 1f 44 00 00 65 48 8b 14 25 c0 cb 01 00 <f6> 42 10 02 48 8b 82 f8 03 00 00 75 0d 85 ff 74 1f 48 8b 40 28 c3 Sep 22 10:24:16 Sakaar kernel: RSP: 0018:ffffc9002c64bd60 EFLAGS: 00010246 Sep 22 10:24:16 Sakaar kernel: RAX: 0000000000000000 RBX: 0000000000009000 RCX: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: RBP: 0000000000000000 R08: 0000000000000022 R09: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000022 Sep 22 10:24:16 Sakaar kernel: R13: 0000000000000000 R14: ffff8883fad30cc0 R15: 0000000000000000 Sep 22 10:24:16 Sakaar kernel: FS: 00001523c6161b48(0000) GS:ffff88904f740000(0000) knlGS:0000000000000000 Sep 22 10:24:16 Sakaar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 22 10:24:16 Sakaar kernel: CR2: 0000000000000000 CR3: 000000065009a005 CR4: 0000000000770ee0 Sep 22 10:24:16 Sakaar kernel: PKRU: 55555554 Sep 22 10:24:16 Sakaar kernel: note: python3[23791] exited with irqs disabled sakaar-diagnostics-20230922-1257.zip syslog
  6. Mine was related to overheating. Mover caused NVMe and HDDs to run hot and overheat the CPU causing a shutdown. Bios meant it would start up again. Solved for me. I used NetData to track all temps; some were _much_ higher than Unraid Temp addon was reporting.
  7. My issue has been resolved but what an issue. I had one bad RAM module as well as intermittent overheating. No wonder it was so difficult to isolate.
  8. It looks like I had one bad RAM stick that has subsequently failed. Possibly due to overheating.
  9. I have commented on other posts because my server seems to crash when Plex or Jellyfin are scanning files or running scheduled maintenance. It also seems to crash while trying to install Windows on a VM. There was nothing in the logs which points to a hardware issue but I've swapped almost everything out. I recently reduced my RAM speed to 2666 based on a suggestions from another post and now it's crashing more frequently. I managed to capture the screen as the server crashed just now and caught the following error. WARNING: CPU: 9 PID: 0 at arch/x86/kernel/fpu/core.c:424 kernel_fpu_begin_mask+0x30/0xcc And then the server crashed again while trying to download the syslog. I'm currently trying to diagnose the issue but can't find results here or on the wider internet regarding the message. Has the RAM speed change made things worse? Or merely exposed the RAM as the potential hardware issue? I ran memtest a few days ago and all was clear. Has anyone encountered this before? sakaar-diagnostics-20230811-2237.zip
  10. I just installed Jellyfin and went through the setup process and the initial scanning killed the server. Again nothing in the logs. This suggests the issue _is_ at least partly a hardware issue. All I have left are the CPU and motherboard but both work fine unless intensely scanning files. I have tried: - New power supply - New cables - Memtest (clear) - Safe mode (with and without starting the array) - Multiple different Plex docker images - None of the HDD or NVMe are reporting errors I have a new motherboard and the CPU appears to work fine (hw transcoding fine). Could this be a HDD issue? I can run for days with no crash until I start the Plex docker. Then scheduled maintenance kills the server. Complete panic and reboot but nothing in the logs. Aug 10 15:18:08 Sakaar kernel: eth0: renamed from veth8e9cec0 Aug 10 15:18:08 Sakaar kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth2823d2e: link becomes ready [ crash occurerd at approx 15:58] Aug 10 15:59:33 Sakaar kernel: microcode: microcode updated early to revision 0x56, date = 2022-08-02 Aug 10 15:59:33 Sakaar kernel: Linux version 6.1.38-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #2 SMP PREEMPT_DYNAMIC Mon Jul 10 09:50:25 PDT 2023 Attached is my syslog (mirrored to the Flash drive) and diagnostics in case someone can see something I can't. EDIT: Jellyfin shows some failures in the logs prior to the crash but there are a few of the same type and I'm not sure that it could cause a crash. MediaBrowser.Common.FfmpegException: ffmpeg image extraction failed for file:"/media/tv/Taskmaster/Season 12/Taskmaster (2015) - S12E01 - An Imbalance in the Poppability [WEBDL-1080p][AAC 2.0][x264]-NTb.mkv" at MediaBrowser.MediaEncoding.Encoder.MediaEncoder.ExtractImageInternal(String inputPath, String container, MediaStream videoStream, Nullable`1 imageStreamIndex, Nullable`1 threedFormat, Nullable`1 offset, Boolean useIFrame, Nullable`1 targetFormat, CancellationToken cancellationToken) at MediaBrowser.MediaEncoding.Encoder.MediaEncoder.ExtractImage(String inputFile, String container, MediaStream videoStream, Nullable`1 imageStreamIndex, MediaSourceInfo mediaSource, Boolean isAudio, Nullable`1 threedFormat, Nullable`1 offset, Nullable`1 targetFormat, CancellationToken cancellationToken) at Emby.Server.Implementations.MediaEncoder.EncodingManager.RefreshChapterImages(Video video, IDirectoryService directoryService, IReadOnlyList`1 chapters, Boolean extractImages, Boolean saveChapters, CancellationToken cancellationToken) sakaar-diagnostics-20230810-1602.zip syslog (15)
  11. Good to know. I'm also on 6.12.2 so may not be the same issue. I have noticed a new parity check accumulation feature (I don't know when this was added) which may have been conflicting with the Parity Check Tuning plugin. I only recently enabled Mover logging so we'll see. I've manually initiated Mover now and have also changed the scheduled time to see if the crashes also change. It could also be hardware. With nothing in logs it's difficult to tell. Constant hard reboots since 6.12 i915 issues could have damaged something.
  12. My server has been crashing consistently in the middle of the night (full reboot) and Mover is the only thing scheduled to run at that time. The last few log entries are related to Mover and Parity Check Tuning. I have removed that plugin and will see if the crashes persist. No errors in syslog. Also ordered a new PSU to eliminate that. I don't really want to run Memtest but I guess I'll have to if none of these work.
  13. Unraid is now crashing every night at around 3am (corresponding to the scheduled maintenance window - periodic scanning disabled). I'm going to change that and see if the reboots change as well. Currently running: Unraid 6.12.2 Plex 4.108.0 I have ordered a new power supply to rule that out.
  14. Ok it seems like every time the server transcodes something it crashes. After several hours of stability I tried to watch something on my phone, opened Plex Dash app and it was a spinning wheel. Then I got a notification saying the server was offline (from Home Assistant). I wonder if this is a CPU issue or a RAM issue triggered by Plex transcoding. It does transcode, and shows hw in the details. This time it stayed off instead of rebooting. I’m not sure why it was crashing every hour before and disabling/enabling the scanner fixed that. More testing needed.
  15. No crash yet after 3 hours (over 2 since enabling hourly scan) so maybe not Plex library scan. I’ll enable a new Docker every couple of hours to see if it happens again.
  16. Unraid 6.12.1 Plex 4.1.08 (official) My server was rebooting every hour (or there about) and it used to be every day (for the last week) and I realised I’d changed the library scan interval in Plex. I ran unraid in safe mode and turned off Docker etc. no reboots. Enabled Docker and Plex - reboots every hour. Turned off scheduled library scan and uptime was 8 hours plus (overnight) so I’m concluding that’s the culprit. I’ve just re-enabled library scan for hourly so we’ll see if it crashes in an hour. I have a ping sensor on home assistant so I can clearly see when it went down and came back. Any ideas what would cause this behaviour? Next step is a fresh Plex install in a new container.
  17. Are you concluding from this crash that the issue is with either the Docker service itself or a container you're running? Would disabling 'auto start' for all containers and rebooting determine if it's a container or the service itself (if it stays up for 4+ hours)? I'm by no means an expert but am experiencing similar issues with 6.12.
  18. @ich777 I appreciate your help. So what are the steps I actually need to perform? The issue is still happening on 6.12.1. I removed everything I entered previously (I think) and then I went back through the steps you mentioned earlier: libkmod: kmod_config_parse: /etc/modprobe.d/i915.conf line 1: ignoring bad line starting with 'enable_dc=0' Attached is diagnostics from just after the reboot. I have set up a syslog server and mirrored it to flash; also attached. I don't see any other errors though (like power_well). sakaar-diagnostics-20230622-2129.zip syslog (1)
  19. Fair. I'm just trying to follow instructions from the thread. Sorry, I am mistaken. I am on 6.12.0 (I forgot I updated when it became available). Where are the contents from this file? They should be: options i915 enable_dc=0 (please note that if you are using the default editor from OSX it will destroy the formatting and Linux can't read it) The contents have only ever shown `enable_dc=0` but the OSX formatting makes sense (as the error never occurred until today when I ssh'd in)
  20. Running rc8 and for whatever reason I am unable to make `enable_dc=0` stick. @ich777 I must be doing something wrong. It resets to -1 every reboot. When I run through the commands again (remove/rescan etc.) I get the following error: libkmod: kmod_config_parse: /etc/modprobe.d/i915.conf line 1: ignoring bad line starting with 'enable_dc=0' I have it applied via `/boot/config/modprobe.d/i915.conf` I have it in the `syslinux.config` in the flash page But whenever I reboot and `cat /sys/module/i915/parameters/enable_dc` I get `-1`
  21. For what it's worth, when I installed a discrete GPU and still used the iGPU (both enabled in bios - GPU passed through to a VM) the system stopped crashing (was running rc-7).
  22. syslinux,config changes did not have any effect. I tried both modprobe commands and rebooted between; server still crashes at the 20-30 min mark. I will try updating to rc-7 next
  23. @ich777 Awesome. Thanks. Maybe a silly question, will I be able to use the Intel GPU TOP plugin after I have changed these settings?
  24. @ich777 I followed your guide above (still on rc-6) and the system crashed at the usual time. I will try with other DC values on rc-6 before upgrading to rc-7 (although I might skip it if the fix has a regression as reported above). Can someone confirm this working on rc-7?
  25. Yeah I was on RC6 but there’s a chance I put the flag in the wrong location (not in modprobe). If I get a chance I’ll test the correct location, then upgrade and test again.