jsmontague Posted October 29, 2023 Share Posted October 29, 2023 For last week I've been trying to track down a hardware issue. From logs I've been pointed all over the place and haven't gotten any closer, hopefully can get some direction from the group here. I attached diagnostics from earlier and again from today. This morning I decided to swap RAM as I had a lab box that has had no issues with 32GB of RAM that I could install. Afterwards I now have different errors and they still point potentially to RAM issue. alexandria-diagnostics-20231029-1423.zip alexandria-diagnostics-20231024-0631.zip Quote Link to comment
JorgeB Posted October 30, 2023 Share Posted October 30, 2023 Only issue I see logged are macvlan call traces, and they will usually end up crashing the server, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)). Quote Link to comment
jsmontague Posted October 30, 2023 Author Share Posted October 30, 2023 Thanks Jorge, I'll flip that over to ipvlan today. Not sure if captured in the diags but I still see the below in my syslog output. Oct 29 11:12:51 Alexandria kernel: clocksource: timekeeping watchdog on CPU18: hpet wd-wd read-back delay of 82063ns Oct 29 11:12:51 Alexandria kernel: clocksource: wd-tsc-wd read-back delay of 124876ns, clock-skew test skipped! Oct 29 11:30:49 Alexandria kernel: clocksource: timekeeping watchdog on CPU5: hpet wd-wd read-back delay of 53079ns Oct 29 11:30:49 Alexandria kernel: clocksource: wd-tsc-wd read-back delay of 100990ns, clock-skew test skipped! Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 Sense Key : 0x5 [current] Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 ASC=0x21 ASCQ=0x0 Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 CDB: opcode=0x42 42 00 00 00 00 00 00 00 18 00 Oct 30 04:00:23 Alexandria kernel: critical target error, dev sdj, sector 467807832 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2 Oct 30 04:00:23 Alexandria kernel: BTRFS warning (device sdj1): failed to trim 1 device(s), last error -121 Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 Sense Key : 0x5 [current] Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 ASC=0x21 ASCQ=0x0 Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 CDB: opcode=0x42 42 00 00 00 00 00 00 00 18 00 Oct 30 04:18:49 Alexandria kernel: critical target error, dev sdi, sector 935397439 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2 Oct 30 04:18:49 Alexandria kernel: BTRFS warning (device sdi1): failed to trim 1 device(s), last error -121 Oct 30 04:18:49 Alexandria root: /etc/libvirt: 920 MiB (964657152 bytes) trimmed on /dev/loop3 Oct 30 04:18:49 Alexandria root: /var/lib/docker: 19.4 GiB (20827594752 bytes) trimmed on /dev/loop2 Oct 30 04:18:49 Alexandria root: /mnt/nvme: 2.2 TiB (2396491730944 bytes) trimmed on /dev/nvme0n1p1 Quote Link to comment
JorgeB Posted October 30, 2023 Share Posted October 30, 2023 Not sure what the first one is about but looks harmless, second one is trim related, most likely either the devices or the controller don't support trim. Quote Link to comment
jsmontague Posted October 30, 2023 Author Share Posted October 30, 2023 I didn't realize disks/controllers didn't support trim for SSDs, I'll dig into that this week thanks Jorge! Quote Link to comment
jsmontague Posted October 30, 2023 Author Share Posted October 30, 2023 Here is some logs I got right before my server locked up that I am not sure root cause on either. Oct 28 21:20:11 Alexandria kernel: BTRFS critical (device loop2): corrupt leaf: root=2 block=13575421952 slot=43, invalid key objectid, have 18446612789731309144 expect to be aligned to 4096 Oct 28 21:20:11 Alexandria kernel: BTRFS info (device loop2): leaf 13575421952 gen 4571109 total ptrs 201 free space 2780 owner 2 Oct 28 21:20:11 Alexandria kernel: #011item 0 key (13517111296 169 0) itemoff 16250 itemsize 33 Oct 28 21:20:11 Alexandria kernel: #011#011extent refs 1 gen 4540846 flags 2 Oct 28 21:20:11 Alexandria kernel: #011#011ref#0: tree block backref root 20620 Oct 28 21:20:11 Alexandria kernel: #011item 1 key (13517127680 169 0) itemoff 16190 itemsize 60 Quote Link to comment
JorgeB Posted October 30, 2023 Share Posted October 30, 2023 It shows that image is corrupt, and loop2 is usually the docker image, if yes delete and re-create. Quote Link to comment
jsmontague Posted October 30, 2023 Author Share Posted October 30, 2023 Could changing RAM resolve that? I haven't re-created the image but since swapping RAM it hasn't come back yet either (still early though it could). Quote Link to comment
JorgeB Posted October 30, 2023 Share Posted October 30, 2023 Bad RAM can corrupt the image, but just replacing the RAM after won't fix it, it still needs to be recreated, see here, it doesn't delete your containers. Quote Link to comment
jsmontague Posted October 30, 2023 Author Share Posted October 30, 2023 Perfect, I'll add recreating the docker image to the list! Appreciate it Quote Link to comment
jsmontague Posted October 31, 2023 Author Share Posted October 31, 2023 My logs are full of nginx memory full errors. Attached new diag. Is this tied to a page being left open on browser of a device that is asleep overnight or something else? Can't believe that's a thing but it's a few of the google results I found when searching for the log messages. Oct 30 20:00:33 Alexandria kernel: clocksource: timekeeping watchdog on CPU6: hpet wd-wd read-back delay of 54266ns Oct 30 20:00:33 Alexandria kernel: clocksource: wd-tsc-wd read-back delay of 106577ns, clock-skew test skipped! Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [crit] 12347#12347: ngx_slab_alloc() failed: no memory Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: shpool alloc failed Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27101. Increase nchan_max_reserved_memory. Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: *1306818 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [crit] 12347#12347: ngx_slab_alloc() failed: no memory Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: shpool alloc failed Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27101. Increase nchan_max_reserved_memory. Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: *1306826 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [crit] 12347#12347: ngx_slab_alloc() failed: no memory Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: shpool alloc failed Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27106. Increase nchan_max_reserved_memory. Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: *1306868 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [crit] 12347#12347: ngx_slab_alloc() failed: no memory Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: shpool alloc failed Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27111. Increase nchan_max_reserved_memory. Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: *1306881 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [crit] 12347#12347: ngx_slab_alloc() failed: no memory Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: shpool alloc failed Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27111. Increase nchan_max_reserved_memory. Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: *1306889 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices alexandria-diagnostics-20231031-0844.zip Quote Link to comment
JorgeB Posted October 31, 2023 Share Posted October 31, 2023 Try booting in safe mode and/or closing any browser windows open to the GUI, only open when you need to use it then close again. 1 Quote Link to comment
craze1994 Posted October 31, 2023 Share Posted October 31, 2023 Hello, was wondering if anyone would be able to help me solve my issue. Received an error stating that "Rootfs file is getting full (currently 99 % used)". I currently have 288 gig of ram installed in my sever and the meters indicate that I am using 51% of that ram. Attacked is the diagnostic file and ran plugin memorystaorage.plg and will include that report: I am currently moving files from NAS storage to unraid using rsync and setup smb to NAS storage if this is any help. plugin: installing: memorystorage.plg Executing hook script: pre_plugin_checks plugin: downloading: memorystorage.plg ... done Executing hook script: pre_plugin_checks This script may take a few minutes to run, especially if you are manually mounting a remote share outside of /mnt/disks or /mnt/remotes /usr/bin/du --exclude=/mnt/user --exclude=/mnt/user0 --exclude=/mnt/disks --exclude=/proc --exclude=/sys --exclude=/var/lib/docker --exclude=/boot --exclude=/mnt -h -d2 / 2>/dev/null | grep -v 0$' ' 13M /tmp/fix.common.problems 4.0K /tmp/ca_notices 2.2M /tmp/pkg 16K /tmp/unassigned.devices 12M /tmp/community.applications 20K /tmp/notifications 596K /tmp/plugins 4.0K /tmp/emhttp 28M /tmp 439M /usr/bin 316M /usr/lib 682M /usr/lib64 25M /usr/libexec 217M /usr/local 46M /usr/sbin 177M /usr/share 4.1M /usr/src 1.1M /usr/man 20K /usr/info 956K /usr/doc 312K /usr/include 1.9G /usr 23M /sbin 52K /run/libvirt 8.0K /run/docker 4.0K /run/avahi-daemon 4.0K /run/elogind 4.0K /run/dbus 308K /run/udev 4.0K /run/blkid 468K /run 4.0K /lib64/xfsprogs 24K /lib64/pkgconfig 1.5M /lib64/elogind 4.0K /lib64/e2fsprogs 1.6M /lib64/security 36M /lib64 25K /lib/dhcpcd 189M /lib/firmware 18K /lib/modprobe.d 90M /lib/modules 512 /lib/systemd 8.0M /lib/udev 286M /lib 8.0K /etc/vulkan 4.0K /etc/nvidia-container-runtime 8.0K /etc/docker 4.0K /etc/OpenCL 4.0K /etc/rsyslog.d 4.0K /etc/netatalk 352K /etc/libvirt 332K /etc/libvirt- 284K /etc/fonts 12K /etc/gtk-3.0 44K /etc/zfs 4.0K /etc/pkcs11 144K /etc/lvm 8.0K /etc/libnl 8.0K /etc/ssmtp 20K /etc/samba 8.0K /etc/php.d 20K /etc/php-fpm.d 36K /etc/nginx 2.5M /etc/file 24K /etc/avahi 48K /etc/apcupsd 12K /etc/sysstat 48K /etc/security 248K /etc/ssl 596K /etc/ssh 48K /etc/mcelog 92K /etc/mc 48K /etc/logrotate.d 36K /etc/iproute2 9.1M /etc/udev 20K /etc/modprobe.d 104K /etc/pam.d 4.0K /etc/elogind 12K /etc/cron.daily 4.0K /etc/cron.d 8.0K /etc/dbus-1 4.0K /etc/sasl2 144K /etc/bash_completion.d 80K /etc/default 168K /etc/rc.d 8.0K /etc/acpi 84K /etc/profile.d 120K /etc/X11 16M /etc 12M /bin 1.1G /var/local 4.0K /var/kerberos 20K /var/state 2.1M /var/cache 16K /var/named 16K /var/tmp 16K /var/spool 1.8M /var/log 2.7M /var/lib 1.1G /var 93G /root/dest 4.0K /root/.config 93G /root 96G / 0 /mnt/rootshare 0 /mnt/addons 0 /mnt Finished. NOTE: If there is any subdirectory from /mnt appearing in this list, then that means that you have (most likely) a docker app which is directly referencing a non-existant disk or cache pool script: memorystorage.plg executed Executing hook script: post_plugin_checks Please if anyone can help I am a noob with unraid and learning as I go. gizmosunraid-diagnostics-20231031-1540.zip Quote Link to comment
itimpi Posted October 31, 2023 Share Posted October 31, 2023 2 hours ago, craze1994 said: 93G /root/dest What is this - it is not normal. Quote Link to comment
jsmontague Posted November 1, 2023 Author Share Posted November 1, 2023 Looking for other ideas to pinpoint my issue, server locked up yesterday evening a little after 5:00pm local time but provided nothing in syslog prior and was not able to use onboard keyboard/mouse or SSH to try and pull DIAG. Had to hard power cycle it to get it back up. alexandria-diagnostics-20231101-0830.zip syslog-192.168.36.150.log Quote Link to comment
JorgeB Posted November 1, 2023 Share Posted November 1, 2023 Did you try what I suggested above? Log is still being spammed with the same errors. Quote Link to comment
jsmontague Posted November 1, 2023 Author Share Posted November 1, 2023 (edited) I can't boot into safe mode as I need the services to be working, but I did shutdown my browser before I put my laptop to sleep. Is there anything else that could cause that? Is this a bug, never had to close a browser to any system before to keep it from crashing. The server was running hours after those logs so I dismissed them as root cause, will see if I can identify any other system that might have a browser open to unraid. Edited November 1, 2023 by jsmontague Quote Link to comment
jsmontague Posted November 1, 2023 Author Share Posted November 1, 2023 I have my server boot into WEB mode but leave the connected monitor/keyboard+mouse off. Could it be that web session? Quote Link to comment
JorgeB Posted November 1, 2023 Share Posted November 1, 2023 1 hour ago, jsmontague said: I can't boot into safe mode as I need the services to be working VMs and docker still work in safe mode, just not plugins. Quote Link to comment
jsmontague Posted November 1, 2023 Author Share Posted November 1, 2023 I did not know that..... Next lockup I'll put it in safe mode to rule out plugins! I confirmed with netstat that I have no open HTTPS session to the box, will report back in couple days if were still up! Thanks Jorge! Quote Link to comment
jsmontague Posted November 2, 2023 Author Share Posted November 2, 2023 The server started a parity check and is running at 1-5 MBs when it typically does a parity check around 150MBs. I don't see anything in the logs outside of the below. I'm going to cancel the parity and I'll restart the server into safe mode today. Nov 1 22:00:01 Alexandria kernel: mdcmd (39): check NOCORRECT Nov 1 22:00:01 Alexandria kernel: Nov 1 22:00:01 Alexandria kernel: md: recovery thread: check P Q ... alexandria-diagnostics-20231102-0641.zip Quote Link to comment
JorgeB Posted November 2, 2023 Share Posted November 2, 2023 There was something else reading from disks 5, 6 and 11. Quote Link to comment
jsmontague Posted November 2, 2023 Author Share Posted November 2, 2023 Yup synching is trying to rebuild its database, just found it. Thanks Jorge! Still feels like the system is not performing as it has in the past but can't sort out why so everything little thing has me running down rabbit holes. Appreciate your help! Quote Link to comment
jsmontague Posted November 3, 2023 Author Share Posted November 3, 2023 So far so good, but I saw the below in the log. Nov 2 15:59:17 Alexandria kernel: mce: [Hardware Error]: Machine check events logged Nov 2 15:59:17 Alexandria kernel: mce: [Hardware Error]: Machine check events logged mcelog: failed to prefill DIMM database from DMI data Kernel does not support page offline interface mcelog: Cannot read sysfs field /sys/kernel/security/lockdown: No such file or directory Kernel in lockdown. Cannot enable DIMM error location reportingFallback Socket memory error count 163 exceeded threshold: 163 in 24h Location SOCKET:1 CHANNEL:? DIMM:? [] Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback) Hardware event. This is not a software error. MCE 0 not finished? CPU 8 BANK 7 TSC 1e0037342d89a MISC 152561e86 ADDR 52d7e2940 TIME 1698789886 Tue Oct 31 17:04:46 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error STATUS cc00290000010092 MCGSTATUS 0 MCGCAP 1000c19 APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Hardware event. This is not a software error. MCE 0 CPU 8 BANK 7 TSC 189d3a0f42916 MISC 30684286 ADDR 4ab0d1940 TIME 1698958757 Thu Nov 2 15:59:17 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error STATUS cc00040000010092 MCGSTATUS 0 MCGCAP 1000c19 APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Hardware event. This is not a software error. MCE 1 CPU 8 BANK 11 TSC 189d3a0f42916 MISC 90000000000208c ADDR 742c52000 TIME 1698958757 Thu Nov 2 15:59:17 2023 MCG status: MCi status: Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR Transaction: Memory scrubbing error MemCtrl: Corrected patrol scrub error STATUS 8c000050000800c2 MCGSTATUS 0 MCGCAP 1000c19 APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Hardware event. This is not a software error. MCE 2 CPU 8 BANK 7 TSC 189d3a0f501de MISC 202ebe86 ADDR 4d4647700 TIME 1698958757 Thu Nov 2 15:59:17 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error STATUS cc00044000010092 MCGSTATUS 0 MCGCAP 1000c19 APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Hardware event. This is not a software error. MCE 3 CPU 8 BANK 7 TSC 189d3a0f571d2 MISC 2076e086 ADDR 4ab0d5540 TIME 1698958757 Thu Nov 2 15:59:17 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error STATUS cc00030000010092 MCGSTATUS 0 MCGCAP 1000c19 APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Hardware event. This is not a software error. MCE 4 CPU 8 BANK 7 TSC 189d3a0f5e316 MISC 205a9686 ADDR 4ab0d5d40 TIME 1698958757 Thu Nov 2 15:59:17 2023 MCG status: MCi status: Error overflow Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR Transaction: Memory read error STATUS cc0000c000010092 MCGSTATUS 0 MCGCAP 1000c19 APICID 20 SOCKETID 1 MICROCODE 42e CPUID Vendor Intel Family 6 Model 62 Step 4 Quote Link to comment
JorgeB Posted November 3, 2023 Share Posted November 3, 2023 That looks like a RAM issue. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.