Identifying hardware issue


Recommended Posts

For last week I've been trying to track down a hardware issue. From logs I've been pointed all over the place and haven't gotten any closer, hopefully can get some direction from the group here. I attached diagnostics from earlier and again from today. This morning I decided to swap RAM as I had a lab box that has had no issues with 32GB of RAM that I could install. Afterwards I now have different errors and they still point potentially to RAM issue. 

alexandria-diagnostics-20231029-1423.zip alexandria-diagnostics-20231024-0631.zip

Link to comment

Thanks Jorge, I'll flip that over to ipvlan today. Not sure if captured in the diags but I still see the below in my syslog output. 

 

Oct 29 11:12:51 Alexandria kernel: clocksource: timekeeping watchdog on CPU18: hpet wd-wd read-back delay of 82063ns
Oct 29 11:12:51 Alexandria kernel: clocksource: wd-tsc-wd read-back delay of 124876ns, clock-skew test skipped!
Oct 29 11:30:49 Alexandria kernel: clocksource: timekeeping watchdog on CPU5: hpet wd-wd read-back delay of 53079ns
Oct 29 11:30:49 Alexandria kernel: clocksource: wd-tsc-wd read-back delay of 100990ns, clock-skew test skipped!

 

Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 Sense Key : 0x5 [current] 
Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 ASC=0x21 ASCQ=0x0 
Oct 30 04:00:23 Alexandria kernel: sd 1:0:8:0: [sdj] tag#2980 CDB: opcode=0x42 42 00 00 00 00 00 00 00 18 00
Oct 30 04:00:23 Alexandria kernel: critical target error, dev sdj, sector 467807832 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2
Oct 30 04:00:23 Alexandria kernel: BTRFS warning (device sdj1): failed to trim 1 device(s), last error -121
Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 Sense Key : 0x5 [current] 
Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 ASC=0x21 ASCQ=0x0 
Oct 30 04:18:49 Alexandria kernel: sd 1:0:7:0: [sdi] tag#2726 CDB: opcode=0x42 42 00 00 00 00 00 00 00 18 00
Oct 30 04:18:49 Alexandria kernel: critical target error, dev sdi, sector 935397439 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2
Oct 30 04:18:49 Alexandria kernel: BTRFS warning (device sdi1): failed to trim 1 device(s), last error -121
Oct 30 04:18:49 Alexandria root: /etc/libvirt: 920 MiB (964657152 bytes) trimmed on /dev/loop3
Oct 30 04:18:49 Alexandria root: /var/lib/docker: 19.4 GiB (20827594752 bytes) trimmed on /dev/loop2
Oct 30 04:18:49 Alexandria root: /mnt/nvme: 2.2 TiB (2396491730944 bytes) trimmed on /dev/nvme0n1p1

 

Link to comment

Here is some logs I got right before my server locked up that I am not sure root cause on either.

 

 

Oct 28 21:20:11 Alexandria kernel: BTRFS critical (device loop2): corrupt leaf: root=2 block=13575421952 slot=43, invalid key objectid, have 18446612789731309144 expect to be aligned to 4096
Oct 28 21:20:11 Alexandria kernel: BTRFS info (device loop2): leaf 13575421952 gen 4571109 total ptrs 201 free space 2780 owner 2
Oct 28 21:20:11 Alexandria kernel: #011item 0 key (13517111296 169 0) itemoff 16250 itemsize 33
Oct 28 21:20:11 Alexandria kernel: #011#011extent refs 1 gen 4540846 flags 2
Oct 28 21:20:11 Alexandria kernel: #011#011ref#0: tree block backref root 20620
Oct 28 21:20:11 Alexandria kernel: #011item 1 key (13517127680 169 0) itemoff 16190 itemsize 60

 

Link to comment

 

My logs are full of nginx memory full errors. Attached new diag. Is this tied to a page being left open on browser of a device that is asleep overnight or something else? Can't believe that's a thing but it's a few of the google results I found when searching for the log messages. 

 

 

Oct 30 20:00:33 Alexandria kernel: clocksource: timekeeping watchdog on CPU6: hpet wd-wd read-back delay of 54266ns
Oct 30 20:00:33 Alexandria kernel: clocksource: wd-tsc-wd read-back delay of 106577ns, clock-skew test skipped!
Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [crit] 12347#12347: ngx_slab_alloc() failed: no memory
Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: shpool alloc failed
Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27101. Increase nchan_max_reserved_memory.
Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: *1306818 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
Oct 30 20:24:51 Alexandria nginx: 2023/10/30 20:24:51 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices
Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [crit] 12347#12347: ngx_slab_alloc() failed: no memory
Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: shpool alloc failed
Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27101. Increase nchan_max_reserved_memory.
Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: *1306826 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
Oct 30 20:24:52 Alexandria nginx: 2023/10/30 20:24:52 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices
Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [crit] 12347#12347: ngx_slab_alloc() failed: no memory
Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: shpool alloc failed
Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27106. Increase nchan_max_reserved_memory.
Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: *1306868 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
Oct 30 20:24:56 Alexandria nginx: 2023/10/30 20:24:56 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices
Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [crit] 12347#12347: ngx_slab_alloc() failed: no memory
Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: shpool alloc failed
Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27111. Increase nchan_max_reserved_memory.
Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: *1306881 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
Oct 30 20:24:57 Alexandria nginx: 2023/10/30 20:24:57 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices
Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [crit] 12347#12347: ngx_slab_alloc() failed: no memory
Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: shpool alloc failed
Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: nchan: Out of shared memory while allocating message of size 27111. Increase nchan_max_reserved_memory.
Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: *1306889 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
Oct 30 20:24:58 Alexandria nginx: 2023/10/30 20:24:58 [error] 12347#12347: MEMSTORE:00: can't create shared message for channel /devices

 

 

alexandria-diagnostics-20231031-0844.zip

Link to comment

Hello, was wondering if anyone would be able to help me solve my issue. Received an error stating that "Rootfs file is getting full (currently 99 % used)". I currently have 288 gig of ram installed in my sever and the meters indicate that I am using 51% of that ram. Attacked is the diagnostic file and ran  plugin memorystaorage.plg and will include that report:

I am currently moving files from NAS storage to unraid using rsync and setup smb to NAS storage if this is any help.

 

plugin: installing: memorystorage.plg
Executing hook script: pre_plugin_checks
plugin: downloading: memorystorage.plg ... done

Executing hook script: pre_plugin_checks


This script may take a few minutes to run, especially if you are manually mounting a remote share outside of /mnt/disks or /mnt/remotes

/usr/bin/du --exclude=/mnt/user --exclude=/mnt/user0 --exclude=/mnt/disks --exclude=/proc --exclude=/sys --exclude=/var/lib/docker --exclude=/boot --exclude=/mnt -h -d2 / 2>/dev/null | grep -v 0$' '
13M /tmp/fix.common.problems
4.0K /tmp/ca_notices
2.2M /tmp/pkg
16K /tmp/unassigned.devices
12M /tmp/community.applications
20K /tmp/notifications
596K /tmp/plugins
4.0K /tmp/emhttp
28M /tmp
439M /usr/bin
316M /usr/lib
682M /usr/lib64
25M /usr/libexec
217M /usr/local
46M /usr/sbin
177M /usr/share
4.1M /usr/src
1.1M /usr/man
20K /usr/info
956K /usr/doc
312K /usr/include
1.9G /usr
23M /sbin
52K /run/libvirt
8.0K /run/docker
4.0K /run/avahi-daemon
4.0K /run/elogind
4.0K /run/dbus
308K /run/udev
4.0K /run/blkid
468K /run
4.0K /lib64/xfsprogs
24K /lib64/pkgconfig
1.5M /lib64/elogind
4.0K /lib64/e2fsprogs
1.6M /lib64/security
36M /lib64
25K /lib/dhcpcd
189M /lib/firmware
18K /lib/modprobe.d
90M /lib/modules
512 /lib/systemd
8.0M /lib/udev
286M /lib
8.0K /etc/vulkan
4.0K /etc/nvidia-container-runtime
8.0K /etc/docker
4.0K /etc/OpenCL
4.0K /etc/rsyslog.d
4.0K /etc/netatalk
352K /etc/libvirt
332K /etc/libvirt-
284K /etc/fonts
12K /etc/gtk-3.0
44K /etc/zfs
4.0K /etc/pkcs11
144K /etc/lvm
8.0K /etc/libnl
8.0K /etc/ssmtp
20K /etc/samba
8.0K /etc/php.d
20K /etc/php-fpm.d
36K /etc/nginx
2.5M /etc/file
24K /etc/avahi
48K /etc/apcupsd
12K /etc/sysstat
48K /etc/security
248K /etc/ssl
596K /etc/ssh
48K /etc/mcelog
92K /etc/mc
48K /etc/logrotate.d
36K /etc/iproute2
9.1M /etc/udev
20K /etc/modprobe.d
104K /etc/pam.d
4.0K /etc/elogind
12K /etc/cron.daily
4.0K /etc/cron.d
8.0K /etc/dbus-1
4.0K /etc/sasl2
144K /etc/bash_completion.d
80K /etc/default
168K /etc/rc.d
8.0K /etc/acpi
84K /etc/profile.d
120K /etc/X11
16M /etc
12M /bin
1.1G /var/local
4.0K /var/kerberos
20K /var/state
2.1M /var/cache
16K /var/named
16K /var/tmp
16K /var/spool
1.8M /var/log
2.7M /var/lib
1.1G /var
93G /root/dest
4.0K /root/.config
93G /root
96G /
0 /mnt/rootshare
0 /mnt/addons
0 /mnt


Finished.
NOTE: If there is any subdirectory from /mnt appearing in this list, then that means that you have (most likely) a docker app which is directly referencing a non-existant disk or cache pool
script: memorystorage.plg executed
Executing hook script: post_plugin_checks

Please if anyone can help I am a noob with unraid and learning as I go.
 

gizmosunraid-diagnostics-20231031-1540.zip

Link to comment

I can't boot into safe mode as I need the services to be working, but I did shutdown my browser before I put my laptop to sleep. Is there anything else that could cause that? Is this a bug, never had to close a browser to any system before to keep it from crashing. The server was running hours after those logs so I dismissed them as root cause, will see if I can identify any other system that might have a browser open to unraid. 

Edited by jsmontague
Link to comment

The server started a parity check and is running at 1-5 MBs when it typically does a parity check around 150MBs. I don't see anything in the logs outside of the below. I'm going to cancel the parity and I'll restart the server into safe mode today. 

 

Nov  1 22:00:01 Alexandria kernel: mdcmd (39): check NOCORRECT
Nov  1 22:00:01 Alexandria kernel: 
Nov  1 22:00:01 Alexandria kernel: md: recovery thread: check P Q ...

 

alexandria-diagnostics-20231102-0641.zip

Link to comment

So far so good, but I saw the below in the log.

 

Nov  2 15:59:17 Alexandria kernel: mce: [Hardware Error]: Machine check events logged
Nov  2 15:59:17 Alexandria kernel: mce: [Hardware Error]: Machine check events logged

 

 

mcelog: failed to prefill DIMM database from DMI data
Kernel does not support page offline interface
mcelog: Cannot read sysfs field /sys/kernel/security/lockdown: No such file or directory
Kernel in lockdown. Cannot enable DIMM error location reportingFallback Socket memory error count 163 exceeded threshold: 163 in 24h
Location SOCKET:1 CHANNEL:? DIMM:? []
Running trigger `socket-memory-error-trigger' (reporter: sockdb_fallback)
Hardware event. This is not a software error.
MCE 0
not finished?
CPU 8 BANK 7 TSC 1e0037342d89a 
MISC 152561e86 ADDR 52d7e2940 
TIME 1698789886 Tue Oct 31 17:04:46 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc00290000010092 MCGSTATUS 0
MCGCAP 1000c19 APICID 20 SOCKETID 1 
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
Hardware event. This is not a software error.
MCE 0
CPU 8 BANK 7 TSC 189d3a0f42916 
MISC 30684286 ADDR 4ab0d1940 
TIME 1698958757 Thu Nov  2 15:59:17 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc00040000010092 MCGSTATUS 0
MCGCAP 1000c19 APICID 20 SOCKETID 1 
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
Hardware event. This is not a software error.
MCE 1
CPU 8 BANK 11 TSC 189d3a0f42916 
MISC 90000000000208c ADDR 742c52000 
TIME 1698958757 Thu Nov  2 15:59:17 2023
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error

STATUS 8c000050000800c2 MCGSTATUS 0
MCGCAP 1000c19 APICID 20 SOCKETID 1 
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
Hardware event. This is not a software error.
MCE 2
CPU 8 BANK 7 TSC 189d3a0f501de 
MISC 202ebe86 ADDR 4d4647700 
TIME 1698958757 Thu Nov  2 15:59:17 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc00044000010092 MCGSTATUS 0
MCGCAP 1000c19 APICID 20 SOCKETID 1 
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
Hardware event. This is not a software error.
MCE 3
CPU 8 BANK 7 TSC 189d3a0f571d2 
MISC 2076e086 ADDR 4ab0d5540 
TIME 1698958757 Thu Nov  2 15:59:17 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc00030000010092 MCGSTATUS 0
MCGCAP 1000c19 APICID 20 SOCKETID 1 
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4
Hardware event. This is not a software error.
MCE 4
CPU 8 BANK 7 TSC 189d3a0f5e316 
MISC 205a9686 ADDR 4ab0d5d40 
TIME 1698958757 Thu Nov  2 15:59:17 2023
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc0000c000010092 MCGSTATUS 0
MCGCAP 1000c19 APICID 20 SOCKETID 1 
MICROCODE 42e
CPUID Vendor Intel Family 6 Model 62 Step 4

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.