Hardware problems?


Hoopster

Recommended Posts

In the last three months, my server has started experiencing lockups every couple of weeks.  It had been rock solid for a couple of years prior to that when I installed the current hardware.

 

The IPMI event logs just record an OS Shutdown like this:

 

216   09/02/2021, 14:24:22  OS  OS Stop / ShutdownRun-time Critical Stop - Asserted

 

Two or three times, there was a call trace before the lockup.  Most times, nothing interesting appears in the syslog immediately prior to a lockup.  The last lockup occurred on Sept. 2 and was preceded by this:

Sep  2 13:23:50 MediaNAS kernel: NETDEV WATCHDOG: eth0 (igb): transmit queue 1 timed out
Sep  2 13:23:50 MediaNAS kernel: WARNING: CPU: 4 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0xcf/0x12b
Sep  2 13:23:50 MediaNAS kernel: Modules linked in: xt_mark macvlan veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle nf_tables vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod ipmi_devintf nct6775 hwmon_vid corefreqk(O) wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables igb sr_mod cdrom i915 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif iosf_mbi kvm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd drm cryptd intel_gtt glue_helper mpt3sas wmi_bmof agpgart syscopyarea sysfillrect rapl intel_cstate raid_class sysimgblt intel_uncore nvme i2c_i801 acpi_ipmi
Sep  2 13:23:50 MediaNAS kernel: fb_sys_fops scsi_transport_sas ahci i2c_algo_bit i2c_smbus input_leds nvme_core wmi ipmi_si i2c_core video intel_pch_thermal led_class libahci ie31200_edac backlight thermal fan acpi_pad button acpi_power_meter [last unloaded: igb]
Sep  2 13:23:50 MediaNAS kernel: CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O      5.10.28-Unraid #1
Sep  2 13:23:50 MediaNAS kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4U, BIOS L2.34 12/23/2020
Sep  2 13:23:50 MediaNAS kernel: RIP: 0010:dev_watchdog+0xcf/0x12b
Sep  2 13:23:50 MediaNAS kernel: Code: 79 b7 00 00 75 38 48 89 ef c6 05 63 79 b7 00 01 e8 79 dd fc ff 44 89 e1 48 89 ee 48 c7 c7 ef 7f de 81 48 89 c2 e8 50 16 10 00 <0f> 0b eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9d eb 16 48 8b
Sep  2 13:23:50 MediaNAS kernel: RSP: 0018:ffffc900001fced8 EFLAGS: 00010286
Sep  2 13:23:50 MediaNAS kernel: RAX: 0000000000000000 RBX: ffff888104184438 RCX: 0000000000000027
Sep  2 13:23:50 MediaNAS kernel: RDX: 00000000ffffefff RSI: 0000000000000001 RDI: ffff88903f518920
Sep  2 13:23:50 MediaNAS kernel: RBP: ffff888104184000 R08: 0000000000000000 R09: 00000000ffffefff
Sep  2 13:23:50 MediaNAS kernel: R10: ffffc900001fcd08 R11: ffffc900001fcd00 R12: 0000000000000001
Sep  2 13:23:50 MediaNAS kernel: R13: ffffc900001fcf10 R14: ffffc900001fcf18 R15: ffffffff820060c8
Sep  2 13:23:50 MediaNAS kernel: FS:  0000000000000000(0000) GS:ffff88903f500000(0000) knlGS:0000000000000000
Sep  2 13:23:50 MediaNAS kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  2 13:23:50 MediaNAS kernel: CR2: 000014ff9e68f180 CR3: 000000000400a004 CR4: 00000000003706e0
Sep  2 13:23:50 MediaNAS kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  2 13:23:50 MediaNAS kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep  2 13:23:50 MediaNAS kernel: Call Trace:
Sep  2 13:23:50 MediaNAS kernel: <IRQ>
Sep  2 13:23:50 MediaNAS kernel: call_timer_fn.isra.0+0x12/0x6f
Sep  2 13:23:50 MediaNAS kernel: ? netif_tx_lock+0x7a/0x7a
Sep  2 13:23:50 MediaNAS kernel: __run_timers.part.0+0x144/0x185
Sep  2 13:23:50 MediaNAS kernel: ? update_process_times+0x68/0x6e
Sep  2 13:23:50 MediaNAS kernel: ? hrtimer_forward+0x73/0x7b
Sep  2 13:23:50 MediaNAS kernel: ? tick_sched_timer+0x5a/0x64
Sep  2 13:23:50 MediaNAS kernel: ? timerqueue_add+0x62/0x68
Sep  2 13:23:50 MediaNAS kernel: run_timer_softirq+0x21/0x43
Sep  2 13:23:50 MediaNAS kernel: __do_softirq+0xc4/0x1c2
Sep  2 13:23:50 MediaNAS kernel: asm_call_irq_on_stack+0xf/0x20
Sep  2 13:23:50 MediaNAS kernel: </IRQ>
Sep  2 13:23:50 MediaNAS kernel: do_softirq_own_stack+0x2c/0x39
Sep  2 13:23:50 MediaNAS kernel: __irq_exit_rcu+0x45/0x80
Sep  2 13:23:50 MediaNAS kernel: sysvec_apic_timer_interrupt+0x87/0x95
Sep  2 13:23:50 MediaNAS kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Sep  2 13:23:50 MediaNAS kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
Sep  2 13:23:50 MediaNAS kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
Sep  2 13:23:50 MediaNAS kernel: RSP: 0018:ffffc900000ebea0 EFLAGS: 00000246
Sep  2 13:23:50 MediaNAS kernel: RAX: ffff88903f522380 RBX: 0000000000000004 RCX: 000000000000001f
Sep  2 13:23:50 MediaNAS kernel: RDX: 0000000000000000 RSI: 0000000022a1e596 RDI: 0000000000000000
Sep  2 13:23:50 MediaNAS kernel: RBP: ffffe8ffffb27b00 R08: 0000fa73e4e5a96c R09: 00000000000001b7
Sep  2 13:23:50 MediaNAS kernel: R10: 000000000000020a R11: 071c71c71c71c71c R12: 0000fa73e4e5a96c
Sep  2 13:23:50 MediaNAS kernel: R13: ffffffff820c5dc0 R14: 0000000000000004 R15: 0000000000000000
Sep  2 13:23:50 MediaNAS kernel: cpuidle_enter_state+0x101/0x1c4
Sep  2 13:23:50 MediaNAS kernel: cpuidle_enter+0x25/0x31
Sep  2 13:23:50 MediaNAS kernel: do_idle+0x1a6/0x214
Sep  2 13:23:50 MediaNAS kernel: cpu_startup_entry+0x18/0x1a
Sep  2 13:23:50 MediaNAS kernel: secondary_startup_64_no_verify+0xb0/0xbb
Sep  2 13:23:50 MediaNAS kernel: ---[ end trace 8c7530a069c55ad4 ]---
Sep  2 13:23:50 MediaNAS kernel: igb 0000:05:00.0 eth0: Reset adapter
Sep  2 13:23:52 MediaNAS kernel: igb 0000:05:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

 

Attached is the syslog going back to January and the latest diagnostics.  Does anything jump out at anyone?

 

I am beginning to suspect hardware issues.

 

 

syslog-192.168.1.10.log medianas-diagnostics-20210905-1623.zip

Link to comment
1 hour ago, trurl said:

Thanks, trurl.  I missed that ipvlan nugget in the 6.10 release.  I am running that on my backup/test server but it has never had any issues. 

 

 I have been running macvlan/custom IP addresses on docker containers for quite a while across several  6.8.x and 6.9.x releases without issue.  Had to set up a VLAN on router to do it without crashes but it has been solid for a very long time.  My crashes just started on 6.9.2 three or so months ago.

 

I'll give the 6.10 RC a try.

Link to comment
  • 2 weeks later...

Locked up again today on 6.10 RC1 and with the Docker custom network type set to ipvlan.  Obviously this was not the issue but it was worth trying.  It locked up again approximately two weeks after last reboot which has been the pattern since this started roughly in early July.

 

I'll keep digging.  This is a weird one with nothing very helpful in the syslog.

Link to comment
  • 3 weeks later...

Crashed again 5 days ago after about a week of uptime.  Locked up again today.  The frequency of lockups is increasing and nothing relevant that I can see in syslog. 

 

I have just swapped out the power supply as a test.  Perhaps the other PSU is failing although it was brand new 18 months ago.

Link to comment
  • 4 weeks later...

The server ran for 23 days without problems on the replacement PSU.  I put the original PSU back in and 4 days later the server locked up again.  Pretty sure a failing PSU was the issue.  I have had it only 18 months and it has a 7-year warranty.  I made a warranty claim today.  The "old" PSU is back in the server for now.

Edited by Hoopster
  • Like 1
Link to comment

Well, so much for the idea that is was a bad PSU.  The server just locked again after a little over 1 day on the "old" PSU that previously ran in the server for 23 days without issue.  That is the quickest it has ever locked up after a reboot since the freezing started.

 

Motherboard bad?  RAM?  I'll run a memtest on the ECC RAM.  Yes, I know the included memtest does not support ECC.

Edited by Hoopster
Link to comment

Bad RAM does not appear to be the cause of the lockups.  I downloaded the free Memtest86+ from the Passmark site as it supports ECC RAM.  I ran the default test which runs four passes of 13 tests.  This took 10+ hours and resulted in 0 errors.

 

There is not really a good way to test the motherboard other than replacing it. 

 

I had to use extender cables on the CPU power and motherboard power cables from the CPU as the included cables with the SFX PSU were too short.  Perhaps one of them is bad.

IMG_3340[1].JPG

IMG_3341[1].JPG

Link to comment
  • 3 weeks later...

I am now leaning more towards these lockups being caused by Linux kernel 5.x and/or i915 driver issues.  There are many reports in the forums of lockups starting with unRAID 6.9.2 and continuing with 6.10 (both use a Linux 5.x kernel) whereas unRAID 6.8.3 causes no problems (4.19 kernel).

 

The IPMI logs report OS Stop/Shutdown when the system locks up.  Nothing useful in the syslog to my untrained eye.

 

Other users are reporting lockups with i915 drivers and I also use those for Plex hardware transcoding although lockups have never occurred for me during transcoding.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.