Unraid crashed several times over the last few days


Recommended Posts

Unraid crashed yesterday and again last night. This has been happening pretty regularly, and I don't really know why this is happening. I'm attaching diagnostics data and a screen shot from the IPMI of the crash. I had disabled the syslog (I have since re-enabled it), so I didn't capture those logs. Hopefully the diagnostics can show why this server keeps crashing on me.

Screen Shot 2021-05-01 at 10.11.09.png

unraid-diagnostics-20210501-1015.zip

Link to comment

Also, start a remote ssh session (if you can) and run dmesg -w on it, and just leave it up. That will leave a monitor hooked on the kernel messages for you, so (if it has time before it eats itself) it'll spit out some information on what happened.

 

In the mean time, I agree that the logs show nothing remarkable. Unfortunately, the diagnostic.zip only captures one boot worth of syslog, so this is one we might need to catch 'in the act'.

Link to comment

It happened again sometime today after 1200 EDT.

 

Output from dmesg -w:

[180253.133938] general protection fault, probably for non-canonical address 0x995d43662ef36987: 0000 [#1] SMP NOPTI
[180253.134046] CPU: 12 PID: 26023 Comm: node Tainted: P        W  O      5.10.28-Unraid #1
[180253.134106] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.1 02/21/2020
[180253.134170] RIP: 0010:nf_nat_setup_info+0x129/0x6aa [nf_nat]
[180253.134222] Code: ff 48 8b 15 ef 6a 00 00 89 c0 48 8d 04 c2 48 8b 10 48 85 d2 74 80 48 81 ea 98 00 00 00 48 85 d2 0f 84 70 ff ff ff 8a 44 24 46 <38> 42 46 74 09 48 8b 92 98 00 00 00 eb d9 48 8b 4a 20 48 8b 42 28
[180253.134381] RSP: 0018:ffffc900003a8700 EFLAGS: 00010286
[180253.134431] RAX: ffff88817aee7811 RBX: ffff88810da4ebc0 RCX: 0000000000000000
[180253.134488] RDX: 995d43662ef36987 RSI: 000000000e2c996a RDI: ffffc900003a8720
[180253.134544] RBP: ffffc900003a87c8 R08: 000000003bda62d3 R09: ffff888146ef46c0
[180253.134600] R10: 0000000000000348 R11: ffffffff815cbe4b R12: 0000000000000000
[180253.134656] R13: ffffc900003a8720 R14: ffffc900003a87dc R15: ffffffff8210b440
[180253.134713] FS:  000014648f1f8700(0000) GS:ffff8887fe800000(0000) knlGS:0000000000000000
[180253.134800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[180253.134875] CR2: 000014fe97325000 CR3: 0000000552bb8000 CR4: 00000000003506e0
[180253.134967] Call Trace:
[180253.135019]  <IRQ>
[180253.135072]  ? dev_hard_start_xmit+0x75/0xab
[180253.135142]  ? bond_start_xmit+0x26e/0x292 [bonding]
[180253.135214]  ? __ksize+0x15/0x64
[180253.135270]  ? krealloc+0x26/0x7a
[180253.135327]  nf_nat_masquerade_ipv4+0x10b/0x131 [nf_nat]
[180253.135403]  masquerade_tg+0x44/0x5e [xt_MASQUERADE]
[180253.135475]  ipt_do_table+0x51a/0x5c0 [ip_tables]
[180253.135543]  ? ipt_do_table+0x570/0x5c0 [ip_tables]
[180253.135614]  ? fib_validate_source+0xb0/0xda
[180253.135678]  nf_nat_inet_fn+0xe9/0x183 [nf_nat]
[180253.135745]  nf_nat_ipv4_out+0xf/0x88 [nf_nat]
[180253.135797]  nf_hook_slow+0x39/0x8e
[180253.135842]  nf_hook+0xab/0xd3
[180253.135886]  ? __ip_finish_output+0x146/0x146
[180253.135933]  ip_output+0x7d/0x8a
[180253.135976]  ? __ip_finish_output+0x146/0x146
[180253.136023]  ip_forward+0x3f1/0x420
[180253.136067]  ? ip_check_defrag+0x18f/0x18f
[180253.136115]  ip_sabotage_in+0x43/0x4d [br_netfilter]
[180253.136164]  nf_hook_slow+0x39/0x8e
[180253.136208]  nf_hook.constprop.0+0xb1/0xd8
[180253.136254]  ? l3mdev_l3_rcv.constprop.0+0x50/0x50
[180253.136302]  ip_rcv+0x41/0x61
[180253.136345]  __netif_receive_skb_one_core+0x74/0x95
[180253.136395]  netif_receive_skb+0x79/0xa1
[180253.136442]  br_handle_frame_finish+0x30d/0x351
[180253.136491]  ? ipt_do_table+0x570/0x5c0 [ip_tables]
[180253.136539]  ? br_pass_frame_up+0xda/0xda
[180253.136585]  br_nf_hook_thresh+0xa3/0xc3 [br_netfilter]
[180253.136636]  ? br_pass_frame_up+0xda/0xda
[180253.136683]  br_nf_pre_routing_finish+0x23d/0x264 [br_netfilter]
[180253.136736]  ? br_pass_frame_up+0xda/0xda
[180253.136781]  ? br_handle_frame_finish+0x351/0x351
[180253.136830]  ? nf_nat_ipv4_pre_routing+0x1e/0x4a [nf_nat]
[180253.136881]  ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
[180253.136932]  ? br_handle_frame_finish+0x351/0x351
[180253.136980]  NF_HOOK+0xd7/0xf7 [br_netfilter]
[180253.137027]  ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
[180253.137079]  br_nf_pre_routing+0x229/0x239 [br_netfilter]
[180253.137131]  ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
[180253.137182]  br_handle_frame+0x25e/0x2a6
[180253.137227]  ? br_pass_frame_up+0xda/0xda
[180253.137273]  __netif_receive_skb_core+0x335/0x4e7
[180253.137323]  __netif_receive_skb_one_core+0x3d/0x95
[180253.137374]  process_backlog+0xa3/0x13b
[180253.137419]  net_rx_action+0xf4/0x29d
[180253.147128]  __do_softirq+0xc4/0x1c2
[180253.147174]  asm_call_irq_on_stack+0x12/0x20
[180253.147220]  </IRQ>
[180253.147260]  do_softirq_own_stack+0x2c/0x39
[180253.147309]  do_softirq+0x3a/0x44
[180253.147353]  __local_bh_enable_ip+0x3b/0x43
[180253.147399]  ip_finish_output2+0x2ec/0x31f
[180253.147444]  ? ipv4_mtu+0x3d/0x64
[180253.147487]  ip_send_skb+0x10/0x32
[180253.147531]  udp_send_skb+0x24e/0x2b0
[180253.147575]  udp_sendmsg+0x60d/0x826
[180253.147621]  ? arch_local_irq_save+0x11/0x17
[180253.147667]  ? skb_zcopy_set+0x5e/0x5e
[180253.147715]  ? xfrm_lookup_with_ifid+0x430/0x4b7
[180253.147765]  ? sock_sendmsg_nosec+0x2b/0x3c
[180253.147811]  sock_sendmsg_nosec+0x2b/0x3c
[180253.147857]  __sys_sendto+0xce/0x109
[180253.147902]  ? __sys_connect+0x73/0x9d
[180253.147947]  __x64_sys_sendto+0x20/0x23
[180253.147993]  do_syscall_64+0x5d/0x6a
[180253.148039]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[180253.148089] RIP: 0033:0x14649472063f
[180253.148134] Code: 53 49 89 d5 89 fb 48 83 ec 10 e8 4c f8 ff ff 48 63 fb 89 c5 45 31 c9 45 31 c0 4d 63 d4 4c 89 ea 4c 89 f6 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1e 89 ef 48 89 44 24 08 e8 7d f8 ff ff 48 8b
[180253.148291] RSP: 002b:000014648f1f6050 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[180253.148351] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 000014649472063f
[180253.148407] RDX: 0000000000000018 RSI: 000014648f1f62f0 RDI: 0000000000000014
[180253.148463] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[180253.148519] R10: 0000000000004000 R11: 0000000000000246 R12: 0000000000004000
[180253.148575] R13: 0000000000000018 R14: 000014648f1f62f0 R15: 000014648f1f8db8
[180253.148633] Modules linked in: nvidia_uvm(PO) xt_mark xt_nat veth macvlan nf_conntrack_netlink xt_addrtype br_netfilter xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun xfs nfsd lockd grace sunrpc md_mod nvidia_drm(PO) nvidia_modeset(PO) drm_kms_helper drm backlight agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops nvidia(PO) ipmi_devintf ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding bnx2x mdio igb i2c_algo_bit amd64_edac_mod edac_mce_amd mpt3sas crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ipmi_ssif aesni_intel crypto_simd cryptd raid_class scsi_transport_sas ahci i2c_piix4 input_leds glue_helper i2c_core led_class k10temp rapl libahci acpi_ipmi button ipmi_si acpi_cpufreq [last unloaded: ccp]
[180253.150423] ---[ end trace 494269cf320fd729 ]---
[180253.150538] RIP: 0010:nf_nat_setup_info+0x129/0x6aa [nf_nat]
[180253.150645] Code: ff 48 8b 15 ef 6a 00 00 89 c0 48 8d 04 c2 48 8b 10 48 85 d2 74 80 48 81 ea 98 00 00 00 48 85 d2 0f 84 70 ff ff ff 8a 44 24 46 <38> 42 46 74 09 48 8b 92 98 00 00 00 eb d9 48 8b 4a 20 48 8b 42 28
[180253.150865] RSP: 0018:ffffc900003a8700 EFLAGS: 00010286

 

The two syslogs are from the same server. The 127.0.0.1 is from the "Local syslog folder" setting and the 192.168.1.223 is from the "Remote syslog server".

 

It might be possible this is the same issue as JorgeB mentioned, but I'm not versed enough in Syslog reading to know for sure.

 

 

 

syslog-127.0.0.1.log syslog-192.168.1.223.log

Link to comment

Okay - one of those logs is pure junk, nothing but connection errors, closed, and timeout messages. Not even enough context to piece together why without more patience than I care to spend on it, haha.

 

The other log indicates that your UPS is detecting power faults every few minutes for a broad margin of time. Unless it's a double-conversion UPS, switching between battery power and mains is not instant, and there's a minor delay where a power supply receives a brief drop in power, then a sudden sharp return of rough, stepped-square-wave (in most cases) faux-AC mains -- which is acceptable, but not ideal - and then the same thing in reverse again later. This can lead to brief drops in power rail voltage and stability, which can cause crashes aplenty, disk errors, silent data corruption, the list goes on. I'm also seeing disk errors in that log, but since we don't have a diagnostics.zip -- Tools, Diagnostics, Download; it contains syslogs, SMART data, hardware configuration data, and other things that deeply aid in chasing down unusual crashes just like this one. They're no silver bullet, but they're very important with systemic issues such as this one.

 

If your power is that unstable, you will have a bad time without a double-conversion UPS. Learn about their caveats too.

 

If your UPS is faulty, replace it because it's probably making it worse.

 

Check all of your disks' SMART data, and run tests on any indicating errors.

 

JorgeB's suggestions are regarding people using specific docker configurations. The information has been covered in the posts, and you don't have to read a syslog to compare your docker container usage to what's being described as problematic. I'd suggest taking a more detail-oriented look.

Link to comment

Interesting, the UPS is a brand new CyberPower OR1500PFCRT2U. According to Unraid, it's only 25-30% loaded. I have noticed Unraid looses connection to the UPS occasionally, but I thought it was due to the USB cable being loose. The cable is as far as it can be and still be connected. I'm getting a server rack to put the UPS and the Unraid server in pretty soon. I don't think power at my place is unstable, my old NAS ran fine for years on a pretty cheap UPS without every crashing.

 

As far as the networking concerns JorgeB pointed out, while I have the same type of configuration (some containers bridged, some containers on br0 with a static IP address). The posts mention the syslog showing issues with macvlan, which I didn't see.

 

I'll let the parity check run, I was stopping it when it crashed since a full parity check takes 2-3 days, and the system wasn't staying up long enough to finish between crashes.

 

That being said, on the Unraid Dashboard, it does show the 12v rail being maxed out, so maybe there is a power issue in the case. I did add a 1070 for transcoding, so I'm going to take  the video card out and see if that might be what's causing to much draw on the power supply. If that seems to fix the stability issue I'll contact the people I purchased it from (45drive) and see about getting a more capable power supply.

Link to comment
On 5/4/2021 at 6:11 AM, Kirizan said:

thought it was due to the USB cable being loose

May  2 06:02:34 unraid apcupsd[6284]: Power failure.
May  2 06:02:37 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 06:24:25 unraid apcupsd[6284]: Power failure.
May  2 06:24:28 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 06:57:22 unraid apcupsd[6284]: Power failure.
May  2 06:57:25 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 07:08:25 unraid apcupsd[6284]: Power failure.
May  2 07:08:28 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 08:00:01 unraid root: Starting Mover
May  2 08:00:01 unraid root: Forcing turbo write on
May  2 08:00:01 unraid kernel: mdcmd (56): set md_write_method 1
May  2 08:00:01 unraid kernel: 
May  2 08:00:01 unraid root: ionice -c 2 -n 0 nice -n 0 /usr/local/emhttp/plugins/ca.mover.tuning/age_mover start 0 0 0 '' .DS_Store '' '' ''
May  2 08:00:01 unraid root: Restoring original turbo write mode
May  2 08:00:01 unraid kernel: mdcmd (57): set md_write_method auto
May  2 08:00:01 unraid kernel: 
May  2 08:01:40 unraid apcupsd[6284]: Power failure.
May  2 08:01:43 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 08:59:39 unraid apcupsd[6284]: Power failure.
May  2 08:59:42 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 09:16:46 unraid apcupsd[6284]: Power failure.
May  2 09:16:48 unraid apcupsd[6284]: Power is back. UPS running on mains.
May  2 10:00:01 unraid root: Starting Mover

I'm specifically referencing these, where it says your power is failing and returning. That's not a UPS connection indication, to be clear (just in case you misunderstood it) but an indication that it thinks you've had a power fault in your home. This is caused by two things;

 

 

First -- An actual problem:

 Blackout, brownout, blip, given that it only lasts three seconds on average it's an incidental issue -- could be your HVAC cycling, or a water heater, or someone using a large vacuum cleaner on the same branch, could be power delivery in your area. Whatever the case, your power isn't as good as your UPS wishes it to be for a short moment, so it turns on, then turns back off. These would result in your UPS actually triggering -- normally associated with loud clicks, sometimes a thunk/hum noise, beeping, typically also large fans inside the unit turning on, some manner of blinking or bright indicator that there is a power fault. 

 

Second -- A phantom problem:

 Either the UPS is reporting very specifically that the power is failing when it isn't, OR something in the pipe between the kernel and the UPS is asserting a "power faliure" condition as a sort of "failsafe assumption" because of a communication issue.

 

If it's the first, best case is continued instability and worst case is actual physical hardware damage and data loss.

If it's the second, best case is logspam, worst case is that these mechanisms fire and attempt to (more or less) gracefully power off your system assuming it's about to run out of power on battery backups.

 

 

On 5/4/2021 at 6:11 AM, Kirizan said:

Dashboard, it does show the 12v rail being maxed out

I'm not familiar with the indication you're mentioning. Could you show me what you mean via screenshot?

 

 

 

On 5/4/2021 at 6:11 AM, Kirizan said:

As far as the networking concerns JorgeB pointed out, while I have the same type of configuration (some containers bridged, some containers on br0 with a static IP address). The posts mention the syslog showing issues with macvlan, which I didn't see.

You'll want to read it all again, then, as your understanding is fundamentally flawed.

 

'macvlan' is the docker subsystem used to enable containers to have a static IP address on a physical network. There will never be a 'macvlan' indication in any syslog using current software stacks. The posts mention the syslog showing issues caused by using macvlan, but the messages do not contain macvlan or any variation, instead typically showing some manner of nf_nat_setup and/or nf_nat_setup_info -- which is in your syslog and probably why it was suggested as a potential thing to investigate.

 

If you explore those options and don't come up with anything helpful, submit a new full diagnostics.zip and I'll check back in. If you need further clarification on anything I've covered, just ask and I'll do what I can.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.