General instability of unraid for months.

[email protected] · December 2, 2020

I have been having some sever stability issues. Unraid UI partially locks up. shows the 502 gateway error. various others. I cant get it to shutdown. This has been repeated over months. Some times unraid wont last 48 hours. Other times it will last a couple of weeks. Had my cache drive go multiple times, i assume corruption was due to the lockups. Why is unraid so unstable right now?

/etc/rc.d/rc.nginx restart

/etc/rc.d/rc.php-fpm restart

This doesnt work.

Cant get a ssh shutdown to work

unraid-diagnostics-20201201-2023.zip

[email protected] · December 2, 2020

Does this kernel stuff mean anything bad? System has been up for 2 hours.

Dec 1 20:44:20 UNRAID kernel: R10: 0000000000000098 R11: ffff889818870000 R12: 000000000000cd45
Dec 1 20:44:20 UNRAID kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000b64c
Dec 1 20:44:20 UNRAID kernel: FS: 0000000000000000(0000) GS:ffff888c4f600000(0000) knlGS:0000000000000000
Dec 1 20:44:20 UNRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 1 20:44:20 UNRAID kernel: CR2: 000056490c7dff78 CR3: 0000000001e0a001 CR4: 00000000001606f0
Dec 1 20:44:20 UNRAID kernel: Call Trace:
Dec 1 20:44:20 UNRAID kernel: <IRQ>
Dec 1 20:44:20 UNRAID kernel: ipv4_confirm+0xaf/0xb9
Dec 1 20:44:20 UNRAID kernel: nf_hook_slow+0x3a/0x90
Dec 1 20:44:20 UNRAID kernel: ip_local_deliver+0xad/0xdc
Dec 1 20:44:20 UNRAID kernel: ? ip_sublist_rcv_finish+0x54/0x54
Dec 1 20:44:20 UNRAID kernel: ip_rcv+0xa0/0xbe
Dec 1 20:44:20 UNRAID kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1
Dec 1 20:44:20 UNRAID kernel: __netif_receive_skb_one_core+0x53/0x6f
Dec 1 20:44:20 UNRAID kernel: process_backlog+0x77/0x10e
Dec 1 20:44:20 UNRAID kernel: net_rx_action+0x107/0x26c
Dec 1 20:44:20 UNRAID kernel: __do_softirq+0xc9/0x1d7
Dec 1 20:44:20 UNRAID kernel: do_softirq_own_stack+0x2a/0x40
Dec 1 20:44:20 UNRAID kernel: </IRQ>
Dec 1 20:44:20 UNRAID kernel: do_softirq+0x4d/0x5a
Dec 1 20:44:20 UNRAID kernel: netif_rx_ni+0x1c/0x22
Dec 1 20:44:20 UNRAID kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Dec 1 20:44:20 UNRAID kernel: ? __switch_to_asm+0x41/0x70
Dec 1 20:44:20 UNRAID kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Dec 1 20:44:20 UNRAID kernel: process_one_work+0x16e/0x24f
Dec 1 20:44:20 UNRAID kernel: worker_thread+0x1e2/0x2b8
Dec 1 20:44:20 UNRAID kernel: ? rescuer_thread+0x2a7/0x2a7
Dec 1 20:44:20 UNRAID kernel: kthread+0x10c/0x114
Dec 1 20:44:20 UNRAID kernel: ? kthread_park+0x89/0x89
Dec 1 20:44:20 UNRAID kernel: ret_from_fork+0x35/0x40
Dec 1 20:44:20 UNRAID kernel: ---[ end trace 716184adcfbc56ef ]---
Dec 1 22:36:54 UNRAID rpcbind[33908]: connect from 192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:54 UNRAID rpcbind[33909]: connect from 192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:54 UNRAID rpcbind[33910]: connect from 192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:54 UNRAID rpcbind[33911]: connect from 192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:59 UNRAID rpcbind[34030]: connect from unraid-diagnostics-20201201-2237.zip192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:59 UNRAID rpcbind[34031]: connect from 192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:59 UNRAID rpcbind[34032]: connect from 192.168.1.82 to getport/addr(mountd)
Dec 1 22:36:59 UNRAID rpcbind[34033]: connect from 192.168.1.82 to getport/addr(mountd)

Edited December 2, 2020 by [email protected]

[email protected] · December 9, 2020

So just happened again. Cant telnet. IMPI is reporting the following events? which are weird. The screen has the following on it.

The IMPI log Event 509 roughly corresponds with the system becoming un-responsive.

image.png.964a1e26cd8120a9881de80d2d128579.png image.png.f45a902be1b18ecbfef9defae23416bb.png

[email protected] · December 9, 2020

Hoopster · December 9, 2020

On 12/1/2020 at 10:38 PM, [email protected] said:

Does this kernel stuff mean anything bad? System has been up for 2 hours.

If you are getting macvlan/broadcast call traces that is usually caused by docker containers to which you have assigned an IP address on br0. They do not always result in immediate server lockups but after a few, the server will eventually crash in anywhere from hours to days.

Dec 1 20:44:20 UNRAID kernel: netif_rx_ni+0x1c/0x22
Dec 1 20:44:20 UNRAID kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Dec 1 20:44:20 UNRAID kernel: ? __switch_to_asm+0x41/0x70
Dec 1 20:44:20 UNRAID kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Dec 1 20:44:20 UNRAID kernel: process_one_work+0x16e/0x24f

See this thread for more info.

[email protected] · December 9, 2020

3 hours ago, Hoopster said:
If you are getting macvlan/broadcast call traces that is usually caused by docker containers to which you have assigned an IP address on br0. They do not always result in immediate server lockups but after a few, the server will eventually crash in anywhere from hours to days.
Dec 1 20:44:20 UNRAID kernel: netif_rx_ni+0x1c/0x22
Dec 1 20:44:20 UNRAID kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Dec 1 20:44:20 UNRAID kernel: ? __switch_to_asm+0x41/0x70
Dec 1 20:44:20 UNRAID kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Dec 1 20:44:20 UNRAID kernel: process_one_work+0x16e/0x24f
See this thread for more info.

Not quite sure if this is the correct implementation. But I used a second port on my server and moved the dockers and VMs over. Br0 is only for the server, it is 10Gbe. Br5 is only for dockers and VMs, it is redundant 1Gbs, will set this up as a LAG later.

[email protected] · December 9, 2020

image.png.ce85560399e69439e06d6ce63d9097ba.png

Hoopster · December 9, 2020

11 hours ago, [email protected] said:

Not quite sure if this is the correct implementation.

If call traces were being generated on Custom:br0, moving the docker containers to another interface is a good first step to see if they can be eliminated. In my case, I went with a VLAN, but you have additional physical NICS, so br5 is a good place to start.

Run your server that way for a while and see if the macvlan call traces and associated server lockups are eliminated.

This is a difficult problem because there is not particular cause that has been identified. It occurs on some hardware combinations and not on others.

[email protected] · December 22, 2020

I am going to rip out these disks. And figure out if they share a cable or something. SC846, I dbout they all share the same SAS cable.

I will temporarily install them outside the enclosure. on SAS to SATA breakout cables.

Disk 6ST8000AS0002-1NA17Z_Z840NY2V - 8 TB (sdw)

Disk 11ST8000DM004-2CX188_WCT0DSW9 - 8 TB (sdr)

sduST14000NM001G-2KJ103_ZL22SBSY - 14 TB (sdu)

sdtWDC_WD140EDFZ-11A0VA0_9KGV6TSL - 14 TB (sdt)

Disk 12WDC_WD80EZAZ-11TDBA0_7HJTPT5F - 8 TB (sdv)

Any suggestions on how to recover from this. There are 3 disks with errors.

unraid-diagnostics-20201221-2209.zip

new 1.txt

Edited December 22, 2020 by [email protected]

[email protected] · December 22, 2020

Rebuild in progress. Not like the drives were not actually bad. Swapped some spares in for Disk 6 and 11. Mapped these with unassigned devices. So no data loss if the rebuild fails. just have to move the data back to the array. Isolated to, either the SAS backplane, SAS card, or SAS cables.

Drives with arms out are those that dropped out.

Isolated to 2 SAS cables/the SAS card with only 2 ports or SC846 backplane. Removed the card, added another RES2SV240. And the entire setup is running off the same 2 port LSI SAS card.

Got a pretty crazy setup for a while. Once i get everything stable. I will probably hook another computer up to these 2 sas backplain ports and do some stress testing.

I think i might just give up the sc846 and mount 24-30 drives on a piece of plywood on the wall. Any ideas?

unraid-diagnostics-20201221-2353.zip

Vr2Io · December 22, 2020

12 hours ago, [email protected] said:

Got a pretty crazy setup for a while.

Seems you suspect problem on disk/HBA/cable/blackplane and troubshoot in this direction.

But I think problem not there. If call-trace happen, I will isolate (disconnect but keep provide power) most storage part first, check CPU, mainboard, left VM/docker storage in system, if problem persists, then it should be system problem instead of storage. Especially, you run multi CPU platform and so much memory stick.

12 hours ago, [email protected] said:

I think i might just give up the sc846 and mount 24-30 drives on a piece of plywood on the wall. Any ideas?

Why so early conclus sc846 cause the problem. If you give up it, then I see many and many cables come out first.

Edited December 22, 2020 by Vr2Io

[email protected] · December 24, 2020

On 12/22/2020 at 12:37 PM, Vr2Io said:

Seems you suspect problem on disk/HBA/cable/blackplane and troubshoot in this direction.

But I think problem not there. If call-trace happen, I will isolate (disconnect but keep provide power) most storage part first, check CPU, mainboard, left VM/docker storage in system, if problem persists, then it should be system problem instead of storage. Especially, you run multi CPU platform and so much memory stick.

Why so early conclus sc846 cause the problem. If you give up it, then I see many and many cables come out first.

The memory is ECC corrected. So as IPMI isnt reporting anything in the log it shouuld be ok. I can switch back to my old MB for the storage array. And run a 30 day trial with dockers on the Supermicro board.

[email protected] · January 4, 2021

unraid-diagnostics-20210103-2337.zip

ErrorWarningSystemArrayLogin

Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2942 CDB: opcode=0x88 88 00 00 00 00 00 02 65 f4 d0 00 00 04 00 00 00
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14)
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(00000000f342ddfc)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(0000000036962349)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2941 CDB: opcode=0x88 88 00 00 00 00 00 02 65 f0 d0 00 00 04 00 00 00
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14)
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(0000000036962349)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(000000006879ad81)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2940 CDB: opcode=0x88 88 00 00 00 00 00 02 65 ec d0 00 00 04 00 00 00
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14)
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(000000006879ad81)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(000000004982ce18)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2939 CDB: opcode=0x88 88 00 00 00 00 00 02 65 e8 d0 00 00 04 00 00 00
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14)
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(000000004982ce18)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(000000005e533f88)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2938 CDB: opcode=0x88 88 00 00 00 00 00 02 65 e4 d0 00 00 04 00 00 00
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14)
Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14)
Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(000000005e533f88)
Jan 3 23:38:13 UNRAID kernel: sd 7:0:18:0: Power-on or device reset occurred
Jan 3 23:38:13 UNRAID rc.diskinfo[8872]: SIGHUP received, forcing refresh of disks info.
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)
Jan 3 23:38:41 UNRAID kernel: sd 7:0:18:0: Power-on or device reset occurred
Jan 3 23:38:41 UNRAID rc.diskinfo[8872]: SIGHUP received, forcing refresh of disks info.

General instability of unraid for months.

Recommended Posts

[email protected]

Link to comment

[email protected]

Link to comment

[email protected]

Link to comment

[email protected]

Link to comment

Hoopster

Link to comment

[email protected]

Link to comment

[email protected]

Link to comment

Hoopster

Link to comment

[email protected]

Link to comment

[email protected]

Link to comment

Vr2Io

Link to comment

[email protected]

Link to comment

[email protected]

Link to comment

Join the conversation