[email protected] Posted December 2, 2020 Share Posted December 2, 2020 I have been having some sever stability issues. Unraid UI partially locks up. shows the 502 gateway error. various others. I cant get it to shutdown. This has been repeated over months. Some times unraid wont last 48 hours. Other times it will last a couple of weeks. Had my cache drive go multiple times, i assume corruption was due to the lockups. Why is unraid so unstable right now? /etc/rc.d/rc.nginx restart /etc/rc.d/rc.php-fpm restart This doesnt work. Cant get a ssh shutdown to work unraid-diagnostics-20201201-2023.zip Quote Link to comment
[email protected] Posted December 2, 2020 Author Share Posted December 2, 2020 (edited) Does this kernel stuff mean anything bad? System has been up for 2 hours. Dec 1 20:44:20 UNRAID kernel: R10: 0000000000000098 R11: ffff889818870000 R12: 000000000000cd45 Dec 1 20:44:20 UNRAID kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000b64c Dec 1 20:44:20 UNRAID kernel: FS: 0000000000000000(0000) GS:ffff888c4f600000(0000) knlGS:0000000000000000 Dec 1 20:44:20 UNRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 1 20:44:20 UNRAID kernel: CR2: 000056490c7dff78 CR3: 0000000001e0a001 CR4: 00000000001606f0 Dec 1 20:44:20 UNRAID kernel: Call Trace: Dec 1 20:44:20 UNRAID kernel: <IRQ> Dec 1 20:44:20 UNRAID kernel: ipv4_confirm+0xaf/0xb9 Dec 1 20:44:20 UNRAID kernel: nf_hook_slow+0x3a/0x90 Dec 1 20:44:20 UNRAID kernel: ip_local_deliver+0xad/0xdc Dec 1 20:44:20 UNRAID kernel: ? ip_sublist_rcv_finish+0x54/0x54 Dec 1 20:44:20 UNRAID kernel: ip_rcv+0xa0/0xbe Dec 1 20:44:20 UNRAID kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1 Dec 1 20:44:20 UNRAID kernel: __netif_receive_skb_one_core+0x53/0x6f Dec 1 20:44:20 UNRAID kernel: process_backlog+0x77/0x10e Dec 1 20:44:20 UNRAID kernel: net_rx_action+0x107/0x26c Dec 1 20:44:20 UNRAID kernel: __do_softirq+0xc9/0x1d7 Dec 1 20:44:20 UNRAID kernel: do_softirq_own_stack+0x2a/0x40 Dec 1 20:44:20 UNRAID kernel: </IRQ> Dec 1 20:44:20 UNRAID kernel: do_softirq+0x4d/0x5a Dec 1 20:44:20 UNRAID kernel: netif_rx_ni+0x1c/0x22 Dec 1 20:44:20 UNRAID kernel: macvlan_broadcast+0x111/0x156 [macvlan] Dec 1 20:44:20 UNRAID kernel: ? __switch_to_asm+0x41/0x70 Dec 1 20:44:20 UNRAID kernel: macvlan_process_broadcast+0xea/0x128 [macvlan] Dec 1 20:44:20 UNRAID kernel: process_one_work+0x16e/0x24f Dec 1 20:44:20 UNRAID kernel: worker_thread+0x1e2/0x2b8 Dec 1 20:44:20 UNRAID kernel: ? rescuer_thread+0x2a7/0x2a7 Dec 1 20:44:20 UNRAID kernel: kthread+0x10c/0x114 Dec 1 20:44:20 UNRAID kernel: ? kthread_park+0x89/0x89 Dec 1 20:44:20 UNRAID kernel: ret_from_fork+0x35/0x40 Dec 1 20:44:20 UNRAID kernel: ---[ end trace 716184adcfbc56ef ]--- Dec 1 22:36:54 UNRAID rpcbind[33908]: connect from 192.168.1.82 to getport/addr(mountd) Dec 1 22:36:54 UNRAID rpcbind[33909]: connect from 192.168.1.82 to getport/addr(mountd) Dec 1 22:36:54 UNRAID rpcbind[33910]: connect from 192.168.1.82 to getport/addr(mountd) Dec 1 22:36:54 UNRAID rpcbind[33911]: connect from 192.168.1.82 to getport/addr(mountd) Dec 1 22:36:59 UNRAID rpcbind[34030]: connect from unraid-diagnostics-20201201-2237.zip192.168.1.82 to getport/addr(mountd) Dec 1 22:36:59 UNRAID rpcbind[34031]: connect from 192.168.1.82 to getport/addr(mountd) Dec 1 22:36:59 UNRAID rpcbind[34032]: connect from 192.168.1.82 to getport/addr(mountd) Dec 1 22:36:59 UNRAID rpcbind[34033]: connect from 192.168.1.82 to getport/addr(mountd) Edited December 2, 2020 by [email protected] Quote Link to comment
[email protected] Posted December 9, 2020 Author Share Posted December 9, 2020 So just happened again. Cant telnet. IMPI is reporting the following events? which are weird. The screen has the following on it. The IMPI log Event 509 roughly corresponds with the system becoming un-responsive. Quote Link to comment
Hoopster Posted December 9, 2020 Share Posted December 9, 2020 On 12/1/2020 at 10:38 PM, [email protected] said: Does this kernel stuff mean anything bad? System has been up for 2 hours. If you are getting macvlan/broadcast call traces that is usually caused by docker containers to which you have assigned an IP address on br0. They do not always result in immediate server lockups but after a few, the server will eventually crash in anywhere from hours to days. Dec 1 20:44:20 UNRAID kernel: netif_rx_ni+0x1c/0x22 Dec 1 20:44:20 UNRAID kernel: macvlan_broadcast+0x111/0x156 [macvlan] Dec 1 20:44:20 UNRAID kernel: ? __switch_to_asm+0x41/0x70 Dec 1 20:44:20 UNRAID kernel: macvlan_process_broadcast+0xea/0x128 [macvlan] Dec 1 20:44:20 UNRAID kernel: process_one_work+0x16e/0x24f See this thread for more info. Quote Link to comment
[email protected] Posted December 9, 2020 Author Share Posted December 9, 2020 3 hours ago, Hoopster said: If you are getting macvlan/broadcast call traces that is usually caused by docker containers to which you have assigned an IP address on br0. They do not always result in immediate server lockups but after a few, the server will eventually crash in anywhere from hours to days. Dec 1 20:44:20 UNRAID kernel: netif_rx_ni+0x1c/0x22 Dec 1 20:44:20 UNRAID kernel: macvlan_broadcast+0x111/0x156 [macvlan] Dec 1 20:44:20 UNRAID kernel: ? __switch_to_asm+0x41/0x70 Dec 1 20:44:20 UNRAID kernel: macvlan_process_broadcast+0xea/0x128 [macvlan] Dec 1 20:44:20 UNRAID kernel: process_one_work+0x16e/0x24f See this thread for more info. Not quite sure if this is the correct implementation. But I used a second port on my server and moved the dockers and VMs over. Br0 is only for the server, it is 10Gbe. Br5 is only for dockers and VMs, it is redundant 1Gbs, will set this up as a LAG later. Quote Link to comment
Hoopster Posted December 9, 2020 Share Posted December 9, 2020 11 hours ago, [email protected] said: Not quite sure if this is the correct implementation. If call traces were being generated on Custom:br0, moving the docker containers to another interface is a good first step to see if they can be eliminated. In my case, I went with a VLAN, but you have additional physical NICS, so br5 is a good place to start. Run your server that way for a while and see if the macvlan call traces and associated server lockups are eliminated. This is a difficult problem because there is not particular cause that has been identified. It occurs on some hardware combinations and not on others. Quote Link to comment
[email protected] Posted December 22, 2020 Author Share Posted December 22, 2020 (edited) I am going to rip out these disks. And figure out if they share a cable or something. SC846, I dbout they all share the same SAS cable. I will temporarily install them outside the enclosure. on SAS to SATA breakout cables. Disk 6ST8000AS0002-1NA17Z_Z840NY2V - 8 TB (sdw) Disk 11ST8000DM004-2CX188_WCT0DSW9 - 8 TB (sdr) sduST14000NM001G-2KJ103_ZL22SBSY - 14 TB (sdu) sdtWDC_WD140EDFZ-11A0VA0_9KGV6TSL - 14 TB (sdt) Disk 12WDC_WD80EZAZ-11TDBA0_7HJTPT5F - 8 TB (sdv) Any suggestions on how to recover from this. There are 3 disks with errors. unraid-diagnostics-20201221-2209.zip new 1.txt Edited December 22, 2020 by [email protected] Quote Link to comment
[email protected] Posted December 22, 2020 Author Share Posted December 22, 2020 Rebuild in progress. Not like the drives were not actually bad. Swapped some spares in for Disk 6 and 11. Mapped these with unassigned devices. So no data loss if the rebuild fails. just have to move the data back to the array. Isolated to, either the SAS backplane, SAS card, or SAS cables. Drives with arms out are those that dropped out. Isolated to 2 SAS cables/the SAS card with only 2 ports or SC846 backplane. Removed the card, added another RES2SV240. And the entire setup is running off the same 2 port LSI SAS card. Got a pretty crazy setup for a while. Once i get everything stable. I will probably hook another computer up to these 2 sas backplain ports and do some stress testing. I think i might just give up the sc846 and mount 24-30 drives on a piece of plywood on the wall. Any ideas? unraid-diagnostics-20201221-2353.zip Quote Link to comment
Vr2Io Posted December 22, 2020 Share Posted December 22, 2020 (edited) 12 hours ago, [email protected] said: Got a pretty crazy setup for a while. Seems you suspect problem on disk/HBA/cable/blackplane and troubshoot in this direction. But I think problem not there. If call-trace happen, I will isolate (disconnect but keep provide power) most storage part first, check CPU, mainboard, left VM/docker storage in system, if problem persists, then it should be system problem instead of storage. Especially, you run multi CPU platform and so much memory stick. 12 hours ago, [email protected] said: I think i might just give up the sc846 and mount 24-30 drives on a piece of plywood on the wall. Any ideas? Why so early conclus sc846 cause the problem. If you give up it, then I see many and many cables come out first. Edited December 22, 2020 by Vr2Io Quote Link to comment
[email protected] Posted December 24, 2020 Author Share Posted December 24, 2020 On 12/22/2020 at 12:37 PM, Vr2Io said: Seems you suspect problem on disk/HBA/cable/blackplane and troubshoot in this direction. But I think problem not there. If call-trace happen, I will isolate (disconnect but keep provide power) most storage part first, check CPU, mainboard, left VM/docker storage in system, if problem persists, then it should be system problem instead of storage. Especially, you run multi CPU platform and so much memory stick. Why so early conclus sc846 cause the problem. If you give up it, then I see many and many cables come out first. The memory is ECC corrected. So as IPMI isnt reporting anything in the log it shouuld be ok. I can switch back to my old MB for the storage array. And run a 30 day trial with dockers on the Supermicro board. Quote Link to comment
[email protected] Posted January 4, 2021 Author Share Posted January 4, 2021 unraid-diagnostics-20210103-2337.zip ErrorWarningSystemArrayLogin Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2942 CDB: opcode=0x88 88 00 00 00 00 00 02 65 f4 d0 00 00 04 00 00 00 Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14) Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(00000000f342ddfc) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(0000000036962349) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2941 CDB: opcode=0x88 88 00 00 00 00 00 02 65 f0 d0 00 00 04 00 00 00 Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14) Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(0000000036962349) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(000000006879ad81) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2940 CDB: opcode=0x88 88 00 00 00 00 00 02 65 ec d0 00 00 04 00 00 00 Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14) Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(000000006879ad81) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(000000004982ce18) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2939 CDB: opcode=0x88 88 00 00 00 00 00 02 65 e8 d0 00 00 04 00 00 00 Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14) Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(000000004982ce18) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: attempting task abort! scmd(000000005e533f88) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: [sds] tag#2938 CDB: opcode=0x88 88 00 00 00 00 00 02 65 e4 d0 00 00 04 00 00 00 Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: handle(0x001d), sas_address(0x5001e67464d08fee), phy(14) Jan 3 23:38:12 UNRAID kernel: scsi target7:0:18: enclosure logical id(0x5001e67464d08fff), slot(14) Jan 3 23:38:12 UNRAID kernel: sd 7:0:18:0: task abort: SUCCESS scmd(000000005e533f88) Jan 3 23:38:13 UNRAID kernel: sd 7:0:18:0: Power-on or device reset occurred Jan 3 23:38:13 UNRAID rc.diskinfo[8872]: SIGHUP received, forcing refresh of disks info. Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:28 UNRAID kernel: mpt2sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302) Jan 3 23:38:41 UNRAID kernel: sd 7:0:18:0: Power-on or device reset occurred Jan 3 23:38:41 UNRAID rc.diskinfo[8872]: SIGHUP received, forcing refresh of disks info. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.