• Server crash under 6.7RC7


    Helmonder
    • Urgent

    Yesterday evening I stopped my array and assigned a new parity drive to the primary parity slot (a 10TB WED RED instead of an 8TB).

     

    At the same time a preclear was also running for another 10TB WD RED

     

    This morning I woke up to the system beiing unresponsive. No webgui and no shares (I was plexing away fine yesterday evening).

     

    Putty also would not work.

     

    I could still IPMI in and tried to salvage the syslog, system was responding to my input but an ls to /mnt/user has caused my console to be unresponsive (I left it like this for half an hour hoping it would get out of this, but it hasn't.

     

    I have been able to make a few screenshots, hoping this shows something.

     

     

    unraid_crash.docx




    User Feedback

    Recommended Comments

    System has come up again with invalid parity, it also restartes parity sync / data rebuild. The notification in the gui told me that parity sync was completed succesfully (ofcourse this wasn';t the case, seperate issue in the notification system).

     

    The preclear was stopped but could be resumed (restarted the post-read phase)

     

    I do not know if it is valuable but I included the current syslog. I also checked to logrotation but the previous one is five days old, so that will not help.

     

    Added: all dockers show update ready, does not seem correct, this was not the case yesterday. I have now stopped all dockers to give the array some rest during the parity rebuild

     

    syslog

    Edited by Helmonder
    Link to comment

    I am seeing code traces in the current log (so at this time):

     

    Apr 28 11:08:45 Tower kernel: Call Trace:
    Apr 28 11:08:45 Tower kernel: <IRQ>
    Apr 28 11:08:45 Tower kernel: ipv4_confirm+0xaf/0xb7
    Apr 28 11:08:45 Tower kernel: nf_hook_slow+0x37/0x96
    Apr 28 11:08:45 Tower kernel: ip_local_deliver+0xa7/0xd5
    Apr 28 11:08:45 Tower kernel: ? ip_sublist_rcv_finish+0x53/0x53
    Apr 28 11:08:45 Tower kernel: ip_rcv+0x9e/0xbc
    Apr 28 11:08:45 Tower kernel: ? ip_rcv_finish_core.isra.0+0x2e5/0x2e5
    Apr 28 11:08:45 Tower kernel: __netif_receive_skb_one_core+0x4d/0x69
    Apr 28 11:08:45 Tower kernel: process_backlog+0x7e/0x116
    Apr 28 11:08:45 Tower kernel: net_rx_action+0x10b/0x274
    Apr 28 11:08:45 Tower kernel: __do_softirq+0xce/0x1e2
    Apr 28 11:08:45 Tower kernel: do_softirq_own_stack+0x2a/0x40
    Apr 28 11:08:45 Tower kernel: </IRQ>
    Apr 28 11:08:45 Tower kernel: do_softirq+0x4d/0x59
    Apr 28 11:08:45 Tower kernel: netif_rx_ni+0x1c/0x22
    Apr 28 11:08:45 Tower kernel: macvlan_broadcast+0x10f/0x153 [macvlan]
    Apr 28 11:08:45 Tower kernel: macvlan_process_broadcast+0xd5/0x131 [macvlan]
    Apr 28 11:08:45 Tower kernel: process_one_work+0x16e/0x24f
    Apr 28 11:08:45 Tower kernel: ? pwq_unbound_release_workfn+0xb7/0xb7
    Apr 28 11:08:45 Tower kernel: worker_thread+0x1dc/0x2ac
    Apr 28 11:08:45 Tower kernel: kthread+0x10b/0x113
    Apr 28 11:08:45 Tower kernel: ? kthread_park+0x71/0x71
    Apr 28 11:08:45 Tower kernel: ret_from_fork+0x35/0x40
    Apr 28 11:08:45 Tower kernel: ---[ end trace c12044621539eec0 ]---

    This seems to correspond with "macvlan" as discussed in the following post:

     

    https://forums.unraid.net/topic/75175-macvlan-call-traces/

     

    Maybe tonight I experienced a kernel panic as a result of this Macvlan issue ?  Weird though.. Server has been stable for weeks... So maybe there is some combination going on with the amount of disk traffic the parity rebuild was causing ?

     

    In the days before the parity rebuild I had two WD red's 10TB doing a preclear.. That did not cause an issue... So the parity sync might be the thing that pushes something over the edge..

     

    I actually allready closed down all of my dockers but for Pihole and the HA-Bridge for Domoticz.. I turned those off also now... Since there is no docker with its own ip address running anymore now I would expect the errors to go away now..

    Edited by Helmonder
    Added info
    Link to comment

    The last few lines in the log now show:

     

    Apr 28 11:27:35 Tower kernel: vetha56c284: renamed from eth0
    Apr 28 11:27:47 Tower kernel: device br0 left promiscuous mode
    Apr 28 11:27:47 Tower kernel: veth3808ad4: renamed from eth0

    The "renamed from eth0" point to the two Dockers stopping, the "device br0 left promiscuous mode"

     

    I googled a bit and read that "promiscuous mode" is most likely activated when some kind of traffic monitoring / sniffering is going on.. Is there something resembling that active in combination with Dockers in unraid ?

    Link to comment

    Mmm.... Did some more searching...:

     

    https://www.linuxquestions.org/questions/linux-security-4/kernel-device-eth0-entered-promiscuous-mode-756884/

     

    As explained before promiscuous mode means a packet sniffer instructed your ethernet device to listen to all traffic. This can be a benign or a malicious act, but usually you will know if you run an application that provides you with traffic statistics (say ntop or vnstat) or an IDS (say Snort, Prelude, tcpdump or wireshark) or Something Else (say a DHCP client which isn't promiscuous mode but could be identified as one). Reviewing your installed packages might turn up valid applications that fit the above categories. Else, if an interface is (still) in promiscuous mode (old or new style) then running 'ip link show' will show the "PROMISC" tag and when a sniffer is not hidden then running Chkrootkit or Rootkit Hunter (or both) should show details about applications. If none of the above returns satisfying results then a more thorough inspection of the system is warranted (regardless the time between promisc mode switching as posted above being ridiculously short).

     

    In my case I do not expect that there is something bad going on... Question is however, is this "promiscuous mode" triggered by an individual docker (and is that docker maybe "bad"), or is this mode triggered by unraid in combination with the docker mechanism, and if so: why ..

     

    So I checked...

     

    Starting any docker on my system will imediately trigger "promiscuous mode" to be on... It does not matter what docker it is.. So that points to unraid doing something there..

     

    I checked my log file:

     

    Apr 28 07:07:27 Tower rc.inet1: ip link set bond0 promisc on master br0 up

    Apr 28 07:07:27 Tower kernel: device bond0 entered promiscuous mode

    Apr 28 07:07:30 Tower kernel: device eth1 entered promiscuous mode

    Apr 28 07:08:10 Tower kernel: device br0 entered promiscuous mode

    pr 28 07:08:12 Tower kernel: device virbr0-nic entered promiscuous mode

     

    The promiscuous mode is related to network bonding, which is what I am using..

     

    I find some info on it in combination with changing a VLAN's hardware address:

     

    https://wiki.linuxfoundation.org/networking/bonding

     

    Note that changing a VLAN interface's HW address would set the underlying device – i.e. the bonding interface – to promiscuous mode, which might not be what you want.

     

    I am going to stop digging as I am completely out of my comfort zone... And into some kind of rabbithole, maybe this promiscous mode has nothing to do with the issue..

     

    Anyone ?

     

    Edited by Helmonder
    Link to comment

    Promiscuous mode is needed for Docker and VM operation, it allows an interface to respond to different MAC adressen.

    Link to comment

    These call traces should not happen indeed, but it is very difficult to pinpoint why they happen.

    It looks like an external event triggers these call traces, some people have no issues (like me and I have a very extensive network set up) and others do.

    Perhaps a future kernel or docker update will address this, but nothing guaranteed.

    Edited by bonienl
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.