Jump to content
  • [6.8.0] Removed Cache Disk + Cache Enabled -> Kernel Crash


    streaky81
    • Retest Minor

    In my Unraid box I used to have an SSD cache disk. Some months back I removed the cache disk from cache duty, let it be mounted as just a disk directly on the filesystem. Ostensibly at this point I left cache enabled, just with no disks, mover still appears to be scheduled to run. Everything has been fine with the server for all these months. Then a week back I upgraded to 6.8.0 and it started randomly crashing ~an hour after booting, completely unresponsive over the network. I didn't even really notice at first - I have been quite busy and haven't been using my server much.

     

    Router errors suggested to me I had some sort of network issue, after tearing my hair out reconfiguring my network, buying a new NIC (I wanted a decent dual NIC for this server anyway so no biggie) and nothing really improving I started to really focus on Unraid.

     

    It turns out that what has been happening is that ostensibly mover still running is causing an unhandled null pointer deference (somehow) then the CPU is stalling, network goes down as a result, and it never recovers.

     

    I don't know for sure that this issue was introduced by 6.8.0 explicitly, but I'm *fairly* sure I didn't have this issue before - because it would have driven me bananas until I'd resolved it as it did in this case. I never tested any pre-release versions so I also don't know at what point it happened.

     

    Basically the workaround aka fix for me was to set number of cache disks to none (it was still set at 1), and disable cache in settings (I believe it was set enabled in two places, weirdly) and it's been happily running since for over 24 hours.

     

    Some sort of check to see if there actually is a cache disk before executing mover would probably help, or maybe a warning when users have an incomplete cache setup? I don't know for sure but for me at least it was definitely a thing.

    console log.txt




    User Feedback

    Recommended Comments

    Since this seems to be a case of shooting yourself in the foot, I expect this report will be downgraded to "Minor" if not moved to General Support instead, but I will leave it for now. I'm pretty sure there is already a check for cache with mover but maybe it doesn't work in your scenario.

     

    But, your scenario is not entirely clear since you didn't post any diagnostics, that snippet of syslog is incomplete at best, and you didn't give any clear directions for how to reproduce the problem. For future reference, here is the guideline on posting a bug report:

     

    https://forums.unraid.net/bug-reports/stable-releases/report-guidelines-r68/

     

     

    Quote

    In my Unraid box I used to have an SSD cache disk. Some months back I removed the cache disk from cache duty, let it be mounted as just a disk directly on the filesystem. Ostensibly at this point I left cache enabled, just with no disks

    This seems to be the most important part of your post, but it is also the part that is most unclear. I assume you mean you added the SSD to the parity array. SSDs aren't recommended in the parity array, but I will leave that for now. (The word "filesystem" usually means something else)

     

    There isn't any specific place where cache is enabled so I don't know what you mean by that part. If you mean there were user shares set to use cache I don't think that would matter. I know cache-yes and cache-prefer would just overflow to the array. I'm not entirely sure what cache-only would do without cache though. Is there something else you had in mind when you said cache was enabled?

     

    Could you give a more complete, step-by-step, description of what exactly you did, that stuff that you only summarized in the part I quoted above?

     

    Link to comment

    The diagnostics aren't super relevant given the issue no longer exists for me if I got them now that's not the state it was in when it was crashing. I could try to reproduce but I don't fancy intentionally making my live server kernel crash.

     

    I thought the reproduction steps were reasonably clear, but, y'know, sorry:

     

    Enable cache, assign disk to cache, start array. Unassign disk but leave the cache disk count as 1/cache enabled and then start array again - for me that left mover scheduled and it caused a crash. Setting cache size to 0 then disabling it all fixed the issue.

     

    The kernel crash may well be specifically contained to my hardware, I get that, but if mover didn't run it wouldn't have caused it. It definitely was doing *something*.

     

    As I said before it's fixed for me, even if it becomes a wontfix hopefully it helps somebody who has a similar setup and their server is seemingly randomly disappearing off their network..

    Edited by streaky81
    Link to comment

    I can't reproduce this, if I unassign all cache devices, leaving slots as they were I get this on the log:

    root: mover: cache not present, or only cache present

     

    mover is not executed

    • Like 1
    Link to comment

    It might only happen during specific circumstances but we'd need to know how to reproduce, so I'm going to change status for now until OP or any other user adds more info.

     

     

    Link to comment
    20 hours ago, johnnie.black said:

    I can't reproduce this, if I unassign all cache devices, leaving slots as they were I get this on the log:

    
    root: mover: cache not present, or only cache present

     

    mover is not executed

    Try this.

    After you unassign the physical cache devices, try creating a /mnt/cache folder, like what would happen if a container were mis-configured to use the disk path instead of /mnt/user

     

    I suspect the OP was filling up RAM with some misconfiguration, causing the crash.

    • Like 1
    Link to comment
    1 hour ago, jonathanm said:

    After you unassign the physical cache devices, try creating a /mnt/cache folder, like what would happen if a container were mis-configured to use the disk path instead of /mnt/user

    Good idea, but the mover still doesn't run with the same error, looking at the mover script it checks for the existence of the user0 mount point:

    if ! mountpoint -q /mnt/user0 ; then
        echo "mover: cache not present, or only cache present"
        exit 3

    So even if /mnt/user0 was created manually the mover script wouldn't run, since the mount point wouldn't exist, also looking more carefully at the OP's log snippet you can see that the mover exited because of the same check:

     

    Jan  5 18:00:01 unraid crond[1826]: exit status 3 from user root /usr/local/sbin/mover &> /dev/null

    Exit status 3 is because /mnt/user0 mountpoint doesn't exist, the difference in how it was logged for me is because of mover logging enable vs disable, the mover script also wasn't running for the OP, so I can't see how it was causing the errors, but it's a bit suspicious the errors starting 30 seconds after the mover script is called, but coding isn't really in my wheelhouse, so not sure if it's related or not.

     

    Link to comment
    1 hour ago, johnnie.black said:

    the mover script it checks for the existence of the user0 mount point

    I thought user0 was deprecated and no longer used by Mover.

    Link to comment
    2 minutes ago, trurl said:

    I thought user0 was deprecated and no longer used by Mover.

    It's not used for the move operation as it was before with rsync, but it's apparently still used for that sanity check.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...