• SQLite DB Corruption testers needed


    limetech
    • Closed

    9/17/2019 Update: may have got to the bottom of this.  Please try 6.7.3-rc3 available on the next branch.

    9/18/2019 Update: 6.7.3-rc4 is available to address Very Slow Array Concurrent Performance.

     

    re:

     

    Trying to get to the bottom of this...  First we have not been able to reproduce, which is odd because it implies there may be some kind of hardware/driver dependency with this issue.  Nevertheless I want to start a series of tests, which I know will be painful for some since every time DB corruption occurs, you have to go through lengthy rebuild process.  That said, we would really appreciate anyone's input during this time.

     

    The idea is that we are only going to change one thing at a time.  We can either start with 6.6.7 and start updating stuff until it breaks, or we can start with 6.7.2 and revert stuff until it's fixed.  Since my best guess at this point is that the issue is either with Linux kernel, docker, or something we have misconfigured (not one of a hundred other packages we updated), we are going to start with 6.7.2 code base and see if we can make it work.

     

    But actually, the first stab at this is not reverting anything, but rather first updating the Linux kernel to the latest 4.19 patch release which is 4.19.60 (6.7.2 uses kernel 4.19.55).  In skimming the kernel change logs, nothing jumps out as a possible fix, however I want to first try the easiest and least impactful change: update to latest 4.19 kernel.

     

    If this does not solve the problem (which I expect it won't), then we have two choices:

     

    1) update to latest Linux stable kernel (5.2.2) - we are using 5.2 kernel in Unraid 6.8-beta and so far no one has reported any sqlite DB corruption, though the sample set is pretty small.  The downside with this is, not all out-of-tree drivers yet build with 5.2 kernel and so some functionality would be lost.

     

    2) downgrade docker from 18.09.06 (version in 6.7.2) to 18.06.03-ce (version in 6.6.7).

    [BTW the latest Docker release 19.03.00 was just published today - people gripe about our release numbers, try making sense of Docker release numbers haha]

     

    If neither of those steps succeed then ... well let's hope one of them does succeed.

     

    To get started, first make a backup of your flash via Main/Flash/Flash Backup, and then switch to the 'next' branch via Tools/Upgrade OS page.  There you should see version 6.7.3-rc1

     

    As soon as a couple people report corruption I'll publish an -rc2, probably with reverted Docker.

    Edited by limetech

    • Upvote 5



    User Feedback

    Recommended Comments



    47 minutes ago, BRiT said:

    Just to rule this out as the causes...

     

    Those with Plex corruption, are you using Graphics Hardware Acceleration for Transcoding? Anyone getting the corruption and not using GPU acceleration?

     

    Also, anyone with the corruptions tried running with CPU Security Mitigations disabled via the additional plugin and still getting corruptions?

    I was getting corruption with NO GPU acceleration. 

     

    Thanks

    Link to comment
    6 minutes ago, isrdude said:

    I just had a brain malfunction and haven't recovered....where am I going to get 6.7.3-rc3???

    Tools/Update OS - select branch next.

    Link to comment
    46 minutes ago, limetech said:

    We may have got to the bottom of this.  Please try new version 6.7.3-rc3 available on next branch.

     

    Looking forward to trying this, what was the issue in the end?

    Link to comment
    18 minutes ago, TheBuz said:

     

    Looking forward to trying this, what was the issue in the end?

    There was a kernel driver internal API change a few releases back that I missed, and md/unraid was doing something that's not valid now.  I noticed this and put fix in upcoming 6.8 and gave to someone who could reproduce the corruption.  Has been running far longer than it ever did before, so I think is safe for wider testing.  Back-ported change to 6.7.3-rc3 and also updated to the latest 4.19 kernel patch release, because, why not?

    • Like 2
    • Thanks 1
    • Haha 1
    Link to comment
    1 hour ago, limetech said:

    There was a kernel driver internal API change a few releases back that I missed, and md/unraid was doing something that's not valid now.  I noticed this and put fix in upcoming 6.8 and gave to someone who could reproduce the corruption.  Has been running far longer than it ever did before, so I think is safe for wider testing.  Back-ported change to 6.7.3-rc3 and also updated to the latest 4.19 kernel patch release, because, why not?

    Good news!  I hope that it fixes the issues, and I look forward to seeing how it tests out over the next few weeks. 

    Link to comment
    7 hours ago, limetech said:

    There was a kernel driver internal API change a few releases back that I missed, and md/unraid was doing something that's not valid now.  I noticed this and put fix in upcoming 6.8 and gave to someone who could reproduce the corruption.  Has been running far longer than it ever did before, so I think is safe for wider testing.  Back-ported change to 6.7.3-rc3 and also updated to the latest 4.19 kernel patch release, because, why not?

    Is these any chance this change will also fix the slow reads from array when writes happen?  If so do you want that to be tested for as well?

    Link to comment
    10 hours ago, itimpi said:

    Is these any chance this change will also fix the slow reads from array when writes happen?  If so do you want that to be tested for as well?

    No but I just released 6.7.3-rc4 which should address this. Nope, fooled by caching.

    Link to comment

    I updated to rc3 this morning and have been hitting sonarr/radarr/plex hard all day with loading new media as sort of a stress test, sad to report that both sonarr and radarr have become corrupted at some point in the last hour or two. I started the morning with uncorrupted (but not fresh) databases. I am just going to restore from the backup I made earlier and hope it doesn't happen again. If there is anything else I should do to prevent this let me know.

    Link to comment
    2 hours ago, rzeeman711 said:

    I updated to rc3 this morning and have been hitting sonarr/radarr/plex hard all day with loading new media as sort of a stress test, sad to report that both sonarr and radarr have become corrupted at some point in the last hour or two. I started the morning with uncorrupted (but not fresh) databases. I am just going to restore from the backup I made earlier and hope it doesn't happen again. If there is anything else I should do to prevent this let me know.

    diagnostics.zip would be nice: Tools/Diagnostics

     

    Also, if possible please repeat with fresh databases.  I know this is a PITA but an important data point.  It's possible they appeared uncorrupted but in fact were not.

    Link to comment
    19 hours ago, limetech said:

    No but I just released 6.7.3-rc4 which should address this. Nope, fooled by caching.

    so i updated to rc4 earlier in the day, does it include the rc3 update as well ?  or should i just downgrade to rc3 ?

     

    thanks

     

    Link to comment

    Now that we presumably know the source of the corruption do we know if any other type of data would have been affected?  Do we know what type of disk activity would have resulted in corruption?

    Link to comment
    On 9/17/2019 at 5:33 PM, limetech said:

    We may have got to the bottom of this.  Please try new version 6.7.3-rc3 available on next branch.

     

    Tom, any reason you are no longer posting that RC's are available in the Prerelease forum?  The last one I see is "Unraid OS version 6.7.0-rc8 available". 

     

    Paul

    Link to comment
    11 hours ago, Pauven said:

    Tom, any reason you are no longer posting that RC's are available in the Prerelease forum?  The last one I see is "Unraid OS version 6.7.0-rc8 available". 

    No, but you're right I should have done it that way.

    Link to comment

    Posted the same thing on the unRAID Users & Help page on FB, but here's my status:

    Upgraded to the new release 2 days ago, on 9/19/2019.

    Have a Sonarr container. Rebuilt the DB from scratch last night. Using the path /mnt/user/... DB is already corrupt, with Sonarr logs showing "yesterday" as the time.

    Have a Radarr container. Using the last good backup I could find. Using the path /mnt/disk*/.... DB is also corrupt. first error shows as 3:51PM today (CST), so made it a little longer.


    If there's anything else I can do to help or test... *please* let me know. I'd love to get this resolved.

    Attached diagnostics file...

    drogon-diagnostics-20190920-2109.zip

    • Thanks 1
    Link to comment
    1 hour ago, digitalhigh said:

    Posted the same thing on the unRAID Users & Help page on FB, but here's my status:

    Upgraded to the new release 2 days ago, on 9/19/2019.

    Have a Sonarr container. Rebuilt the DB from scratch last night. Using the path /mnt/user/... DB is already corrupt, with Sonarr logs showing "yesterday" as the time.

    Have a Radarr container. Using the last good backup I could find. Using the path /mnt/disk*/.... DB is also corrupt. first error shows as 3:51PM today (CST), so made it a little longer.


    If there's anything else I can do to help or test... *please* let me know. I'd love to get this resolved.

    Attached diagnostics file...

    drogon-diagnostics-20190920-2109.zip 77.73 kB · 0 downloads

     

    Thank you for the report.

    When were diags captured, I'm guessing after you discovered corruption?

    Can you post the text of the error you see that reports the corruption?

    Link to comment
    3 hours ago, digitalhigh said:

    If there's anything else I can do to help or test... *please* let me know. I'd love to get this resolved.

    Also, please Start array in Maintenance mode and then click on each of your data disks and run Check File System Status utility and let me know if any errors are reported.

    Link to comment
    5 hours ago, limetech said:

    Also, please Start array in Maintenance mode and then click on each of your data disks and run Check File System Status utility and let me know if any errors are reported.

    Has any test builds internally been tried with just released by slackware fuse3-3.6.2

     

    The 2.xx series is now EOL

    Link to comment
    7 hours ago, limetech said:

     

    Thank you for the report.

    When were diags captured, I'm guessing after you discovered corruption?

    Can you post the text of the error you see that reports the corruption?

    Diags were captured today, as soon as I discovered corruption.

    Text of the error is the usual radarr/sonarr error "Database image is malformed"

     

    5 hours ago, limetech said:

    Also, please Start array in Maintenance mode and then click on each of your data disks and run Check File System Status utility and let me know if any errors are reported.

    My scheduled Parity check is running presently, I will do this and report back tomorrow.

    Link to comment
    8 hours ago, Dazog said:

    Has any test builds internally been tried with just released by slackware fuse3-3.6.2

     

    The 2.xx series is now EOL

    Unraid OS 6.8 uses FUSE 3.6.2 however I don't think FUSE has anything to do with this issue.

    Link to comment



    Guest
    This is now closed for further comments

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.