• SQLite Data Corruption testing


    limetech

    tldr: Starting with 6.8.0-rc2 please visit Settings/Disk Settings and change the 'Tunable (scheduler)' to 'none'.  Then run with SQLite DB files located on array disk shares and report whether your databases still become corrupted.

     

    When we first started looking into this issue one of the first things I ran across was this monster topic:
    https://bugzilla.kernel.org/show_bug.cgi?id=201685

    and related patch discussion:
    https://patchwork.kernel.org/patch/10712695/


    This bug is very very similar to what we're seeing.  In addition Unraid 6.6.7 is on the last of the 4.18 kernels (4.18.20).  Unraid 6.7 is on 4.19 kernel and of course 6.8 is on 5.3 currently.  The SQLite DB Corruption bug also only started happening with 4.19 and so I don't think this is coincidence.

    In looking at the 5.3 code the patch above is not in the code; however, I ran across a later commit that reverted that patch and solved the bug a different way:
    https://www.spinics.net/lists/linux-block/msg34445.html

    That set of changes is in 5.3 code.

    I'm thinking perhaps their "fix" is not properly handling some I/O pattern that SQLite via md/unraid is generating.

     

    Before I go off and revert the kernel to 4.18.20, please test if setting the scheduler to 'none' makes any difference in whether databases become corrupted.

    • Like 1
    • Thanks 3



    User Feedback

    Recommended Comments



    4 minutes ago, limetech said:

    Thank you for your testing, very much appreciated.

     

    If you think this is further than ever before, please Stop array and set md_restrict to 1 and let 'er run.

    Actually, I have to wait until I get home.  I'm using the openvpn docker. 

     

     

    Link to comment

    I am having the same problem. I set set md_restrict to 0 and have been stable for the last 4 hours. The most in the past with the other options was 20 minutes. 

    Link to comment
    15 minutes ago, Fitz1015 said:

    I am having the same problem. I set set md_restrict to 0 and have been stable for the last 4 hours. The most in the past with the other options was 20 minutes. 

    If you don't mind, please try setting to 1.  This will tell me which of the two is the culprit.

    Link to comment

    Updated to rc4, been on it for about 5 hours, no corruption.  Set "mdcmd set md_restrict 2"

     

    Will stress it more later.

    Link to comment
    7 minutes ago, mi5key said:

    Updated to rc4, been on it for about 5 hours, no corruption.  Set "mdcmd set md_restrict 2"

    Do you mean you ran it without any changes and it ran for 5 hours, and then you set to 2,

    OR

    You set to 2 at the beginning?

    Link to comment
    16 hours ago, limetech said:

    I suggest trying first:

    
    mdcmd set md_restrict 2

     

    If still corruption, let's try:

    
    mdcmd set md_restrict 0

    @limetech I followed this advice.  Should I set to something else?

    Edited by mi5key
    Link to comment
    9 minutes ago, mi5key said:

    I followed this advice.  Should I set to something else?

    The instructions said to run first without using 'mdcmd set md_restrict' at all (because there was a small change which I didn't think would make a difference but wanted to test nevertheless).  THEN if corruption noticed try with value set to 2, and then if still corruption try with value 1.  Sorry my instructions were written late at night and I was probably not very clear.

     

    I'm looking for consistency.  A few posts up Rich had set value to 2 and reported corruption but you have set value to 2 and do not see corruption - which is what I think you are saying, I was just wanting to make sure.  I think with value 2 you will indeed eventually see corruption - this issue is very much timing dependent, meaning it's determined by your combination of CPU speed, memory speed, disk controller speed, hard disk characteristics, and current phase of moon apparently.

     

    The last test, to set value to 1, and no corruption, tells me readahead handling is culprit.

    • Thanks 2
    • Haha 1
    Link to comment
    8 hours ago, limetech said:

    Thank you for your testing, very much appreciated.

     

    If you think this is further than ever before, please Stop array and set md_restrict to 1 and let 'er run.

    I just made the change to md_restrict to 1.  7:25pm.    We will see how it goes. 

    • Thanks 1
    Link to comment
    2 hours ago, limetech said:

    If you don't mind, please try setting to 1.  This will tell me which of the two is the culprit.

    I have just made the change. i will note i did try the 2 setting and showed corruption with in 20 minutes.

     

    also just want to note my current settings as its easy to lose is a chain of post :)

     

    appdata: is /mnt/disk1/appdata

    Tunable (scheduler): kyber

    mdcmd set md_restrict 1

     

    I will report back in a couple of hours and let you know how its going. 

     

     

     

     

       

    Link to comment
    1 minute ago, Fitz1015 said:

    I have just made the change. i will note i did try the 2 setting and showed corruption with in 20 minutes.

     

    also just want to note my current settings as its easy to lose is a chain of post :)

     

    appdata: is /mnt/disk1/appdata

    Tunable (scheduler): kyber

    mdcmd set md_restrict 1

     

    I will report back in a couple of hours and let you know how its going. 

     

     

     

     

       

    well that didnt take long. i have corruption again. I am going to go back to the 0 setting and let it run again.

     

    System.Data.SQLite.SQLiteException (0x80004005): database is locked database is locked at System.Data.SQLite.SQLite3.Step (System.Data.SQLite.SQLiteStatement stmt) [0x00088] in <61a20cde294d4a3eb43b9d9f6284613b>:0 at System.Data.SQLite.SQLiteDataReader.NextResult () [0x0016b] in <61a20cde294d4a3eb43b9d9f6284613b>:0 at System.Data.SQLite.SQLiteDataReader..ctor (System.Data.SQLite.SQLiteCommand cmd, System.Data.CommandBehavior behave) [0x00090] in <61a20cde294d4a3eb43b9d9f6284613b>:0 at (wrapper remoting-invoke-with-check) System.Data.SQLite.SQLiteDataReader..ctor(System.Data.SQLite.SQLiteCommand,System.Data.CommandBehavior) at System.Data.SQLite.SQLiteCommand.ExecuteReader (System.Data.CommandBehavior behavior) [0x0000c] in <61a20cde294d4a3eb43b9d9f6284613b>:0 at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery (System.Data.CommandBehavior behavior) [0x00006] in <61a20cde294d4a3eb43b9d9f6284613b>:0 at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery () [0x00006] in <61a20cde294d4a3eb43b9d9f6284613b>:0 at Marr.Data.QGen.UpdateQueryBuilder`1[T].Execute () [0x0003b] in M:\BuildAgent\work\5d7581516c0ee5b3\src\Marr.Data\QGen\UpdateQueryBuilder.cs:157 at Marr.Data.DataMapper.Update[T] (T entity, System.Linq.Expressions.Expression`1[TDelegate] filter) [0x00000] in M:\BuildAgent\work\5d7581516c0ee5b3\src\Marr.Data\DataMapper.cs:674 at NzbDrone.Core.Datastore.BasicRepository`1[TModel].Update (TModel model) [0x0002a] in M:\BuildAgent\work\5d7581516c0ee5b3\src\NzbDrone.Core\Datastore\BasicRepository.cs:125 at NzbDrone.Core.Tv.SeriesService.UpdateSeries (NzbDrone.Core.Tv.Series series, System.Boolean updateEpisodesToMatchSeason) [0x000a9] in M:\BuildAgent\work\5d7581516c0ee5b3\src\NzbDrone.Core\Tv\SeriesService.cs:160 at NzbDrone.Core.Tv.RefreshSeriesService.RefreshSeriesInfo (NzbDrone.Core.Tv.Series series) [0x00213] in M:\BuildAgent\work\5d7581516c0ee5b3\src\NzbDrone.Core\Tv\RefreshSeriesService.cs:110 at NzbDrone.Core.Tv.RefreshSeriesService.Execute (NzbDrone.Core.Tv.Commands.RefreshSeriesCommand message) [0x000d2] in M:\BuildAgent\work\5d7581516c0ee5b3\src\NzbDrone.Core\Tv\RefreshSeriesService.cs:188

    Link to comment
    2 hours ago, limetech said:

    The instructions said to run first without using 'mdcmd set md_restrict'

    Alright then, testing with default.

    Link to comment
    11 hours ago, Rich Minear said:

    I just made the change to md_restrict to 1.  7:25pm.    We will see how it goes. 

    Good morning.  With md_restrict set to 1, I ran all night long with no corruption.  I started a few jobs before I went to bed to put some of the apps through some work.  They all finished without problem.  I checked every app this morning that has a sqlite db, and there was no corruption. 

     

    I plan on trying to stress the machine a bit this morning.  I will let you know the results.  🙂

    • Thanks 3
    Link to comment
    3 hours ago, Rich Minear said:

    Good morning.  With md_restrict set to 1, I ran all night long with no corruption.  I started a few jobs before I went to bed to put some of the apps through some work.  They all finished without problem.  I checked every app this morning that has a sqlite db, and there was no corruption. 

     

    I plan on trying to stress the machine a bit this morning.  I will let you know the results.  🙂

    It has been about 3 hours, and there is no corruption with the MD_restrict set 1.  I had run a couple of movies through the dockers and file system, and performed some maintenance within Plex to try and stress the DB a bit.  But so far, no hiccups in Plex (my main culprit), and the rest of the dockers with sqlite look good also. 

     

    @limetech:  Is there  anything you want or need here?  Diags? Captures?   It will be good to let it run through the weekend with the normal usage and see how things go. 

    • Thanks 3
    Link to comment

    Just a note for others testing this, as Limetech said, you must stop the array before issueing the command for it to make a difference.

     

    To make this effective you must Stop array, type the command in terminal window, then Start array (with no browser refresh in-between).

    Link to comment
    3 minutes ago, limetech said:

    How has the testing gone over the weekend?

    Is this the magic setting?

    
    mdcmd set md_restrict 1

     

    I just connected to the server, and checked it again.  No corruption of the databases.  I had the server up all weekend (minus a short period where I shut down the entire system to move it to another room).  No issues at all.  We watched a couple of TV shows from Plex, and I saw that several new shows and files were added over the weekend.

     

    Yes....md_restrict 1 is where it is set right now. 

     

    The system has not stayed stable for this long since the update from 6.6.7. 

     

     

    • Thanks 5
    Link to comment

    Wanted to throw my hat in this ring. I have a brand new server (as of Thursday) running on 6.8 RC4, and after migrating the Plex server/data over to it, I have been experiencing a corrupted database after about 24 hours ( means I've restored the database 3 times so far, and that's how long it's taken me to notice the corruption). I restored the Plex database to it's original version (repair attempts didn't work for me). This morning I took the array offline and issued the 'mdcmd set md_restrict 1' command, then restarted the array/Plex. Crossing my fingers for the next 24 hours.

    • Like 1
    Link to comment

    I was just looking at it, and everything is stable so far! No hint of a corrupted database (I could tell in the past with episodes not getting marked watched properly, and generally the interface would load slower over time).

    • Like 1
    Link to comment
    2 minutes ago, limetech said:

    To not ever fail read aheads.

    Good job on coming out the other side of this rabbit hole btw. Didn't affect me as I use cache drives, but was amazing to watch the gears turning through the RC's and 6.7.* releases.

    Edited by cybrnook
    • Like 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.