• SQLite Data Corruption testing


    limetech

    tldr: Starting with 6.8.0-rc2 please visit Settings/Disk Settings and change the 'Tunable (scheduler)' to 'none'.  Then run with SQLite DB files located on array disk shares and report whether your databases still become corrupted.

     

    When we first started looking into this issue one of the first things I ran across was this monster topic:
    https://bugzilla.kernel.org/show_bug.cgi?id=201685

    and related patch discussion:
    https://patchwork.kernel.org/patch/10712695/


    This bug is very very similar to what we're seeing.  In addition Unraid 6.6.7 is on the last of the 4.18 kernels (4.18.20).  Unraid 6.7 is on 4.19 kernel and of course 6.8 is on 5.3 currently.  The SQLite DB Corruption bug also only started happening with 4.19 and so I don't think this is coincidence.

    In looking at the 5.3 code the patch above is not in the code; however, I ran across a later commit that reverted that patch and solved the bug a different way:
    https://www.spinics.net/lists/linux-block/msg34445.html

    That set of changes is in 5.3 code.

    I'm thinking perhaps their "fix" is not properly handling some I/O pattern that SQLite via md/unraid is generating.

     

    Before I go off and revert the kernel to 4.18.20, please test if setting the scheduler to 'none' makes any difference in whether databases become corrupted.

    • Like 1
    • Thanks 3



    User Feedback

    Recommended Comments



    Hi all

    I upgraded to 6.8.0 rc3 and set Disk Settings->Tunable (scheduler) to None.

    Sonarr docker was clean installed, removing everything existing trace from appdata.

    Then started my normal Sonarr manual import routine to force load the system.

    At the same time I started copying data from my other unRaid server to the new one, via my desktop. Total data to be copied around 300GB  in 800 files

     

    Two things I noticed.

    a. First time around, I started the manual import of a season, and then started the file transfer. The file transfer this case did not seem to start, but rather was in a "paused" state. I did not pay much attention and stopped the copy, thinking that the initial read from my prod unRaid was just taking too long. The import finished successfully and Plex was actually building thumbnails in the background without any issue. 

    b. Then started manual import of another season. However this time I started the copy from the prod server before I started the manual import in Sonarr. Copy seemed to be doing from 58MB/s to 80MB/s which considering data was read from server 1 to my desktop and then sent to server 2 through a single GB Lan connection was very good. I would think that the network connection was actually saturated. This time around, when the import started after the file transfer, it seemed that the series import was "paused" or progressing very slowly instead. If I manually paused the file transfer then season import would pick-up speed. When I un-paused the file transfer then season import would slow down to almost a halt.

     

    The array consists of 4 8TB drives, these are all WD 5400 RPM drives except for the parity disk which is a 7.2k disk)

    Since all disks are empty only disk1 is getting, for now, any activity, even though the media library path, where all the episodes are getting moved to by Sonarr. is situated on a user share including all 3 data drives.

     

    Setting the scheduler to none seems to have a performance impact, as far as concurrent block write activities are concerned. The default block device scheduler mechanism seems to not be splitting I/O cycles among all processes equally (or trying to do so) but rather to somehow prefer one process over another. 

     

    So far no DB corruption, however until now only 45GBs have been imported. Will keep the import going until I get in around 500 GBs worth of media, and report back.

     

     

    Link to comment

    Good morning.  I upgraded my system from 6.6.7 to 6.8.0 rc3 this morning.  Before I upgraded, I stopped all of my docker containers, made sure that each database was clean, no errors.  And then I made backup copies of each.  My docker containers are:

    Plex

    Sonarr

    Radarr

    Ombi

    Sabnzbd

    tautulli

     

    I made the change in disk settings to set the tunable scheduler to NONE. 

     

    I then kicked off some activity in Sonarr, Radarr, and Plex to put a bit of stress on the system.  I will check it ever hour today to see what I can find. 

     

    Attached are the diagnostics immediately after starting the system, setting tunable to none, and starting my dockers.  No activity started in apps yet.  Just a starting point to look at. 

     

    Thanks

     

     

    swissarmy-diagnostics-20191018-1235.zip

    Link to comment

    I just got to work, and checked the system via vpn:  The database for the Plex server corrupted at 9:32am.  That is roughly 1 hour and 45 minutes after upgrading the OS, making the tuning changes, and starting the docker containers. 

     

    The Sonarr database does not appear to have any corruption. 

    The Radarr database does not appear to have any corruption.

    The Sabnzbd database does not appear to have any corruption.

    The Ombi database does not appear to have any corruption.  The logs show errors, but appear to be related to its connection to Plex (failing because of the db corruption with Plex).

    The tautulli database does not appear to have any corruption.

     

    Attached are the diagnostics from 9:49am...roughly two hours from the upgrade.

     

    I will wait an hour to see if anyone wants me to make changes.  If not, I will revert the plex datbase to the known good db, and restart....and let it catch back up.

     

    Thanks

    swissarmy-diagnostics-20191018-1448.zip

    • Like 1
    Link to comment

    After waiting an hour with all of the dockers down, I did the following:

    1.  I kept a copy of the corrupt database (Plex), but renamed. 

    2.  I copied the known good database (Plex) that had been running on  6.6.7 (stable for over 3 weeks) back to the default database for the Plex docker.

    3.  I restarted Plex, and started to let Plex catch back up with the tv shows and movies that I had placed into the file system.  (10:56am Central Time).

    4.  I restarted all of the rest of the docker containers. 

     

    No changes were made to Unraid settings.

     

    At this point, I am back up and running.  I will continue to stress the system, and check it hourly. 

     

    Thanks.

    Link to comment

    I happened to have the VPN alive, and was watching the log file for Plex with tail....

     

    The database corrupted at 11:39, which is roughly 40 minutes after moving back to a known good, clean database.  Plex had caught up with the new TV shows that were on the file system, and was in the process of analyzing the metadata for the TV shows. 

     

    I doubt that anything has changed on the unraid system, but updated diagnostics are attached.

     

    This is now a repeatable corruption...I will reset the database again, and restart. 

     

    @limetech Is there something that you would like for me to change in the Unraid system?  Otherwise, my belief is that I will just be repeating the issues that I have seen so far this morning. 

     

     

    swissarmy-diagnostics-20191018-1643.zip

    Link to comment

    12:10 Central Time:  The Plex database corrupted again.  This was 22 minutes after Plex was restarted with a known good database file. 

     

    I see no reason to restart the docker containers at this point, as it will continue to corrupt a database without any changes being made to the Unraid system. 

     

    Diagnostics are attached....

     

    I will leave the system down for a while...until I receive some direction.  I'm willing to try what you want tested....

     

     

    swissarmy-diagnostics-20191018-1712.zip

    Link to comment
    26 minutes ago, Rich Minear said:

    12:10 Central Time:  The Plex database corrupted again.  This was 22 minutes after Plex was restarted with a known good database file. 

     

    I see no reason to restart the docker containers at this point, as it will continue to corrupt a database without any changes being made to the Unraid system. 

     

    Diagnostics are attached....

     

    I will leave the system down for a while...until I receive some direction.  I'm willing to try what you want tested....

     

     

    swissarmy-diagnostics-20191018-1712.zip 90.49 kB · 1 download

    1. will plex corrupt if it is the only docker running?

    2. what's your plex path mapping?

    Link to comment
    39 minutes ago, Rich Minear said:

    I will leave the system down for a while...until I receive some direction.  I'm willing to try what you want tested....

    Sure, please do this:  First get everything ready for a "run" (array started, etc).  Then just before starting Plex, type this in terminal or console:

    mdcmd set md_trace 3

    This will result in logging all the I/O operations seen by md/unraid driver and will likely quickly fill your syslog.  But let it run for a while and then post diagnostics.

     

    To turn this off either Stop array or type:

    mdcmd set md_trace 1

     

    Link to comment

    Ok. Let's find out. 

     

    I Plex set to be the only docker running.  All others (even non-sql apps) at stopped.  I have attached a picture of the Plex path mapping. 

     

    I ran the cmd as requested:  mdcmd set md_trace 3

     

    Then started Plex.  I will let Plex catch up the database with the file system, and then start a job in plex to refresh the metadata. 

     

    Started at 1:23pm Central Time.

     

    Link to comment
    Just now, Rich Minear said:

    Ok. Let's find out. 

     

    I Plex set to be the only docker running.  All others (even non-sql apps) at stopped.  I have attached a picture of the Plex path mapping. 

     

    I ran the cmd as requested:  mdcmd set md_trace 3

     

    Then started Plex.  I will let Plex catch up the database with the file system, and then start a job in plex to refresh the metadata. 

     

    Started at 1:23pm Central Time.

     

    +picture of path for Plex

    Screen Shot 2019-10-18 at 1.21.21 PM.png

    Link to comment

    Sorry to report that DB corruption appeared in Sonarr.

     

    As I mentioned before, with the proposed setup the manual import process was taking too long if I at the same time I was doing another file transfer.

    So first two rounds of manual import (around 300GBs) where actually done with no other loads on the server.

    Third round I started a manual import of another 80GBs of media, and a concurrent file transfer of another 600GBs to the server, and a series refresh in Sonarr.

     

    File transfer speed was as before ranging from 50-70MB/s

    The actual import speed was a nightmare. In 1 hour less 7 files where finished for a total of 10 GBs

    And then the Database malformed error appeared in the Sonarr logs.

     

    Attaching the diagnostics

    alexandria-diagnostics-20191018-1837.zip

     

    Link to comment

    Has anyone tried it keeping the Scheduler set to the default? Does that still cause corruption?

     

    The reason I ask is that kernel discussion thread is a bit dense with info and seems like sometimes conflicting info, so for me the one part where they talked about the Scheduler didn't seem as straightforward as it should have. It was tough to see a consensus on scheduler being set to any value versus being set to "none".

    Edited by BRiT
    Link to comment
    14 minutes ago, BRiT said:

    Has anyone tried it keeping the Scheduler set to the default? Does that still cause corruption?

    The default is 'mq-deadline' which is what is used for all prior releases after 6.6.7 release.  In Unraid 6.6 and earlier, the Linux 'mq' (multi-queue) code was optional (that is, in Linux kernels prior to 4.19, note Unraid 6.6 uses kernel 4.18).  Starting with 4.19 they got rid of the old scheduling code and replaced with 'mq' code.  With more people now using that code (because now mandatory) it shook out more bugs as detailed in OP.  They got to work fixing those bugs but my theory is we are doing something that reveals a latent bug still remaining.  If this is the case it will eventually turn up in other kernel bug reports, e.g., if someone puts an sqlite db onto traditional Linux md/raid5 array it's very possible the bug will show up there (probably not a lot of people out there doing this), but can we wait that long? 

    • Thanks 1
    Link to comment

    I should add: we are not trying to shift blame onto some theoretical bug in Linux scheduler.  But during the course of investigating this, I ruled out shfs, and 99% sure ruled out md/unraid.  It's still possible the bug is there.

    Link to comment

    It has been just over an hour with only Plex running and doing its own background work.  I've been watching the logs (plex), and I see work happening, and the normal errors/warnings about metadata, but so far...no database corruption.

     

    I still have the md trace running. 

     

    I will continue to watch it for a while. 

    Link to comment

    It has been about 2hours and 45 minutes, and I cannot get the Plex db to corrupt.  I tried.  I forced it to do all kinds of work, database cleanup, metadata cleanup, etc.  But it would not fail.  (I guess this is a good thing!!!). 

     

    I stopped the trace, and have attached the diags. 

     

    Since Plex was the only docker running, I'm assuming we should turn on an additional docker one at a time.  I will start with Sabnzbd.  It was in the process of download some files, and will generate some additional disk work.  I will turn the trace back on. 

     

     

    swissarmy-diagnostics-20191018-2100.zip

    Link to comment
    33 minutes ago, Rich Minear said:

    It has been about 2hours and 45 minutes, and I cannot get the Plex db to corrupt.  I tried.  I forced it to do all kinds of work, database cleanup, metadata cleanup, etc.  But it would not fail.  (I guess this is a good thing!!!). 

     

    I stopped the trace, and have attached the diags. 

     

    Since Plex was the only docker running, I'm assuming we should turn on an additional docker one at a time.  I will start with Sabnzbd.  It was in the process of download some files, and will generate some additional disk work.  I will turn the trace back on. 

     

     

    swissarmy-diagnostics-20191018-2100.zip 203.9 kB · 1 download

    Thank you!  I will analyze.  Maybe generating the syslog entries disturbs the timing enough that the error case does not happen :(   That would be a bummer.

    • Haha 1
    Link to comment

    Recap:  Plex running in conjunction with Sabnzbd.  The latter was downloading a fairly large file, but did not finish it yet.  So it was writing to disk, but not moving a large or number of .mkv files around. 

     

    It took about 35 to 40 minutes for it to blow up.  The first section where I saw errors in the Plex log file was:

     

    Oct 18, 2019 16:25:40.852 [0x14710c9c4700] INFO - Library section 1 (TV Shows) will be updated because of a change in '"/media/Video/TV Shows/NCIS- Los Angeles/Season 11"/NCIS- Los Angeles - S11E01 - Let Fate Decide WEBDL-1080p Proper.en.srt'
    Oct 18, 2019 16:25:45.904 [0x1470a11e8700] ERROR - Unknown metadata type: folder
    Oct 18, 2019 16:25:46.058 [0x1470a1bed700] ERROR - Unknown metadata type: folder
    Oct 18, 2019 16:26:01.351 [0x1470a1dee700] ERROR - Error issuing curl_easy_perform(handle): 28
    Oct 18, 2019 16:26:01.351 [0x1470a1dee700] ERROR - Error parsing content.
    Oct 18, 2019 16:26:01.351 [0x1470a1dee700] ERROR - Error parsing XML response for update.
    Oct 18, 2019 16:26:04.620 [0x14710dbcd700] WARN - Unable to load episode file ["seasons/11/episodes/3.xml"]
    Oct 18, 2019 16:26:06.571 [0x1470a31f8700] ERROR - Unknown metadata type: folder
    Oct 18, 2019 16:26:06.679 [0x1470a17eb700] ERROR - Unknown metadata type: folder
    Oct 18, 2019 16:34:33.813 [0x14709b3f9700] ERROR - SQLITE3:(nil), 11, database corruption at line 64883 of [bf8c1b2b7a]
    Oct 18, 2019 16:34:33.813 [0x14709b3f9700] ERROR - SQLITE3:(nil), 11, statement aborts at 209: [update metadata_items set library_section_id=:metadata_items_library_section_id,parent_id=:metadata_items_parent_id,metadata_type=:metadata_items_metadata_type,guid=:metadata_items_gu
    Oct 18, 2019 16:34:33.897 [0x14709b3f9700] ERROR - Exception inside transaction (inside=1) (../Library/MetadataItem.cpp:987): sqlite3_statement_backend::loadOne: database disk image is malformed

     

    It was moving around a .srt file (captions). 

     

    I stopped both Plex, and Sabnzbd, and stopped the trace.  New diagnostics are attached.

     

     

     

     

    swissarmy-diagnostics-20191018-2137.zip

    Link to comment
    9 minutes ago, Rich Minear said:

    Boom. It just blew up.  I will have new diags in a few min. 

    ok thanks.  Another thing to try while I look this over: change scheduler to 'kyber' and repeat test - don't need to setup the trace, just see if it corrupts.

    Link to comment

    I just made the change to the scheduler, and reset the database for Plex.  FYI...the copy of the db file was painfully slow after making that change. 

     

    Plex and Sabnzbd were restarted at 4:52pm

    Link to comment

    With "kyber" set for the scheduler, the Plex database corrupted within minutes.  11 to be exact.  Plex was updating the metadata when it blew. 

     

    Diags attached. 

     

    I will reset the Plex database, but leave everything off for right now. 

     

    @limetech I will wait to change the scheduler until you specify what you would like to try next.

     

    swissarmy-diagnostics-20191018-2206.zip

    Link to comment

    Weird...copying databases is back to normal speed.  I did not change the scheduler.  It's possible that maybe I was waiting on a disk to spin up last time.  Unsure. 

    Link to comment
    2 hours ago, limetech said:

    Thank you!  I will analyze.  Maybe generating the syslog entries disturbs the timing enough that the error case does not happen :(   That would be a bummer.

     

    I certainly hope this isn't a HeisenBug.  I was getting that feeling when reading all the progress posts that were not corrupting. 😟

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.