• SQLite Data Corruption testing


    limetech

    tldr: Starting with 6.8.0-rc2 please visit Settings/Disk Settings and change the 'Tunable (scheduler)' to 'none'.  Then run with SQLite DB files located on array disk shares and report whether your databases still become corrupted.

     

    When we first started looking into this issue one of the first things I ran across was this monster topic:
    https://bugzilla.kernel.org/show_bug.cgi?id=201685

    and related patch discussion:
    https://patchwork.kernel.org/patch/10712695/


    This bug is very very similar to what we're seeing.  In addition Unraid 6.6.7 is on the last of the 4.18 kernels (4.18.20).  Unraid 6.7 is on 4.19 kernel and of course 6.8 is on 5.3 currently.  The SQLite DB Corruption bug also only started happening with 4.19 and so I don't think this is coincidence.

    In looking at the 5.3 code the patch above is not in the code; however, I ran across a later commit that reverted that patch and solved the bug a different way:
    https://www.spinics.net/lists/linux-block/msg34445.html

    That set of changes is in 5.3 code.

    I'm thinking perhaps their "fix" is not properly handling some I/O pattern that SQLite via md/unraid is generating.

     

    Before I go off and revert the kernel to 4.18.20, please test if setting the scheduler to 'none' makes any difference in whether databases become corrupted.

    • Like 1
    • Thanks 3



    User Feedback

    Recommended Comments



    Upgraded to 6.8.0-rc3 two days ago.  During normal use, Plex scans periodically, a couple of movies were added, no corruption on linuxserver/Plex or binhex/Sonarr during that time.

     

    Tunable(scheduler) = none

     

    I just moved a 60G directory of tv episodes out of the TV directory, Plex rescanned, no corruption.

    I moved that 60G tv show back to the tv watch folder, Plex starts scanning, and started a copy of large files in another window, near immediate corruption.

     

    Diag attached.

    r410-diagnostics-20191020-1654.zip

    Edited by mi5key
    Link to comment
    24 minutes ago, mi5key said:

    started a copy of large files

    Please expand on this - do you mean you started a copy of large files?  what is source and destination?

    Link to comment
    1 hour ago, limetech said:

    Please expand on this - do you mean you started a copy of large files?  what is source and destination?

    After moving the directory of TV episodes back to the TV watch folder, I started to copy those same exact files to another location... /mnt/user/stuff/tv/Star Trek DS9 --> /mnt/user/stuff/holding

     

    So, while Plex was scanning those files, they were being copied.

    Link to comment

    Sorry still confused ... did you do this via command line? If so please show the commands you used.  Also when you say "moved" to me that means a 'mv' operation where file by file, each file is copied and then source is deleted - is that what happened.

     

    I think what you're saying is this: there is a directory with a bunch of files in it.  While Plex is scanning that directory you also initiated a copy (not move) of that same directory to another directory.  correct?

    Link to comment
    47 minutes ago, limetech said:

    Sorry still confused ... did you do this via command line? If so please show the commands you used.  Also when you say "moved" to me that means a 'mv' operation where file by file, each file is copied and then source is deleted - is that what happened.

     

    I think what you're saying is this: there is a directory with a bunch of files in it.  While Plex is scanning that directory you also initiated a copy (not move) of that same directory to another directory.  correct?

    In this case the mv command doesn't copy, just changes location pointer.  From the command line, here is what I did..

    1) CLI:  mv /mnt/user/stuff/tv/Star\ Trek\ Deep\ Space\ Nine  /mnt/user/stuff/holding/

    2) Plex GUI: Scan library files in Plex

    3) CLI: less /mnt/user/appdata/PlexMediaServer/Library/Application\ Support/Plex\ Media\ Server/Logs/Plex\ Media\ Server.log - observed no corruption

    4) CLI: mv /mnt/user/stuff/holding/Star\ Trek\ Deep\ Space\ Nine /mnt/user/stuff/tv

    5) Plex GUI: Initiated scan, Plex picked up the change and is processing files.

    6) CLI: cp -rv /mnt/user/stuff/tv/Star\ Trek\ Deep\ Space\ Nine  /mnt/user/stuff/tmp/

    7) CLI: less /mnt/user/appdata/PlexMediaServer/Library/Application\ Support/Plex\ Media\ Server/Logs/Plex\ Media\ Server.log - observed corruption within 1 minute.

    Edited by mi5key
    Link to comment
    4 hours ago, mi5key said:

    Diag attached.

    Perhaps related, probably not, it appears you have some kind of network loop going on, your logs are filling up with:

     

    Oct 20 04:40:10 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:10 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:12 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:12 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:14 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:14 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:16 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:16 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:18 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:18 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:20 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:20 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:22 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:22 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:24 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:24 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:26 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:26 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:40:28 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:40:28 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:41:28 r410 kernel: br0: received packet on bond0 with own address as source address (addr:78:2b:cb:63:a3:af, vlan:0)
    ### [PREVIOUS LINE REPEATED 1 TIMES] ###
    Oct 20 04:42:13 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:13 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:15 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:15 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:17 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:17 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:19 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:19 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:21 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:21 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:22 r410 kernel: br0: received packet on bond0 with own address as source address (addr:78:2b:cb:63:a3:af, vlan:0)
    Oct 20 04:42:23 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:23 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:25 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:25 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:27 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:27 r410 kernel: br0: topology change detected, propagating
    Oct 20 04:42:29 r410 kernel: br0: port 1(bond0) received tcn bpdu
    Oct 20 04:42:29 r410 kernel: br0: topology change detected, propagating

     

    Link to comment

    @limetech  Is there anything else you want me to try?  Right now I have all my dockers down except for Plex so we can watch some media.  It has been up all day...but no one has watched anything and no stress.

     

    Also, my scheduler is still set to "kyber".  Should I leave it like that?

     

    Thanks

     

    rm

    Link to comment
    1 hour ago, limetech said:

    Oct 20 04:42:21 r410 kernel: br0: topology change detected, propagating Oct 20 04:42:22 r410 kernel: br0: received packet on bond0 with own address as source address (addr:78:2b:cb:63:a3:af, vlan:0) Oct 20 04:42:23 r410 kernel: br0: port 1(bond0) received tcn bpdu

    @limetech I disabled the bonding since I only have one network cable at the moment anyhow.  Messages have disappeared.

     

    Anything else I can try to assist?  I have time today to try other scenarios.

    • Like 1
    Link to comment

    Good morning.  After Plex ran all day yesterday with no issues (no other dockers on, and very little usage), I started turning on other dockers this morning:

    10:00 Sabnzbd

    12:15  Sonarr

    12:28  Radarr

    12:32  Plex database corrupted. 

     

    I'm going to let them all run for a while, as I need to catch up on a couple of things.  Plex is not usable in this state.  I will revert to the last known good database. 

     

    @Limetech  Is there anything you would like tried?  I'm at home all day today.

     

    Diags attached.

     

     

    swissarmy-diagnostics-20191021-1744.zip

    Link to comment
    On 10/21/2019 at 12:53 PM, limetech said:

    Preparing another release.  Mondays are always very busy but will try to get it out asap.

    Any word on the new update to try??

    • Like 1
    Link to comment

    I'm gonna roll back. I'm having random usage spikes to 100% on every core even though the server isn't doing much. I caught mariadb (used exclusively for Nextcloud) using like 27% of resources which along with the massive iowait I'm experiencing it means at least once it was the culprit. I don't know if the database actually got corrupted, but I never had any issue with it on 6.7.2. On 6.7.2 I had a lot of database locked errors in Plex and failures to retain watched status. I love some of the new features, but honestly I'd rather had the server be stable so I'm gonna roll back to 6.6.7 and hopefully that fixes my issues.

     

    I've attached the diagnostics just after I had a 100% usage event in which no docker was reporting cpu usage statistics. 

    tower-diagnostics-20191024-0117.zip

    Link to comment

    A couple minor md/unraid driver changes made in 6.8.0-rc4 but please retest before trying below.

     

    If DB corruption persists, then there are two tweak we can try that influence how I/O is generated.  Both of these make use of

    mdcmd set md_restrict <value>

    The default <value> is 3, that is both bits 0 and 1 set.

    Bit 0 determines how incoming I/O requests are broken up.  If set, then they are broken into 128K requests.  If not set they are processed as-is (meaning a really large request will be entirely scheduled by md/unraid driver.

    Bit 1 determines how to handle read-ahead requests when pool resources are limited.  If set, then if a stripe cannot be allocated the read-ahead is terminated (this is valid).  If not set, then read-ahead will compete with all other I/O until completion.

     

    To make this effective you must Stop array, type the command in terminal window, then Start array (with no browser refresh in-between).

     

    I suggest trying first:

    mdcmd set md_restrict 2

     

    If still corruption, let's try:

    mdcmd set md_restrict 0

     

    • Thanks 1
    Link to comment
    5 hours ago, crazygambit said:

    I'm having random usage spikes to 100% on every core even though the server isn't doing much.

    Please open separate bug report for this.

    Link to comment

    When I attempt to run that cmd (with the array stopped), I receive an error message: 

    /usr/local/sbin/mdcmd: line 11: echo: write error: Invalid argument

     

    I found the plex db corrupt again this morning.  I fell back to a former (known good) db. 

    When I attempted to stop the array, it would not stop.  It started the process, but 30 minutes later it had not stopped.  So I rebooted.  Then stopped all my "auto start" dockers, stopped the array, and attempted the cmd.  It failed with the error message above. 

     

    Right now, on 6.0.8-rc3, I'm left with an unusable system for my dockers.  Corruption occurs within an hour or less.  Any more changes you would like to try??

     

    Diags attached. 

    swissarmy-diagnostics-20191024-0922.zip

    Link to comment
    Just now, Rich Minear said:

    When I attempt to run that cmd (with the array stopped), I receive an error message: 

    /usr/local/sbin/mdcmd: line 11: echo: write error: Invalid argument

     

    I found the plex db corrupt again this morning.  I fell back to a former (known good) db. 

    When I attempted to stop the array, it would not stop.  It started the process, but 30 minutes later it had not stopped.  So I rebooted.  Then stopped all my "auto start" dockers, stopped the array, and attempted the cmd.  It failed with the error message above. 

     

    Right now, on 6.0.8-rc3, I'm left with an unusable system for my dockers.  Corruption occurs within an hour or less.  Any more changes you would like to try??

     

    Diags attached. 

    swissarmy-diagnostics-20191024-0922.zip 89.31 kB · 0 downloads

     

    Screen Shot 2019-10-24 at 4.14.39 AM.png

    Link to comment
    4 minutes ago, Rich Minear said:

    When I attempt to run that cmd (with the array stopped), I receive an error message: 

    /usr/local/sbin/mdcmd: line 11: echo: write error: Invalid argument

    Are you on rc4 when you try this?   I have a feeling that is a new option in rc4.

    Link to comment

    I have been asking for a couple of days if there was a new version out....haven't received any info.  No, I'm on rc3. 

     

    How can I get to rc4? 

    Link to comment

    Ok.  I upgraded to rc4.  No issues with the upgrade.  I started my plex docker, and made sure that everything was working, and no db corruption.  Then I started all of the rest of the dockers.  4:40am.

     

    At 5:20am the Plex DB corrupted.  So I stopped all the dockers, and pulled a diags (attached). 

    Stopped the array. 

    5:27 am:  Ran the cmd:   mdcmd set md_restrict 2

    Restarted the array

    Restarted Plex Docker to ensure it that it came up clean.

    Restarted the rest of the dockers.

    Here we go.....

    swissarmy-diagnostics-20191024-1022.zip

    • Thanks 1
    Link to comment

    5:53 AM:  Plex DB Corrupted.  It ran for about 24 minutes before corrupting.  It appears to have occurred when a new TV show was attempting to be written to the disk and alerting Plex. 

     

    Stopped the dockers.  Restored a known good Plex DB.  Pulled a copy of the diags (attached). 

     

    Stopped the array.  Made the change with the cmd:   mdcmd set md_restrict 0

     

    Started the array.  Started Plex and made it sure it came up clean. 

     

    Restarted all dockers.

     

    6:09  Here we go....

     

     

    swissarmy-diagnostics-20191024-1057.zip

    Link to comment

    After making the change:  mdcmd set md_restrict 0

     

    Its been about an hour and fifteen minutes, and no corruption.  But the system appears "slow", and my processor has been at or near 100% most of the time.  File writes/moves appear to be slow...but that is just my observation.  No data. 

     

    I'm headed to work...and will check on it sometime this morning. 

     

     

    • Thanks 1
    Link to comment

    Its almost 3 hours in since making the mdcmd change to 0.  So far, no corruption in the databases.  This better than it has been for a while. 

     

    I will check it again in a couple of hours. 

    • Thanks 1
    Link to comment
    7 hours ago, limetech said:

    Please open separate bug report for this.

    I think it's definitely related to the sqlite issue as when I went into the terminal to look at top I saw sqlite up there using lots of CPU when there's no reason it should. Anyhow I rolled back to 6.6.7 and while it's a little uglier it's been running great.

    Edited by crazygambit
    Link to comment
    2 hours ago, Rich Minear said:

    I will check it again in a couple of hours. 

    Thank you for your testing, very much appreciated.

     

    If you think this is further than ever before, please Stop array and set md_restrict to 1 and let 'er run.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.