• SQLite DB Corruption testers needed


    limetech
    • Closed

    9/17/2019 Update: may have got to the bottom of this.  Please try 6.7.3-rc3 available on the next branch.

    9/18/2019 Update: 6.7.3-rc4 is available to address Very Slow Array Concurrent Performance.

     

    re:

     

    Trying to get to the bottom of this...  First we have not been able to reproduce, which is odd because it implies there may be some kind of hardware/driver dependency with this issue.  Nevertheless I want to start a series of tests, which I know will be painful for some since every time DB corruption occurs, you have to go through lengthy rebuild process.  That said, we would really appreciate anyone's input during this time.

     

    The idea is that we are only going to change one thing at a time.  We can either start with 6.6.7 and start updating stuff until it breaks, or we can start with 6.7.2 and revert stuff until it's fixed.  Since my best guess at this point is that the issue is either with Linux kernel, docker, or something we have misconfigured (not one of a hundred other packages we updated), we are going to start with 6.7.2 code base and see if we can make it work.

     

    But actually, the first stab at this is not reverting anything, but rather first updating the Linux kernel to the latest 4.19 patch release which is 4.19.60 (6.7.2 uses kernel 4.19.55).  In skimming the kernel change logs, nothing jumps out as a possible fix, however I want to first try the easiest and least impactful change: update to latest 4.19 kernel.

     

    If this does not solve the problem (which I expect it won't), then we have two choices:

     

    1) update to latest Linux stable kernel (5.2.2) - we are using 5.2 kernel in Unraid 6.8-beta and so far no one has reported any sqlite DB corruption, though the sample set is pretty small.  The downside with this is, not all out-of-tree drivers yet build with 5.2 kernel and so some functionality would be lost.

     

    2) downgrade docker from 18.09.06 (version in 6.7.2) to 18.06.03-ce (version in 6.6.7).

    [BTW the latest Docker release 19.03.00 was just published today - people gripe about our release numbers, try making sense of Docker release numbers haha]

     

    If neither of those steps succeed then ... well let's hope one of them does succeed.

     

    To get started, first make a backup of your flash via Main/Flash/Flash Backup, and then switch to the 'next' branch via Tools/Upgrade OS page.  There you should see version 6.7.3-rc1

     

    As soon as a couple people report corruption I'll publish an -rc2, probably with reverted Docker.

    Edited by limetech

    • Upvote 5



    User Feedback

    Recommended Comments



    Just upgraded to 6.8.0 rc1

    Have been running without any corruption for the last 5 hours now.

    I did not have the opportunity to overload the unRaid server yet. Will try this tomorrow.

    Keeping my fingers crossed.

    • Thanks 1
    Link to comment

    I am seeing more corruptions after resetting everything from scratch with the latest 6.8.0-rc1 build. Repaired the Plex database after upgrading Unraid and was able to get my libraries back online but after some basic testing and some more importing of backed up shows I seem to be right back where I started. Any ideas? Really frustrating to see it all just undo my settings after I put them all back manually. 

    Screenshot 2019-10-13 16.16.30.png

    glacier-diagnostics-20191013-2016.zip

    Link to comment

    Hello all

     

    Sorry to report that I just witnessed the DB corruption on my first attempt to stress the server in 6.8.0rc1

     

    The test was as follows

     

    Plex was scanning the Library

    Sonarr was importing around 10GB worth of media, and is set to notify Plex upon successful import of a library item.

    At the same time I was moving 180GB of data through the network to a single disk of the array via a disk mount.

     

    The performance was good as I was monitoring the data transfer. It only seemed to slow very much at one point, where I am guessing some of the disks would have to spin up because of the library scan.

     

    At the end of the import everything seemed to be fine. The new episodes where showing up in Plex, the transfer speed to the array was ranging around 65-75MB/s. All in all everything seemed fine, until I checked the Sonarr logs, where I saw the first messages about malformed database

     

    Plex DB did not show any signs of corruption yet.

     

    I noticed a weird thing I would like to report. I changed the setup the /config path in all dockers to /mnt/disk1/appdata from /mnt/user/appdata, as part of the proposed workaround back in July and still have the same setup as all dockers perform much better if /config is not on a user share. However, today I noticed that the reported available space for /config in Sonarr WebUI was 7.2TB. In order to get to the 7.2TB you need to add free space from several, but not all, drives of the array. The / path which is actually inside docker.img reported properly 12.3 GBs free. 

    It would seem that even though I specifically said /config is in disk1, unRaid is ignoring and treating it as a user share.

    In the user shares tab /appdata shows as having 441 GBs available which is what is actually available in disk1. This might be important, as it indicates that somehow access to /config by Sonarr is actually still happening through fuse and a user share, and not a disk share.

     

    Anyway I will be downgrading back again to version 6.6.7 today, and will check above inconsistency with /config path in Sonarr.

     

     

    Link to comment
    7 minutes ago, simalex said:

    Sorry to report that I just witnessed the DB corruption on my first attempt to stress the server in 6.8.0rc1

    Thanks for the report.

     

    8 minutes ago, simalex said:

    It would seem that even though I specifically said /config is in disk1

    You're saying AppData config path is set to: /mnt/disk1/appdata/sonarr  is that correct?

    Also, which sonarr, binhex-sonarr or linuxserver sonarr?

     

    11 minutes ago, simalex said:

    This might be important, as it indicates that somehow access to /config by Sonarr is actually still happening through fuse and a user share, and not a disk share.

    Indeed, if there are two or more containers accessing the same DB file, one via user share, another via disk share, that can cause problems.

    Link to comment
    3 hours ago, limetech said:

    You're saying AppData config path is set to: /mnt/disk1/appdata/sonarr  is that correct?

    Also, which sonarr, binhex-sonarr or linuxserver sonarr?

    was using linuxserver

    I have both dockers installed with different paths for their config files

    binhex has /config in /mnt/disk1/appdata/binhex-sonarr

    linuxserver has /config in /mnt/disk1/appdata/sonarr

     

    I will do the same load test with binhex before downgrading to see if the corruption happens there also.

     

    The reason I moved back to linuxserver is that the backup in binhex is somehow broken and access to /tmp/nzbdrone_backup is for some reason denied. I don't know the reason for this, maybe it is related to the fact that I restored the DB to binhex docker from an old linuxserver backup, and there are some parameters set in the database that don't match the existing paths in the docker. Anyway I wanted to have a working backup process before going to "test land" again, so that I could avoid doing rebuilds from older backups.

     

    3 hours ago, limetech said:

    Indeed, if there are two or more containers accessing the same DB file, one via user share, another via disk share, that can cause problems.

     

    I noticed a strange behavior a while back (don't even remember unRaid version but was definitely before 6.6.7), when I changed  the setup of a user share.

    Initially that share was setup to include all disks but then I limited it to only one disk (disk3 I think). I moved all data of that share from the rest of the disks to disk3, and also deleted the top level directory from the rest of the disk shares.

    Everything seemed to work properly at the time.

    However, once I added a new disk (disk8) in the array, I noticed that a directory for that user share was created in disk8 as well, and unRaid was now actually using disk8 for new writes, as if it was ignoring the setup. I moved again all files of that user share from disk8 to disk3 and deleted the top level directory from disk8, and have not had the same issue appear since even though disk3 has still near 1TB free, and disk8 is an 8TB disk filling up gradually.

     

    Also another thing to note Tom 

    disk1 has appdata for all dockers, however it also stores part of my Media Library, which is actually accessed through a user share

     

    So in my Sonarr docker configuration I have

    /config pointing to /mnt/disk1/appdata/sonarr, and

    /tv pointing to /mnt/user/TV-Shows

     

    The setup of TV-Shows user share, has disk1 as part of the included disks list.


    Could this be causing the corruption, because we have the same docker app accessing files on the same physical disk through a different file system. Could it be that there is some kind of i-node mix-up happening in this specific scenario, when the same docker application is attempting to write to a user share and the i-node for the fuse file system matches an actually existing i-node on the disk itself which it is already accessing directly through the disk share?

     

    It would be unlikely to find the above setup if you have appdata stored in a cache drive, as I would assume that the media library would be on the protected storage of the array. That would explain why the corruption is not happening when appdata is stored in cache or unassigned devices. It would also explain why there are users not experiencing the corruption at all, even though the appdata is located in the protected storage.

     

     

    Link to comment
    10 hours ago, simalex said:

    Plex was scanning the Library

    Sonarr was importing around 10GB worth of media, and is set to notify Plex upon successful import of a library item.

    Sonarr importing to Plex library or are you referring simply to separate media storage directories where media files are written by sonarr and then picked up by Plex?

     

    10 hours ago, simalex said:

    today I noticed that the reported available space for /config in Sonarr WebUI was 7.2TB. In order to get to the 7.2TB you need to add free space from several, but not all, drives of the array.

    This is the part that is confusing.  Please double check.  If /config is mapped to /mnt/disk1 then it should not report free space beyond what's available on /mnt/disk1.

     

    6 hours ago, simalex said:

    So in my Sonarr docker configuration I have

    /config pointing to /mnt/disk1/appdata/sonarr, and

    /tv pointing to /mnt/user/TV-Shows

    That should be ok.

    Link to comment
    21 minutes ago, limetech said:

    Sonarr importing to Plex library or are you referring simply to separate media storage directories where media files are written by sonarr and then picked up by Plex?

    Sonarr is importing from the download area where deluge or nzbget placed the files and moving media to the same directories that Plex Media library resides. This is a common setup. 

     

    Plex as far as I understand gets notified that there was a change in 2 ways.

     

    a. Sonarr is connecting to Plex and notifying Plex of any changes it made to the media library (additions, deletions, replacement)

    b. Plex is also monitoring the media directories and when it detects a change will trigger internally a library re-scan.

     

    So when you import a new media file through Sonarr,  this will trigger a re-scan of the library by Plex. In addition to this Plex will start downloading metadata, subtitles and (based on the setup) even start generating thumbnails etc for the newly detected media, which means further taxing of the same disks.

     

    In order to push the server to the limit, I will have a directory of multiple media files, larger then 1GB each and do a manual import in Sonarr of all the media files. This in turn will copy the media files, one at a time, to the destination area. As each file is finished copying Sonarr will notify Plex of the update. So I get large files copied across different array disks, and at the same time usually read access of other parts of usually the same drive that is getting written.

     

    21 minutes ago, limetech said:

    today I noticed that the reported available space for /config in Sonarr WebUI was 7.2TB. In order to get to the 7.2TB you need to add free space from several, but not all, drives of the array.

    Will check this when I get back and let you know

     

     

    Edited by simalex
    Link to comment

    In order to get DB corruption one (or more) of these have to be true:

    1. Multiple processes of same application or multiple threads of same process are accessing same DB file and mutual-exclusion method being used (locking) is not working correctly.
    2. Multiple processes of different apps are accessing same DB file and stepping on each other.
    3. There is a bug in the underlying block-I/O driver, in our case md/unraid if DB on array disk.
    4. There is a bug in an underlying kernel device driver.
    5. There is a bug in SQLite.

     

    #1 is what I looked at initially and don't think this is the case.

    #2 is what @simalex might be causing to happen?

    #3 I don't think has ever been source of problem, where for 6.8 I pretty thoroughly reviewed and tested (again).

    #4 Is possible since we cannot reproduce this problem with servers we have.

    #5 Also possible since sqlite db corruption has been common over the years.

     

    Why does 6.7.x work and later releases don't?  I have no idea.  We are willing to go back and start bisecting changes but this is nearly impossible to do if we can't reproduce the issue to begin with.

     

    The other thing that makes this issue difficult is that corruption is detected only long after the event which caused the corruption and sqlite doesn't exactly make it clear what the nature of the corruption  is.

    • Thanks 2
    Link to comment

    If it's an sqlite bug, it's certainly not an sqlite _regression_, since it happens on Plex (we ship our own sqlite) and in Docker containers.

    Here's a question: has anyone seen this on a standalone PMS install, or only on Docker? I haven't run into this issue since updating to the affected versions, and I run PMS directly on the host.

    I've seen rare cases of sqlite corruption on my system in the past, but they were attributable to the cache/mover bug (which can be ruled out as the cause of this issue, since it reproduces on systems without cache drives, and even when bypassing shfs altogether by using an individual disk md mount).

    I'm not especially familiar with how Docker volumes are implemented, but has anyone tried using a bind mount instead? Could be interesting to see if that gives different results.

    Otherwise, I'd suggest having someone who sees this frequently attach strace to an affected app, watching all filesystem I/O syscalls, and see exactly what's going on when the corruption occurs. Just make sure to use `-f` to get all threads and spawned children, and direct the output to somewhere with a lot of free space.

    Link to comment
    2 hours ago, 11rcombs said:

    I'm not especially familiar with how Docker volumes are implemented, but has anyone tried using a bind mount instead? Could be interesting to see if that gives different results.

    Yes maybe docker related - using a bind mount of what?

     

    2 hours ago, 11rcombs said:

    Otherwise, I'd suggest having someone who sees this frequently attach strace to an affected app,

    This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption.

    Link to comment

    >Yes maybe docker related - using a bind mount of what?

     

    Of whatever the affected sqlite database is on, just as opposed to using a volume. Again, I don't know how those features are implemented, but there's a chance it might hit on a different code path.

     

    >This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption.

     

    Could maybe do something clever with `logrotate` + something to kill the affected service soon after it detects db corruption? But otherwise yeah sometimes you just need to generate several gigabytes of logs and then sift through to find the useful couple megs.

    Link to comment
    2 hours ago, 11rcombs said:

    Here's a question: has anyone seen this on a standalone PMS install, or only on Docker?

    There is similar report of happening on Ubuntu:

    https://forums.plex.tv/t/error-sqlite3-nil-database-corruption-docker/403032

     

    Title of post suggests running in a Docker container, but OP doesn't explicitly specify.

     

    One of the first things we tried when we first started looking at the issue was to revert Docker to same version used in 6.6.7 and the db corruption reports continued.

    Link to comment

    Well its been 2 days since I've upgraded to the 6.8.0 rc and I can safely report there has been no corruption. I've downloaded over 100GB using sonarr and I've been recording TV on Plex and no apps have presented any indication of corruption (nor do the logs show). I am keeping my fingers crossed and hoping this keeps up.

    Link to comment
    21 hours ago, simalex said:

    Sorry to report that I just witnessed the DB corruption on my first attempt to stress the server in 6.8.0rc1

    Looking back at original report: is it possible there was corruption already and so happens a DB operation need to operate on the corrupted record?

     

    I wondering, if when you upgraded to 6.8.0-rc were the databases recreated from scratch or checked for corruption first?

    Link to comment
    15 hours ago, limetech said:

    In order to get DB corruption one (or more) of these have to be true:

    1. Multiple processes of same application or multiple threads of same process are accessing same DB file and mutual-exclusion method being used (locking) is not working correctly.
    2. Multiple processes of different apps are accessing same DB file and stepping on each other.
    3. There is a bug in the underlying block-I/O driver, in our case md/unraid if DB on array disk.
    4. There is a bug in an underlying kernel device driver.
    5. There is a bug in SQLite.

     

    #1 is what I looked at initially and don't think this is the case.

    #2 is what @simalex might be causing to happen?

    #3 I don't think has ever been source of problem, where for 6.8 I pretty thoroughly reviewed and tested (again).

    #4 Is possible since we cannot reproduce this problem with servers we have.

    #5 Also possible since sqlite db corruption has been common over the years.

     

    Why does 6.7.x work and later releases don't?  I have no idea.  We are willing to go back and start bisecting changes but this is nearly impossible to do if we can't reproduce the issue to begin with.

     

    The other thing that makes this issue difficult is that corruption is detected only long after the event which caused the corruption and sqlite doesn't exactly make it clear what the nature of the corruption  is.

    The bigger question is, why some people dont have problems, and other have.

     

    I never had any problem with plex, radarr, sonarr, but with storjv3 container.

     

    Maybe updating docker would be a good idea?

    Edited by nuhll
    Link to comment
    1 hour ago, limetech said:

    Looking back at original report: is it possible there was corruption already and so happens a DB operation need to operate on the corrupted record?

     

    I wondering, if when you upgraded to 6.8.0-rc were the databases recreated from scratch or checked for corruption first?

    Don't see how this could have happened.

    This was the same DB that had been working from a restore taken since before upgrading to 6.7.2.

     

    The Sonarr logs had not had any corruption message for at least 2 months.

    Also the malformed database messages where actually linked to some series episodes that literally aired Sunday night after I had upgraded to 6.8 and Sonarr was trying to decide to put them in the download list.

     

    You did get me to wondering though so I did the following

    a. check the DB backup before upgrading to 6.8 for corruptions -- no errors came up

    b. check the DB after the malformed message for corruptions -- no errors came up (kind of weird)

    c. start the linuxserver Sonarr docker and tried some activities like refresh series, add new series and check in parallel the logs for any new errors.

     

    So after adding a few new series in Sonarr I had these in the sonarr error logs.

     

    image.thumb.png.9bf71e9aff5884485b258286a8d17023.png

     

    The last error is from yesterday when I got the first malformed DB message.

    Immediately above that are 2 errors that are happening on Series Refresh because there are apparently more than one posters/banners in some series available.

    What is a problem for me are the top 2 errors. The 2 directories that are mentioned as missing are actually for new series just added. I checked and the directories where created successfully. I added in total 8 new series. So Sonarr created the directories and then tried to immediately access them to put covers and banners, but apparently unRaid had not yet finished creating the actual directory structure?

     

    I also did a bit of reading on SQLite locking mechanism yesterday. After the results of today's DB integrity checks, I am thinking that yesterday I actually managed to somehow have a "corrupted"  -wal (write ahead log) file but because I caught it and stopped Sonarr almost immediately the corruption did not get pushed to the actual DB. Apparently the DB itself is accessed simultaneously for read only, and only one process can write. Everyone is appending changes to -wal file and they are using some kind of file semaphore to indicate if a process is blocking the actual DB for updating. My theory is that the -wal file or the file lock share memory (-shm file) somehow become inconsistent and then "garbage" is ends up getting updated to the actual .db file. This would make the issue much more difficult to track.

     

    Anyway

    I will be back on my weekly travelling schedule again from tomorrow so I will not have much time for testing.

    I am on the process of setting up a second unRaid server with actual "server" grade parts. I plan to use it for testing for this corruption issue initially. At some point I will move my data there and keep the old server up for testing if the issue has not been resolved.

     

     

    Link to comment

    Is it possible to get an idea of what has changed in the new version (6.8.0rc1) that we believe may fix the issue?  I'm more than happy to try it and see...if we have something that we believe was changed to fix it.  My machine can usually recreate the problem in less than a day. 

     

    If its just a *hope*, then I will just wait to see what others find.....

    15 hours ago, limetech said:

    Yes maybe docker related - using a bind mount of what?

     

    This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption.

    Some of us had corruption within hours or minutes....

    Link to comment
    46 minutes ago, limetech said:

    Repeatable?

    The last 6 or 7 posts that I did about corruption all occurred within hours.  None lasted a day.   The diags were attached.  6.6.7 is rock solid.  No corruption for weeks.

    Link to comment

    Did you start off with an entirely new/fresh database? Or was it restored from an earlier backup?

    Link to comment
    2 hours ago, BRiT said:

    Did you start off with an entirely new/fresh database? Or was it restored from an earlier backup?

    So that depends on what you mean by "new/fresh".  Each database that I fell back to was "known good", and had been cleaned after prior corruption.  I had an earlier post where I showed how to clean a database so that it contained no corruption.  But two of the times, I started over from scratching, letting Plex build it. 

     

    None of the databases that I restored to had any corruption. 

    • Thanks 1
    Link to comment

    Well I'm sorry to report I did not make it to day 3. Database corruption happened either last night or this morning. In any case, back to 6.6.7.

    Link to comment

    Honestly, don't bother posting if you can't post at least diagnostics.zip file.  How can we possibly determine where this issue lies with "sorry doesn't work".  If this is all you have to say, please don't run -rc releases.  This is why we have closed beta's now btw.

    Link to comment
    4 hours ago, Rich Minear said:

    So that depends on what you mean by "new/fresh".  Each database that I fell back to was "known good", and had been cleaned after prior corruption.  I had an earlier post where I showed how to clean a database so that it contained no corruption.  But two of the times, I started over from scratching, letting Plex build it. 

     

    None of the databases that I restored to had any corruption. 

     

    Rich,

     

    In your previous post:

    https://forums.unraid.net/bug-reports/stable-releases/sqlite-db-corruption-testers-needed-r598/page/8/?tab=comments#comment-5518

     

    How do you have /config mapped for the sonarr container?

    Link to comment



    Guest
    This is now closed for further comments

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.