• SQLite DB Corruption testers needed


    limetech
    • Closed

    9/17/2019 Update: may have got to the bottom of this.  Please try 6.7.3-rc3 available on the next branch.

    9/18/2019 Update: 6.7.3-rc4 is available to address Very Slow Array Concurrent Performance.

     

    re:

     

    Trying to get to the bottom of this...  First we have not been able to reproduce, which is odd because it implies there may be some kind of hardware/driver dependency with this issue.  Nevertheless I want to start a series of tests, which I know will be painful for some since every time DB corruption occurs, you have to go through lengthy rebuild process.  That said, we would really appreciate anyone's input during this time.

     

    The idea is that we are only going to change one thing at a time.  We can either start with 6.6.7 and start updating stuff until it breaks, or we can start with 6.7.2 and revert stuff until it's fixed.  Since my best guess at this point is that the issue is either with Linux kernel, docker, or something we have misconfigured (not one of a hundred other packages we updated), we are going to start with 6.7.2 code base and see if we can make it work.

     

    But actually, the first stab at this is not reverting anything, but rather first updating the Linux kernel to the latest 4.19 patch release which is 4.19.60 (6.7.2 uses kernel 4.19.55).  In skimming the kernel change logs, nothing jumps out as a possible fix, however I want to first try the easiest and least impactful change: update to latest 4.19 kernel.

     

    If this does not solve the problem (which I expect it won't), then we have two choices:

     

    1) update to latest Linux stable kernel (5.2.2) - we are using 5.2 kernel in Unraid 6.8-beta and so far no one has reported any sqlite DB corruption, though the sample set is pretty small.  The downside with this is, not all out-of-tree drivers yet build with 5.2 kernel and so some functionality would be lost.

     

    2) downgrade docker from 18.09.06 (version in 6.7.2) to 18.06.03-ce (version in 6.6.7).

    [BTW the latest Docker release 19.03.00 was just published today - people gripe about our release numbers, try making sense of Docker release numbers haha]

     

    If neither of those steps succeed then ... well let's hope one of them does succeed.

     

    To get started, first make a backup of your flash via Main/Flash/Flash Backup, and then switch to the 'next' branch via Tools/Upgrade OS page.  There you should see version 6.7.3-rc1

     

    As soon as a couple people report corruption I'll publish an -rc2, probably with reverted Docker.

    Edited by limetech

    • Upvote 5



    User Feedback

    Recommended Comments



    sonarr decided to shit the bed now too... -_- restoring from backup now... what further testing can i do ?  im tempted to roll back, but i would love to help fix this bug..

    also be advised, im having the same issues as described in this thread as well:

     

     

     

    let me know how to proceed.

     

    image.thumb.png.071aa9de7bf91aab25fd3fbe2e4ab05c.png

    Link to comment

    Just a quick update I rolled back to unraid6.6.7, and turned my parity drive into a data drive(my thought process is that I have been getting such terrible IO speeds that maybe the parity is my issue...) Gonna reload all these docker images one by one and rebuild the DB and let you guys know how it goes.

     

    -DR

    Link to comment

    I'm about ready to roll back to it also.  So far, it doesn't seem that we are making any headway with the Sqlite issue, and I'm tired of explaining to the family why I have to spend some time rebuilding the database so they can watch things.  

    Link to comment

    I have had my dockers on the cache since the 9th (of August) and no corruption yet it used to happen at least once a day, I have CA BACKUP AND RESTORE running every 12 hours so if it corrupts it's a 2 minute fix to restore.

     

    FYI, SABnbzd has never corrupted in any configuration, ever for me.

     

    Edited by TheBuz
    Link to comment
    Just now, Rich Minear said:

    Not everyone has a cache drive...or even the need for one.  

    True, I don't really need a cache drive either, but i had a couple of 240gb SSDs gather dust from an old project.

     

    But if there is a difference with the way data is handled cache vs array, there might be some clues in there as to why this is happening.

    Link to comment

    Yea, i rolled back, AND removed my parity drive (for better performance / more space...), so im not really sure how much of a help i will be to you guys.  But im going to keep a REAL close eye on my SQL for radarr/sonarr and will let you guys know if i see corruption on the 6.6.x branch as well.

     

    thanks

     

    -DCR

    Link to comment

    Yikes!  I'm behind the curve here.  I updated to 6.7.2 from 6.6.6 a few days ago, before seeing this information.

     

    I've not yet had any corruption, but I'm a little worried.   Glancing over this and related threads, so far it seems like only those storing SQLite DBs directly on disk rather than on SSD/Cache are seeing this issue.  Is that correct, or are there reports of people storing appdata on cache with this issue?  

     

    Having all my dockers in appdata, and having appdata set to cache only, am I immune to this issue, or should I roll back to 6.6.7?

     

    Link to comment

    Alright, so I did have 6.7.3rc2 installed and still had the same issues... So, I've rolled back to 6.6.7

     

    How can I help? What would you like me to do? :)

    Link to comment

    Since I've had 6.7 installed, and now 6.7.3rc2, I don't have the option in my gui to roll back to 6.6.7.

     

    How do I do it manually?

    Link to comment
    16 minutes ago, Rich Minear said:

    How do I do it manually?

    Backup your flash (just in case). Download 6.7.7. Unpack the zip. Replace the bz files and syslinux on your flash with the ones in the downloaded folder. Reboot.

     

    Sorry, typo 6.7.7 should have been 6.6.7.

    Edited by wgstarks
    • Upvote 1
    Link to comment

    so i hadn't actually had a problem whatsoever with 6.7.1 nor 6.7.2 until the other day, but i'm unsure whether my problem is the same as everyone else's. i was just doing a little "aesthetic maintenance" on plex - literally just adding a couple album covers to those missing it - and maybe 10 minutes after i went to put on a movie and the tower was unavailable. went to laptop and it confirmed it was unavailable. went to my unraid tab and the docker tab said Plex was "unhealthy" so i restarted it and it never started again. in the log it just keeps repeating Starting Plex Media Server over and over and over. 

     

    are those the same symptoms others have had?

     

    my appdata is in /mnt/user/appdata if that helps. i haven't been able to do anything else since, so i haven't reverted to 6.6.7 or anything, and i don't know when or if i will have the time to do so.

    Link to comment
    On 8/13/2019 at 8:21 PM, dustinr said:

    Yea, i rolled back, AND removed my parity drive (for better performance / more space...), so im not really sure how much of a help i will be to you guys.  But im going to keep a REAL close eye on my SQL for radarr/sonarr and will let you guys know if i see corruption on the 6.6.x branch as well.

     

    thanks

     

    -DCR

    Did you see a performance increase in read speeds after rolling back.

     

    Some people have report much faster read/write speeds after rolling back, and I would probably do it for this reason alone.

     

    Are Dockers, VMs and Community Apps affected by downgrading?

    Link to comment

    Where I noticed the performance gain was in the metadata matching!  I switched to mounting dockers in cache drive first, before I rolled back.  It took almost a full day for my movie collection to fetch all the metadata.  After another crash, I kept dockers on the cache drive, rolled back to UNRAID 6.6.7.  When I started rebuilding the libraries, my metadata downloading was a lot faster.  I had same config as always in PLEX but I was then able to tag my movies, music, and tv shows all in under 12 hours.  NEVER had this happen before.  We're talking a 16TB library here.  I know the cache drive helped quite a bit, but until the rollback, never saw rebuilds that were this fast.

    Link to comment

    OK 

    So I switched to

    binhex docker for Sonarr (instead of linuxserver)

    and plexinc docker for Plex (instead of limetech which was anyway deprecated)

    Then I upgraded again to unRaid 6.7.2

    I did not rebuild the databases, instead backed up appdata and pointed the new dockers to the old paths.

    The system was running for almost 3 days straight without any SQLite corruption. However, for the first 3 days I did not do any heavy lifting. That is only few new TV episodes were added and those sporadically 

     

    Then I decided to force heavy load on both Sonarr & Plex by manual importing a full season.

     

    So I imported 10 3.3GB episodes through Sonarr. What this effectively was doing was

    i. Sonarr created a local copy of the file that was to be imported named .backup in the source dir

    ii. Sonarr copies the file to the destination directory

    iii. Once finished Sonarr deletes both the original and .backup from the source dir (my setup was to move the files)

    iv. Sonarr notifies Plex of the change

    v. Plex will start its own analysis of the new media file, and process it in order to create thumbnails etc.

     

    In order to further load the system, at the same time I forced Sonarr to do a Series Refresh, which since my library is huge what trigger reads in at least three 8TB disks at the same time.

     

    Results

    binhex Sonarr docker (instead of linuxserver) is still ok, no corruption

    Plex database (plex inc docker) was corrupted at some point when Plex detected a change in a directory time stamp and started re-scanning the library and at the same time analyzing the files for generating new thumbnails.

     

    It is apparent that the corruption issue will only manifest when unRaid or the dockers are under load.

     

    All my dockers have been set for more than a month with appdata directory in /mnt/disk1 as initially suggested (and that by itself resulted in an significant performance increase of the containers)

    One other thing to note is that the since I don't have a cache drive, all my media is first placed in disk1 (same as where the SQLite DBs reside) and then from there are transferred to the target locations which are in various user shares /mnt/user directories. This puts an additional stress on disk1 during the import, as it is also used as a) storage during downloads, b) used in some of the user shares.

     

    I will now downgrade to 6.6.x and then upgrade to 6.7.3rc2 (so that I have an easy fallback point) and try the above again.

     

     

     

    Link to comment

    Same results with 6.7.3-rc2.

    After the upgrade everything worked properly for a while, no SQLite corruptions.

    Almost one hour after starting the manual import in Sonarr, again using a set of 3-4GB media files, the corruption issue appeared.  The only difference is that this time both Plex and Sonarr have database corruptions. First Sonarr DB was corrupted and a several minutes later so was Plex.

     

    As far as I understand the corruption is happening when there is a heavy load on the unRaid server e.g. copying large files from one disk of the array to another.

     

    As I mentioned in my previous post I am moving media files using manual import process of Sonarr from disk1 to other disks in the array. My TV Shows library is in a user share that spans several disks including disk1 and using high-water allocation method. Media files currently are getting copied to disk8 as that one has 3TB free. So effectively I have heavy file copying from disk1 to disk8, and at the same time Plex and Sonarr are updating their databases in disk1.

     

    I am inclined to think that this is putting a strain on the parity drive because the heads are forced to do a lot of flying around for all the updates to be processed correctly.


    For Sonarr when you start a manual import this is seems to be running in the background probably on a different thread. At the same time other scheduled tasks (e.g RSS scans, Series refresh etc) will still start in the background at the predefined times.

    Similar for Plex, Sonarr will notify Plex that a new episode was uploaded and Plex will start a library re-scan. At the same time it will still run any other scheduled tasks (e.g. create thumbnails etc)

     

    So if there are performance issues it is possible that there are some kind of time outs that are mishandled by SQLite and the result is a) the threads have a different "image" of the DB and any successful write after that could corrupt the actual DB file b) the on disk copy of the DB is inconsistent with the the in memory cached parts of the DB so again any write after that could end up corrupting the DB.

     

    Why this problem is only manifesting in the latest version of unRaid, I can only speculate that even a slight change in a threshold value that got missed might increase the sensitivity of SQLite to any type of time-out.

     

    If that is the case then people that are using a cache drive for storing appdata should not have a similar strain on the parity drive as writes to the cache drive don't update the parity, so the DB that is in the cache drive will be much more robust to this type of failure.

     

    Are any of you that have the SQLite corruption issue have the appdata on a cache drive?

     

     

     

    Link to comment
    42 minutes ago, simalex said:

     

    Are any of you that have the SQLite corruption issue have the appdata on a cache drive?

     

     

     

    Dockers have been on 24/7 I moved appdata to the cache on the 9th of August. No corruption yet.

     

    Sonarr, Radarr, Plex, SAB

    Edited by TheBuz
    Link to comment
    On 8/15/2019 at 1:59 AM, TheBuz said:

    Did you see a performance increase in read speeds after rolling back.

     

    Some people have report much faster read/write speeds after rolling back, and I would probably do it for this reason alone.

      

    Are Dockers, VMs and Community Apps affected by downgrading?

    WELL, i did see increased IO, but i think thats PRIMARILY because i deleted my PARITY drive and just made it part of my storage array.  unRaid is performing GREAT NOW.  and i haven't had any corruption in three days. I think all of my issues stem from the parity drive.  is there anything on the road map for snapraid ?  (or something similar..)  I think my big bottleneck is writing parity drive data synchronously on OLD HARDWARE / OLD HARDDRIVES (2010-2018).

     

    EDIT: In the interest of SCIENCE. I am upgrading my unraid from 6.6.7 to the new 6.7.3 rc2 and continuing to run without parity and without cache. 

     

    Edited by dustinr
    Link to comment
    2 hours ago, dustinr said:

    is there anything on the road map for snapraid ?

    I truly hope not.  It would remove a very key feature of unRaid vs snapraid where unRaid can emulate missing / dead drives seamlessly.  snapraid cannot do that at all

    • Upvote 1
    Link to comment

    Got corruption again, this time in Sonarr, Radarr, and OpenVPN-AS.

     

    As has been mentioned it seems to occur during periods of high disk activity. I believe Sonarr was importing some media while Plex was streaming/transcoding.

    At this point I'm very tempted to revert to 6.6.7 as it was rock stable. Are there any other tests we can do to help resolve this?

    tower-diagnostics-20190818-1657.zip

    Edited by mdeabreu
    Link to comment
    1 hour ago, mdeabreu said:

    Got corruption again, this time in Sonarr, Radarr, and OpenVPN-AS.

     

    As has been mentioned it seems to occur during periods of high disk activity. I believe Sonarr was importing some media while Plex was streaming/transcoding.

    At this point I'm very tempted to revert to 6.6.7 as it was rock stable. Are there any other tests we can do to help resolve this?

    tower-diagnostics-20190818-1657.zip 100.99 kB · 1 download

    I had to do the same thing.  6.6.7.  I've been fighting corruption since mid May on the new platform, and nothing seems to work.  I tried all the things they asked...but nothing seemed to make any difference.  And I couldn't keep rebuilding the Plex database every day.  😞

    Link to comment

    In the previous thread I said it seemed to be fixed by changing to /mnt/disk1 well it's not.

    I'm still on 6.7.2, please let me know if there's an rc3 and I'll help test. Maybe downgrade the kernel?

    Link to comment
    7 hours ago, Rich Minear said:

    I had to do the same thing.  6.6.7.  I've been fighting corruption since mid May on the new platform, and nothing seems to work.  I tried all the things they asked...but nothing seemed to make any difference.  And I couldn't keep rebuilding the Plex database every day.  😞

    _IF_ your feeling adventurous remove your parity drive from your array and see if corruption occurs on RC2.  Ive been running perfect since i deleted my parity drive..  If nothing else it would be a good test to correlate the issue.

    Link to comment

    Maybe related, maybe not. Had Plex DB issues as well on cache drive. Flipped to XFS from btrfs for cache as I wasn't using the pool feature and haven't had issues since. Noticed cache corruption with my VMs too becoming unable to backup. Just my 2 cents.

    Link to comment



    Guest
    This is now closed for further comments

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.