• SQLite DB Corruption testers needed


    limetech
    • Closed

    9/17/2019 Update: may have got to the bottom of this.  Please try 6.7.3-rc3 available on the next branch.

    9/18/2019 Update: 6.7.3-rc4 is available to address Very Slow Array Concurrent Performance.

     

    re:

     

    Trying to get to the bottom of this...  First we have not been able to reproduce, which is odd because it implies there may be some kind of hardware/driver dependency with this issue.  Nevertheless I want to start a series of tests, which I know will be painful for some since every time DB corruption occurs, you have to go through lengthy rebuild process.  That said, we would really appreciate anyone's input during this time.

     

    The idea is that we are only going to change one thing at a time.  We can either start with 6.6.7 and start updating stuff until it breaks, or we can start with 6.7.2 and revert stuff until it's fixed.  Since my best guess at this point is that the issue is either with Linux kernel, docker, or something we have misconfigured (not one of a hundred other packages we updated), we are going to start with 6.7.2 code base and see if we can make it work.

     

    But actually, the first stab at this is not reverting anything, but rather first updating the Linux kernel to the latest 4.19 patch release which is 4.19.60 (6.7.2 uses kernel 4.19.55).  In skimming the kernel change logs, nothing jumps out as a possible fix, however I want to first try the easiest and least impactful change: update to latest 4.19 kernel.

     

    If this does not solve the problem (which I expect it won't), then we have two choices:

     

    1) update to latest Linux stable kernel (5.2.2) - we are using 5.2 kernel in Unraid 6.8-beta and so far no one has reported any sqlite DB corruption, though the sample set is pretty small.  The downside with this is, not all out-of-tree drivers yet build with 5.2 kernel and so some functionality would be lost.

     

    2) downgrade docker from 18.09.06 (version in 6.7.2) to 18.06.03-ce (version in 6.6.7).

    [BTW the latest Docker release 19.03.00 was just published today - people gripe about our release numbers, try making sense of Docker release numbers haha]

     

    If neither of those steps succeed then ... well let's hope one of them does succeed.

     

    To get started, first make a backup of your flash via Main/Flash/Flash Backup, and then switch to the 'next' branch via Tools/Upgrade OS page.  There you should see version 6.7.3-rc1

     

    As soon as a couple people report corruption I'll publish an -rc2, probably with reverted Docker.

    Edited by limetech

    • Upvote 5



    User Feedback

    Recommended Comments



    1 hour ago, trott said:

    Hi guys,   just want to confirm if this unknow bug will also impact the normal file we written to array?

    Well it should not.

    This corruption is limited to SQLite Database files. It seems to be manifesting on the SQLite databases when there is other heavy I/O workload on the server and at the same time the Docker application (Plex, Sonarr/Radarr) is running parallel tasks that need to update the database concurrently. 

    For other use cases like reading/copying/moving or updating files I did not have any problem, and I was doing massive copying/ moving files around above the 200-250GB per day mark, as I am trying to re-organize my backups. My backup .pst file alone is in excess of 25GB, and I have also convert to VMs many of my previous computers and I have archived copies in my unRaid server.

     

    Link to comment
    12 hours ago, toonamo said:

    just created a script that checks the pragma integrity, and if passes, backs up the database. if it fails, then it stops the docker, restores the latest backup, and restarts the docker.

     

    i need to make one for sonarr now and then i'm going to barrow your code on deleting backups older than 10 days.

    Any chance of sharing your script. Sounds really helpful.

    Link to comment

    well, sonarr finally corrupted running the latest RC2

    * no parity drive

    * no cache drive

     

    Sonarr corrupted after about a week.  restoring the appdata backup to a few days ago and rolling back to previous unraid, i cant beta test anymore lol

     

    tower-diagnostics-20190824-1805.zip

    Link to comment
    6 hours ago, dustinr said:

    well, sonarr finally corrupted running the latest RC2

    * no parity drive

    * no cache drive

     

    Sonarr corrupted after about a week.  restoring the appdata backup to a few days ago and rolling back to previous unraid, i cant beta test anymore lol

     

    tower-diagnostics-20190824-1805.zip 97.47 kB · 1 download

     

    Yeah, I haven't seen any movement from Limetech on this in quite a bit.  I may be downgrading also.

    Link to comment
    13 hours ago, mi5key said:

     

    Yeah, I haven't seen any movement from Limetech on this in quite a bit.  I may be downgrading also.

    I had to do the same thing.  The corruption was an every day occurrence...and I couldn't keep rebuilding or restoring the databases.  Since I have moved to back, I have seen ZERO corruption.  

     

    I wanted to help, and tried for a several weeks.  But now I'm just anxiously watching to see if anything changes.  

    Link to comment

    As a stop gap, I put in a 5400rpm laptop spinning rust drive and used it as an unnasigned disk for my appdata for about a week. During that week plex was slow to load items and felt a little sluggish (expected), but at the end of a week, I still had no corruption. I rebuilt my docker twice in that time and let it auto import my entire library both times. I stopped the import the second time after about 18hrs and it still hadn't finished importing everything. I hit it with more than 550 movies, more than 4000 tv episodes, less than 1000 songs, and well over 6000 pictures. While importing, the console had a bunch of "took to long" and "waited one whole second for busy database" and after hours of that and chugging through my files, the database still had no corruption. I think the leading theory already had to do with unraid fs, but I guess here's some more confirmation.

     

    Obviously with limetech replies on the matter being fairly sparse and no idea when a fix is coming, I didn't want to deal with 5400rpm slim sata drive speeds for my database so I am now rebuilding plex from scratch on an nvme drive and the console is show a few "held transaction for to long" warnings, but it's going much faster on import.

    Link to comment
    On 8/25/2019 at 7:00 AM, Rich Minear said:

    I had to do the same thing.  The corruption was an every day occurrence...and I couldn't keep rebuilding or restoring the databases.  Since I have moved to back, I have seen ZERO corruption.  

     

    I wanted to help, and tried for a several weeks.  But now I'm just anxiously watching to see if anything changes.  

     

    I've moved back to 6.6.7 today because I was dealing with regular corruption.  Limetech's near silence on the matter is troubling too.

    Edited by mi5key
    Link to comment
    19 hours ago, -Daedalus said:

    I'm assuming this database corruption presents as a borked container? Because if that's the case, I'm on 6.7.2, with a cache pool, and zero issues. I've never experienced this (at least, so far as I can tell).

    server-diagnostics-20190903-0955.zip 280.64 kB · 2 downloads

    No, it doesn't result in a borked container.... just a borked SQL Lite db file which breaks the software (e.g. Sonarr, Radarr).  After restoring a working database (or deleting so that the app will rebuild again) then the container works just fine.

     

    As clearly outlined in the many threads I'll summarize here:  In my experience -- like many others -- I was getting a corrupted DB every day.  The Sonarr/Radarr UI will load, but many of the menu items will fail to work, and of course the software would not process any configured movies/shows, etc.  The real errors are clearly visible in the logs stating that the Sql Lite DB is malformed.

    Link to comment
    4 hours ago, TheBuz said:

    @limetech may we have an update please?

    I am also getting worried by the lack of response. Does it mean this is less urgent than it used to be? I currently cannot upgrade unraid and benefit from the latest features and fixes.

    Link to comment

    I saw somewhere that version 6.8 is coming soon...but I cannot remember where (email, blog, somewhere).  But I couldn't find anything about it here in the forums.  

     

    I'm stuck at 6.6.7 until the sqlite issue is fixed.  I cannot go back to fighting databases daily.  

     

    If fixing this has been moved to less urgent...then I guess I will stay there (6.6.7).  My needs are very small, and it does everything I need.  

     

    I do think it sucks that we went from lots of discussion, asking for diagnostics, etc...and then nothing.  Zip.  Zilch.  

    • Like 1
    Link to comment
    4 minutes ago, Rich Minear said:

    saw somewhere that version 6.8 is coming soon...but I cannot remember where (email, blog, somewhere).  But I couldn't find anything about it here in the forums.  

     

    It was in the monthly newsletter.

    
    6.8 Release Coming Soon 
    
    A peek at some new features coming in Unraid OS version 6.8:
    
    Forms-based webGUI authentication: now compatible with most password managers.
    
    WD-Discovery support: for reliable Windows Network explorer listing of your server and eliminates need for SMBv1 on your network.
    
    WireGuard support: for easy configuration of VPN tunnels (experimental).
    
    Numerous bug fixes and package updates.
    
    Introducing Unraid.net: a set of web-based services:
    
    Server status such as online/offline, storage used/available, etc.
    
    Links for local and remote access to your server webGUI.
    
    Backup and Restore of your USB Flash boot device.
    
    Much more to come- stay tuned!

    Link to comment

    FWIW I am still running corruption-free (knock on wood) ever since I moved my app data and system folders to the cache drive several weeks back. I also setup auto-backups for my Plex db and app data folders, but haven't had to use them yet, thankfully.

     

    Prior to moving these to the cache drive I was getting Plex db corruption almost immediately after rebuilding the db. It was so frustrating that I was about to ditch unRaid. Would be nice for them to provide an update (and obviously implement a fix).

    Link to comment
    2 hours ago, BBLV said:

    FWIW I am still running corruption-free (knock on wood) ever since I moved my app data and system folders to the cache drive several weeks back. I also setup auto-backups for my Plex db and app data folders, but haven't had to use them yet, thankfully.

     

    Prior to moving these to the cache drive I was getting Plex db corruption almost immediately after rebuilding the db. It was so frustrating that I was about to ditch unRaid. Would be nice for them to provide an update (and obviously implement a fix).

     

    The update is we are still working on this problem and we have looked at a lot of code.  But what we can't do at the present time is go back to the 6.6.7 build (which does not display this issue) and start adding stuff one-by-one on the path to 6.7.x until we hit something.  Then if we determine it's a kernel change, we face the task of bisecting the kernel.  The problem with this approach it that it's extremely time-consuming, especially when each 'run' might take several hours or even days to exhibit the problem.

     

    I'm not trying to make excuses, just laying out where we are with this.  We have access to a server which seems to exhibit the problem readily (meaning it will happen eventually but not "on demand") and we are trying out some theories.

     

    One thing that will be useful to know, and why I quoted this particular post from @BBLV is whether this issue only happens when some or all of sqlite database files exist on the array vs. the cache pool.  Put another way, if the entire appdata share, with all sqlite db files underneath all exist on a cache disk or cache pool only, does anyone see this problem?  When it was first reported I thought some were seeing the issue no matter what storage volumes were being utilized and independent of mapping through user shares or not.  This led us down the path of suspecting some common h/w whose driver might have had a change.  This still may end up being part of the issue but now I'm seeing more and more reports that having appdata entirely on cache the problem does not appear.

    Link to comment

    I have not experienced any SQlite DB corruptions running on Version 6.7.2 since its release.

    Currently running Radarr, Sonarr, Plex, Emby etc.

     

    All of my appdata resides on the cache, with Plex and Emby being the exception.

    Plex and Emby are currently on an Unassigned device SSD (/mnt/disks/samsungssd/plex/ - /mnt/disks/samsungssd/emby/) mounted as RW/Slave.

    Edited by FalconX
    Link to comment

    I've not had any DB corruption. I'm new to unraid and loaded 6.7.2 fresh. All my appdata exists on an unassigned SSD  since the initial setup. Cache is a separate SSD.

    Just adding input for a data point.

    Link to comment
    1 hour ago, limetech said:

    Put another way, if the entire appdata share, with all sqlite db files underneath all exist on a cache disk or cache pool only, does anyone see this problem?

    I am not having data corruption issues running a single cache drive using /mnt/cache/appdata for all dockers. All SQLite DBs are on the cache drive. Specs are in my signature as follows:

     

    unRAID Server Pro v6.7.2 | Array: 112TB | Parity: 10TBx2 | Cache: Samsung 970 EVO Plus 500GB NVMe | Flash Drive: SanDisk Cruzer Fit 16GB

    Case: Norco-4220 | MB: ASRock EP2C602-4L/D16 | PSU: Corsair RM1000i | RAM: IBM RDIMM 256GB (16GBx16) DDR3 1.5v

    Controllers: ASUS HYPER M.2 X16 | LSI 9201-16i | LSI 9210-8i

    Docker Containers: Guacamole | Jackett | Lidarr | Let's Encrypt | NetData | NZBGet | Ombi | Plex | Radarr | rTorrentVPN | Sonarr | Tautulli | Unifi Controller

    Plugins: Dynamix System Statistics | CA Auto Update Applications | Community Applications | Dev Tools | Dynamix SSD TRIM | Dynamix System Information | Fix Common Problems | Nerd Tools | Preclear Disks | rclone | Recycle Bin | Tips and Tweaks | Unassigned Devices | unBALANCE | User Scripts

    Edited by GroxyPod
    Sigs not visible on here.
    Link to comment

    Again, are you guys are 100% sure this only impact the sqlite DB?  I don't have cache now, so I download to array directly using qbittorent,  recently I found MakeMAV failed to remux some movies, I have thought it might be movie issue, but I have a force recheck on those torrents today, it happaned they are not 100% complete

     

    I have no prove to say it is unraid issue, but it is not one torrent, but serveral, and I never have this issue before. I'm not happy on this beause I don't know if there is any other file are also corrpution during the moving to unaid, I have no way to check without the checksum

     

    Frankly speaking, I think unraid should pull back the 6.72 until they fixed this issue.

    Link to comment

    Can confirm with all 6.7.2, had appdata on array and it corrupted immediately and repeatedly. Switched to cache disk and no issues. 

    Edited by spyd4r
    Link to comment

    All my Appdata on Cache drives, never had any corruption through all the builds.

     

    Possible to setup a survey for us to all fill out to generate data for you?

     

    And is the server you are testing on, tried a 6.8 build with new kernel to rule out a 4.19 regression?

     

    Edited by Dazog
    Link to comment
    8 minutes ago, Rick Gillyon said:

    All my appdata is on cache pool, no corruption on 6.7.2 since release.

    +1

    Link to comment



    Guest
    This is now closed for further comments

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.