• SQLite DB Corruption testers needed


    limetech
    • Closed

    9/17/2019 Update: may have got to the bottom of this.  Please try 6.7.3-rc3 available on the next branch.

    9/18/2019 Update: 6.7.3-rc4 is available to address Very Slow Array Concurrent Performance.

     

    re:

     

    Trying to get to the bottom of this...  First we have not been able to reproduce, which is odd because it implies there may be some kind of hardware/driver dependency with this issue.  Nevertheless I want to start a series of tests, which I know will be painful for some since every time DB corruption occurs, you have to go through lengthy rebuild process.  That said, we would really appreciate anyone's input during this time.

     

    The idea is that we are only going to change one thing at a time.  We can either start with 6.6.7 and start updating stuff until it breaks, or we can start with 6.7.2 and revert stuff until it's fixed.  Since my best guess at this point is that the issue is either with Linux kernel, docker, or something we have misconfigured (not one of a hundred other packages we updated), we are going to start with 6.7.2 code base and see if we can make it work.

     

    But actually, the first stab at this is not reverting anything, but rather first updating the Linux kernel to the latest 4.19 patch release which is 4.19.60 (6.7.2 uses kernel 4.19.55).  In skimming the kernel change logs, nothing jumps out as a possible fix, however I want to first try the easiest and least impactful change: update to latest 4.19 kernel.

     

    If this does not solve the problem (which I expect it won't), then we have two choices:

     

    1) update to latest Linux stable kernel (5.2.2) - we are using 5.2 kernel in Unraid 6.8-beta and so far no one has reported any sqlite DB corruption, though the sample set is pretty small.  The downside with this is, not all out-of-tree drivers yet build with 5.2 kernel and so some functionality would be lost.

     

    2) downgrade docker from 18.09.06 (version in 6.7.2) to 18.06.03-ce (version in 6.6.7).

    [BTW the latest Docker release 19.03.00 was just published today - people gripe about our release numbers, try making sense of Docker release numbers haha]

     

    If neither of those steps succeed then ... well let's hope one of them does succeed.

     

    To get started, first make a backup of your flash via Main/Flash/Flash Backup, and then switch to the 'next' branch via Tools/Upgrade OS page.  There you should see version 6.7.3-rc1

     

    As soon as a couple people report corruption I'll publish an -rc2, probably with reverted Docker.

    Edited by limetech

    • Upvote 5



    User Feedback

    Recommended Comments



    Unfortunately, the Plex database got corruption while trying to rebuild the Radarr and Sonarr databases. I decided to revert to 6.6.7. I will keep an eye on this thread in case there is any more testing that I might be able to help with. Fingers crossed we can get to the root of this soon.

    Link to comment

    I've been struggling with Radarr & Sonarr "malformed" database errors for weeks  now also -- ever since trying to move them from my HTPC to my Unraid server (seemed like a great idea, now not-so-much).

     

    Unraid Setup:

    • I'm running a brand new Unraid server build (using my old HTPC hardware) so my first and only version I've used is Unraid v6.7.2.
    • I'm using binhex-radarr & binhex-sonarr docker containers.
    • All appdata mappings have been updated to /mnt/disk2.
    • I do not have a cache drive (nor do I need one for my purposes) so I'm currently using the disk array directly via /mnt/disk2.

     

    Hardware Info:

    • AMD Athlon II X4 620 
    • ASUS M4A785-M
    • 8GB DDR2 RAM
    • LSI SAS 9207-8i HBA card -- all disk drives are running on an; no drives are connected to the motherboard (to avoid various known BIOS booting issues with the MB)
    • Running 5 Data Disks with 1 Parity Disk (brand new Toshiba 4TB NAS 7200 RPM HDD).

     

    I haven't been able to get Sonarr stable on Unraid v6.7.2 . . . I also keep getting Sqllite malformed database errors every couple of days. 

     

    Initially Sonarr seemed to be stable for a week before installing Radarr... but I wasn't paying too much attention prior to installing Radarr. Lately however, Radarr has been working without issue, but Sonarr restuls in a corrupted/malformed SQLite database just about every couple of days.

     

    I may try adding a small HDD I have in an external closure and moving appdata mappings to that via Unassigned Devices as a last attempt . . .

    •  
    Edited by raerae1616
    Details/formatting
    Link to comment

    I had downgraded to 6.6.7 after i spent days with corrupted databases in plex, sonarr, and radarr. but I needed to upgrade to 6.7.x for some new hardware I got and am now having the same corruption issues.

     

    Following this thread, I moved my appdata to cache disk, but then lost everything due to a restart. I then rebuilt from backups. I ran appdata from cache for 2 days with no corruptions. I rebuilt plex from the ground up twice. Usually importing all my video and audio would cause a corruption error, but running from cache, it did not corrupt. However due to loosing data on power loss, I decided to put an old hard drive into the server as an unassigned drive and have mounted that to use exclusively for appdata. I'm hoping that since it isnt included whatsoever in the array and the fs doesn't touch this drive, my db's will be safe. I'll update if I find any corruption.

    Link to comment
    4 hours ago, Kosslyn said:

    I had downgraded to 6.6.7 after i spent days with corrupted databases in plex, sonarr, and radarr. but I needed to upgrade to 6.7.x for some new hardware I got and am now having the same corruption issues.

     

    Following this thread, I moved my appdata to cache disk, but then lost everything due to a restart. I then rebuilt from backups. I ran appdata from cache for 2 days with no corruptions. I rebuilt plex from the ground up twice. Usually importing all my video and audio would cause a corruption error, but running from cache, it did not corrupt. However due to loosing data on power loss, I decided to put an old hard drive into the server as an unassigned drive and have mounted that to use exclusively for appdata. I'm hoping that since it isnt included whatsoever in the array and the fs doesn't touch this drive, my db's will be safe. I'll update if I find any corruption.

    I guess you haven't assigned a cache disk? That means that the cache is just written to ram. You could assign the extra disk as cache disk and that would solve the problem.

    Link to comment

    For some reason my google-fu was failing me so I decided not to risk it. If I write to /mnt/cache without a cache disk then it writes to ram, correct? If I write there with a cache disk, then it writes to the disk and the data persists even through restarts and power loss?

     

    And so far there still isn't anyone who has had an issue with the db's while running from /mnt/cache, correct? I figured another reason for using unassigned devices was that Unraid didn't touch the disk at all and it'd be safe from this and most future bugs. I don't mind getting the smallest nvme I can find and just putting that as a unassigned device for appdata so that I know everything is safe. Is this a bad idea? Anything I'm missing?

     

    I've also hit 2 other issues with 6.7.x so I'm trying to get those figured out, but I'm going to add back a cache disk at some point.

    Edited by Kosslyn
    Link to comment
    1 hour ago, Kosslyn said:

    If I write to /mnt/cache without a cache disk then it writes to ram, correct?

    Yes

    1 hour ago, Kosslyn said:

    If I write there with a cache disk, then it writes to the disk and the data persists even through restarts and power loss?

    Yes

    Link to comment
    1 hour ago, Kosslyn said:

    If I write to /mnt/cache without a cache disk then it writes to ram, correct?

     

    8 minutes ago, Squid said:

    Yes

     

    Does this mean that even without a physical cache disk I could set my docker containers to /mnt/cache/appdata and then periodically run the mover to get the appdata out of ram and into the array? Could this be a potential solution for those of us without physical cache disks? (fully understanding that power loss means data loss)

    Link to comment
    5 minutes ago, mdeabreu said:

    Does this mean that even without a physical cache disk I could set my docker containers to /mnt/cache/appdata and then periodically run the mover to get the appdata out of ram and into the array?

    If you set an appdata path of /mnt/cache and then move appdata to the array your appdata won’t be at /mnt/cache anymore and the dockers won’t be able to connect to it.

    • Upvote 1
    Link to comment
    1 minute ago, wgstarks said:

    If you set an appdata path of /mnt/cache and then move appdata to the array your appdata won’t be at /mnt/cache anymore and the dockers won’t be able to connect to it.

     

    Sorry I was unclear, the appdata share points to the array (say disk1 only); the containers themselves point directly to /mnt/cache/appdata instead of /mnt/user/appdata or /mnt/disk1/appdata

     

    Then, if I understand correctly, the containers will write directly to /mnt/cache/appdata which will be in RAM; then the mover should grab the RAM only contents and move them back into the array.

    Link to comment

    If you point your containers to /mnt/cache/appdata they will only see what is contained in /mnt/cache/appdata. As soon as mover runs you’ll lose your appdata.

    Link to comment

    Thought I'd share my experience.  I changed my drive configuration a couple weeks ago.  Went from 7 2TB drives with BTRFS cache pool and 2 parity to 2 14TB drives with no cache and 1 parity.  I too ran in the to sqlite database corruption with Plex.  I thought it was something I did when I moved all the files to the new drives.  I did perform the Plex database fix procedures a couple times since the drive change and then came across this thread.  Since I have downgraded to Unraid 6.6.7 and it's been running fine for 5 days.  Non of my shares are set to use to use a cache drive since moving to the new drives.

    Link to comment
    2 hours ago, Squid said:

    Yes

    Yes

    Thank you!

     

    1 hour ago, wgstarks said:

    If you point your containers to /mnt/cache/appdata they will only see what is contained in /mnt/cache/appdata. As soon as mover runs you’ll lose your appdata.

    So far as I understand it, there is no way to have files on cache and on the array (i.e. duplicate), correct? The cache is a write only cache and there is no way to set it up as a tiered file system or have files live on cache, but then have basically a snapshot of cache stored onto the array without removing the files from cache, correct?

     

    Using "yes" for cache would never move the files to the array and using "prefer" would also not move the files from cache onto the array (unless the cache fills up). And if we use "prefer" and cache fills up, then once it is cleared, the files move back to cache but are then removed from the array, correct?

    Link to comment
    53 minutes ago, Kosslyn said:

    Using "yes" for cache would never move the files to the array

    Incorrect Yes moves them to the array

    Link to comment
    3 hours ago, mdeabreu said:

     

     

    Does this mean that even without a physical cache disk I could set my docker containers to /mnt/cache/appdata and then periodically run the mover to get the appdata out of ram and into the array? Could this be a potential solution for those of us without physical cache disks? (fully understanding that power loss means data loss)

    If you really want to, you could do something like you said if you have enough RAM. However... if you have a power outage, you will need to manually intervene and have a long enough UPS runtime to stop the docker service, move appdata and system share to an array disk, then shut down after all data is safely back on the array. I'd guestimate you'd need probably around an hour of runtime to get all that accomplished, so not a consumer grade UPS, or have a backup generator that will allow seamless power through the UPS.

     

    Then, when the coast is clear, start the array, manually move the appdata and system to /mnt/cache (the mover won't work if there is no real cache drive), enable the docker service and be back up running.

     

    If at any point the box shuts down before you get your data moved out of RAM, it's all gone.

     

    So, theoretically given enough resources (RAM, UPS runtime) you could make it work.

     

    However, it would seem to me that sourcing a SSD cache drive would be much cheaper and less stress.

    Link to comment

    I'm new to unraid, I have tried it about 15 days, running emby, sonarr without issue.  but I do not use array or cache, I put appdata on the SSD mounted with UD

    Link to comment

    So I'm a glutton for punishment and still am using 6.7.2 and get the corruption issue, but i mean duh right?

    Info about system if it helps towards finding the common denominator.

    eVGA X58 121-BL-E756 w/ I7-950x @ 3066 MHz; HVM: Enabled; IOMMU: Disabled (Not Available for this board); Memory: 12 GiB DDR2; Kernel: Linux 4.19.56-Unraid x86_64; 2*10TB WD RED's; no cache; docker appdata pointed to disk1 not user;

     

    i deleted db and slowly added everything back one by one to see if particular move/ tv show was causing issue. Doing it this way i was able to rebuild the database with out corruption. If i just added all my files and let it rebuild all at once it was almost guaranteed to cause corruption.

    after doing this and getting plex to work i noticed sonarr was corrupt now.

    but since i was able to get it going and have a backup to go to i'm wondering if anyone out there knows of some commands i could put into a script to run every so often to look for corruption, if none found manually backup the database, else if corruption found, delete database and restore from backup.

    not sure what command i could use to find the malformed database

    can i backup the database while plex is running? i know the wal and shm files are only there while plex is running, i'm wondering what would happen if i backup the main files while those exist, or if i have to stop plex.

    also would anyone know how to do this with sonarr as well?

    Link to comment

    I backup my Plex database nightly via cron because of this corruption issue.  I have my auto update Library turned off and corruptions are rare, but this is a super duper workaround.  I'd much rather have it auto scan, but until Limetech can fix this, this is how I roll.  Sonarr corrupts more frequently due to the way it runs.

     

    docker stop Plex

    cd /mnt/user/appdata/PlexMediaServer/

    tar zcf /mnt/user/appdata/PlexMediaServer/backups/Library_`date +%m-%d-%Y_%H%M`.tar.gz ./Library

    docker start Plex

     

    Same with Sonarr

     

    docker stop binhex-sonarr

    cd /mnt/user/appdata/binhex-sonarr

    tar zcf /mnt/user/appdata/binhex-sonarr/Backups/cron-based/sonarrdb_`date +%m-%d-%Y_%H%M`.tar.gz config.xml nzbdrone.db nzbdrone.db-journal

    docker start binhex-sonarr

     

    Then I have a cron sweep both locations for any backups more ten 10 days old.

     

    find /mnt/user/appdata/PlexMediaServer/backups -mtime +9 -type f -delete

    Edited by mi5key
    Link to comment

    just created a script that checks the pragma integrity, and if passes, backs up the database. if it fails, then it stops the docker, restores the latest backup, and restarts the docker.

     

    i need to make one for sonarr now and then i'm going to barrow your code on deleting backups older than 10 days.

    Link to comment

    Still no hard fix?? I moved appdata and system to cache 2 nights ago and so far so good. I'll report back if I see corruption. If I do, I might toss this server in the can!

    Link to comment

    Downgraded again to 6.6.x version as, at least for now, my use case involves periods of heavy I/O load on the server.

     

    The problem as far as I can pinpoint it is related concurrent writes by more than one threads/processes of the same file under heavy I/O load. It is obvious that under certain load circumstances the updates are not being applied to the file in the proper sequence, meaning that a disk section that has been in theory updated by process A and then process B, is actually getting written to disk first process B and then process A. leaving the actual file in an inconsistent state. 

    This good be a bug or not properly handled exception case both of unRaid or SQLite. In the case SQLite on unRaid the chances of this "heavy load" issue manifesting are multiplied because of the way unRaid works having a. the Parity disk as a bottleneck and b. unused disks are spinned down which causes the system to "freezes" i/o operations when one of the disks needs to be spun-up again (at least that is the case on my H310).

     

    What is most concerning for me however, is the following post on this thread by

    On 8/19/2019 at 6:26 AM, phbigred said:

    Noticed cache corruption with my VMs too becoming unable to backup

    which of course might be completely unrelated. This to me indicates that even choosing to go the VM way instead of Docker for my Plex & Sonarr I might still have issues once I upgrade to the latest unRaid version.

     

    In any case I think it would be great if we could get some update from the development team, just to understand what the status is.

     

     

    Link to comment

    I've noticed a new and scary problem that happened specifically to Sonarr, for about 45 minutes, Sonarr would nearly instantly corrupt it's database on 6.7.2. I do have what I'd consider to be a decent docker load running at all times (see the image for proof). On Sunday, Sonarr, Lidarr, Radarr and Plex all corrupted their DBs during a move, Radarr was backed up that Saturday and was easy to restore, for Plex, I ran my script (i'll talk  more about it below), and then came Sonarr. I had 3 backups in the Sonarr scheduled backup directory and 2 in the manual, and I have 2 months of backups of those sets mostly from the original CA Appdata Backup/Restore, with my last 3 weeks being on v2. The first 2 databases didn't fix Sonarr's malformed DB (which was weird, because I used the Manual backups that I KNEW were fine when I triggered the backup). Anyway, after trying about 1/2 of my backups none of them fixed Sonarr, and then all of a sudden one worked fine, then I tested some of the backups I'd already tried (because there was no way that ALL of my backups were corrupted for this long), and with the exception of my latest auto backup, and a rogue db from 3 weeks go, they all worked.

     

    Now, Sonarr had corrupted these DB files, I had backups upon backups so many of them were backed up redundantly, but every time Sonarr started up on one of the backups, It would corrupt, and even after Sonarr stopped doing that, the DBs were still corrupt, the overlap in backups saved me.

     

    image.png.e6278436a4195b098951f8a89a2d0cc0.png

     

     

    I wish I had recorded this better as this just happened on Sunday, but I was tired, and after everything went back to normal, I just deleted all of the corrupted databases

     

    Now, as for the script I mentioned, I made this a few months ago to check and manage my plex database files, it checks them, backs up if needed, attempts a repair if needed, etc. You shouldn't run it headless, as it requires user input, although anyone who's done a basic script before can change that with two "#"s. 

     

     

    I do apologise if this is the wrong place to post this comment, I just thought that it'd be a good place to get this out to other users who need to work on their DBs a bit during this issue.

    Link to comment

    What I do for checking the Sonarr DB is periodically go through the Logs filtering out everything but the errors. 

    If there is a corruption in the DB you will see the malformed message there. Once I have gone through a log set, I will just clear the logs as well.

    Sonarr initially still seems to be working properly, when in fact the DB has only few corruptions. Once the number of corruptions increases then Sonarr starts initially showing slow responsiveness issues, until it reaches a point where you can't even get to the landing page.

     

    Anyway. When doing manual backups, unless you do them from inside Sonarr, where I assume the DB is paused for this process, I think it is better to stop the Docker altogether.

     

    I started also going through the SQLite site for additional information, and I would suggest before restoring from a manual backup to delete any existing .db-wal files as they contain pending transactions. If you don't back everything up so as to be able overwrite the .db-wal files with the exact same set when the actual DB was backed up, then these might cause a problem when restarting the DB as SQLite will probably try to apply the pending changes. More so if the db-wal files are corrupt or have been already partially applied.

     

     

    • Like 1
    Link to comment
    Quote

    Anyway. When doing manual backups, unless you do them from inside Sonarr, where I assume the DB is paused for this process, I think it is better to stop the Docker altogether.

    That's what I do, I use the in-app backup, I didn't know about the .db-wal files, though, that makes totla sense, thank you. I'll make sure to delete those the next time something pops up.

    Link to comment

    Hi guys,   just want to confirm if this unknow bug will also impact the normal file we written to array?

    Link to comment



    Guest
    This is now closed for further comments

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.