SQLite DB Corruption testers needed

simalex · October 13, 2019

Just upgraded to 6.8.0 rc1

Have been running without any corruption for the last 5 hours now.

I did not have the opportunity to overload the unRaid server yet. Will try this tomorrow.

Keeping my fingers crossed.

midasisking · October 13, 2019

I am seeing more corruptions after resetting everything from scratch with the latest 6.8.0-rc1 build. Repaired the Plex database after upgrading Unraid and was able to get my libraries back online but after some basic testing and some more importing of backed up shows I seem to be right back where I started. Any ideas? Really frustrating to see it all just undo my settings after I put them all back manually.

glacier-diagnostics-20191013-2016.zip

simalex · October 14, 2019

Hello all

Sorry to report that I just witnessed the DB corruption on my first attempt to stress the server in 6.8.0rc1

The test was as follows

Plex was scanning the Library

Sonarr was importing around 10GB worth of media, and is set to notify Plex upon successful import of a library item.

At the same time I was moving 180GB of data through the network to a single disk of the array via a disk mount.

The performance was good as I was monitoring the data transfer. It only seemed to slow very much at one point, where I am guessing some of the disks would have to spin up because of the library scan.

At the end of the import everything seemed to be fine. The new episodes where showing up in Plex, the transfer speed to the array was ranging around 65-75MB/s. All in all everything seemed fine, until I checked the Sonarr logs, where I saw the first messages about malformed database

Plex DB did not show any signs of corruption yet.

I noticed a weird thing I would like to report. I changed the setup the /config path in all dockers to /mnt/disk1/appdata from /mnt/user/appdata, as part of the proposed workaround back in July and still have the same setup as all dockers perform much better if /config is not on a user share. However, today I noticed that the reported available space for /config in Sonarr WebUI was 7.2TB. In order to get to the 7.2TB you need to add free space from several, but not all, drives of the array. The / path which is actually inside docker.img reported properly 12.3 GBs free.

It would seem that even though I specifically said /config is in disk1, unRaid is ignoring and treating it as a user share.

In the user shares tab /appdata shows as having 441 GBs available which is what is actually available in disk1. This might be important, as it indicates that somehow access to /config by Sonarr is actually still happening through fuse and a user share, and not a disk share.

Anyway I will be downgrading back again to version 6.6.7 today, and will check above inconsistency with /config path in Sonarr.

limetech · October 14, 2019

7 minutes ago, simalex said:

Sorry to report that I just witnessed the DB corruption on my first attempt to stress the server in 6.8.0rc1

Thanks for the report.

8 minutes ago, simalex said:

It would seem that even though I specifically said /config is in disk1

You're saying AppData config path is set to: /mnt/disk1/appdata/sonarr is that correct?

Also, which sonarr, binhex-sonarr or linuxserver sonarr?

11 minutes ago, simalex said:

This might be important, as it indicates that somehow access to /config by Sonarr is actually still happening through fuse and a user share, and not a disk share.

Indeed, if there are two or more containers accessing the same DB file, one via user share, another via disk share, that can cause problems.

simalex · October 14, 2019

3 hours ago, limetech said:

You're saying AppData config path is set to: /mnt/disk1/appdata/sonarr is that correct?

Also, which sonarr, binhex-sonarr or linuxserver sonarr?

was using linuxserver

I have both dockers installed with different paths for their config files

binhex has /config in /mnt/disk1/appdata/binhex-sonarr

linuxserver has /config in /mnt/disk1/appdata/sonarr

I will do the same load test with binhex before downgrading to see if the corruption happens there also.

The reason I moved back to linuxserver is that the backup in binhex is somehow broken and access to /tmp/nzbdrone_backup is for some reason denied. I don't know the reason for this, maybe it is related to the fact that I restored the DB to binhex docker from an old linuxserver backup, and there are some parameters set in the database that don't match the existing paths in the docker. Anyway I wanted to have a working backup process before going to "test land" again, so that I could avoid doing rebuilds from older backups.

3 hours ago, limetech said:

Indeed, if there are two or more containers accessing the same DB file, one via user share, another via disk share, that can cause problems.

I noticed a strange behavior a while back (don't even remember unRaid version but was definitely before 6.6.7), when I changed the setup of a user share.

Initially that share was setup to include all disks but then I limited it to only one disk (disk3 I think). I moved all data of that share from the rest of the disks to disk3, and also deleted the top level directory from the rest of the disk shares.

Everything seemed to work properly at the time.

However, once I added a new disk (disk8) in the array, I noticed that a directory for that user share was created in disk8 as well, and unRaid was now actually using disk8 for new writes, as if it was ignoring the setup. I moved again all files of that user share from disk8 to disk3 and deleted the top level directory from disk8, and have not had the same issue appear since even though disk3 has still near 1TB free, and disk8 is an 8TB disk filling up gradually.

Also another thing to note Tom

disk1 has appdata for all dockers, however it also stores part of my Media Library, which is actually accessed through a user share

So in my Sonarr docker configuration I have

/config pointing to /mnt/disk1/appdata/sonarr, and

/tv pointing to /mnt/user/TV-Shows

The setup of TV-Shows user share, has disk1 as part of the included disks list.

Could this be causing the corruption, because we have the same docker app accessing files on the same physical disk through a different file system. Could it be that there is some kind of i-node mix-up happening in this specific scenario, when the same docker application is attempting to write to a user share and the i-node for the fuse file system matches an actually existing i-node on the disk itself which it is already accessing directly through the disk share?

It would be unlikely to find the above setup if you have appdata stored in a cache drive, as I would assume that the media library would be on the protected storage of the array. That would explain why the corruption is not happening when appdata is stored in cache or unassigned devices. It would also explain why there are users not experiencing the corruption at all, even though the appdata is located in the protected storage.

limetech · October 14, 2019

10 hours ago, simalex said:

Plex was scanning the Library

Sonarr was importing around 10GB worth of media, and is set to notify Plex upon successful import of a library item.

Sonarr importing to Plex library or are you referring simply to separate media storage directories where media files are written by sonarr and then picked up by Plex?

10 hours ago, simalex said:

today I noticed that the reported available space for /config in Sonarr WebUI was 7.2TB. In order to get to the 7.2TB you need to add free space from several, but not all, drives of the array.

This is the part that is confusing. Please double check. If /config is mapped to /mnt/disk1 then it should not report free space beyond what's available on /mnt/disk1.

6 hours ago, simalex said:

So in my Sonarr docker configuration I have

/config pointing to /mnt/disk1/appdata/sonarr, and

/tv pointing to /mnt/user/TV-Shows

That should be ok.

simalex · October 14, 2019

21 minutes ago, limetech said:

Sonarr importing to Plex library or are you referring simply to separate media storage directories where media files are written by sonarr and then picked up by Plex?

Sonarr is importing from the download area where deluge or nzbget placed the files and moving media to the same directories that Plex Media library resides. This is a common setup.

Plex as far as I understand gets notified that there was a change in 2 ways.

a. Sonarr is connecting to Plex and notifying Plex of any changes it made to the media library (additions, deletions, replacement)

b. Plex is also monitoring the media directories and when it detects a change will trigger internally a library re-scan.

So when you import a new media file through Sonarr, this will trigger a re-scan of the library by Plex. In addition to this Plex will start downloading metadata, subtitles and (based on the setup) even start generating thumbnails etc for the newly detected media, which means further taxing of the same disks.

In order to push the server to the limit, I will have a directory of multiple media files, larger then 1GB each and do a manual import in Sonarr of all the media files. This in turn will copy the media files, one at a time, to the destination area. As each file is finished copying Sonarr will notify Plex of the update. So I get large files copied across different array disks, and at the same time usually read access of other parts of usually the same drive that is getting written.

21 minutes ago, limetech said:

today I noticed that the reported available space for /config in Sonarr WebUI was 7.2TB. In order to get to the 7.2TB you need to add free space from several, but not all, drives of the array.

Will check this when I get back and let you know

Edited October 14, 2019 by simalex

limetech · October 14, 2019

In order to get DB corruption one (or more) of these have to be true:

Multiple processes of same application or multiple threads of same process are accessing same DB file and mutual-exclusion method being used (locking) is not working correctly.
Multiple processes of different apps are accessing same DB file and stepping on each other.
There is a bug in the underlying block-I/O driver, in our case md/unraid if DB on array disk.
There is a bug in an underlying kernel device driver.
There is a bug in SQLite.

#1 is what I looked at initially and don't think this is the case.

#2 is what @simalex might be causing to happen?

#3 I don't think has ever been source of problem, where for 6.8 I pretty thoroughly reviewed and tested (again).

#4 Is possible since we cannot reproduce this problem with servers we have.

#5 Also possible since sqlite db corruption has been common over the years.

Why does 6.7.x work and later releases don't? I have no idea. We are willing to go back and start bisecting changes but this is nearly impossible to do if we can't reproduce the issue to begin with.

The other thing that makes this issue difficult is that corruption is detected only long after the event which caused the corruption and sqlite doesn't exactly make it clear what the nature of the corruption is.

11rcombs · October 14, 2019

If it's an sqlite bug, it's certainly not an sqlite _regression_, since it happens on Plex (we ship our own sqlite) and in Docker containers.

Here's a question: has anyone seen this on a standalone PMS install, or only on Docker? I haven't run into this issue since updating to the affected versions, and I run PMS directly on the host.

I've seen rare cases of sqlite corruption on my system in the past, but they were attributable to the cache/mover bug (which can be ruled out as the cause of this issue, since it reproduces on systems without cache drives, and even when bypassing shfs altogether by using an individual disk md mount).

I'm not especially familiar with how Docker volumes are implemented, but has anyone tried using a bind mount instead? Could be interesting to see if that gives different results.

Otherwise, I'd suggest having someone who sees this frequently attach strace to an affected app, watching all filesystem I/O syscalls, and see exactly what's going on when the corruption occurs. Just make sure to use `-f` to get all threads and spawned children, and direct the output to somewhere with a lot of free space.

limetech · October 14, 2019

2 hours ago, 11rcombs said:

I'm not especially familiar with how Docker volumes are implemented, but has anyone tried using a bind mount instead? Could be interesting to see if that gives different results.

Yes maybe docker related - using a bind mount of what?

2 hours ago, 11rcombs said:

Otherwise, I'd suggest having someone who sees this frequently attach strace to an affected app,

This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption.

11rcombs · October 14, 2019

>Yes maybe docker related - using a bind mount of what?

Of whatever the affected sqlite database is on, just as opposed to using a volume. Again, I don't know how those features are implemented, but there's a chance it might hit on a different code path.

>This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption.

Could maybe do something clever with `logrotate` + something to kill the affected service soon after it detects db corruption? But otherwise yeah sometimes you just need to generate several gigabytes of logs and then sift through to find the useful couple megs.

limetech · October 14, 2019

2 hours ago, 11rcombs said:

Here's a question: has anyone seen this on a standalone PMS install, or only on Docker?

There is similar report of happening on Ubuntu:

https://forums.plex.tv/t/error-sqlite3-nil-database-corruption-docker/403032

Title of post suggests running in a Docker container, but OP doesn't explicitly specify.

One of the first things we tried when we first started looking at the issue was to revert Docker to same version used in 6.6.7 and the db corruption reports continued.

AHappyCloud · October 15, 2019

Well its been 2 days since I've upgraded to the 6.8.0 rc and I can safely report there has been no corruption. I've downloaded over 100GB using sonarr and I've been recording TV on Plex and no apps have presented any indication of corruption (nor do the logs show). I am keeping my fingers crossed and hoping this keeps up.

limetech · October 15, 2019

21 hours ago, simalex said:

Sorry to report that I just witnessed the DB corruption on my first attempt to stress the server in 6.8.0rc1

Looking back at original report: is it possible there was corruption already and so happens a DB operation need to operate on the corrupted record?

I wondering, if when you upgraded to 6.8.0-rc were the databases recreated from scratch or checked for corruption first?

NewDisplayName · October 15, 2019

15 hours ago, limetech said:

In order to get DB corruption one (or more) of these have to be true:

Multiple processes of same application or multiple threads of same process are accessing same DB file and mutual-exclusion method being used (locking) is not working correctly.

Multiple processes of different apps are accessing same DB file and stepping on each other.

There is a bug in the underlying block-I/O driver, in our case md/unraid if DB on array disk.

There is a bug in an underlying kernel device driver.

There is a bug in SQLite.

#1 is what I looked at initially and don't think this is the case.

#2 is what @simalex might be causing to happen?

#3 I don't think has ever been source of problem, where for 6.8 I pretty thoroughly reviewed and tested (again).

#4 Is possible since we cannot reproduce this problem with servers we have.

#5 Also possible since sqlite db corruption has been common over the years.

Why does 6.7.x work and later releases don't? I have no idea. We are willing to go back and start bisecting changes but this is nearly impossible to do if we can't reproduce the issue to begin with.

The other thing that makes this issue difficult is that corruption is detected only long after the event which caused the corruption and sqlite doesn't exactly make it clear what the nature of the corruption is.

The bigger question is, why some people dont have problems, and other have.

I never had any problem with plex, radarr, sonarr, but with storjv3 container.

Maybe updating docker would be a good idea?

Edited October 15, 2019 by nuhll

simalex · October 15, 2019

1 hour ago, limetech said:

Looking back at original report: is it possible there was corruption already and so happens a DB operation need to operate on the corrupted record?

I wondering, if when you upgraded to 6.8.0-rc were the databases recreated from scratch or checked for corruption first?

Don't see how this could have happened.

This was the same DB that had been working from a restore taken since before upgrading to 6.7.2.

The Sonarr logs had not had any corruption message for at least 2 months.

Also the malformed database messages where actually linked to some series episodes that literally aired Sunday night after I had upgraded to 6.8 and Sonarr was trying to decide to put them in the download list.

You did get me to wondering though so I did the following

a. check the DB backup before upgrading to 6.8 for corruptions -- no errors came up

b. check the DB after the malformed message for corruptions -- no errors came up (kind of weird)

c. start the linuxserver Sonarr docker and tried some activities like refresh series, add new series and check in parallel the logs for any new errors.

So after adding a few new series in Sonarr I had these in the sonarr error logs.

The last error is from yesterday when I got the first malformed DB message.

Immediately above that are 2 errors that are happening on Series Refresh because there are apparently more than one posters/banners in some series available.

What is a problem for me are the top 2 errors. The 2 directories that are mentioned as missing are actually for new series just added. I checked and the directories where created successfully. I added in total 8 new series. So Sonarr created the directories and then tried to immediately access them to put covers and banners, but apparently unRaid had not yet finished creating the actual directory structure?

I also did a bit of reading on SQLite locking mechanism yesterday. After the results of today's DB integrity checks, I am thinking that yesterday I actually managed to somehow have a "corrupted" -wal (write ahead log) file but because I caught it and stopped Sonarr almost immediately the corruption did not get pushed to the actual DB. Apparently the DB itself is accessed simultaneously for read only, and only one process can write. Everyone is appending changes to -wal file and they are using some kind of file semaphore to indicate if a process is blocking the actual DB for updating. My theory is that the -wal file or the file lock share memory (-shm file) somehow become inconsistent and then "garbage" is ends up getting updated to the actual .db file. This would make the issue much more difficult to track.

Anyway

I will be back on my weekly travelling schedule again from tomorrow so I will not have much time for testing.

I am on the process of setting up a second unRaid server with actual "server" grade parts. I plan to use it for testing for this corruption issue initially. At some point I will move my data there and keep the old server up for testing if the issue has not been resolved.

Rich Minear · October 15, 2019

Is it possible to get an idea of what has changed in the new version (6.8.0rc1) that we believe may fix the issue? I'm more than happy to try it and see...if we have something that we believe was changed to fix it. My machine can usually recreate the problem in less than a day.

If its just a *hope*, then I will just wait to see what others find.....

15 hours ago, limetech said:

Yes maybe docker related - using a bind mount of what?

This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption.

Some of us had corruption within hours or minutes....

limetech · October 15, 2019

4 hours ago, Rich Minear said:

Some of us had corruption within hours or minutes....

Repeatable?

Rich Minear · October 15, 2019

46 minutes ago, limetech said:

Repeatable?

The last 6 or 7 posts that I did about corruption all occurred within hours. None lasted a day. The diags were attached. 6.6.7 is rock solid. No corruption for weeks.

BRiT · October 15, 2019

Did you start off with an entirely new/fresh database? Or was it restored from an earlier backup?

Rich Minear · October 16, 2019

2 hours ago, BRiT said:

Did you start off with an entirely new/fresh database? Or was it restored from an earlier backup?

So that depends on what you mean by "new/fresh". Each database that I fell back to was "known good", and had been cleaned after prior corruption. I had an earlier post where I showed how to clean a database so that it contained no corruption. But two of the times, I started over from scratching, letting Plex build it.

None of the databases that I restored to had any corruption.

AHappyCloud · October 16, 2019

Well I'm sorry to report I did not make it to day 3. Database corruption happened either last night or this morning. In any case, back to 6.6.7.

limetech · October 16, 2019

Honestly, don't bother posting if you can't post at least diagnostics.zip file. How can we possibly determine where this issue lies with "sorry doesn't work". If this is all you have to say, please don't run -rc releases. This is why we have closed beta's now btw.

limetech · October 16, 2019

4 hours ago, Rich Minear said:

So that depends on what you mean by "new/fresh". Each database that I fell back to was "known good", and had been cleaned after prior corruption. I had an earlier post where I showed how to clean a database so that it contained no corruption. But two of the times, I started over from scratching, letting Plex build it.

None of the databases that I restored to had any corruption.

Rich,

In your previous post:

https://forums.unraid.net/bug-reports/stable-releases/sqlite-db-corruption-testers-needed-r598/page/8/?tab=comments#comment-5518

How do you have /config mapped for the sonarr container?

Rich Minear · October 16, 2019

2 hours ago, limetech said:

Rich,

In your previous post:

https://forums.unraid.net/bug-reports/stable-releases/sqlite-db-corruption-testers-needed-r598/page/8/?tab=comments#comment-5518

How do you have /config mapped for the sonarr container?

389995153_ScreenShot2019-10-16at2_08_27AM.png.7856199d57217586999fb3d26b5af017.png

SQLite DB Corruption testers needed

User Feedback

Recommended Comments

simalex 2

Link to comment

midasisking 0

Link to comment

simalex 2

Link to comment

limetech 3326

Link to comment

simalex 2

Link to comment

limetech 3326

Link to comment

simalex 2

Link to comment

limetech 3326

Link to comment

11rcombs 2

Link to comment

limetech 3326

Link to comment

11rcombs 2

Link to comment

limetech 3326

Link to comment

AHappyCloud 0

Link to comment

limetech 3326

Link to comment

NewDisplayName 117

Link to comment

simalex 2

Link to comment

Rich Minear 33

Link to comment

limetech 3326

Link to comment

Rich Minear 33

Link to comment

BRiT 408

Link to comment

Rich Minear 33

Link to comment

AHappyCloud 0

Link to comment

limetech 3326

Link to comment

limetech 3326

Link to comment

Rich Minear 33

Link to comment