[Support] Paperless Docker


T0a

Recommended Posts

Is there a way to backup just the database of paperless, I see you can run a full backup and dump the files and database in a folder, however as we are most setup on unraid parity the one thing I need to be able to do that I can't figure out is to run a backup dump on the database every so often

Link to comment

FYI you can run both the web server and consumer in a single docker container by using a bash script:

#! /bin/bash

/sbin/docker-entrypoint.sh document_consumer &
/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
wait

save this file into a volume that's mounted in the container.  i just put this in the appdata directory.

then turn on advanced view and override the entry point, e.g.

--entrypoint /usr/src/paperless/data/entry.sh

clear out the 'post arguments', since you're doing that in the bash script now.

  • Thanks 3
Link to comment

I am running paperless since a few days and i am absolutely in love with it.

Problem i ran into yesterday is bad performance when a PDF file is more than one page. I uploaded a 2Mb 8 pages file (not that much actually...) and it took the OCR process over 30 minutes while using 100% cpu on all 4 Xeon 1225-v3 cores. Maybe that has something to do with this issue https://github.com/the-paperless-project/paperless/issues/438 ?

 

Any one has any idea how to optimize that process?

paperless-consumer docker log:

Consuming /consume/03.2020.pdf
** Processing: /tmp/paperless/paperless-up38twsl/convert.png
500x700 pixels, 3x16 bits/pixel, RGB
Input IDAT size = 575331 bytes
Input file size = 575592 bytes

Trying:
zc = 9 zm = 9 zs = 0 f = 0 IDAT size = 545251
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 545208
Selecting parameters:
zc = 9 zm = 9 zs = 0 f = 1 IDAT size = 494809

Output file: /tmp/paperless/paperless-up38twsl/optipng.png

Output IDAT size = 494809 bytes (80522 bytes decrease)
Output file size = 494866 bytes (80726 bytes = 14.02% decrease)

Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0002.pnm -> /tmp/paperless/paperless-up38twsl/convert-0002.unpaper.pnm
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0000.pnm -> /tmp/paperless/paperless-up38twsl/convert-0000.unpaper.pnm
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0001.pnm -> /tmp/paperless/paperless-up38twsl/convert-0001.unpaper.pnm
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0003.pnm -> /tmp/paperless/paperless-up38twsl/convert-0003.unpaper.pnm
[pgm_pipe @ 0x55698b596f80] [pgm_pipe @ 0x56315b5eaf80] [pgm_pipe @ 0x55b79cc53f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
Stream #0: not enough frames to estimate rate; consider increasing probesize
[pgm_pipe @ 0x55d75f3f5f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55b79cc55600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55b79cc55600] Encoder did not produce proper pts, making some up.
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0004.pnm -> /tmp/paperless/paperless-up38twsl/convert-0004.unpaper.pnm
[pgm_pipe @ 0x55a4ad8d8f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55698b598600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55698b598600] Encoder did not produce proper pts, making some up.
out of deviation range - NO ROTATING
[image2 @ 0x55d75f3f7600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0005.pnm -> /tmp/paperless/paperless-up38twsl/convert-0005.unpaper.pnm
[image2 @ 0x55d75f3f7600] Encoder did not produce proper pts, making some up.
[pgm_pipe @ 0x564bda956f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0006.pnm -> /tmp/paperless/paperless-up38twsl/convert-0006.unpaper.pnm
[pgm_pipe @ 0x5610d26a6f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x56315b5ec600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x56315b5ec600] Encoder did not produce proper pts, making some up.
Processing sheet #1: /tmp/paperless/paperless-up38twsl/convert-0007.pnm -> /tmp/paperless/paperless-up38twsl/convert-0007.unpaper.pnm
[pgm_pipe @ 0x56090cae1f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55a4ad8da600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55a4ad8da600] Encoder did not produce proper pts, making some up.
[image2 @ 0x564bda958600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x564bda958600] Encoder did not produce proper pts, making some up.
[image2 @ 0x5610d26a8600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x5610d26a8600] Encoder did not produce proper pts, making some up.
[image2 @ 0x56090cae3600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x56090cae3600] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for deu
Parsing for deu
Parsing for deu
Detected document date 2014-01-20T00:00:00+01:00 based on string 20.01.2014

d
Document 20140120000000: 03.2020 consumption finished

 

Link to comment
On 2/14/2020 at 4:23 PM, bling said:

FYI you can run both the web server and consumer in a single docker container by using a bash script:


#! /bin/bash

/sbin/docker-entrypoint.sh document_consumer &
/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
wait

save this file into a volume that's mounted in the container.  i just put this in the appdata directory.

then turn on advanced view and override the entry point, e.g.


--entrypoint /usr/src/paperless/data/entry.sh

clear out the 'post arguments', since you're doing that in the bash script now.

Thanks Bling, can you elaborate on how you override the entrypoint?  I don't see entrypoint as a variable in Unraid.

Link to comment
On 2/4/2020 at 1:01 PM, Nickproof said:

Yeah, at first, I've created this folder on Unraid 🤦‍♂️. I'm new to Docker, respectively how it works inside, so don't blame me ;)

Then i thought about it, and it seems logical to me that it affects the container itself. So I put the file in the container. After a few attempts, it now works. My approach: while the container starts, quickly upload the traineddata file via ssh (It is necessary to do so, because the container shuts down after about 30 seconds if it has an error Only if you have any file in the /consumption folder, so make sure its empty.).

Paperless finds the file and performs OCR without errors (ant it works great!).

 

So the real question is, why does Paperless not download other languages after configuration?

 

tessdata.png

Having the same issue right now

 

I went into /usr/share/tessdata and downloaded deu.traineddata via wget https://github.com/tesseract-ocr/tessdata/blob/master/deu.traineddata

chmod +x deu.traineddata

 

See

 

bash-5.0# cd /usr/share/tessdata/
bash-5.0# ls -l
total 35504
drwxr-xr-x    1 root     root           360 Mar  1 19:05 configs
-rwxr-xr-x    1 root     root         64820 Mar  7 18:14 deu.traineddata
-rwxr-xr-x    1 root     root      23466654 Jul  9  2019 eng.traineddata
-rwxr-xr-x    1 root     root       2251950 Jul  9  2019 equ.traineddata
-rwxr-xr-x    1 root     root      10562874 Jul  9  2019 osd.traineddata
-rw-r--r--    1 root     root           572 Jul  9  2019 pdf.ttf
drwxr-xr-x    1 root     root            88 Mar  1 19:05 tessconfigs

 

Still getting:

 

pyocr.error.TesseractError: (1, b'Error opening data file /usr/share/tessdata/deu.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.\nFailed loading language \'deu\'\nTesseract couldn\'t load any languages!\nCould not initialize tesseract.\n')

 

and the paperless_consumer docker crashed

 

Everything worked for some weeks and now this is happening

Any idea why?

Link to comment

I don't know, where, when and whom this happened , but there are two problems and a kind of workaround.

 

First add to your Container a new variable named "PAPERLESS_OCR_LANGUAGES" . You see the additinal 'S' at the end of the name.

Set both 'PAPERLESS_OCR_LANGUAGES' and 'PAPERLESS_OCR_LANGUAGE' to only one language. In your case that should be 'deu' without the quotes.

 

I hadn't enough time to test other combinations, but this worked for me.

 

I hope this will help you

 

Michael

  • Thanks 1
Link to comment

Hi there.
Can the GUI also be changed to German?
Can i keep the original file name? I save my documents according to this scheme YYYY-MM-DD - template.
If I see it correctly, the documents in the originals folder are simply numbered consecutively without meaning.

Link to comment

Does the consumer reach into directories in the consume directory or just consume in the root?  (/consume)

 

ScannerPro added a (/ScannerPro) directory in my /consume directory and I can't figure out how to remove it.

 

And paperless hasn't consumed it yet, I assume thats why.

 

Link to comment
  • 2 weeks later...

If I change the time zone to something else from UTC, web UI stops working.

 

  This suggestion below worked for me, thanks!

But I think this may be the reason that if I change any other settings UI stops working

 

On 2/14/2020 at 6:23 PM, bling said:

FYI you can run both the web server and consumer in a single docker container by using a bash script:


#! /bin/bash

/sbin/docker-entrypoint.sh document_consumer &
/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
wait

save this file into a volume that's mounted in the container.  i just put this in the appdata directory.

then turn on advanced view and override the entry point, e.g.


--entrypoint /usr/src/paperless/data/entry.sh

clear out the 'post arguments', since you're doing that in the bash script now.

 

Edited by nextgenpotato
Link to comment
On 3/26/2020 at 5:32 AM, nextgenpotato said:

If I change the time zone to something else from UTC, web UI stops working.

 

  This suggestion below worked for me, thanks!

But I think this may be the reason that if I change any other settings UI stops working

 

 

the image doesn't have any time zone info so you need to add a volume mount for /usr/share/zoneinfo

Link to comment
On 2/15/2020 at 1:23 AM, bling said:

FYI you can run both the web server and consumer in a single docker container by using a bash script:


#! /bin/bash

/sbin/docker-entrypoint.sh document_consumer &
/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
wait

save this file into a volume that's mounted in the container.  i just put this in the appdata directory.

then turn on advanced view and override the entry point, e.g.


--entrypoint /usr/src/paperless/data/entry.sh

clear out the 'post arguments', since you're doing that in the bash script now.

tried this but if i do so the container doesnt start. iam only get this error in the log:

standard_init_linux.go:211: exec user process caused "no such file or directory"

andy idea what went wrong?

Edited by Eraxar
Link to comment
On 3/10/2020 at 8:17 AM, Michael Baecker said:

I don't know, where, when and whom this happened , but there are two problems and a kind of workaround.

 

First add to your Container a new variable named "PAPERLESS_OCR_LANGUAGES" . You see the additinal 'S' at the end of the name.

Set both 'PAPERLESS_OCR_LANGUAGES' and 'PAPERLESS_OCR_LANGUAGE' to only one language. In your case that should be 'deu' without the quotes.

 

I hadn't enough time to test other combinations, but this worked for me.

 

I hope this will help you

 

Michael

This is great! Now the main paperless docker starts downloading the proper tesseract data. I think it is a problem with the Unraid template for the Docker and everybody using other then English OCR will run into this problem. How can the docker be updated with this other variable?

Link to comment
On 2/5/2020 at 9:28 PM, Nickfmc said:

Is there a way to backup just the database of paperless, I see you can run a full backup and dump the files and database in a folder, however as we are most setup on unraid parity the one thing I need to be able to do that I can't figure out is to run a backup dump on the database every so often

 

@NickfmcI use the "CA Backup / Restore Appdata" plugin to backup the paperless appdata folder (including the paperless data directory) to a backup share on my array. The uploaded documents reside inside another share on my array. Then I use the "Unassigned Devices" plugin with a custom script to backup both shares from my array to an external hard disk. Does this answer your question?

 

On 3/17/2020 at 4:03 PM, OOmatrixOO said:

Hi there.
Can the GUI also be changed to German?
Can i keep the original file name? I save my documents according to this scheme YYYY-MM-DD - template.
If I see it correctly, the documents in the originals folder are simply numbered consecutively without meaning.

 

@OOmatrixOO As long as the paperless metadata contain the original file name you should be save. In case you decide to move your documents to another management system, you can use the paperless Exporter to export the files with their original name. See Exporter documentation.

 

In case you access the documents not from the paperless web UI (e.g. via the share) the following Pull-Request might solve your problem. However, can't estimate when the feature gets merged.

 

On 4/1/2020 at 4:44 PM, pietjebell said:

This is great! Now the main paperless docker starts downloading the proper tesseract data. I think it is a problem with the Unraid template for the Docker and everybody using other then English OCR will run into this problem. How can the docker be updated with this other variable?

 

@pietjebell Sorry for the inconvenience. I created a PR including your request. Should be available soon. Template change is available.

 

On 3/28/2020 at 6:14 PM, bling said:

the image doesn't have any time zone info so you need to add a volume mount for /usr/share/zoneinfo

 

@nextgenpotato @bling I have good news. The newest version of paperless adds the environment variable "TZ" (e.g. TZ=America/Los_Angeles). Now, UnRaid passes your servers time zone to the container automatically. You need to update your container in order to use this feature though. I will also remove the PAPERLESS_TIME_ZONE variable from the template as it works out of the box now.

 

BTW, the new paperless version also ships with a preview window of your documents in edit mode.

Edited by T0a
Answer more questions
Link to comment
On 2/14/2020 at 6:23 PM, bling said:

FYI you can run both the web server and consumer in a single docker container by using a bash script:


#! /bin/bash

/sbin/docker-entrypoint.sh document_consumer &
/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
wait

save this file into a volume that's mounted in the container.  i just put this in the appdata directory.

then turn on advanced view and override the entry point, e.g.


--entrypoint /usr/src/paperless/data/entry.sh

clear out the 'post arguments', since you're doing that in the bash script now.

I did this and the document_consumer would run, but the webserver wasn't running.  There was an error in the log about /etc/passwd being locked, not sure if that was the problem.

I switched the two lines in the entry.sh (listing the webserver first, then the document_consumer second, as below) and it works now. 

#! /bin/bash

/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
/sbin/docker-entrypoint.sh document_consumer &
wait

And I also had to make the file executable (chmod +x).

Link to comment

Hi, great software btw :) Just struggling to get it setup properly. I often get the problem where it does not recognise the language as english, when i scan letters, and therefore tries to parse in another language:

PARSE FAILURE for /consume/scan0002.pdf: Language detection failed. Set PAPERLESS_FORGIVING_OCR in config file to continue anyway.

I think part of the issue is that I live in Wales, all letters typically are in english, but have a bilingual header with part of the title or header in both english and welsh. I think this is causing problems with the OCR. I have no desire to have anything read in welsh, as I cannot speak or read it ha. Wondering if you had any suggestions on how to overcome this issue?

Link to comment

Anyone managed to get email checking to work with this? Running in a single container with the bash script and passing the relevant email variables, but it's not working. Seeing this in the log:

 

Starting document consumer at /consume with inotify
Traceback (most recent call last):
File "/usr/src/paperless/src/manage.py", line 11, in <module>
execute_from_command_line(sys.argv)
File "/usr/lib/python3.8/site-packages/django/core/management/__init__.py", line 371, in execute_from_command_line
utility.execute()
File "/usr/lib/python3.8/site-packages/django/core/management/__init__.py", line 365, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/lib/python3.8/site-packages/django/core/management/base.py", line 288, in run_from_argv
self.execute(*args, **cmd_options)
File "/usr/lib/python3.8/site-packages/django/core/management/base.py", line 335, in execute
output = self.handle(*args, **options)
File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 97, in handle
self.loop_inotify(mail_delta)
File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 130, in loop_inotify
self.loop_step(mail_delta)
File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 120, in loop_step
self.mail_fetcher.pull()
File "/usr/src/paperless/src/documents/mail.py", line 185, in pull
for message in self._get_messages():
File "/usr/src/paperless/src/documents/mail.py", line 203, in _get_messages
self._login()
File "/usr/src/paperless/src/documents/mail.py", line 227, in _login
login = self._connection.login(self._username, self._password)
File "/usr/lib/python3.8/imaplib.py", line 601, in login
typ, dat = self._simple_command('LOGIN', user, self._quote(password))
File "/usr/lib/python3.8/imaplib.py", line 1197, in _quote
arg = arg.replace('\\', '\\\\')
AttributeError: 'NoneType' object has no attribute 'replace'

 

Anyone have any idea what I'm doing wrong?

 

Thanks

Link to comment
  • trurl locked this topic
Guest
This topic is now closed to further replies.