[Support] Paperless Docker


T0a

Recommended Posts

logo-dark.png

------------------------------

Dear paperless user,

paperless hasn't received a lot of updates and bug fixes in the past. Even pull requests are not merged for some time now. Though, paperless runs rocks solid and gets the job done! For some time now, there is a well-maintained fork of paperless out there. It's called paperless-ng and I'm happy to announce that paperless-ng is officially available via Unraids community application store (CA store). Go check it out!

----------------------------

 

Overview: Support for Docker paperless template in the selfhosters/unRAID-CA-templates repository.

Docker Hub: https://hub.docker.com/r/thepaperlessproject/paperless/

Documentation: https://paperless.readthedocs.io/en/latest/

 

This is the official Paperless Docker support thread. Feel free to ask questions and share information or your experience. I try to update this main post regularly and also incorporate your shares. I also started to contribute features to the Paperless project. You are welcome to help to improve Paperless too as it is a community-driven project.

 

You might also find this old question about Paperless helpful: https://forums.unraid.net/topic/71733-help-with-paperless-dockerhub-unraid/

 

1. How to Install

 

Paperless uses a two container setup: (1) a webserver serving your files via the browser and (2) a consumer service that checks for new files in the input directory, doing the parsing and importing the documents to the database. Because unRaid does not support docker-compose, you need to create those two containers from the same template with manual adjustments: 

 

1. Create a "Paperless" share on your array with subfolders for media, consume, and export

2. Install the Paperless webserver

2.1 From the Apps tab, search for "Paperless"

2.2 Configure the Media, Consumption, and Export paths to point at the folders you created in step 1

2.3 Accept the defaults for the remaining variables or adjust as needed. As you get more familiar with Paperless, you may wish to add additional variables that are defined here: https://github.com/the-paperless-project/paperless/blob/master/paperless.conf.example

2.4 Hit Apply to start the container.

3. Create the Paperless superuser

3.1 From the Unraid Docker UI, click the Paperless icon and choose Console. At the prompt, type "./manage.py createsuperuser". Follow the instructions to create the paperless user

4. Install paperless as consumer service to process documents in your /consume folder

4.1 From the Unraid Docker UI, click Add Container and select the paperless template from the [ User templates ] section

4.2 Rename the container to "paperless-consumer"

4.3 Remove the port to avoid port conflicts with the webserver

4.4 Switch to Advanced mode and change the "Post Arguments" parameter to "document_consumer". If you are using NFS, also add "--loop-time 60 --no-inotify" (See FAQ)

 

Now, you should be able to place a document in your /consume folder of your "Paperless" share and recognize it being imported to paperless.

 

2. Paperless Scan Workflows

 

2.1 WebDav Scan App with Nextcloud

 

I also want to share my scanning workflow with Paperless and Nextcloud (See https://blog.kilian.io/paperless/ as reference). With the app ScannerPro, I can upload my scanned files to a Nextcloud folder via WebDav from my mobile device. This folder functions as Paperless consume folder and grabs the files and imports them to Paperless. Following steps required for the setup:

 

1. Mount the paperless /consume folder in the Nextcloud docker container via unRaids Docker template editor

2. Enable the external storage app as Nextcloud admin. The app can be found in the apps dropdown hidden in disabled apps.

3. Add the mounted consume/ folder as local storage for your Nextcloud user with name paperless-consume (Settings -> External Storage)

4. Configure the Nextcloud paperless-consume folder as WebDav target in your scan app

 

Quote

An added bonus is that the Nextcloud desktop client also syncs the consume directory to my computer, so I also have a directory there where I can drop PDF files to be added to paperless (https://blog.kilian.io/paperless/)

 

3. FAQ

 

3.1 Why does the consumer not pick up my files?

 

  • The consumer service uses `inotify` to detect new documents in the consume folder. This subsystem, however, does not support NFS shares. Thus, you need to start your consumer with "Post Arguments" defined as `document_consumer --no-inotify --loop-time 60`.
  • The document file type might not be supported. Check the consumer logs using the unRaid docker section for warnings and issues.

 

3.2 How to customize Paperless?

 

All variables from paperless.conf.example can be passed as Docker environment variable to the container.

 

3.3 What scanner do you use for your paperless home?

 

 

3.4 Can I use Paperless on a mobile device?

 

  • There is a mobile app in pretty early development stage
Edited by T0a
paperless-ng mention
  • Thanks 2
Link to comment

Hey TOa,

 

Thanks for the guide on getting this set up, I managed to get my system up and running last week after much frustration. However I still have an issue getting document_consumer to run on startup. From reading your guide, it sounds like you are saying that you need two instances of paperless, one specially for the consumer? is that correct?

 

Secondly I have configured a couple of extra variables for my docker container which others might find useful:

1282220993_Screenshot2020-01-10at22_01_17.thumb.png.314808ef2bc6eed92ebac04c7cf5ca1e.png

  • PAPERLESS_PASSPHRASE - To enable encryption on your files
  • PAPERLESS_FORGIVING_OCR - If for some reason, Paperless can't complete OCR'ing the document it will still consume it into the system. By default, Paperless will not consume the file.
  • PAPERLESS_INLINE_DOC - This allows you to view paperless documents in your web browser! By default, Paperless will trigger a download.

Thanks again! 

  • Thanks 1
Link to comment
Quote

From reading your guide, it sounds like you are saying that you need two instances of paperless, one specially for the consumer? is that correct?

You are right. The first container runs `runserver 0.0.0.0:8000 --insecure --noreload` (webserver) and the second one `document_consumer --no-inotify --loop-time 60` (consumer) in „Post Arguments“. Both containers have the /consume folder shared. I will update the install instructions with pictures and making it more precise once I find the time. If you stick to smb shares you might omit the `--no-inotify --loop-time 60` .

 

Thanks for the additional configuration variables. We plan to update the template and I will consider these too.

Edited by T0a
  • Thanks 1
Link to comment
10 hours ago, 702Pilgrim said:

Thank you @T0a

I'm sorry if someone already posted this question.

Is the "./manage.py createsuperuser" the only way to create users? Can users have there own documents?

 

This is the only way as far as I know. Is this a problem for you? There is also no multi-user feature. Can you describe your use case for multi-user support? As a workaround you can start multiple Paperless instances - one per user.

Edited by T0a
Link to comment

Sort of adjacent question: What scanners are people using with Paperless and, importantly, which are the least nonsense to use day to day. I'd like something that I can carelessly throw paper into and press one button to scan and upload if I can. Home budget.

 

I do see the recomendations in the Paperless docs and in the OP, but there is a slight dating issue now. That said, if those are still the reliable working options, I'd happily consider it.

Link to comment
1 hour ago, hpka said:

Sort of adjacent question: What scanners are people using with Paperless and, importantly, which are the least nonsense to use day to day. I'd like something that I can carelessly throw paper into and press one button to scan and upload if I can. Home budget.

 

I do see the recomendations in the Paperless docs and in the OP, but there is a slight dating issue now. That said, if those are still the reliable working options, I'd happily consider it.

Maybe look here?

 

https://thewirecutter.com/office/printers-scanners/

 

I haven't had time to set this docker up yet, but I use a Fujitsu scansnap, but that was about $400 when I purchased it.

Link to comment
On 1/14/2020 at 2:42 AM, 702Pilgrim said:

Just so different users can have their own files but it's fine if not. 

So I'm guessing there's no way to send an email for the users to sign up?
 

 

I'm sorry, I think this is not possible ATM. However, you can search in the projects issue tracker for this. If you don't find something similar, feel free to open a feature request and describe your use-case. Maybe someone from the community will pick it up and implement it.

 

14 hours ago, hpka said:

Sort of adjacent question: What scanners are people using with Paperless and, importantly, which are the least nonsense to use day to day. I'd like something that I can carelessly throw paper into and press one button to scan and upload if I can. Home budget.

 

I do see the recomendations in the Paperless docs and in the OP, but there is a slight dating issue now. That said, if those are still the reliable working options, I'd happily consider it.

 

If you are not aiming to implement this workflow for your business, you might run well with an app for your phone like ScanBot or ScannerPro (Make sure to disable the build-in OCR since it is either bad or runs in the cloud). Otherwise, as already pointed out by @ice pube, the Fujitsu ScanSnap is really popular in business like environments. 

Edited by T0a
Link to comment

I hadn't seen this project before but after finding it in the community apps store, I wanted to try it out. Depolying the docker took no time at all - I've got a few documents uploaded now and like this a lot - much more easy to organize than documents in Google Drive for example. The OCR and tagging is awesome. I'm now considering purchasing a used scanner with the scan to ftp feature..  

Thank you T0a!!

Link to comment

I followed the instructions at the top of this post and Paperless was working for about a week - now, all of sudden the "paperless-consumer" template stopped working.  Now the "consumer" template will stop running after about 30 seconds.  Looking at the logs, this warning shows a few times, I'm wondering if this is what is causing it to halt? 

 

pyocr.error.TesseractError: (-11, b'Tesseract Open Source OCR Engine v4.0.0 with Leptonica\nDetected 116 diacritics\ncontains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502\n')

Full log is:

 

Mapping UID and GID for paperless:paperless to 99:1000
Operations to perform:
Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
Running migrations:
No migrations to apply.
Starting document consumer at /consume with inotify
Parsers available: RasterisedDocumentParser
Consuming /consume/2020-01-27-20-45-10.pdf
** Processing: /tmp/paperless/paperless-pg3qi59k/convert.png
500x647 pixels, 3x16 bits/pixel, RGB
Input IDAT size = 418512 bytes
Input file size = 418713 bytes

Trying:
zc = 9 zm = 9 zs = 0 f = 0 IDAT size = 233906
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 233890
Selecting parameters:
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 233890

Output file: /tmp/paperless/paperless-pg3qi59k/optipng.png

Output IDAT size = 233890 bytes (184622 bytes decrease)
Output file size = 233947 bytes (184766 bytes = 44.13% decrease)

Processing sheet #1: /tmp/paperless/paperless-pg3qi59k/convert-0000.pnm -> /tmp/paperless/paperless-pg3qi59k/convert-0000.unpaper.pnm
Processing sheet #1: /tmp/paperless/paperless-pg3qi59k/convert-0001.pnm -> /tmp/paperless/paperless-pg3qi59k/convert-0001.unpaper.pnm
[pgm_pipe @ 0x555930bc4f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
[pgm_pipe @ 0x559b9f1f0f80] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x559b9f1f2600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x559b9f1f2600] Encoder did not produce proper pts, making some up.
out of deviation range - NO ROTATING
[image2 @ 0x555930bc6600] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x555930bc6600] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 277, in image_to_string
return ocr.image_to_string(f, lang=lang)
File "/usr/lib/python3.7/site-packages/pyocr/tesseract.py", line 373, in image_to_string
raise TesseractError(status, errors)
pyocr.error.TesseractError: (-11, b'Tesseract Open Source OCR Engine v4.0.0 with Leptonica\nDetected 116 diacritics\ncontains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502\n')
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/paperless/src/manage.py", line 11, in <module>
execute_from_command_line(sys.argv)
File "/usr/lib/python3.7/site-packages/django/core/management/__init__.py", line 371, in execute_from_command_line
utility.execute()
File "/usr/lib/python3.7/site-packages/django/core/management/__init__.py", line 365, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/lib/python3.7/site-packages/django/core/management/base.py", line 288, in run_from_argv
self.execute(*args, **cmd_options)
File "/usr/lib/python3.7/site-packages/django/core/management/base.py", line 335, in execute
output = self.handle(*args, **options)
File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 97, in handle
self.loop_inotify(mail_delta)
File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 130, in loop_inotify
self.loop_step(mail_delta)
File "/usr/src/paperless/src/documents/management/commands/document_consumer.py", line 122, in loop_step
self.file_consumer.consume_new_files()
File "/usr/src/paperless/src/documents/consumer.py", line 117, in consume_new_files
if not self.try_consume_file(file):
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/usr/src/paperless/src/documents/consumer.py", line 158, in try_consume_file
date = parsed_document.get_date()
File "/usr/src/paperless/src/documents/parsers.py", line 127, in get_date
text = self.get_text()
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 110, in get_text
self._text = self._get_ocr(images)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 170, in _get_ocr
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 222, in _ocr
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
File "/usr/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
pyocr.error.TesseractError: (-11, b'Tesseract Open Source OCR Engine v4.0.0 with Leptonica\nDetected 116 diacritics\ncontains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502\n')

 

Link to comment

Hello TOa, is there an equivalent container to exporter function?  Consumer service is working great.  What if we want to export stuff?  In the WebUI, there isn't any option to export a document?  I tried creating another container called paperless-exporter and change the post argument to document_exporter but really I don't know what I'm doing.  What's the workflow to get documents to be exported?  Thank you.

Link to comment
On 1/25/2020 at 1:11 PM, ithelpme said:

Hello All, Paperless not consuming doc extension file but it consumes pdf, jpg just fine.  Is this a normal behavior?

To follow up on my own question.  I wanted to throw all my documents at Paperless to leverage its ability to search metadata, etc.  I found a utility called FileConver-1.2.3.x64.exe that adds a Explorer extension so that you can convert doc or any file to different types.  Changes can be make in bulk.  For those that may be in my situation, just thought I'd share.

Link to comment

I have no issue with OCR parsing English, but when I try to scan a Greek document it never detects that it is in Greek. Instead it seems to detect Slovenian. I do have the Greek language loaded via the OCR languages parameter. If I override the default language to Greek then it parses the document in Greek without issue, but it is no longer able to parse English documents. Am I doing something wrong? Is this behavior expected? What if I have a document that contains both Greek and English. Some words, particularly proper nouns, are not always translated in Greek. So for those documents it tries to parse the English words as Greek (when Greek is forced as the default) and does so horribly. I understand OCR is not bulletproof, but it would be great to do as little manual work as possible when scanning in hundreds of documents. Thank you to whomever takes the time to read this and possibly offers help.

Link to comment

Hello,

I have an issue with german OCR (deu).

I've installed Paperless and it just works fine withthe eng OCR, even with already OCRed german files. But it won't work with deu language. I've reinstalled both docker containers with PAPERLESS_OCR_LANGUAGE: deu and i'm getting the following error and the paperless-consumer docker stop working:

pyocr.error.TesseractError: (1, b'Error opening data file /usr/share/tessdata/deu.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.\nFailed loading language \'deu\'\nTesseract couldn\'t load any languages!\nCould not initialize tesseract.\n')

I've even created a folder /usr/share/tessdata/ , downloaded und put deu.traineddata file into it, tried to add the TESSDATA_PREFIX environment variable to point to that folder, but nothing helps.

 

Any ideas?

 

Full log:

tesseract.TXT

Link to comment

Yeah, at first, I've created this folder on Unraid 🤦‍♂️. I'm new to Docker, respectively how it works inside, so don't blame me ;)

Then i thought about it, and it seems logical to me that it affects the container itself. So I put the file in the container. After a few attempts, it now works. My approach: while the container starts, quickly upload the traineddata file via ssh (It is necessary to do so, because the container shuts down after about 30 seconds if it has an error Only if you have any file in the /consumption folder, so make sure its empty.).

Paperless finds the file and performs OCR without errors (ant it works great!).

 

So the real question is, why does Paperless not download other languages after configuration?

 

tessdata.png

Edited by Nickproof
Link to comment
  • trurl locked this topic
Guest
This topic is now closed to further replies.