[Support] Paperless Docker


T0a

Recommended Posts

On 4/4/2020 at 4:33 PM, T0a said:

 

 

 

 

@OOmatrixOO As long as the paperless metadata contain the original file name you should be save. In case you decide to move your documents to another management system, you can use the paperless Exporter to export the files with their original name. See Exporter documentation.

 

In case you access the documents not from the paperless web UI (e.g. via the share) the following Pull-Request might solve your problem. However, can't estimate when the feature gets merged.

 

 

Hi.

This Pull-Request ist that what im looking for. That would be great.

How does this work when its ready?

Work this only for new files?

Link to comment

To all Android paperless users who are keen to experiment, there is now a mobile app in pretty early development stage. Feel free to give the developer some feedback in order to improve the app.

 

On 4/29/2020 at 8:11 PM, OOmatrixOO said:

Hi.

This Pull-Request ist that what im looking for. That would be great.

How does this work when its ready?

Work this only for new files?

 

@OOmatrixOO seems like the PR got merged :) You should have access to the new feature after a container update via the UI. For now you need to add the env variable `PAPERLESS_FILENAME_FORMAT` yourself to the docker template. Will update the template soon.

 

I only had a glance at the code yet. There is a migration script that might rename already imported documents as well. Give it a try.

Edited by T0a
Link to comment
  • 2 weeks later...

@T0a

 

Hi.

I've just tested it.

 

PAPERLESS_FILENAME_FORMAT: {created} - {title}

 

It works quite well.
I don't really like 2 things yet. How to save the documents in Paperless: 2020-04-28-0000000000 - test-0000007.pdf

Better would be 2020-04-28 - test
Can i remove the time (0000000000) and the counter (0000007) at the end?

 

Edited by OOmatrixOO
Link to comment
  • 4 weeks later...

Hello all,

 

I have a question about the PAPERLESS_FILENAME_FORMAT parameter. 

 

The description of the parameter states "Specify a filename format for the document (directories are supported)." 

 

Does this mean that directories will be auto created if they don't exist?

 

For example I want to have directories for each correspondent so I set the parameter to: 

{correspondent}/{correspondent}-{created}-{title}

When I did a test the consume failed with "FileNotFoundError: [Errno 2] No such file or directory:" error in the log file, the document doesn't appear in Documents, and the original file is left in the consume directory.

 

Thanks!

Link to comment

Hello, I was hoping to use this to replace ocrmypdf-auto. But it looks like the OCR output is put into a database rather than re-encoded(?) inline within the document. Is there a switch or flag I can set to change this? I would like my PDF files to also have selectable text after OCR finishes, and other previous options I have used let me do this. Thanks!

  • Like 1
Link to comment

Loaded this up today and it's running great.

 

One slight stumbling block I didn't see mentioned.  I don't use "Bridge" for my dockers, instead give them their own IP on br2 network.  The paperless server ran fine.  When I added the consumer, it complained that the IP was being used (it was, it was used by the server container).  In the end I used Bridge for consumer and it started.

But when I click on Webgui from the docker page, it goes to the wrong URL, trying the server IP address.  I manually type in the consumer address and it worked.  Not a biggie but some info in case others have the same problem.

 

I haven't tried the script that was posted to mount server and consumer from the same container.  Is that something that will be incorporated into this solution or is it something that I would have to try on my own if I want to give it a shot?

Link to comment

Actually, I realise my TZ is set incorrectly on Paperless.  it's showing UTC.  Above it says the TZ variable has been removed because it uses the servers TZ but it doesn't seem to be happening in my case.

 

Any idea how I can fix this?  My server's TZ is set correctly ( (UTC+08:00) Kuala Lumpur, Singapore)

 

UPDATE: Looked through the documentation and saw that the PAPERLESS_TIME_ZONE variable still works.  Added it to the startup script and all is good.

Edited by dalben
Link to comment
  • 3 weeks later...

This software is great. I've got it setup as per OP and works fine.

 

My goal is a 100% paperless household. Most of my source of paper is mail and doctor visits.

 

Any suggestions on:

  1. What is the best budget home scanner to use for duplex scanning?
  2. How best to handle the work flow after the scan is consumed
    1. Do you just leave it in there and use search?
    2. Do you sort it by tag/correspondent to automatically move it to nextcloud?
    3. Other ideas?

Just looking for practical advice from other users that may have dialed in the workflow and are really happy with it.

Link to comment
On 4/8/2020 at 3:26 AM, lewispm said:

I did this and the document_consumer would run, but the webserver wasn't running.  There was an error in the log about /etc/passwd being locked, not sure if that was the problem.

I switched the two lines in the entry.sh (listing the webserver first, then the document_consumer second, as below) and it works now. 


#! /bin/bash

/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload &
/sbin/docker-entrypoint.sh document_consumer &
wait

And I also had to make the file executable (chmod +x).

I was getting frustrated in getting it to run as one docker...  Until I read this post.  once I reversed the order and chmod +x the entry.sh file, and 

made it a single line as below, the stopping and starting the docker became stable and worked every time.  I was getting mixed results until I made it on a one liner.

 

#! /bin/bash

/sbin/docker-entrypoint.sh runserver 0.0.0.0:8000 --insecure --noreload & /sbin/docker-entrypoint.sh document_consumer & wait

 

Now everything works as a single docker.  Thanks Louispm, Bling and TOa for making paperless a great tool.  Perhaps this script or equivalent can be embedded into the next docker update.

 

Edited by dcoens
mixed results as posted
Link to comment

I'm having an issue getting the container to run with both webserver and consumer in one container using the script.  This is what shows in the container log:

 

Mapping UID and GID for paperless:paperless to 99:100
Operations to perform:
Apply all migrations: admin, auth, contenttypes, documents, reminders, sessions
Running migrations:
No migrations to apply.
Unknown command: '--entrypoint'
Type 'manage.py help' for usage.

I've got the script in Post Arguments:

 

image.thumb.png.831c81a50b4ca88a1568bab916d67f6e.png

 

I've also tried it in Extra Parameters to no avail.  Tried quotes also.  I've changed the script like @dcoens mentions above as well.  The script is executable as well.  Any ideas what I'm doing wrong here?

Link to comment

Looks like I figured it out and I don't get the error any longer.  It had to do with line endings.  For anyone else, the entrypoint needs to go into Extra Parameters.  I was getting the following error when I had it in Extra Parameters:

 

standard_init_linux.go:211: exec user process caused "no such file or directory"

I found this post:

 

https://forums.docker.com/t/standard-init-linux-go-175-exec-user-process-caused-no-such-file/20025/9

 

I created the file in VS Code, which I needed to change the End of Line Sequence from CRLF to LF.

Link to comment
  • 2 weeks later...
17 hours ago, dcpdad said:

Hello all-

AWESOME software! Thanks for creating Paperless.

Has anyone gotten it to work via reverse proxy, specifically with LetsEncrypt? I'd love to see a sample subdomain .conf if possible.

TIA!

I have mine working with the following subdomain.conf I hope it helps,:

 

# make sure that your dns has a cname set for radarr and that your radarr container is not using a base url

server {
    listen 443 ssl;
    listen [::]:443 ssl;

    server_name paperless.*;

    include /config/nginx/ssl.conf;

    client_max_body_size 0;

    # enable for ldap auth, fill in ldap details in ldap.conf
    #include /config/nginx/ldap.conf;

    # enable for Authelia
    #include /config/nginx/authelia-server.conf;

    location / {
        # enable the next two lines for http auth
        #auth_basic "Restricted";
        #auth_basic_user_file /config/nginx/.htpasswd;

        # enable the next two lines for ldap auth
        #auth_request /auth;
        #error_page 401 =200 /ldaplogin;

        # enable for Authelia
        #include /config/nginx/authelia-location.conf;

        include /config/nginx/proxy.conf;
        resolver 127.0.0.11 valid=30s;
        set $upstream_app paperless;
        set $upstream_port 8000;
        set $upstream_proto http;
        proxy_pass $upstream_proto://$upstream_app:$upstream_port;

        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $http_connection;
    }

    location ~ (/paperless)?/api {
        include /config/nginx/proxy.conf;
        resolver 127.0.0.11 valid=30s;
        set $upstream_app paperless;
        set $upstream_port 8000;
        set $upstream_proto http;
        proxy_pass $upstream_proto://$upstream_app:$upstream_port;

        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $http_connection;
    }
}
 

  • Like 1
Link to comment
On 6/6/2020 at 5:40 PM, KeithG said:

Hello all,

 

I have a question about the PAPERLESS_FILENAME_FORMAT parameter. 

 

The description of the parameter states "Specify a filename format for the document (directories are supported)." 

 

Does this mean that directories will be auto created if they don't exist?

 

For example I want to have directories for each correspondent so I set the parameter to: 


{correspondent}/{correspondent}-{created}-{title}

When I did a test the consume failed with "FileNotFoundError: [Errno 2] No such file or directory:" error in the log file, the document doesn't appear in Documents, and the original file is left in the consume directory.

 

Thanks!

Not sure if you saw, but there is an issue for this in the paperless github: https://github.com/the-paperless-project/paperless/issues/651.  Looks like there is a pull request to fix it but has not yet been merged.

Link to comment
  • 1 month later...

Whenever I scan a bunch of papers into a large PDF document, paperless always skips OCR. Any thoughts?

 

Consuming /consume/Scan1232131.pdf
** Processing: /tmp/paperless/paperless-8qv7_pge/convert.png
494x643 pixels, 3x16 bits/pixel, RGB
Input IDAT size = 70480 bytes
Input file size = 70561 bytes

Trying:
zc = 9 zm = 9 zs = 0 f = 0 IDAT size = 52581
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 52451
Selecting parameters:
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 52451

Output file: /tmp/paperless/paperless-8qv7_pge/optipng.png

Output IDAT size = 52451 bytes (18029 bytes decrease)
Output file size = 52508 bytes (18053 bytes = 25.58% decrease)

Skipping OCR, using Text from PDF
Unable to detect date for document

d
Document 20200903202020: Scan1232131 consumption finished

 

paperless-consumer

root@localhost:# /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker run -d --name='paperless-consumer' --net='proxynet' --cpuset-cpus='1,3,4,5,13,15,16,17' -e TZ="America/New_York" -e HOST_OS="Unraid" -e 'PAPERLESS_OCR_LANGUAGE'='eng' -e 'PAPERLESS_OCR_LANGUAGES'='eng' -e 'PAPERLESS_FORGIVING_OCR'='true' -e 'PAPERLESS_INLINE_DOC'='true' -e 'PAPERLESS_FILENAME_FORMAT'='{added}_{title}_{created}' -e 'USERMAP_UID'='99' -e 'USERMAP_GID'='100' -p '8001:8000/tcp' -v '/mnt/user/appdata/paperless/data':'/usr/src/paperless/data':'rw' -v '/mnt/user/paperless/Media/':'/usr/src/paperless/media':'rw' -v '/mnt/user/paperless/Consume/':'/consume':'rw' -v '/mnt/user/paperless/Export/':'/export':'rw' 'thepaperlessproject/paperless' document_consumer

 

 

Link to comment

Just installed Paperless on my Unraid server today. I love the idea of this application!

 

Installation post says, "4.3 Remove the port to avoid port conflicts with the webserver" and I wan't sure if this meant deleting the port number or using the "Remove" button for the variable. I used the "Remove" button and things seem to work. Maybe someone else will also be unsure and see this post.

 

My logs time zone is wrong, so it's not passing my server time zone. Reading posts here I'm confused about this. For now I've added:

 

PAPERLESS_TIME_ZONE: US

 

as a variable to the consumer docker setup. The consumer docker failed because it didn't like "US". I need to figure this one out.

 

UPDATE: Set PAPERLESS_TIME_ZONE to "America/New_York" without quotes for the Paperless docker, not the consumer docker. This seemed to fix the logs.

 

I'm a little surprised by the OCR situation so far. The first scan I processed was a 1-page PDF of instructions for one of those "Kill A Watt" devices. I would have thought the OCR would do well with it but initially it failed because it didn't see "english". I changed PAPERLESS_FORGIVING_OCR to true so it would process the pdf, which is did, but no good OCR. I also tried a scan of a 1-page magazine article and the results were also not good. Other PDFs and jpgs seemed better.

 

UPDATE: I had previously scanned a obituary which was a newspaper clipping. I had scanned it as both a jpg and PDF. No OCR for the PDF but OCR for the jpg was pretty good. Puzzling to me.

Edited by Abe677
Link to comment
17 hours ago, Abe677 said:

UPDATE: I had previously scanned a obituary which was a newspaper clipping. I had scanned it as both a jpg and PDF. No OCR for the PDF but OCR for the jpg was pretty good. Puzzling to me.

Add a container variable 'PAPERLESS_OCR_ALWAYS' and set to 'true'. 

 

See the original post section 3.2 how to customize paperless.

Link to comment
5 hours ago, bigbangus said:

Add a container variable 'PAPERLESS_OCR_ALWAYS' and set to 'true'. 

 

See the original post section 3.2 how to customize paperless.

Didn't make a difference. The document in question was a paper document that I scanned on my flatbed scanner, so there's no embedded text in this PDF. It's a PDF with an image in it.

Link to comment
18 minutes ago, bigbangus said:


What’s the container log say?
 

Consuming /consume/P4400 Kill A Watt Operation Manual.pdf
** Processing: /tmp/paperless/paperless-x6exs9yt/convert.png
439x571 pixels, 16 bits/pixel, grayscale
Input IDAT size = 436899 bytes
Input file size = 437112 bytes

Trying:
Selecting parameters:
zc = 9 zm = 9 zs = 3 f = 4 IDAT size = 435698

Output file: /tmp/paperless/paperless-x6exs9yt/optipng.png

Output IDAT size = 435698 bytes (1201 bytes decrease)
Output file size = 435755 bytes (1357 bytes = 0.31% decrease)

Processing sheet #1: /tmp/paperless/paperless-x6exs9yt/convert-0000.pnm -> /tmp/paperless/paperless-x6exs9yt/convert-0000.unpaper.pnm
[pgm_pipe @ 0x564da79becc0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x564da79c0100] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x564da79c0100] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
Language detection error: No features in text.
Language detection failed!
As FORGIVING_OCR is enabled, we're going to make the best with what we have.
Unable to detect date for document

d
Document 20200906082154: P4400 Kill A Watt Operation Manual consumption finished

Link to comment
On 6/10/2020 at 1:24 AM, Chad Kunsman said:

Hello, I was hoping to use this to replace ocrmypdf-auto. But it looks like the OCR output is put into a database rather than re-encoded(?) inline within the document. Is there a switch or flag I can set to change this? I would like my PDF files to also have selectable text after OCR finishes, and other previous options I have used let me do this. Thanks!

I know and used ocrmypdf-auto and now I'm testing Paperless. Works good and has a nice gui.

After your post I checked the pdf files (because I couldn't believe it) and yes, these seem to be the renamed original files without OCR text merged.

But indeed I need the OCR output "merged" in the pdf file. It's absolutly necessary and essential!

But I can't find any information to this. Did you find something about this?
 

Ok, I've searched and found this issue on github: https://github.com/the-paperless-project/paperless/issues/681

In short: "...it looks like embedding the OCR'd text back into the PDF is not in scope for this project..."

Edited by vakilando
correction
  • Like 1
Link to comment
  • trurl locked this topic
Guest
This topic is now closed to further replies.