[Support] cmccambridge - ocrmypdf-auto


54 posts in this topic Last Reply

Recommended Posts

  • Replies 53
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

Hi @Toskache, @nik82, @sunpower,   Apologies for being absent this past week, and thanks for your patience!   Thank you for letting me know about this issue. From your description

Application: ocrmypdf-auto Overview: Automatic OCR of image PDFs from an input directory to an output directory using ocrmypdf and the latest tesseract. Docker: https://quay.io/repository/cm

Great to hear @sunpower, @nik82! Thanks for letting me know that it's working for you - now I can be confident that I correctly understood the problem     I will file an issue in my github b

Posted Images

Hi @Toskache, @nik82, @sunpower,

 

Apologies for being absent this past week, and thanks for your patience!

 

Thank you for letting me know about this issue. From your description and screenshots, I believe I understand the problem, and have deployed a fix to resolve the "Not Available" message under the Version column on the Docker tab. However, since the bug is related to checking for and applying updates... we may need to be a little more hands-on to repair your individual Docker containers.


Details of the bug at the bottom, but first, let's get your Docker containers fixed.... this process will be one-time only to get past this update-related issue.

  1. From your Docker tab, find the ocrmypdf-auto docker container that you have configured. It will be shown as in @nik82's screenshot, with the "Not Available" status in the Version column.
  2. Click on your ocrmypdf-auto docker container and select Edit
    image.png.928cbfa2385b37760e5a8c86fa3248bf.png
  3. From the editor, find the field named Repository and edit the value to be exactly: cmccambridge/ocrmypdf-auto
    image.thumb.png.604b50d14e67fb861fe0df8777840a8e.png
    Please note that you will likely be deleting the text "quay.io/" from the beginning of this field.
  4. At the bottom of the page, click Apply. Unraid will restart your container and update it to point to the Docker Hub repository instead of quay.io. This will also update the image if your local container was not previously up to date.
  5. After returning to the main Unraid Docker tab, click the "Check for Updates" button at the bottom of the page. This process will take a few moments.
  6. You should now see proper version information for ocrmypdf-auto:
    image.png.93430fc3a740b7f002eb24f181f9b930.png

That's it! Let me know if you continue to experience any issues.

 

More technical details, for the curious:

  • The container itself was not experiencing any problems. You should have noticed that it was still shown as ▶️started and was operating correctly.
  • The trouble lies in Unraid's docker manager page, which is designed for the standard/first-party container repository Docker Hub. Although Unraid's docker manager does support installing and running a container from a third-party container repository like quay.io, it does not properly handle version and update checks for such a third-party repository.
  • I learned about this rather too late, and had already deployed a few Community Applications using quay.io 🤦‍♂️
  • ocrmypdf-auto has not had any feature changes since I learned of the version reporting issue, so I had never gone through the migration process to Docker Hub, and Unraid's docker page simply reported "Update available" in many cases when no update was actually needed.
  • For better or worse, the latest Unraid version is improved to be aware of the failure to query version information from third party repositories, and shows this failure as the "Not Available" message we've all been seeing on ocrmypdf-auto lately.
  • The Fix:
    1. I've migrated the ocrmypdf-auto repository from quay.io to be mirrored at Docker Hub
    2. I've updated my Community App template for ocrmypdf-auto to refer to the Docker Hub version of the container, so all future users will never experience this issue.
    3. I've documented the manual remediation steps for you all, as I'm not confident that the update to the Community App template will necessarily be automatically applied for you, given that the bug is related to checking for updates, and I don't want anyone stuck with a broken container!

Thanks again for reporting this issue,

- @cmccambridge

 

Link to post

Many thanks @cmccambridge. 👏

 

Everyting is fine now at my server with your app. The ocrmypdf-auto docker is now up to date after correcting the repository location. It was very easy to change it (for me as a noob) with your detailed (pictured) description and explanation. I´m happy now.🙂

Edited by sunpower
Link to post
6 hours ago, cmccambridge said:

Hi @Toskache, @nik82, @sunpower,

 

Apologies for being absent this past week, and thanks for your patience!

 

Thank you for letting me know about this issue. From your description and screenshots, I believe I understand the problem, and have deployed a fix to resolve the "Not Available" message under the Version column on the Docker tab. However, since the bug is related to checking for and applying updates... we may need to be a little more hands-on to repair your individual Docker containers.


Details of the bug at the bottom, but first, let's get your Docker containers fixed.... this process will be one-time only to get past this update-related issue.

  1. From your Docker tab, find the ocrmypdf-auto docker container that you have configured. It will be shown as in @nik82's screenshot, with the "Not Available" status in the Version column.
  2. Click on your ocrmypdf-auto docker container and select Edit
    image.png.928cbfa2385b37760e5a8c86fa3248bf.png
  3. From the editor, find the field named Repository and edit the value to be exactly: cmccambridge/ocrmypdf-auto
    image.thumb.png.604b50d14e67fb861fe0df8777840a8e.png
    Please note that you will likely be deleting the text "quay.io/" from the beginning of this field.
  4. At the bottom of the page, click Apply. Unraid will restart your container and update it to point to the Docker Hub repository instead of quay.io. This will also update the image if your local container was not previously up to date.
  5. After returning to the main Unraid Docker tab, click the "Check for Updates" button at the bottom of the page. This process will take a few moments.
  6. You should now see proper version information for ocrmypdf-auto:
    image.png.93430fc3a740b7f002eb24f181f9b930.png

That's it! Let me know if you continue to experience any issues.

 

More technical details, for the curious:

  • The container itself was not experiencing any problems. You should have noticed that it was still shown as ▶️started and was operating correctly.
  • The trouble lies in Unraid's docker manager page, which is designed for the standard/first-party container repository Docker Hub. Although Unraid's docker manager does support installing and running a container from a third-party container repository like quay.io, it does not properly handle version and update checks for such a third-party repository.
  • I learned about this rather too late, and had already deployed a few Community Applications using quay.io 🤦‍♂️
  • ocrmypdf-auto has not had any feature changes since I learned of the version reporting issue, so I had never gone through the migration process to Docker Hub, and Unraid's docker page simply reported "Update available" in many cases when no update was actually needed.
  • For better or worse, the latest Unraid version is improved to be aware of the failure to query version information from third party repositories, and shows this failure as the "Not Available" message we've all been seeing on ocrmypdf-auto lately.
  • The Fix:
    1. I've migrated the ocrmypdf-auto repository from quay.io to be mirrored at Docker Hub
    2. I've updated my Community App template for ocrmypdf-auto to refer to the Docker Hub version of the container, so all future users will never experience this issue.
    3. I've documented the manual remediation steps for you all, as I'm not confident that the update to the Community App template will necessarily be automatically applied for you, given that the bug is related to checking for updates, and I don't want anyone stuck with a broken container!

Thanks again for reporting this issue,

- @cmccambridge

 

Thank you for the update, for some reason i am still getting the access denied error on all processed documents, the original works fine but anything in the output folder gives Access denied error when trying to open them 

 

image.png.b8408474e0e43ce8773bbfce34aac725.png

Link to post
6 hours ago, cmccambridge said:

Hi @Toskache, @nik82, @sunpower,

 

Apologies for being absent this past week, and thanks for your patience!

 

Thank you for letting me know about this issue. From your description and screenshots, I believe I understand the problem, and have deployed a fix to resolve the "Not Available" message under the Version column on the Docker tab. However, since the bug is related to checking for and applying updates... we may need to be a little more hands-on to repair your individual Docker containers.


Details of the bug at the bottom, but first, let's get your Docker containers fixed.... this process will be one-time only to get past this update-related issue.

  1. From your Docker tab, find the ocrmypdf-auto docker container that you have configured. It will be shown as in @nik82's screenshot, with the "Not Available" status in the Version column.
  2. Click on your ocrmypdf-auto docker container and select Edit
    image.png.928cbfa2385b37760e5a8c86fa3248bf.png
  3. From the editor, find the field named Repository and edit the value to be exactly: cmccambridge/ocrmypdf-auto
    image.thumb.png.604b50d14e67fb861fe0df8777840a8e.png
    Please note that you will likely be deleting the text "quay.io/" from the beginning of this field.
  4. At the bottom of the page, click Apply. Unraid will restart your container and update it to point to the Docker Hub repository instead of quay.io. This will also update the image if your local container was not previously up to date.
  5. After returning to the main Unraid Docker tab, click the "Check for Updates" button at the bottom of the page. This process will take a few moments.
  6. You should now see proper version information for ocrmypdf-auto:
    image.png.93430fc3a740b7f002eb24f181f9b930.png

That's it! Let me know if you continue to experience any issues.

 

More technical details, for the curious:

  • The container itself was not experiencing any problems. You should have noticed that it was still shown as ▶️started and was operating correctly.
  • The trouble lies in Unraid's docker manager page, which is designed for the standard/first-party container repository Docker Hub. Although Unraid's docker manager does support installing and running a container from a third-party container repository like quay.io, it does not properly handle version and update checks for such a third-party repository.
  • I learned about this rather too late, and had already deployed a few Community Applications using quay.io 🤦‍♂️
  • ocrmypdf-auto has not had any feature changes since I learned of the version reporting issue, so I had never gone through the migration process to Docker Hub, and Unraid's docker page simply reported "Update available" in many cases when no update was actually needed.
  • For better or worse, the latest Unraid version is improved to be aware of the failure to query version information from third party repositories, and shows this failure as the "Not Available" message we've all been seeing on ocrmypdf-auto lately.
  • The Fix:
    1. I've migrated the ocrmypdf-auto repository from quay.io to be mirrored at Docker Hub
    2. I've updated my Community App template for ocrmypdf-auto to refer to the Docker Hub version of the container, so all future users will never experience this issue.
    3. I've documented the manual remediation steps for you all, as I'm not confident that the update to the Community App template will necessarily be automatically applied for you, given that the bug is related to checking for updates, and I don't want anyone stuck with a broken container!

Thanks again for reporting this issue,

- @cmccambridge

 

 

My apologies solved the Access denied issue by running the  Docker safe new Perms under the tools section :)

 

image.thumb.png.118c4fabae2b177b17f82c9d7e0e716f.png

Link to post

Great to hear @sunpower, @nik82! Thanks for letting me know that it's working for you - now I can be confident that I correctly understood the problem :) 

 

I will file an issue in my github backlog here to try and provide an explicit warning (or maybe even an error) from the container logs if it detects permissions that are unlikely to work properly...

 

Since the /output directory is mapped to another location (e.g. appdata or a share), the permissions are set and controlled from outside ocrmypdf-auto... but I can try and at least observe them and provide a recommendation for folks to go use the "Docker Safe New Perms" tool. Nice detective work!

 

- @cmccambridge

 

Link to post

@cmccambridge  Thanks for your effort. works fine now.
I have have a new Brother ADS-2700w and it works very fine with your ocrmypdf-docker-container. To scan a bunch of different documents it would be great if ocrmypdf would be able to split documents using "seperation-pages" (with special barcode or special text...) which could inserted into the stack before scanning. Is something like that possible? That would bring workflows and productivity to a next level.

In one sentence: If the content of a page is just "SEPERATOR-PAGE" than discard that page, save all previous pages and start a new document.

Link to post
14 minutes ago, Toskache said:

Give Devonthink a try!

Thanks @Toskachefor the hint. I´m also looking for an DMS-Tool.

 

How can I use it with unraid? Sorry for my stupid question but I´m noob. I can´t find "devonthink" in the apps-catalogue. I´m not knowing how to install a devonthink-docker in unraid.

 

Edited by sunpower
Link to post
24 minutes ago, sunpower said:

Thanks @Toskachefor the hint. I´m also looking for an DMS-Tool.

 

How can I use it with unraid? Sorry for my stupid question but I´m noob. I can´t find "devonthink" in the apps-catalogue. I´m not knowing how to install a devonthink-docker in unraid.

 

Sorry, I thought you where looking for a desktop solution. "Paperless" is a name of an application for macOS. Devonthink is also a desktop solution.
I was also looking for an DMS running as a docker container on the unraid-system. But I found nothing running OOB. "EcoDMS" seems to be a good system and is available for Windows, macOS, Linux. There is also a docker-version:  https://hub.docker.com/r/ecodms/allinone-18.09/ 
But I am not sure, what to do to run it on unraid. There used to be a version in the unraid apps-catalouge, but not at the moment.

Link to post

Hello, thank you for making this app

I'm a new user of unraid, and I found ocrmypdf-auto in the app and installed it.

When I put a PDF file into the input folder, I found it executed in the log

But there are no files in the output folder

Sorry for my stupid question (because I´m a noob) and bad (school-)English (It´s not my mother language)

 

Quote

root@localhost:# /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker run -d --name='ocrmypdf-auto' --net='bridge' -e TZ="Asia/Shanghai" -e HOST_OS="Unraid" -e 'OCR_OUTPUT_MODE'='MIRROR_TREE' -e 'OCR_ACTION_ON_SUCCESS'='NOTHING' -e 'OCR_LANGUAGES'='chi-sim' -e 'OCR_NOTIFY_URL'='' -e 'OCR_PROCESS_EXISTING_ON_START'='0' -e 'OCR_VERBOSITY'='warn' -e 'USERMAP_UID'='99' -e 'USERMAP_GID'='100' -v '/mnt/user/Download/pdf/input':'/input':'rw' -v '/mnt/user/Download/pdf/output':'/output':'rw' -v '/mnt/user/appdata/ocrmypdf-auto':'/config':'rw' 'cmccambridge/ocrmypdf-auto:latest' 
5ae0a6bd84857e48fcd0a475f6642ff9424bba16bb22095e669e4c1ec60d4069

The command finished successfully!

 

Quote

Mapping UID and GID for docker:docker to 99:100
---- Updating apt cache for Tessearct Language installation
Get:1 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [855 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages [186 kB]
Get:7 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [831 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages [11.3 MB]
Get:9 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [34.6 kB]
Get:10 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [7904 B]
Get:11 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages [1344 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages [13.5 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [1360 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [11.9 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [1149 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [48.1 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-backports/main amd64 Packages [2496 B]
Get:18 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [4245 B]
Fetched 17.7 MB in 8s (2143 kB/s)
Reading package lists...
---- Installing Tesseract Langauge: chi-sim via tesseract-ocr-chi-sim
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
tesseract-ocr-chi-sim
0 upgraded, 1 newly installed, 0 to remove and 87 not upgraded.
Need to get 1636 kB of archives.
After this operation, 2484 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-chi-sim all 4.00~git24-0e00fe6-1.2 [1636 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 1636 kB in 4s (451 kB/s)
Selecting previously unselected package tesseract-ocr-chi-sim.
Preparing to unpack .../tesseract-ocr-chi-sim_4.00~git24-0e00fe6-1.2_all.deb ...
Unpacking tesseract-ocr-chi-sim (4.00~git24-0e00fe6-1.2) ...
Setting up tesseract-ocr-chi-sim (4.00~git24-0e00fe6-1.2) ...
---- Cleaning apt cache
2020-03-18 02:41:15 - Watching /input
2020-03-18 02:41:46 - Processing: /input/test.pdf -> /output/test.pdf
2020-03-18 02:41:50 - Processing complete in 3.430000 seconds with status 3: /input/test.pdf
TESTOCR_PROCESS_RESULT/input/test.pdf/output/test.pdf.430000

 

Link to post
10 minutes ago, fkeven said:

But there are no files in the output folder

Are you sure you are looking in the right place?

11 minutes ago, fkeven said:

/mnt/user/Download/pdf/output

This would be in your Download share in the pdf/output folder.

 

Maybe that is where you looked, but since you are new I thought maybe you don't understand how to use dockers.

Link to post
38 minutes ago, trurl said:

Are you sure you are looking in the right place?

This would be in your Download share in the pdf/output folder.

 

Maybe that is where you looked, but since you are new I thought maybe you don't understand how to use dockers.

Thank you for your reply

I'm a noob, just started trial unraid

I think the output file should be in the /mnt/user/Download/pdf/output

But it didn't

Link to post
  • 2 months later...

Hi, thanks for this docker, it's a pleasure to use it. I notice some differences in the quality between source file output file and that seem to have something to do with the image compression level. Is there a setting to change the quality/compression level of the output pdf file?

Thanks in advance.

Link to post
  • 1 month later...
  • 3 months later...

The app is great, but to get my paperless set-up working I would need a feature to specify a unique file name for the output file, i.e. something like: 

SCAN_YEAR_MONTH_DAY_TIME_ID.pdf

 

The problem I have is that my scanner is not providing unique file name indices with i.e. an increasing index number, but instead will re-start counting up from 1 as soon as there are no more files in the folder. 

This means that once the files have been processed and deleted in the incoming scans folder, the scanner would restart indexing and provide the same file name as before i.e. SCN_0001.pdf, which then causes the output file to get overwritten. 

 

Also keeping the input files is not an option either, as the scanner has a limitation of index number 2000 (SCN_2000.pdf), which would limit the number of possible scans. 

 

Is there a way to make a small modification to the ocrmypdf-auto.py phyton script to give a unique file name to the output file?

 

 

Edited by rob_robot
Link to post
  • 2 weeks later...

@cmccambridge -thank you so much for this!!! First I was searching the web for macOS native apps...
Nearly none of them can do a batch processing - only the big players like Abbyy can and they forcing me to spent 200€ for their tool. 🤯
 
I have noticed both - an increase and a decrease of the file size.

Does your docker make any resizing / optimizations?
As far as I can see:

  • PDFs scanned at black/white or grayscale got a little 5-10% increased file size - maybe just because of the text layer - fine.
  • PDFs scanned in color get highly decreased file size - sometimes nearly 50% !
    • Is there any way to control this "optimization" ... 40-50% decrease is not possible without a high "loss" of information when compressing.
    • Can you provide such an option via the unRAID template to just set another parameter like "CompressionLevel=veryhigh,high,normal,low,verylow,none"

 
I highly appreciate your work and want to buy you a beer, but neither here or at github I found a link.
So just provide me a link and the next drinks go on me.👍

Link to post
  • 2 months later...

Looks like exactly what i was looking for.

One question: Is it possible to have multiple incoming directories -> multiple output directories?

Because i have configured my Brother Document Scanner with multiple Profiles -> multiple Directories to Scan...

 

Any suggestions?

Link to post
7 hours ago, Calimero said:

Is it possible to have multiple incoming directories -> multiple output directories?

Because i have configured my Brother Document Scanner with multiple Profiles -> multiple Directories to Scan...

 

@Calimero It is, yes! I use my scanner in similar fashion to separate gray versus color documents.

 

Two ideas to try:

  1. If there's a shared parent directory that includes all of your scanner profiles, you can just use that shared parent as /input or /output  (I run my scanner this way)
  2. If there is no shared parent in the "real world" (meaning the real paths that your profiles point to), then you can create one within the container by mapping each input (and each output) directory to a sub-directory within the container, like this:
    1. For each input folder, add an extra "Path" mapping like:
      1. /mnt/user/one_share/path  to  /input/one
      2. /mnt/user/another_share/path  to  /input/two
    2. Likewise for each output folder (if you want the outputs separated), you can add an extra "Path" mapping like:
      1. /mnt/user/out_share/path  to  /output/one
      2. /mnt/user/other_out_share/path  to  /output/two
    3. The container will match your /input folder hierarchy into the /output location as the container sees it, as long as you are using the default OCR_OUTPUT_MODE = MIRROR_TREE setting... so /input/one/doc.pdf --> /output/one/doc.pdf, or /input/two/foo.pdf --> /output/two/foo.pdf.  You just need to map your shares into the container so that all the inputs end up under /input, and the corresponding /output locations map back to where you would like them to land.

Good luck! Let me know how it goes :)

Link to post
  • 1 month later...

@Chad Kunsman Can you try the steps from this post, and see whether you are impacted by the same issue? 

 

If that's not it, i.e. if the repository is already correct, please let me know: I'm not on the 6.9 series yet as I haven't hit an opportunity where I could do the upgrade and have time to sort out what breaks... 🤞 Hoping my own container isn't one of the breaking changes.

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.