Jump to content
cmccambridge

[Support] cmccambridge - ocrmypdf-auto

14 posts in this topic Last Reply

Recommended Posts

logo.png

 

Application: ocrmypdf-auto

Overview: Automatic OCR of image PDFs from an input directory to an output directory using ocrmypdf and the latest tesseract.

Docker: https://quay.io/repository/cmccambridge/ocrmypdf-auto

Application GitHub: https://github.com/cmccambridge/ocrmypdf-auto

 

This container automates one stage in a "paperless" document processing pipeline: Take all the PDFs in some input folder, run OCR on them, and save the output to an output folder.  It combines the excellent tools ocrmypdf and tesseract with file-monitoring and some new configurability.

 

For example, you could configure a wireless document scanner to save all images to one share on your unRAID server, and use this container to monitor all new incoming files, OCR them, and write the finished (searchable!) PDFs to another share:

flow-unraid.png

 

For details on how to configure the container and ocrmypdf to tweak OCR behavior, please see the README on GitHub! You can configure:

  • What options (per-folder) to pass to ocrmypdf
    • e.g. one folder for clean, normal page size grayscale scans from the document scanner
    • e.g. a separate folder for skewed, poor contrast receipts from a phone app
    • e.g. a separate folder for multi-language scans
  • What to do with original files after OCR
    • Archive them to a 2nd output folder?
    • Delete them?
  • Where to store temporary files
    • By default, within the container
    • Or: configure your own high-speed temporary path (cache disk, ramdisk, etc.)

 

Questions?

Post any other questions or issues relating to this Docker container in this thread.

 

 

Edited by cmccambridge
Better container docs

Share this post


Link to post

I see at your GitHub that the unRAID template is "TODO", so I am going to approve this new user first post, but will not move it to Docker Containers until that is "DONE".

 

 

Share this post


Link to post

Great catch @trurl, thanks! The unRAID template is already up and running here, but I had forgotten to go back and tidy up my TODO list.

 

There is now unRAID-specific documentation in the project's README file that describes the recommended container settings for anybody not installing via the defaults in the unRAID template directly. https://github.com/cmccambridge/ocrmypdf-auto/blob/master/README.md#unraid-integration

 

 

(Note: At the moment, I've still got a few open questions to @Squid about that template in a DM, so I don't believe that it is live in CA just yet... feel free to wait on moving this thread until the template is live.)

Share this post


Link to post
23 hours ago, cmccambridge said:
Great catch [mention=7072]trurl[/mention], thanks! The unRAID template is already up and running here, but I had forgotten to go back and tidy up my TODO list.
 
There is now unRAID-specific documentation in the project's README file that describes the recommended container settings for anybody not installing via the defaults in the unRAID template directly. https://github.com/cmccambridge/ocrmypdf-auto/blob/master/README.md#unraid-integration
 
 
(Note: At the moment, I've still got a few open questions to [mention=10290]Squid[/mention] about that template in a DM, so I don't believe that it is live in CA just yet... feel free to wait on moving this thread until the template is live.)

Working out of town. I'll be looking at everything in ~12 hours  EDIT:  Its in there now

Sent via Tapatalk because I'm either at work or enjoying the summer
 

Edited by Squid

Share this post


Link to post
Posted (edited)

2019-01-08 16:27:42 - Processing complete in 3.820000 seconds with status 3: /input/test.pdf

There is no File

What is the problem ?

Edited by Abigel

Share this post


Link to post
10 minutes ago, Abigel said:

2019-01-08 16:27:42 - Processing complete in 3.820000 seconds with status 3: /input/test.pdf

There is no File

What is the problem ?

Any other information you can give us? Your docker run command, for example?

 

Share this post


Link to post
1 hour ago, trurl said:

Any other information you can give us? Your docker run command, for example?

 

I use no command, only install the plugin

Share this post


Link to post
3 minutes ago, Abigel said:

I use no command, only install the plugin

Did you even bother to read the link I gave you? It explains exactly how to get the information I was asking for, the docker run command.

Share this post


Link to post

Sorry, I hadn't seen it on the smartphone, looked like a signature.

 

root@localhost:# /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker run -d --name='ocrmypdf-auto' --net='bridge' -e TZ="Europe/Berlin" -e HOST_OS="Unraid" -e 'OCR_OUTPUT_MODE'='MIRROR_TREE' -e 'OCR_ACTION_ON_SUCCESS'='NOTHING' -e 'OCR_PROCESS_EXISTING_ON_START'='1' -e 'OCR_LANGUAGES'='' -e 'OCR_NOTIFY_URL'='' -e 'OCR_VERBOSITY'='' -e 'USERMAP_UID'='99' -e 'USERMAP_GID'='100' -v '/mnt/user/pre/':'/input':'rw' -v '/mnt/user/after/':'/output':'rw' -v '/mnt/user/appdata/ocrmypdf-auto':'/config':'rw' 'quay.io/cmccambridge/ocrmypdf-auto'

0d96f4947c0a21a2527ecf4c484b993859d7d9ca73c1c437989167d580d50da6

 

Share this post


Link to post

Hi @Abigel, sorry to hear you're having issues... I'm away on vacation at the moment and so don't have access to a computer to debug, but one thing comes to mind to try. The most recent change made to the code was regarding support for multiple languages. Perhaps we introduced a bug there that didn't surface until now.

 

Could you try explicitly setting OCR_LANGUAGES="enu" (or your language of choice) even though it's supposed to work correctly without?

 

Let me know if that changes anything...

Share this post


Link to post
Posted (edited)

Hi

i have choose en OCR_LANGUAGES'='enu' but this comes:

 

Reading package lists...
---- Installing Tesseract Langauge: en via tesseract-ocr-en
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr-en

when i choose

OCR_LANGUAGES="enu"

it works

but in the docker run command it is written like OCR_LANGUAGES'=''

 

thanks for your help!

 

Can the program recognize handwriting?

Edited by Abigel

Share this post


Link to post
3 hours ago, Abigel said:

Can the program recognize handwriting?

Depends on the penmanship.

Share this post


Link to post

I have test it with block capitals but it doesnt work.

Nothing is detectet as Word

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now