Overview: Automatic OCR of image PDFs from an input directory to an output directory using ocrmypdf and the latest tesseract.
Application GitHub: https://github.com/cmccambridge/ocrmypdf-auto
This container automates one stage in a "paperless" document processing pipeline: Take all the PDFs in some input folder, run OCR on them, and save the output to an output folder. It combines the excellent tools ocrmypdf and tesseract with file-monitoring and some new configurability.
For example, you could configure a wireless document scanner to save all images to one share on your unRAID server, and use this container to monitor all new incoming files, OCR them, and write the finished (searchable!) PDFs to another share:
For details on how to configure the container and ocrmypdf to tweak OCR behavior, please see the README on GitHub! You can configure:
What options (per-folder) to pass to ocrmypdf
e.g. one folder for clean, normal page size grayscale scans from the document scanner
e.g. a separate folder for skewed, poor contrast receipts from a phone app
e.g. a separate folder for multi-language scans
What to do with original files after OCR
Archive them to a 2nd output folder?
Where to store temporary files
By default, within the container
Or: configure your own high-speed temporary path (cache disk, ramdisk, etc.)
Post any other questions or issues relating to this Docker container in this thread.