cmccambridge Posted July 6, 2018 Share Posted July 6, 2018 (edited) Application: ocrmypdf-auto Overview: Automatic OCR of image PDFs from an input directory to an output directory using ocrmypdf and the latest tesseract. Docker: https://quay.io/repository/cmccambridge/ocrmypdf-auto Application GitHub: https://github.com/cmccambridge/ocrmypdf-auto This container automates one stage in a "paperless" document processing pipeline: Take all the PDFs in some input folder, run OCR on them, and save the output to an output folder. It combines the excellent tools ocrmypdf and tesseract with file-monitoring and some new configurability. For example, you could configure a wireless document scanner to save all images to one share on your unRAID server, and use this container to monitor all new incoming files, OCR them, and write the finished (searchable!) PDFs to another share: For details on how to configure the container and ocrmypdf to tweak OCR behavior, please see the README on GitHub! You can configure: What options (per-folder) to pass to ocrmypdf e.g. one folder for clean, normal page size grayscale scans from the document scanner e.g. a separate folder for skewed, poor contrast receipts from a phone app e.g. a separate folder for multi-language scans What to do with original files after OCR Archive them to a 2nd output folder? Delete them? Where to store temporary files By default, within the container Or: configure your own high-speed temporary path (cache disk, ramdisk, etc.) Questions? Post any other questions or issues relating to this Docker container in this thread. Edited July 9, 2018 by cmccambridge Better container docs 1 Quote Link to comment
trurl Posted July 6, 2018 Share Posted July 6, 2018 I see at your GitHub that the unRAID template is "TODO", so I am going to approve this new user first post, but will not move it to Docker Containers until that is "DONE". Quote Link to comment
cmccambridge Posted July 6, 2018 Author Share Posted July 6, 2018 Great catch @trurl, thanks! The unRAID template is already up and running here, but I had forgotten to go back and tidy up my TODO list. There is now unRAID-specific documentation in the project's README file that describes the recommended container settings for anybody not installing via the defaults in the unRAID template directly. https://github.com/cmccambridge/ocrmypdf-auto/blob/master/README.md#unraid-integration (Note: At the moment, I've still got a few open questions to @Squid about that template in a DM, so I don't believe that it is live in CA just yet... feel free to wait on moving this thread until the template is live.) Quote Link to comment
Squid Posted July 6, 2018 Share Posted July 6, 2018 (edited) 23 hours ago, cmccambridge said: Great catch [mention=7072]trurl[/mention], thanks! The unRAID template is already up and running here, but I had forgotten to go back and tidy up my TODO list. There is now unRAID-specific documentation in the project's README file that describes the recommended container settings for anybody not installing via the defaults in the unRAID template directly. https://github.com/cmccambridge/ocrmypdf-auto/blob/master/README.md#unraid-integration (Note: At the moment, I've still got a few open questions to [mention=10290]Squid[/mention] about that template in a DM, so I don't believe that it is live in CA just yet... feel free to wait on moving this thread until the template is live.) Working out of town. I'll be looking at everything in ~12 hours EDIT: Its in there now Sent via Tapatalk because I'm either at work or enjoying the summer Edited July 7, 2018 by Squid Quote Link to comment
cmccambridge Posted July 7, 2018 Author Share Posted July 7, 2018 On 7/6/2018 at 10:37 AM, Squid said: Its in there now Excellent, thanks very much for your help! Quote Link to comment
Abigel Posted January 8, 2019 Share Posted January 8, 2019 (edited) 2019-01-08 16:27:42 - Processing complete in 3.820000 seconds with status 3: /input/test.pdf There is no File What is the problem ? Edited January 8, 2019 by Abigel Quote Link to comment
trurl Posted January 8, 2019 Share Posted January 8, 2019 10 minutes ago, Abigel said: 2019-01-08 16:27:42 - Processing complete in 3.820000 seconds with status 3: /input/test.pdf There is no File What is the problem ? Any other information you can give us? Your docker run command, for example? Quote Link to comment
Abigel Posted January 8, 2019 Share Posted January 8, 2019 1 hour ago, trurl said: Any other information you can give us? Your docker run command, for example? I use no command, only install the plugin Quote Link to comment
trurl Posted January 8, 2019 Share Posted January 8, 2019 3 minutes ago, Abigel said: I use no command, only install the plugin Did you even bother to read the link I gave you? It explains exactly how to get the information I was asking for, the docker run command. Quote Link to comment
Abigel Posted January 9, 2019 Share Posted January 9, 2019 Sorry, I hadn't seen it on the smartphone, looked like a signature. root@localhost:# /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker run -d --name='ocrmypdf-auto' --net='bridge' -e TZ="Europe/Berlin" -e HOST_OS="Unraid" -e 'OCR_OUTPUT_MODE'='MIRROR_TREE' -e 'OCR_ACTION_ON_SUCCESS'='NOTHING' -e 'OCR_PROCESS_EXISTING_ON_START'='1' -e 'OCR_LANGUAGES'='' -e 'OCR_NOTIFY_URL'='' -e 'OCR_VERBOSITY'='' -e 'USERMAP_UID'='99' -e 'USERMAP_GID'='100' -v '/mnt/user/pre/':'/input':'rw' -v '/mnt/user/after/':'/output':'rw' -v '/mnt/user/appdata/ocrmypdf-auto':'/config':'rw' 'quay.io/cmccambridge/ocrmypdf-auto' 0d96f4947c0a21a2527ecf4c484b993859d7d9ca73c1c437989167d580d50da6 Quote Link to comment
cmccambridge Posted January 9, 2019 Author Share Posted January 9, 2019 Hi @Abigel, sorry to hear you're having issues... I'm away on vacation at the moment and so don't have access to a computer to debug, but one thing comes to mind to try. The most recent change made to the code was regarding support for multiple languages. Perhaps we introduced a bug there that didn't surface until now. Could you try explicitly setting OCR_LANGUAGES="enu" (or your language of choice) even though it's supposed to work correctly without? Let me know if that changes anything... Quote Link to comment
Abigel Posted January 9, 2019 Share Posted January 9, 2019 (edited) Hi i have choose en OCR_LANGUAGES'='enu' but this comes: Reading package lists... ---- Installing Tesseract Langauge: en via tesseract-ocr-en Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package tesseract-ocr-en when i choose OCR_LANGUAGES="enu" it works but in the docker run command it is written like OCR_LANGUAGES'='' thanks for your help! Can the program recognize handwriting? Edited January 9, 2019 by Abigel Quote Link to comment
JonathanM Posted January 10, 2019 Share Posted January 10, 2019 3 hours ago, Abigel said: Can the program recognize handwriting? Depends on the penmanship. Quote Link to comment
Abigel Posted January 10, 2019 Share Posted January 10, 2019 I have test it with block capitals but it doesnt work. Nothing is detectet as Word Quote Link to comment
cmccambridge Posted January 24, 2019 Author Share Posted January 24, 2019 I'm glad that it's working now for you, @Abigel - I believe I know what the problem was there, and will get an update posted so that other folks don't run into the same problem down the road. Thanks for reporting this! Re: handwriting recognition... This isn't really the intended purpose for tesseract, the OCR program that ocrmypdf-auto is using internally. I have limited success with recognizing block letter handwriting, such as the attached example... you can see that it mostly recognized the block letters (mistook "IS" for "1S"), did similarly OK on mixed upper and lowercase printed letters (mistook "Hello" for "Yello" and got some capitalization wrong), and did poorly on cursive lettering. If you want to research this further, here's a link I had found regarding academic research into customizing tesseract for handwriting recognition... it sounds like the accuracy is not very good: https://stackoverflow.com/questions/39556443/using-tesseract-for-handwriting-recognition Note: If it wasn't clear from the documentation or your experience with ocrmypdf-auto, there's one thing I should clarify: The program intentionally does not change the input image of the PDF itself, other than some minor quality enhancements like deskewing, etc. Instead, the program only adds an extra invisible "text layer" to the output PDF that lets you search for and highlight recognized text. For example, if you highlight all the handwriting in the output sample here, you can copy and paste the following "recognized" text: HELLO, THIS 1S AN OCR TEST. Yello, this is an OcR test. Alle, thin ko am OCR et. OCR Test Input.pdf OCR Test Output.pdf Quote Link to comment
Abigel Posted January 24, 2019 Share Posted January 24, 2019 Okay My goal is to have my handwritten notes recognized by the search in the PDF document. I study and take a lot of notes. I took photos of my notes and had hoped that I could search for words in the document if I only wanted to read a certain word. Unfortunately, it doesn't work very well for me. Maybe it's my handwriting. Is it possible to train the program in my handwriting? Could it be that I write in German? For example the documents in the attachement. Is there any way to get to my destination? If necessary also with other software? input.pdf output.pdf Quote Link to comment
cmccambridge Posted January 25, 2019 Author Share Posted January 25, 2019 Hi @Abigel - couple of thoughts... First and most important: the tesseract OCR engine that is used by ocrmypdf-auto really isn't optimized for handwriting. It's designed for typeset / printed text which has properties that make it "easier to read" like consistent letter shapes, letter spacing, word spacing, line breaks, etc. You can read all the gory details on the tesseract homepage, or explore some of the academic research efforts to extend the engine to handwriting, but the short version to my understanding is: handwriting recognition is a lot harder than recognizing typeset text. That said... here would be my best tips Sorry that I don't have any solution to this problem... Your example image appears to be a cellphone photo of a page of text. This should work, but you will probably get better results the closer your image looks to a black-and-white piece of paper. Tesseract has tips on improving recognition by improving image quality. For example, in your files the handwriting is blue on a tan background... this is clear to a human, but not as obvious to a computer. It will be easier for the computer to understand if all text is black on white. If you have access to a scanner, I would try that instead of a phone camera, since the scanner will remove some of the artificial "room coloring" that a phone camera sees. Or, convert your phone image to black and white and increase the contrast before trying to run OCR on it, so that the text versus background are very clear for the computer. Since the OCR engine does try to recognize full words, not just individual letters, it's important to tell it what language(s) to expect. This is what the OCR_LANGUAGES variable is for. In your case, since you're writing in German, I would try setting OCR_LANGUAGES="deu" to install the German language data, rather than the default of English. And as a side tip... the best program I've ever seen for recognizing actual handwriting is a somewhat unexpected one: Microsoft OneNote. This may not be helpful to you at all unless you have a Windows computer, but it could be worth a try :). I am not sure whether it will do as good a job recognizing handwriting in a photo as it does recognizing direct pen input on a tablet, though... I did a quick experiment with some of the tips above, and got slightly better results... enough that it might be worth it for you to keep experimenting? Up to you Converted your image to black and white. Increased the contrast until the handwriting was very black and the background was very white. Ran ocrmypdf-auto with OCR_LANGUAGES=deu The result was partial recognition: Das Haus ı5+ gem. Best of luck! input_black_white.pdf output_black_white.pdf Quote Link to comment
Abigel Posted January 25, 2019 Share Posted January 25, 2019 many thanks! it hasn't completely solved my problem because not everything is recognized and OneNote Not my solution ist. But I have come much closer to the solution, thank you very much. Quote Link to comment
sunpower Posted January 27, 2020 Share Posted January 27, 2020 (edited) Hello, I´m a new happy unraid user and installed this plugin. The container runs and I put a pdf-file in the input folder. Then I restarted the container and the file is additionally in the output folder. That´s fine. But I don´t understand how to make it that ocrmypd_auto runs automatically; e.g. at a time period (once an hour or once a day...) or every time when a new file is detected in the input folder. Or is it necessary that I restart every time manually the container for (a) new pdf-file(s)? Sorry for my stupid question (because I´m a noob) and bad (school-)English (It´s not my mother language) Update 1 day later: I´m really stupid, because the (very easy) solution is already built in Container. I changed "Process Existing on Startup" from "0"(=default) to "1". Edited January 28, 2020 by sunpower Found (built-in) solution on my own by "studying" container settings Quote Link to comment
cmccambridge Posted January 29, 2020 Author Share Posted January 29, 2020 @sunpower - It sounds like you got it sorted out, but let me know if you are still having any issues! Quote Link to comment
sunpower Posted January 30, 2020 Share Posted January 30, 2020 (edited) @cmccambridgeYes, I could solve it. Thanks for your ocrpdf-tool/app. I love it. Another question I have is where I have to create (or find) the "archive" if i use "Archive_Input_Files" at "Action On Success"? Edited January 30, 2020 by sunpower Could -> could Quote Link to comment
schilling3003 Posted February 18, 2020 Share Posted February 18, 2020 (edited) On 1/30/2020 at 1:20 AM, sunpower said: @cmccambridgeYes, I could solve it. Thanks for your ocrpdf-tool/app. I love it. Another question I have is where I have to create (or find) the "archive" if i use "Archive_Input_Files" at "Action On Success"? You have to define a custom path. You are supposed to define the container path as /archive and then map the host path to a folder on a share. Unfortunately when I have tried this I assigned it to folder like \mnt\usr\Receipts\Archived\ but it starts creating folders within folders within folders all with whatever the folder name you give it. I kept clicking through the folders about 50 times and never did find the end, so I have no clue how many folders it created. Edited February 18, 2020 by schilling3003 Quote Link to comment
Toskache Posted March 10, 2020 Share Posted March 10, 2020 Thanks for the great docker-container. I installed it via "Community Applications". It works very fine. I added also a /archive volume and everything works. But in the list of all my unraid docker-containers there is a "not available" in the version column. Any ideas why? Quote Link to comment
nik82 Posted March 11, 2020 Share Posted March 11, 2020 (edited) 22 hours ago, Toskache said: Thanks for the great docker-container. I installed it via "Community Applications". It works very fine. I added also a /archive volume and everything works. But in the list of all my unraid docker-containers there is a "not available" in the version column. Any ideas why? Same problem has been working since it was first setup but after unraid update and community app update today we noticed that version is no longer viable in the docker section, also processed files can no longer accessed and comes back with an access denied when trying to open them. Edited March 11, 2020 by nik82 Quote Link to comment
sunpower Posted March 11, 2020 Share Posted March 11, 2020 18 minutes ago, nik82 said: .......... but after unraid update and community app update ......... Same at my sytem: ocrmypdf_auto -> Link is not available Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.