Categories
Technology Tools

Automatic OCR with Hazel: The Easy Way

I have previously written about how to run OCR (Optical Character Recognition) on a PDF using Hazel and… a complicated pile of Python scripts and other software. Since I wrote that post, several of those pieces of software have been updated, and the core component has been, apparently, entirely abandoned.
Recently, while I was waiting for yet another keyboard replacement on my MacBook, I took another look at the OCR thing and found that there’s a much easier way available: OCRmyPDF.
It’s easy to install, assuming you’ve already got brew: brew install ocrmypdf
From there, it’s just a single action in Hazel. “Run embedded shell script: ocrmypdf $1
Admittedly, you can use some of their many settings to get something a bit nicer than just OCR; personally, I’m using --rotate-pages --deskew --mask-barcodes – the first two to help with variations in the input because I sometimes use a bed scanner, and the latter to help Tesseract, which can have issues with barcodes..
I’ve also paired it with a couple additional actions, just to keep everything organized:

I also took the time to stop using Dropbox as the go-between for my scanner and the Mac running Hazel; I’d forgotten that the scanner has a USB port. Plug in a cheap flash drive, and it’s available as a (very slow) file server. Mount the drive, add it as a Login Item so it’ll auto-mount on boot, and you can set Hazel automations to run right there. I’m not OCRing them there, though — like I said, it’s a very slow server, so it tags them ‘for OCR’ and moves them to my desktop.1


  1. With iCloud Drive handling my desktop, I’ve found it to be a pretty great ‘intake’ folder for all of my Hazel automations. It’s quite nice to be able to save a PDF from my phone, add a tag, and watch it disappear again as it’s auto-sorted, or throw a PDF on my desktop with a tag and see it pop in and out as the OCR runs. 

5 replies on “Automatic OCR with Hazel: The Easy Way”

Hi there

I’ve tested your solution but somehow your script does not work. It returns a “shell script exited with a non successful error”. Do you know why? I have macOS Catalina.

Thank you very much for your help!

Best regards
Steve

Hi, thanks for this guide.

I’ve set this rule up but am running into errors the i have no idea how to trouble shoot.

Hazel is just saying ‘error processing shell script on XXXX. When I run the script in terminal it’s giving different errors.

error: the following arguments are required: output_pdf

Do you have any suggestions?
Thanks,

I used the ocrmypdf from the command line and the issue is that you need to specify both an input filename and an output filename:

ocrmypdf input.pdf output.pdf

The error states that it is missing a required argument. When I read through the instructions I was surprised that only a single argument was given.

I have and use Adobe Acrobat Pro for OCR. One file of about 270 pages took 21 minutes on this 2020 MacBook Pro (16G RAM, 2.6 GHz 6-Core Intel Core i7). It was slow because Acrobat “Pro” is only single-threaded.

On ocrmypdf the same source file was completed in just over 9 minutes. It gave warnings about some pages but it kept going. Half of the time was spent running the OCR on the individual page images. Processor meters and the fans showed that the resources were being fully utilized (as it would on HandBrake). The other half was rebuilding the output PDF. That process used a single thread and processor core. Still, there was a significant time savings on this file so I will keep it in my toolbox.

Isn’t it amazing that Adobe, who gets so much from so many users for the subscription to its programs cannot make a multi-threaded program in the past decade when those machines have become ubiquitous. Yet, a free program can outperform it?

The problem seems to be that his code misses two things: the output-file and ” ” around the $1. I got the script to work when looking like this (in the most basic form):

ocrmypdf “$1” “$1”

Thank you so much Grey. I’ve been googling for an OCR option and finally stumbled onto this post.

As James mentioned the script requires input.pdf output.pdf so I just added another `$1` to the script to work in Hazel:

`ocrmypdf $1 $1`

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.