index all rss twitter github linkedin email

Álvaro Ramírez

25 December 2018 Using OCR to create searchable pdfs from images

Used my phone to take a handful of photos of an article from a magazine. Wanted to convert the images to a searchable pdf on macOS.

This was straightforward, having already installed tesseract.

for i in IMG_3*.jpg; do echo $i; tesseract $i $(basename $i .tif) pdf; done

Should now have a handful of OCR'd pdfs:

ls *.jpg.pdf
IMG_3104.jpg.pdf
IMG_3105.jpg.pdf
IMG_3106.jpg.pdf
IMG_3107.jpg.pdf

Finally, join all pdfs into one. Turns out macOS has a handy python script already installed. We can use it as:

/usr/bin/python "/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py" -o joined.pdf IMG_*pdf

ps. pdfgrep is great for searching pdfs.