Álvaro Ramírez
Using OCR to create searchable pdfs from images
Used my phone to take a handful of photos of an article from a magazine. Wanted to convert the images to a searchable pdf on macOS.
This was straightforward, having already installed tesseract.
for i in IMG_3*.jpg; do echo $i; tesseract $i $(basename $i .tif) pdf; done
Should now have a handful of OCR'd pdfs:
ls *.jpg.pdf
IMG_3104.jpg.pdf IMG_3105.jpg.pdf IMG_3106.jpg.pdf IMG_3107.jpg.pdf
Finally, join all pdfs into one. Turns out macOS has a handy python script already installed. We can use it as:
/usr/bin/python "/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py" -o joined.pdf IMG_*pdf
ps. pdfgrep is great for searching pdfs.