PDF has often been used for publishing valuable data. Sometimes it is a piece of cake to extract the human-readable data and convert it into csv, excel or put it into a database. Sometimes PDF is used to publish data in a hard-to-process format, i.e. using embedded images. The open-source ecosystem provides fantastic tools to get data out of PDF files.

Machine-generated PDF files are the dreams of developers. Most wordprocessors have got an option “Save as PDF”, billing platforms usually PDF invoices, only frameworks generate reports in PDF and etc. These files are usually easy-to-process. pdfplumber is a superb Python package to get text out of PDF documents. We love using pdfplumber within a Jupyter Notebook because of its visual debugging capabilities which come very handy if we have to deal with tables.
Sadly, PDF has often been published in a terrible format. E.g. government institutions publish terribly formatted PDF files, containing images of scanned documents, missing or corrupted metadata, etc. pikepdf is a wonderful package to deal with such pdfs. Even, if you have to edit, merge, mix, or whatever your pdf files, you will love this tool!

Sometimes PDF files contain images of scanned pages. If you are lucky, you can use the tesseract-ocr out-of-the-box on such data to turn the images into text.
Sometimes the open OCR models cannot recognize the text on the images. E.g. if you have to work with historical texts, or the document is using a custom font. The video tutorial above and this post are great sources to learn how you can train your own OCR model for such cases.

Things can get really complicated in the realm of PDF documents. If the data is semi-structured or structured and it is an embedded image, you are in trouble. You can train your own model to identify parts of the document, and also, you will need an OCR to extract the data. This post and the accompanying github repo might be useful to you.
Our favorite PDF files contain scanned images of questionnaires or forms filled out by hand. This is where you have to consider handwriting recognition. You can use tesseract to train a model for handwriting recognition, or you can use your favorite deep learning framework too.
If you cannot extract the data using rules, you have to use some Machine Learning magic. Sometimes out-of-the-box models work fine. Sometimes you have to train your own model(s) and you have to annotate lots of data. Sadly, sometimes this requires lots of time and resources and it doesn’t worth doing it. However, if you are about to build your own models, keep in mind that annotation is the king. The more high-quality data you have, the better your chances to train a high-quality model.
OCR and handwriting recognition work best with a spelling corrector because it can improve the output a lot. Peter Norvig’s classic essay, How to Write a Spelling Corrector, is the best place to learn about the topic. These days neural spell checkers can yield better results. NeuSpell is a fantastic library that lets you use pre-trained models for English or you can train your own model from scratch for other languages.
The source of the cover image can be found here.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearlyDo you like our visualizations? Buy them for yourself!
Visit our shop on Society6 to get a printout of our vizs.


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.