In this tutorial, you will learn how you can extract tables in PDF using camelot library in Python. Camelot is a Python library and a command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files, check their official documentation and Github repository. Let's dive in !
Related tutorial: How to Extract Images from PDF in Python.
pip3 install camelot-py[cv]
Note that you need to make sure that you have Tkinter and ghostscript (which are the required dependencies) installed properly in your computer.
Now that you have installed all requirements for this tutorial, open up a new Python file and follow along:
import camelot # PDF file to extract tables from file = "foo.pdf"
I have a PDF file in the current directory called "foo.pdf" which is a normal page that contains one table shown in the following image:
Just a random table, let's extract it in Python:
# extract all the tables in the PDF file tables = camelot.read_pdf(file)
read_pdf() function extracts all tables in a PDF file, let's print number of tables extracted:
# number of tables extracted print("Total tables extracted:", tables.n)
Total tables extracted: 1
Sure enough, it contains only one table, printing this table as a Pandas DataFrame:
# print the first table as Pandas DataFrame print(tables.df)
0 1 2 3 4 5 6 0 Cycle \nName KI \n(1/km) Distance \n(mi) Percent Fuel Savings 1 Improved \nSpeed Decreased \nAccel Eliminate \nStops Decreased \nIdle 2 2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4% 3 2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7% 4 4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3% 5 2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2% 6 4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%
That's precise, let's export the table to a CSV file:
# export individually tables.to_csv("foo.csv")
Or if you want to export all tables in one go:
# or export all in a zip tables.export("foo.csv", f="csv", compress=True)
f parameter indicates the file format, in this case "csv". By setting compress parameter equals to True, this will create a ZIP file that contains all the tables in CSV format.
You can also export the tables to HTML format:
# export to HTML tables.export("foo.html", f="html")
or you can export to other formats such as JSON and Excel too.
It is worth to note that Camelot only works with text-based PDFs and not scanned documents. If you can click and drag to select text in your table in a PDF viewer, then it is a text-based PDF, so this will work on papers, books, documents and much more!
So this won't convert image characters to digital text, if you wish so, you can use OCR techniques to convert image optical characters to actual text that can be manipulated in Python.
Alright, this is it for this tutorial, check their official documentation for more detailed information.
Finally, many of the Python concepts aren't discussed in detail here, if you feel you want to dig more to Python, I highly suggest you get one of these amazing Coursera courses:
Learn also: How to Extract All PDF Links in Python.
Happy Coding ♥View Full Code