How to Extract PDF Tables in Python

Learning how to extract PDF Tables in Python using camelot library and export them into several formats such as CSV, excel, Pandas data frame and HTML.
Abdou Rockikz · 4 min read · Updated may 2020 · General Python Topics


In this tutorial, you will learn how you can extract tables in PDF using camelot library in Python. Camelot is a Python library and a command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files, check their official documentation and Github repository. Let's dive in !

Related tutorial: How to Convert HTML Tables into CSV Files in Python.

First, you need to install required dependencies for this library to work properly, and then you can install the library using the command line:

pip3 install camelot-py[cv]

Note that you need to make sure that you have Tkinter and ghostscript (which are the required dependencies) installed properly in your computer.

Now that you have installed all requirements for this tutorial, open up a new Python file and follow along:

import camelot

# PDF file to extract tables from
file = "foo.pdf"

I have a PDF file in the current directory called "foo.pdf" which is a normal page that contains one table shown in the following image:

Table in PDF to extract in Python

Just a random table, let's extract it in Python:

# extract all the tables in the PDF file
tables = camelot.read_pdf(file)

read_pdf() function extracts all tables in a PDF file, let's print number of tables extracted:

# number of tables extracted
print("Total tables extracted:", tables.n)

This outputs:

Total tables extracted: 1 

Sure enough, it contains only one table, printing this table as a Pandas DataFrame:

# print the first table as Pandas DataFrame
print(tables[0].df)

Output:

              0            1                2                     3                  4                  5                 6
0  Cycle \nName  KI \n(1/km)  Distance \n(mi)  Percent Fuel Savings
1                                                  Improved \nSpeed  Decreased \nAccel  Eliminate \nStops  Decreased \nIdle
2        2012_2         3.30              1.3                  5.9%               9.5%              29.2%             17.4%
3        2145_1         0.68             11.2                  2.4%               0.1%               9.5%              2.7%
4        4234_1         0.59             58.7                  8.5%               1.3%               8.5%              3.3%
5        2032_2         0.17             57.8                 21.7%               0.3%               2.7%              1.2%
6        4171_1         0.07            173.9                 58.1%               1.6%               2.1%              0.5%

That's precise, let's export the table to a CSV file:

# export individually
tables[0].to_csv("foo.csv")

Or if you want to export all tables in one go:

# or export all in a zip
tables.export("foo.csv", f="csv", compress=True)

f parameter indicates the file format, in this case "csv". By setting compress parameter equals to True, this will create a ZIP file that contains all the tables in CSV format.

You can also export the tables to HTML format:

# export to HTML
tables.export("foo.html", f="html")

or you can export to other formats such as JSON and Excel too.

It is worth to note that Camelot only works with text-based PDFs and not scanned documents. If you can click and drag to select text in your table in a PDF viewer, then it is a text-based PDF, so this will work on papers, books, documents and much more!

So this won't convert image characters to digital text, if you wish so, you can use OCR techniques to convert image optical characters to actual text that can be manipulated in Python.

Alright, this is it for this tutorial, check their official documentation for more information.

Finally, if you feel you need to learn Python, I highly suggest you get the Automating the boring stuff with Python book, it is a hands-on, and project-based book, you'll get to work with excel spreadcheets, google spreadsheets, PDF and word documents and more.

Read also: How to Convert Speech to Text in Python.

Happy Coding ♥

View Full Code
Sharing is caring!



Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.