How to Extract PDF Metadata in Python

Learn how to use pikepdf library to extract useful information from PDF files in Python.
  · 4 min read · Updated oct 2021 · PDF File Handling


The metadata in PDFs is useful information about the PDF document, it includes the title of the document, the author, last modification date, creation date, subject, and much more. Some PDF files got more information than others, and in this tutorial, you will learn how to extract PDF metadata in Python.

There are a lot of libraries and utilities in Python to accomplish the same thing but I like using pikepdf, as it's an active and maintained library. Let's install it:

$ pip install pikepdf

Pikepdf is a Pythonic wrapper around the C++ QPDF library. Let's import it in our script:

import pikepdf
import sys

We'll also use the sys module to get the filename from the command-line arguments:

# get the target pdf file from the command-line arguments
pdf_filename = sys.argv[1]

Let's load the PDF file using the library, and get the metadata:

# read the pdf file
pdf = pikepdf.Pdf.open(pdf_filename)
docinfo = pdf.docinfo
for key, value in docinfo.items():
    print(key, ":", value)

The docinfo attribute contains a dictionary of the document's metadata. Here is an example execution:

$ python extract_pdf_metadata_simple.py bert-paper.pdf

Output:

/Author : 
/CreationDate : D:20190528000751Z
/Creator : LaTeX with hyperref package
/Keywords :
/ModDate : D:20190528000751Z
/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
/Producer : pdfTeX-1.40.17
/Subject :
/Title :
/Trapped : /False

Here is another PDF file:

$ python extract_pdf_metadata_simple.py python_cheat_sheet.pdf

Output:

/CreationDate : D:20201002181301Z
/Creator : wkhtmltopdf 0.12.5
/Producer : Qt 4.8.7
/Title : Markdown To PDF

As you can see, not all documents have the same fields, some contain much less information.

Notice that the /ModDate and /CreationDate are the last modification date and creation date respectively in the PDF datetime format. If you want to convert this format into Python datetime format, then I have copied this code from StackOverflow and edit it a little to run on Python 3:

import pikepdf
import datetime
import re
from dateutil.tz import tzutc, tzoffset
import sys

pdf_date_pattern = re.compile(''.join([
    r"(D:)?",
    r"(?P<year>\d\d\d\d)",
    r"(?P<month>\d\d)",
    r"(?P<day>\d\d)",
    r"(?P<hour>\d\d)",
    r"(?P<minute>\d\d)",
    r"(?P<second>\d\d)",
    r"(?P<tz_offset>[+-zZ])?",
    r"(?P<tz_hour>\d\d)?",
    r"'?(?P<tz_minute>\d\d)?'?"]))

def transform_date(date_str):
    """
    Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
    http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm
    (D:YYYYMMDDHHmmSSOHH'mm')
    :param date_str: pdf date string
    :return: datetime object
    """
    global pdf_date_pattern
    match = re.match(pdf_date_pattern, date_str)
    if match:
        date_info = match.groupdict()

        for k, v in date_info.items():  # transform values
            if v is None:
                pass
            elif k == 'tz_offset':
                date_info[k] = v.lower()  # so we can treat Z as z
            else:
                date_info[k] = int(v)

        if date_info['tz_offset'] in ('z', None):  # UTC
            date_info['tzinfo'] = tzutc()
        else:
            multiplier = 1 if date_info['tz_offset'] == '+' else -1
            date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))

        for k in ('tz_offset', 'tz_hour', 'tz_minute'):  # no longer needed
            del date_info[k]

        return datetime.datetime(**date_info)

# get the target pdf file from the command-line arguments
pdf_filename = sys.argv[1]
# read the pdf file
pdf = pikepdf.Pdf.open(pdf_filename)
docinfo = pdf.docinfo
for key, value in docinfo.items():
    if str(value).startswith("D:"):
        # pdf datetime format, convert to python datetime
        value = transform_date(str(pdf.docinfo["/CreationDate"]))
    print(key, ":", value)

Here is the same output previously, but with datetime formats converted to Python datetime objects:

/Author : 
/CreationDate : 2019-05-28 00:07:51+00:00
/Creator : LaTeX with hyperref package
/Keywords :
/ModDate : 2019-05-28 00:07:51+00:00
/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
/Producer : pdfTeX-1.40.17
/Subject :
/Title :
/Trapped : /False

Much better. I hope this quick tutorial helped you out to get the metadata of PDF documents with Python.

Check the full code here.

Learn also: How to Extract Image Metadata in Python

Happy coding ♥

View Full Code
Sharing is caring!



Read Also




Comment panel