How to Extract Images from PDF in Python

Learn how to extract and save images from PDF files in Python using PyMuPDF and Pillow libraries.
  · 3 min read · Updated sep 2022 · PDF File Handling

Disclosure: This post may contain affiliate links, meaning when you click the links and make a purchase, we receive a commission.

In this tutorial, we will write Python code to extract images from PDF files and save them on the local disk using PyMuPDF and Pillow libraries.

With PyMuPDF, you can access PDF, XPS, OpenXPS, epub, and many other extensions. It should run on all platforms, including Windows, Mac OSX, and Linux.

Let's install it along with Pillow:

pip3 install PyMuPDF Pillow

Open up a new Python file and let's get started. First, let's import the libraries:

import fitz # PyMuPDF
import io
from PIL import Image

I'm gonna test this with this PDF file, but you're free to bring and PDF file and put it in your current working directory, let's load it to the library:

# file path you want to extract images from
file = "1710.05006.pdf"
# open the file
pdf_file = fitz.open(file)

Since we want to extract images from all pages, we need to iterate over all the pages available and get all image objects on each page, the following code does that:

# iterate over PDF pages
for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    # get image list
    image_list = page.get_images()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on page", page_index)
    for image_index, img in enumerate(image_list, start=1):
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extract_image(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

Related: How to Convert PDF to Images in Python.

We're using the getImageList() method to list all available image objects as a list of tuples on that particular page. To get the image object index, we simply get the first element of the tuple returned.

After that, we use the extractImage() method that returns the image in bytes and additional information, such as the image extension.

Finally, we convert the image bytes to a PIL image instance and save it to the local disk using the save() method which accepts a file pointer as an argument; we're simply naming the images with their corresponding page and image indices.

After I ran the script, I got the following output:

[!] No images found on page 0
[+] Found a total of 3 images in page 1
[+] Found a total of 3 images in page 2
[!] No images found on page 3
[!] No images found on page 4

The images are saved as well in the current directory:

Extracted images using PythonConclusion

Alright, we have successfully extracted images from that PDF file without losing image quality. For more information on how the library works, I suggest you take a look at the documentation.

You can get the full code here.

Here are some PDF Related tutorials:

Alternatively, you can check this page for handling PDF documents in Python tutorials.

Finally, if you're a beginner and want to learn Python, I suggest you take the Python For Everybody Coursera course, in which you'll learn a lot about Python. You can also check our resources and courses page to see the Python resources I recommend on various topics!

Happy Coding ♥

View Full Code
Sharing is caring!



Read Also



Comment panel