How to Download All Images from a Web Page in Python

Abdou Rockikz · 28 sep 2019

Abdou Rockikz · 5 min read · Updated oct 2019 · Web Scraping

Have you ever wanted to download all images in a certain web page ? In this tutorial, you will learn how you can retrieve all images from web pages as well as downloading them in Python using requests and BeautifulSoup.

We need quite a few dependencies, let's install them:

pip3 install requests bs4 tqdm

Open up a new Python file and import necessary modules:

import requests
import os
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse

First, when you extract the URL of images from a web page, there are quite a lot of URLs that are relative, which means it does not contain the full absolute URL with the scheme. So we need a way to check whether a URL is absolute:

def is_absolute(url):
    """
    Determines whether a `url` is absolute.
    """
    return bool(urlparse(url).netloc)

urlparse() function parses a URL into six components, we just need to see if the netloc (domain name) is there.

Second, there are URLs of some websites that put encoded data in the place of a URL, we need to skip those. Let's implement a function that validates every URL passed:

def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

Third, I'm going to write a core function that grabs all image URLs of a web page:

def get_all_images(url):
    """
    Returns all image URLs on a single `url`
    """
    soup = bs(requests.get(url).content, "html.parser")

The HTML content of the web page is in soup object, to extract all img tags in HTML, we need to use soup.find_all("img") method, let's see it in action:

    for img in tqdm(soup.find_all("img"), "Extracting images"):
        img_url = img.attrs.get("src")

        if not img_url:
            # if img does not contain src attribute, just skip
            continue

This will retrieve all img elements as a Python list.

I've wrapped it in a tqdm object just to print a progress bar though. To grab the URL of an img tag, there is a src attribute. However, there are some tags that does not contain the src attribute, we skip those by using continue statement above.

Let's see if the URL is relative, if so, we join it by the original URL itself:

        if not is_absolute(img_url):
            # if img has relative URL, make it absolute by joining
            img_url = urljoin(url, img_url)

There are some URLs that contains HTTP GET key value pairs which we don't like (that ends with something like this "/image.png?c=3.2.5"), let's remove them:

        try:
            pos = img_url.index("?")
            img_url = img_url[:pos]
        except ValueError:
            pass

We're getting the position of '?' character, then removing everything after it, if there isn't any, it will raise ValueError, that's why I wrapped it in try/except block.

Now let's make sure that every URL is valid and returns all the image URLs:

        # finally, if the url is valid
        if is_valid(img_url):
            urls.append(img_url)
    return urls

Now that we have a function that grabs all images URLs, we need a function to download files from the web with Python, I brought the following function from this tutorial:

def download(url, pathname):
    """
    Downloads a file given an URL and puts it in the folder `pathname`
    """
    # if path doesn't exist, make that path dir
    if not os.path.isdir(pathname):
        os.makedirs(pathname)
    # download the body of response by chunk, not immediately
    response = requests.get(url, stream=True)
    # get the total file size
    file_size = int(response.headers.get("Content-Length", 0))
    # get the file name
    filename = os.path.join(pathname, url.split("/")[-1])
    # progress bar, changing the unit to bytes instead of iteration (default by tqdm)
    progress = tqdm(response.iter_content(buffer_size), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True, unit_divisor=1024)
    with open(filename, "wb") as f:
        for data in progress:
            # write data read to the file
            f.write(data)
            # update the progress bar manually
            progress.update(len(data))

The pathname is the folder that we gonna store the images.

Finally, here is the main function:

def main(url, path):
    # get all images
    imgs = get_all_images(url)
    for img in imgs:
        # for each image, download it
        download(img, path)

Let's test this:

main("https://www.thepythoncode.com/topic/web-scraping", "web-scraping")

This will download all images from that URL and stores it in the folder "web-scraping" that will be created automatically.

Alright, that's it! I hope this was a helpful tutorial for you to get your hands dirty with web scraping.

Here are some ideas you can implement:

Happy Scraping ♥

View Full Code
Sharing is caring!


Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.