How to Extract YouTube Data in Python

Scraping YouTube videos and extracting useful video information such as title, total views, publish date, video duration, tags, likes and dislikes and more in Python using requests_html and Beautiful Soup libraries.
  · 7 min read · Updated aug 2020 · Web Scraping

Web scraping is extracting data from websites. It is a form of copying, in which specific data is gathered and copied from the web into a central local database or spreadsheet for later analysis or retrieval.

Since YouTube is the biggest video sharing website in the internet, extracting data from it can be very helpful, you can find the most popular channels, keeping track on the popularity of channels, recording likes, dislikes and views on videos and much more. In this tutorial, you will learn how to extract data from YouTube videos using requests_html and BeautifulSoup in Python.

Related: How to Extract YouTube Comments in Python.

Installing required dependencies:

pip3 install requests_html bs4

Before we dive into the quick script, we gonna need to experiment on how to extract such data from websites using BeautifulSoup, open up a Python interactive shell and write this lines of code:

from requests_html import HTMLSession 
from bs4 import BeautifulSoup as bs # importing BeautifulSoup

# sample youtube video url
video_url = ""
# init an HTML Session
session = HTMLSession()
# get the html content
response = session.get(video_url)
# execute Java-script
# create bs object to parse HTML
soup = bs(response.html.html, "html.parser")
# write all HTML code into a file
open("video.html", "w", encoding='utf8').write(response.html.html)

This will create a new HTML file in the current directory, open it up on a browser and see how requests_html and BeautifulSoup will see the YouTube video web page.

Once you open it on your browser, right click on the elements you want to extract and inspect element. For instance, when I did that for the video title, I see that the title is under h1 tag:

YouTube Video Title HTML TagOr the number of video views is under span tag that has the class of view-count:

YouTube Video Views HTML TagGreat, now let's try to extract title and number of views in Python:

In [10]: soup.find("h1").text
Out[10]: 'Me at the zoo'

Easy as that, number of views:

In [11]: soup.find("span", attrs={"class": "view-count"}).text
Out[11]: '106,600,098 views'

Awesome, now let's convert this attribute to an integer in Python:

In [12]: int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
Out[12]: 106600098

This way, you will be able to extract everything you want from that web page. Now let's make our script that extracts all possible information we can get from a YouTube video page, open up a new Python file and follow along:

Importing necessary modules:

from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs

Before we make our function that extract all video data, let's initialize our HTTP session:

# init session
session = HTMLSession()

Let's make a function, given an URL of a YouTube video, it will return all the data in a dictionary:

def get_video_info(url):
    # download HTML code
    response = session.get(url)
    # execute Javascript
    # create beautiful soup object to parse HTML
    soup = bs(response.html.html, "html.parser")
    # open("index.html", "w").write(response.html.html)
    # initialize the result
    result = {}

Notice after we downloaded the HTML content of the web page, we ran render() method to execute Javascript, so that the data we're looking for, is rendered in the HTML.

Retrieving the video title:

    # video title
    result["title"] = soup.find("h1").text.strip()

Number of views converted to an integer:

    # video views (converted to integer)
    result["views"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))

Get the video description:

    # video description
    result["description"] = soup.find("yt-formatted-string", {"class": "content"}).text

The video description is located in the yt-formatted-string HTML tag that has the class attribute of content, so hopefully, we'll be able to extract it using the above code.

The date when the video was published:

    # date published
    result["date_published"] = soup.find("div", {"id": "date"}).text[1:]

The duration of the video:

    # get the duration of the video
    result["duration"] = soup.find("span", {"class": "ytp-time-duration"}).text

We can also extract the video tags:

    # get the video tags
    result["tags"] = ', '.join([ meta.attrs.get("content") for meta in soup.find_all("meta", {"property": "og:video:tag"}) ])

The number of likes and dislikes as integers:

    # number of likes
    text_yt_formatted_strings = soup.find_all("yt-formatted-string", {"id": "text", "class": "ytd-toggle-button-renderer"})
    result["likes"] = text_yt_formatted_strings[0].text
    # number of dislikes
    result["dislikes"] = text_yt_formatted_strings[1].text

Since in a YouTube video, you can see the channel details, such as the name, and number of subscribers, let's grab that as well:

    # channel details
    channel_tag = soup.find("yt-formatted-string", {"class": "ytd-channel-name"}).find("a")
    # channel name
    channel_name = channel_tag.text
    # channel URL
    channel_url = f"{channel_tag['href']}"
    # number of subscribers as str
    channel_subscribers = soup.find("yt-formatted-string", {"id": "owner-sub-count"}).text.strip()
    result['channel'] = {'name': channel_name, 'url': channel_url, 'subscribers': channel_subscribers}
    return result

Since soup.find() function returns a Tag object, you can still find HTML tags within other tags. As a result, It is a common practice to call find() more than once.

Now this function returns a lot of video information in a dictionary, let's finish up our script:

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="YouTube Video Data Extractor")
    parser.add_argument("url", help="URL of the YouTube video")
    args = parser.parse_args()
    url = args.url
    # get the data
    data = get_video_info(url)
    # print in nice format
    print(f"Title: {data['title']}")
    print(f"Views: {data['views']}")
    print(f"Published at: {data['date_published']}")
    print(f"Video Duration: {data['duration']}")
    print(f"Video tags: {data['tags']}")
    print(f"Likes: {data['likes']}")
    print(f"Dislikes: {data['dislikes']}")
    print(f"\nDescription: {data['description']}\n")
    print(f"\nChannel Name: {data['channel']['name']}")
    print(f"Channel URL: {data['channel']['url']}")
    print(f"Channel Subscribers: {data['channel']['subscribers']}")

Nothing special here, since we need a way to retrieve the video URL from the command line, the above does just that, and then print it in a format, here is my output when running the script:

Title: Me at the zoo
Views: 106602383
Published at: 23/04/2005
Video Duration: 0:18
Video tags: me at the zoo, jawed karim, first youtube video
Likes: 3825489
Dislikes: 111818

Description: The first video on YouTube. Maybe it's time to go back to the zoo?


== Ok, new video as soon as 10M subscriberz! ==

Channel Name: jawed
Channel URL:
Channel Subscribers: 1.03M

This is it! Now If you want to extract YouTube comments, there are a lot of things to do beside this, there is a separate tutorial for this.

Now you can not only extract YouTube video details, you can apply this skill to any website you want. If you want to extract Wikipedia pages, there is a tutorial for that ! Or maybe you want to scrape weather data from Google? There is a tutorial for that as well.

Learn also: How to Convert HTML Tables into CSV Files in Python.

Happy Scraping ♥

View Full Code
Sharing is caring!

Read Also

Comment panel