Web scraping is extracting data from websites. It is a form of copying, in which specific data is gathered and copied from the web into a central local database or spreadsheet for later analysis or retrieval.
Since YouTube is the biggest video sharing website in the internet, extracting data from it can be very helpful, you can find the most popular channels, keeping track on the popularity of channels, recording likes, dislikes and views on videos and much more. In this tutorial, you will learn how to extract data from YouTube videos using requests_html and BeautifulSoup in Python.
Note that it isn't reliable to use this method to extract YouTube data, as YouTube keeps changing their code, the code can not work in any time. Therefore, for more reliable use, I suggest you use YouTube API for extracting data instead.
Related: How to Extract YouTube Comments in Python.
Installing required dependencies:
pip3 install requests_html bs4
Before we dive into the quick script, we gonna need to experiment on how to extract such data from websites using BeautifulSoup, open up a Python interactive shell and write this lines of code:
from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs # importing BeautifulSoup
# sample youtube video url
video_url = "https://www.youtube.com/watch?v=jNQXAC9IVRw"
# init an HTML Session
session = HTMLSession()
# get the html content
response = session.get(video_url)
# execute Java-script
response.html.render(sleep=1)
# create bs object to parse HTML
soup = bs(response.html.html, "html.parser")
# write all HTML code into a file
open("video.html", "w", encoding='utf8').write(response.html.html)
This will create a new HTML file in the current directory, open it up on a browser and see how requests_html and BeautifulSoup will see the YouTube video web page.
Once you open it on your browser, right click on the elements you want to extract and inspect element. For instance, when I did that for the video title, I see that the title is under h1
tag:
Or the number of video views is under
span
tag that has the class of view-count
:
Great, now let's try to extract title and number of views in Python:
In [10]: soup.find("h1").text
Out[10]: 'Me at the zoo'
Easy as that, number of views:
In [11]: soup.find("span", attrs={"class": "view-count"}).text
Out[11]: '106,600,098 views'
Awesome, now let's convert this attribute to an integer in Python:
In [12]: int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
Out[12]: 106600098
This way, you will be able to extract everything you want from that web page. Now let's make our script that extracts all possible information we can get from a YouTube video page, open up a new Python file and follow along:
Importing necessary modules:
from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs
Before we make our function that extract all video data, let's initialize our HTTP session:
# init session
session = HTMLSession()
Let's make a function, given an URL of a YouTube video, it will return all the data in a dictionary:
def get_video_info(url):
# download HTML code
response = session.get(url)
# execute Javascript
response.html.render(sleep=1)
# create beautiful soup object to parse HTML
soup = bs(response.html.html, "html.parser")
# open("index.html", "w").write(response.html.html)
# initialize the result
result = {}
Notice after we downloaded the HTML content of the web page, we ran render()
method to execute Javascript, so that the data we're looking for, is rendered in the HTML.
Retrieving the video title:
# video title
result["title"] = soup.find("h1").text.strip()
Number of views converted to an integer:
# video views (converted to integer)
result["views"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
Get the video description:
# video description
result["description"] = soup.find("yt-formatted-string", {"class": "content"}).text
The video description is located in the yt-formatted-string
HTML tag that has the class
attribute of content
, so hopefully, we'll be able to extract it using the above code.
The date when the video was published:
# date published
result["date_published"] = soup.find("div", {"id": "date"}).text[1:]
The duration of the video:
# get the duration of the video
result["duration"] = soup.find("span", {"class": "ytp-time-duration"}).text
We can also extract the video tags:
# get the video tags
result["tags"] = ', '.join([ meta.attrs.get("content") for meta in soup.find_all("meta", {"property": "og:video:tag"}) ])
The number of likes and dislikes as integers:
# number of likes
text_yt_formatted_strings = soup.find_all("yt-formatted-string", {"id": "text", "class": "ytd-toggle-button-renderer"})
result["likes"] = text_yt_formatted_strings[0].text
# number of dislikes
result["dislikes"] = text_yt_formatted_strings[1].text
Since in a YouTube video, you can see the channel details, such as the name, and number of subscribers, let's grab that as well:
# channel details
channel_tag = soup.find("yt-formatted-string", {"class": "ytd-channel-name"}).find("a")
# channel name
channel_name = channel_tag.text
# channel URL
channel_url = f"https://www.youtube.com{channel_tag['href']}"
# number of subscribers as str
channel_subscribers = soup.find("yt-formatted-string", {"id": "owner-sub-count"}).text.strip()
result['channel'] = {'name': channel_name, 'url': channel_url, 'subscribers': channel_subscribers}
return result
Since soup.find() function returns a Tag object, you can still find HTML tags within other tags. As a result, It is a common practice to call find() more than once.
Now this function returns a lot of video information in a dictionary, let's finish up our script:
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="YouTube Video Data Extractor")
parser.add_argument("url", help="URL of the YouTube video")
args = parser.parse_args()
url = args.url
# get the data
data = get_video_info(url)
# print in nice format
print(f"Title: {data['title']}")
print(f"Views: {data['views']}")
print(f"Published at: {data['date_published']}")
print(f"Video Duration: {data['duration']}")
print(f"Video tags: {data['tags']}")
print(f"Likes: {data['likes']}")
print(f"Dislikes: {data['dislikes']}")
print(f"\nDescription: {data['description']}\n")
print(f"\nChannel Name: {data['channel']['name']}")
print(f"Channel URL: {data['channel']['url']}")
print(f"Channel Subscribers: {data['channel']['subscribers']}")
Nothing special here, since we need a way to retrieve the video URL from the command line, the above does just that, and then print it in a format, here is my output when running the script:
C:\youtube-extractor>python extract_video_info.py https://www.youtube.com/watch?v=jNQXAC9IVRw
Title: Me at the zoo
Views: 106602383
Published at: 23/04/2005
Video Duration: 0:18
Video tags: me at the zoo, jawed karim, first youtube video
Likes: 3825489
Dislikes: 111818
Description: The first video on YouTube. Maybe it's time to go back to the zoo?
NEW VIDEO LIVE! https://www.youtube.com/watch?v=dQw4w...
== Ok, new video as soon as 10M subscriberz! ==
Channel Name: jawed
Channel URL: https://www.youtube.com/channel/UC4QobU6STFB0P71PMvOGN5A
Channel Subscribers: 1.03M
This is it! Now If you want to extract YouTube comments, there are a lot of things to do beside this, there is a separate tutorial for this.
Now you can not only extract YouTube video details, you can apply this skill to any website you want. If you want to extract Wikipedia pages, there is a tutorial for that ! Or maybe you want to scrape weather data from Google? There is a tutorial for that as well.
Note: If the code of this tutorial doesn't work for you, please check out using YouTube API tutorial instead.
Learn also: How to Convert HTML Tables into CSV Files in Python.
Happy Scraping ♥
View Full Code