How to Extract YouTube Data in Python

Scraping YouTube videos and extracting useful video information such as title, total views, publish date, video duration, tags, likes and dislikes and more in Python using requests_html and Beautiful Soup libraries.
  · 12 min read · Updated sep 2021 · Web Scraping


Web scraping is extracting data from websites. It is a form of copying, in which specific data is gathered and copied from the web into a central local database or spreadsheet for later analysis or retrieval.

Since YouTube is the biggest video sharing website in the internet, extracting data from it can be very helpful, you can find the most popular channels, keeping track on the popularity of channels, recording likes, dislikes and views on videos and much more. In this tutorial, you will learn how to extract data from YouTube videos using requests_html and BeautifulSoup in Python.

Note that it isn't reliable to use this method to extract YouTube data, as YouTube keeps changing their code, the code can not work in any time. Therefore, for more reliable use, I suggest you use YouTube API for extracting data instead.

Related: How to Extract YouTube Comments in Python.

Installing required dependencies:

pip3 install requests_html bs4

Before we dive into the quick script, we gonna need to experiment on how to extract such data from websites using BeautifulSoup, open up a Python interactive shell and write these lines of code:

from requests_html import HTMLSession 
from bs4 import BeautifulSoup as bs # importing BeautifulSoup

# sample youtube video url
video_url = "https://www.youtube.com/watch?v=jNQXAC9IVRw"
# init an HTML Session
session = HTMLSession()
# get the html content
response = session.get(video_url)
# execute Java-script
response.html.render(sleep=1)
# create bs object to parse HTML
soup = bs(response.html.html, "html.parser")

This makes a request to that YouTube video URL, renders the Javascript and finally creates the BeatifulSoup object wrapping the resulting HTML.

Great, now let's try to find all meta tags in the page:

In [10]: soup.find_all("meta")
Out[10]: 
[<meta content="IE=edge" http-equiv="X-UA-Compatible"/>,
 <meta content="rgba(255,255,255,0.98)" name="theme-color"/>,
 <meta content="Me at the zoo" name="title"/>,
 <meta content="The first video on YouTube. While you wait for Part 2, listen to this great song: https://www.youtube.com/watch?v=zj82_v2R6ts" name="description"/>,
 <meta content="me at the zoo, jawed karim, first youtube video" name="keywords"/>,
 <meta content="YouTube" property="og:site_name"/>,
 <meta content="https://www.youtube.com/watch?v=jNQXAC9IVRw" property="og:url"/>,
 <meta content="Me at the zoo" property="og:title"/>,
 <meta content="https://i.ytimg.com/vi/jNQXAC9IVRw/hqdefault.jpg" property="og:image"/>,
 <meta content="480" property="og:image:width"/>,
 <meta content="360" property="og:image:height"/>,
 <meta content="The first video on YouTube. While you wait for Part 2, listen to this great song: https://www.youtube.com/watch?v=zj82_v2R6ts" property="og:description"/>,
 <meta content="544007664" property="al:ios:app_store_id"/>,
 <meta content="YouTube" property="al:ios:app_name"/>,
 <meta content="vnd.youtube://www.youtube.com/watch?v=jNQXAC9IVRw&amp;feature=applinks" property="al:ios:url"/>,
 <meta content="vnd.youtube://www.youtube.com/watch?v=jNQXAC9IVRw&amp;feature=applinks" property="al:android:url"/>,
 <meta content="http://www.youtube.com/watch?v=jNQXAC9IVRw&amp;feature=applinks" property="al:web:url"/>,
 <meta content="video.other" property="og:type"/>,
 <meta content="https://www.youtube.com/embed/jNQXAC9IVRw" property="og:video:url"/>,
 <meta content="https://www.youtube.com/embed/jNQXAC9IVRw" property="og:video:secure_url"/>,
 <meta content="text/html" property="og:video:type"/>,
 <meta content="480" property="og:video:width"/>,
 <meta content="360" property="og:video:height"/>,
 <meta content="YouTube" property="al:android:app_name"/>,
 <meta content="com.google.android.youtube" property="al:android:package"/>,
 <meta content="me at the zoo" property="og:video:tag"/>,
 <meta content="jawed karim" property="og:video:tag"/>,
 <meta content="first youtube video" property="og:video:tag"/>,
 <meta content="87741124305" property="fb:app_id"/>,
 <meta content="player" name="twitter:card"/>,
 <meta content="@youtube" name="twitter:site"/>,
 <meta content="https://www.youtube.com/watch?v=jNQXAC9IVRw" name="twitter:url"/>,
 <meta content="Me at the zoo" name="twitter:title"/>,
 <meta content="The first video on YouTube. While you wait for Part 2, listen to this great song: https://www.youtube.com/watch?v=zj82_v2R6ts" name="twitter:description"/>,
 <meta content="https://i.ytimg.com/vi/jNQXAC9IVRw/hqdefault.jpg" name="twitter:image"/>,
 <meta content="YouTube" name="twitter:app:name:iphone"/>,
 <meta content="544007664" name="twitter:app:id:iphone"/>,
 <meta content="YouTube" name="twitter:app:name:ipad"/>,
 <meta content="544007664" name="twitter:app:id:ipad"/>,
 <meta content="vnd.youtube://www.youtube.com/watch?v=jNQXAC9IVRw&amp;feature=applinks" name="twitter:app:url:iphone"/>,
 <meta content="vnd.youtube://www.youtube.com/watch?v=jNQXAC9IVRw&amp;feature=applinks" name="twitter:app:url:ipad"/>,
 <meta content="YouTube" name="twitter:app:name:googleplay"/>,
 <meta content="com.google.android.youtube" name="twitter:app:id:googleplay"/>,
 <meta content="https://www.youtube.com/watch?v=jNQXAC9IVRw" name="twitter:app:url:googleplay"/>,
 <meta content="https://www.youtube.com/embed/jNQXAC9IVRw" name="twitter:player"/>,
 <meta content="480" name="twitter:player:width"/>,
 <meta content="360" name="twitter:player:height"/>,
 <meta content="Me at the zoo" itemprop="name"/>,
 <meta content="The first video on YouTube. While you wait for Part 2, listen to this great song: https://www.youtube.com/watch?v=zj82_v2R6ts" itemprop="description"/>,
 <meta content="False" itemprop="paid"/>,
 <meta content="UC4QobU6STFB0P71PMvOGN5A" itemprop="channelId"/>,
 <meta content="jNQXAC9IVRw" itemprop="videoId"/>,
 <meta content="PT0M19S" itemprop="duration"/>,
 <meta content="False" itemprop="unlisted"/>,
 <meta content="480" itemprop="width"/>,
 <meta content="360" itemprop="height"/>,
 <meta content="HTML5 Flash" itemprop="playerType"/>,
 <meta content="480" itemprop="width"/>,
 <meta content="360" itemprop="height"/>,
 <meta content="true" itemprop="isFamilyFriendly"/>,
 <meta content="AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BQ,BR,BS,BT,BV,BW,BY,BZ,CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CW,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR,GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM,JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY,MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM,PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,SS,ST,SV,SX,SY,SZ,TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW" itemprop="regionsAllowed"/>,
 <meta content="172826227" itemprop="interactionCount"/>,
 <meta content="2005-04-23" itemprop="datePublished"/>,
 <meta content="2005-04-23" itemprop="uploadDate"/>,
 <meta content="Film &amp; Animation" itemprop="genre"/>]

Easy as that, a lot of useful data here. For example, we can get the video title by:

In [11]: soup.find("meta", itemprop="name")["content"]
Out[11]: 'Me at the zoo'

Or the number of views:

In [12]: soup.find("meta", itemprop="interactionCount")['content']
Out[12]: '172826227'

This way, you will be able to extract everything you want from that web page. Now let's make our script that extracts some useful information we can get from a YouTube video page, open up a new Python file and follow along:

Importing necessary modules:

from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs

Before we make our function that extract all video data, let's initialize our HTTP session:

# init session
session = HTMLSession()

Let's make a function, given an URL of a YouTube video, it will return all the data in a dictionary:

def get_video_info(url):
    # download HTML code
    response = session.get(url)
    # execute Javascript
    response.html.render(sleep=1)
    # create beautiful soup object to parse HTML
    soup = bs(response.html.html, "html.parser")
    # open("index.html", "w").write(response.html.html)
    # initialize the result
    result = {}

Notice after we downloaded the HTML content of the web page, we ran render() method to execute Javascript, so that the data we're looking for, is rendered in the HTML.

Note that if you get timeout error, then you can simply add timeout parameter and set it to 60 seconds (default is 8 seconds) or something, like so:

response.html.render(sleep=1, timeout=60)

Retrieving the video title:

    # video title
    result["title"] = soup.find("meta", itemprop="name")['content']

Number of views converted to an integer:

    # video views (converted to integer)
    result["views"] = result["views"] = soup.find("meta", itemprop="interactionCount")['content']

Get the video description:

    # video description
    result["description"] = soup.find("meta", itemprop="description")['content']

The date when the video was published:

    # date published
    result["date_published"] = soup.find("meta", itemprop="datePublished")['content']

The duration of the video:

    # get the duration of the video
    result["duration"] = soup.find("span", {"class": "ytp-time-duration"}).text

We could get the duration from meta tag as previous fields, but it'll be in another format, such as PT0M19S which translates to 19 seconds or 00:19 in the format that is in the ytp-time-duration span tag.

We can also extract the video tags:

    # get the video tags
    result["tags"] = ', '.join([ meta.attrs.get("content") for meta in soup.find_all("meta", {"property": "og:video:tag"}) ])

The number of likes and dislikes as integers:

    # number of likes
    text_yt_formatted_strings = soup.find_all("yt-formatted-string", {"id": "text", "class": "ytd-toggle-button-renderer"})
    result["likes"] = ''.join([ c for c in text_yt_formatted_strings[0].attrs.get("aria-label") if c.isdigit() ])
    result["likes"] = 0 if result['likes'] == '' else int(result['likes'])
    # number of dislikes
    result["dislikes"] = ''.join([ c for c in text_yt_formatted_strings[1].attrs.get("aria-label") if c.isdigit() ])
    result['dislikes'] = 0 if result['dislikes'] == '' else int(result['dislikes'])

Since in a YouTube video, you can see the channel details, such as the name, and number of subscribers, let's grab that as well:

    # channel details
    channel_tag = soup.find("yt-formatted-string", {"class": "ytd-channel-name"}).find("a")
    # channel name
    channel_name = channel_tag.text
    # channel URL
    channel_url = f"https://www.youtube.com{channel_tag['href']}"
    # number of subscribers as str
    channel_subscribers = soup.find("yt-formatted-string", {"id": "owner-sub-count"}).text.strip()
    result['channel'] = {'name': channel_name, 'url': channel_url, 'subscribers': channel_subscribers}
    return result

Since soup.find() function returns a Tag object, you can still find HTML tags within other tags. As a result, It is a common practice to call find() more than once.

Now this function returns a lot of video information in a dictionary, let's finish up our script:

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="YouTube Video Data Extractor")
    parser.add_argument("url", help="URL of the YouTube video")
    args = parser.parse_args()
    url = args.url
    # get the data
    data = get_video_info(url)
    # print in nice format
    print(f"Title: {data['title']}")
    print(f"Views: {data['views']}")
    print(f"Published at: {data['date_published']}")
    print(f"Video Duration: {data['duration']}")
    print(f"Video tags: {data['tags']}")
    print(f"Likes: {data['likes']}")
    print(f"Dislikes: {data['dislikes']}")
    print(f"\nDescription: {data['description']}\n")
    print(f"\nChannel Name: {data['channel']['name']}")
    print(f"Channel URL: {data['channel']['url']}")
    print(f"Channel Subscribers: {data['channel']['subscribers']}")

Nothing special here, since we need a way to retrieve the video URL from the command line, the above does just that, and then print it in a format, here is my output when running the script:

C:\youtube-extractor>python extract_video_info.py https://www.youtube.com/watch?v=jNQXAC9IVRw
Title: Me at the zoo
Views: 172639597
Published at: 2005-04-23
Video Duration: 0:18
Video tags: me at the zoo, jawed karim, first youtube video
Likes: 8188077
Dislikes: 191986

Description: The first video on YouTube. While you wait for Part 2, listen to this great song: https://www.youtube.com/watch?v=zj82_v2R6ts


Channel Name: jawed
Channel URL: https://www.youtube.com/channel/UC4QobU6STFB0P71PMvOGN5A
Channel Subscribers: 1.98M subscribers

Conclusion

This is it! You know how to extract data from HTML tags, then go on and add other fields such as video quality and others.

Now If you want to extract YouTube comments, there are a lot of things to do besides this, there is a separate tutorial for this.

You can not only extract YouTube video details, but you can also apply this skill to any website you want. If you want to extract Wikipedia pages, there is a tutorial for that! Or maybe you want to scrape weather data from Google? There is a tutorial for that as well.

Check the full code of this tutorial here.

Note: YouTube constantly changes the HTML structure of video pages. If the code of this tutorial doesn't work for you, please check out using YouTube API tutorial instead.

Want to Learn More about Web Scraping?

Finally, if you want to dig more into web scraping with different Python libraries, not just BeautifulSoup, the below courses will definitely be valuable for you:

Learn also: How to Convert HTML Tables into CSV Files in Python.

Happy Scraping ♥

View Full Code
Sharing is caring!



Read Also




Comment panel