How to Make an Email Extractor in Python

Abdou Rockikz · 03 dec 2019

Abdou Rockikz · 4 min read · Web Scraping

An email extractor or harvester is a type of software used to extract email addresses from online and offline sources which generates a large list of addresses. Even though these extractors can serve multiple legitimate purposes such as marketing compaigns, unfortunately, they are mainly used to send spamming and phishing emails.

Since the web nowadays is the major source of information in the Internet, in this tutorial, you will learn how you can build such a tool in Python to extract email addresses from web pages using requests-html library.

Because many websites load their data using JavaScript instead of directly rendering HTML code, I chose requests-html library as it supports JavaScript driven websites.

Related: How to Send Emails in Python using smtplib Module.

Alright, let's get started, we need to first install requests-html:

pip3 install requests-html

Let's start coding:

import re
from requests_html import HTMLSession

We need re module here because we will be extracting emails from HTML content using regular expressions, if you're not sure what a regular expression is, it is basically a sequence of characters that define a search pattern.

I've grabbed the most used and accurate regular expression for email addresses from this stackoverflow answer:

url = "https://www.randomlists.com/email-addresses"
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

I know, it is very long, but this is the best so far that defines how email addresses are expressed in a general way.

url string is the URL we want to grab email addresses from, I'm using a website that generates random email addresses (which loads them using Javascript).

Let's initiate the HTML session, which is a consumable session for cookie persistence and connection pooling:

# initiate an HTTP session
session = HTMLSession()

Now let's send the GET request to the URL:

# get the HTTP Response
r = session.get(url)

If you're sure that the website you're grabbing email addresses from uses JavaScript to load most of the data, then you need to execute the below line of code:

# for JAVA-Script driven websites
r.html.render()

This will reload the website in Chromium, and replaces HTML content with an updated version, with Javascript executed. Of course, it'll take some time to do that, that's why you need to execute this only if the website is loading its data using JavaScript.

Note: Executing render() method as the first time will automatically download Chromium for you, so it will take some time to do that.

Now that we have the HTML content and our email address regular expression, let's do it:

for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
    print(re_match.group())

re.finditer() method returns an iterator over all non-overlapping matches in the string. For each match, the iterator returns a match object, that is why we're accessing the matched string (the email address) using group() method.

Here is a result of my execution:

msherr@comcast.net
miyop@yahoo.ca
ardagna@yahoo.ca
tokuhirom@att.net
atmarks@comcast.net
isotopian@live.com
hoyer@msn.com
ozawa@yahoo.com
mchugh@outlook.com
sriha@outlook.com
monopole@sbcglobal.net
monopole@sbcglobal.net

Awesome, only with few lines of code, we are able to grab email addresses from any web page we want !

You can extend this code to build a crawler to extract all website URLs and run this on every page you find, and then you save them to a file, let us know what you did with this in the comments below !

Read Also: How to Build a XSS Vulnerability Scanner in Python.

Happy Scraping ♥

View Full Code
Sharing is caring!


Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.