How to Use Regular Expressions in Python

Learn how to use Python's built-in re module to use several string matching techniques using functions like match, search, finditer and sub.
  · 7 min read · Updated jul 2020 · Python Standard Library


A regular expression is a special sequence of characters that forms a search pattern, it can be used to check if a string contains a specified pattern, and it can also be used to extract all occurrences of that pattern and much more.

Regex are everywhere, from validating email addresses, passwords, date formats, to being used in search engines, so it is an essential skill for any developer, and most of programming languages provide regex capabilities. In this tutorial, we will be using re module in Python.

Here are the techniques we gonna cover:

We won't be covering the basics of constructing regular expressions from scratch in this tutorial, instead, we'll be focusing more on how you can use regex on Python effectively.

Matching Strings

For demonstration on how to use re.match() function, say you want to validate user passwords. For instance, you want to make sure the password they enter is at least 8 characters length and contain at least a single digit. The following code does that:

import re # stands for regular expression 
# a regular expression for validating a password
match_regex = r"^(?=.*[0-9]).{8,}$"
# a list of example passwords
passwords = ["pwd", "password", "password1"]
for pwd in passwords:
    m = re.match(match_regex, pwd)
    print(f"Password: {pwd}, validate password strength: {bool(m)}")

match_regex is the regular expression responsible for validating the password criteria we mentioned earlier:

  • ^: Start character.
  • (?=.*[0-9]): Ensure string has at least a digit.
  • .{8,}: Ensure string has at least 8 characters.
  • $: End character.

We then used a list of passwords to match, here is the output:

Password: pwd, validate password strength: False
Password: password, validate password strength: False
Password: password1, validate password strength: True

As expected, failed for the first two, and succeeded for the last. The first password (pwd) has less than 8 characters, the second doesn't include a digit, whereas the third has at least 8 characters and contain a digit.

Note we wrapped re.match() method with the built-in bool() method to return a boolean that indicates whether the string matches the pattern.

Search Method

A good example to demonstrate re.search() method is to search for a specific pattern in a string. For this section, we'll try to extract an IPv4 address from a part of the output of ipconfig command in Windows:

import re

# part of ipconfig output
example_text = """
Wireless LAN adapter Wi-Fi:
   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::380e:9710:5172:caee%2
   IPv4 Address. . . . . . . . . . . : 192.168.1.100
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 192.168.1.1
"""
# regex for IPv4 address
ip_address_regex = r"((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])(\.(?!$)|$)){4}"
# use re.search() method to get the match object
match = re.search(ip_address_regex, example_text)
print(match)

Don't worry much about ip_address_regex expression, it basically validates an IPv4 address (making sure that each number of the total 4 doesn't exceed 255).

We used re.search() in this case to search for a valid IPv4 address, here is the output:

<_sre.SRE_Match object; span=(281, 292), match='192.168.1.1'>

re.search() returns a match object which has the start and end indices of the string found and the actual string, in this case, it returned '192.168.1.1' as the matched string. You can use:

  • match.start() to get the index of the first character of the found pattern.
  • match.end() to get the index of the last character fo the found pattern.
  • match.span() to get both start and end as a tuple (start, end).
  • match.group() to get the actual string found.

As you can see, it only returns the first match and ignore the remaining valid IP addresses. In the next section, we'll see how to extract multiple matches in a string.

Finding Multiple Matches

We'll be using the output of same command (ipconfig) but we will try to use regular expressions to match for MAC addresses this time:

import re

# fake ipconfig output
example_text = """
Ethernet adapter Ethernet:
   Media State . . . . . . . . . . . : Media disconnected
   Physical Address. . . . . . . . . : 88-90-E6-28-35-FA
Ethernet adapter Ethernet 2:
   Physical Address. . . . . . . . . : 04-00-4C-4F-4F-60
   Autoconfiguration IPv4 Address. . : 169.254.204.56(Preferred)
Wireless LAN adapter Local Area Connection* 2:
   Media State . . . . . . . . . . . : Media disconnected
   Physical Address. . . . . . . . . : B8-21-5E-D3-66-98
Wireless LAN adapter Wi-Fi:
   Physical Address. . . . . . . . . : A0-00-79-AA-62-74
   IPv4 Address. . . . . . . . . . . : 192.168.1.101(Preferred)
   Default Gateway . . . . . . . . . : 192.168.1.1
"""
# regex for MAC address
mac_address_regex = r"([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})"
# iterate over matches and extract MAC addresses
extracted_mac_addresses = [ m.group(0) for m in re.finditer(mac_address_regex, example_text) ]
print(extracted_mac_addresses)

After defining the regular expression, we used re.finditer() function to find all occurrences of MAC addresses in the string passed.

Since finditer() returns an iterator of match objects, we used a list comprehension to extract only the found MAC addresses using group(0) (the entire match). Check out the output:

['88-90-E6-28-35-FA', '04-00-4C-4F-4F-60', 'B8-21-5E-D3-66-98', 'A0-00-79-AA-62-74']

Awesome, we have successfully extracted all MAC addresses in that string. In the next section, we'll see how to use regex to replace occurrences of the pattern in strings.

Replacing Matches

If you have experience on web scraping, you may be encountered with a website that uses a service like CloudFlare to hide email addresses from email harvester tools. In this section, we will do exactly that, given a string that has email addresses, we will replace each one of the addresses by a '[email protected]' token:

import re

# a basic regular expression for email matching
email_regex = r"[a-zA-Z0-9_.+-][email protected][a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
# example text to test with
example_text = """
Subject: This is a text email!
From: John Doe <[email protected]>
Some text here!
===============================
Subject: This is another email!
From: Abdou Rockikz <[email protected]>
Some other text!
"""
# substitute any email found with [email protected]
print(re.sub(email_regex, "[email protected]", example_text))

We used re.sub() method which takes 3 arguments, the first is the regular expression (the pattern), the second is the replacement of all patterns found, the third is the target string, here is the output:

Subject: This is a text email!
From: John Doe <[email protected]>
Some text here!
===============================
Subject: This is another email!
From: Abdou Rockikz <[email protected]>
Some other text!

Great, as we expected, re.sub() function return the string obtained by replacing the leftmost non-overlapping occurences of the pattern in string by the replacement specified (2nd argument).

Conclusion

Now you have the skills to use regular expressions in Python, note that we didn't cover all the methods provided by re module, there are other handy functions like split() and fullmatch(), so I highly encourage you to check the Python's official documentation.

If you aren't sure how to build and construct regular expressions for your needs, you can either check the official documentation or this tutorial.

Want to Learn More ?

Finally, many of the Python concepts aren't discussed in detail here, if you feel you want to dig more to Python, I highly suggest you get one of these amazing courses:

Learn alsoHow to Make an Email Extractor in Python.

Happy Coding ♥

View Full Code
Sharing is caring!



Read Also





Comment panel