How to Extract and Submit Web Forms from a URL using Python

Learn how you can scrape forms from web pages, as well as filling and submitting them using requests html and beautiful soup in Python.
  · 8 min read · Updated apr 2020 · Web Scraping


One of the most challenging tasks in web scraping is being able to login automatically and extract data within your account in that website. In this tutorial, you will learn how you can extract all forms from web pages as well as filling and submitting them using requests_html and BeautifulSoup libraries.

To get started, let's install them:

pip3 install requests_html bs4

Extracting Forms from Web Pages

Open up a new file, i'm calling it form_extractor.py:

from bs4 import BeautifulSoup
from requests_html import HTMLSession

To start off, we need a way to make sure that after making requests to the target website, we're storing the cookies provided by that website, so we can persist the session:

# initialize an HTTP session
session = HTMLSession()

Now session variable is a consumable session for cookie persistance, we will use this variable everywhere in our code. Let's write a function that given a URL, it makes a request to that page, and extracts all HTML form tags from it and then return them (as a list):

def get_all_forms(url):
    """Returns all form tags found on a web page's `url` """
    # GET request
    res = session.get(url)
    # for javascript driven website
    # res.html.render()
    soup = BeautifulSoup(res.html.html, "html.parser")
    return soup.find_all("form")

You may notice that I commented out that res.html.render() line, it basically executes javascript before trying to extract anything, as some websites load their content dynamically using Javascript, uncomment it if you feel that the website is using Javascript to load forms.

So the above function will be able to extract all forms from a web page, but we need a way to extract each form's details, such as inputs, form method (GET, POST, DELETE, etc.) and action (target URL for form submission), the below function does that:

def get_form_details(form):
    """Returns the HTML details of a form,
    including action, method and list of form controls (inputs, etc)"""
    details = {}
    # get the form action (requested URL)
    action = form.attrs.get("action").lower()
    # get the form method (POST, GET, DELETE, etc)
    # if not specified, GET is the default in HTML
    method = form.attrs.get("method", "get").lower()
    # get all form inputs
    inputs = []
    for input_tag in form.find_all("input"):
        # get type of input form control
        input_type = input_tag.attrs.get("type", "text")
        # get name attribute
        input_name = input_tag.attrs.get("name")
        # get the default value of that input tag
        input_value =input_tag.attrs.get("value", "")
        # add everything to that list
        inputs.append({"type": input_type, "name": input_name, "value": input_value})
    # put everything to the resulting dictionary
    details["action"] = action
    details["method"] = method
    details["inputs"] = inputs
    return details

Now let's try out these functions before we dive into submitting forms:

url = "https://wikipedia.org"
# get all form tags
forms = get_all_forms(url)
# iteratte over forms
for i, form in enumerate(forms, start=1):
    form_details = get_form_details(form)
    print("="*50, f"form #{i}", "="*50)
    print(form_details)

I've used enumerate() just for numerating extracted forms, here is the output in the case of the home page of Wikipedia:

================================================== form #1 ==================================================
{'action': '//www.wikipedia.org/search-redirect.php',
 'inputs': [{'name': 'family', 'type': 'hidden', 'value': 'wikipedia'},
            {'name': 'language', 'type': 'hidden', 'value': 'en'},
            {'name': 'search', 'type': 'search', 'value': ''},
            {'name': 'go', 'type': 'hidden', 'value': 'Go'}],
 'method': 'get'}

As you can see, if you try to go into that page using your browser, you'll see a simple wikipedia search box, that's why we see only one form here.

Learn also: How to Download All Images from a Web Page in Python.

Submitting Web Forms

You can also notice that most of input fields extracted earlier got the hidden type, we're not interested in that. Instead, we need to fill the input in which it has the name of "search" and type of "search", that's actually the only visible field for the normal user. More generally, we look for any input field that is not hidden for the user.

First, since it's a single form, let's get it into a variable:

# get the first form
first_form = get_all_forms(url)[0]

Let's once again parse all form details as seen earlier:

# extract all form details
form_details = get_form_details(first_form)

Now in order to make our code as flexible as possible (in which we can run for any website), let's prompt the user of the script the actual value we want to submit on each non-hidden input field:

# the data body we want to submit
data = {}
for input_tag in form_details["inputs"]:
    if input_tag["type"] == "hidden":
        # if it's hidden, use the default value
        data[input_tag["name"]] = input_tag["value"]
    elif input_tag["type"] != "submit":
        # all others except submit, prompt the user to set it
        value = input(f"Enter the value of the field '{input_tag['name']}' (type: {input_tag['type']}): ")
        data[input_tag["name"]] = value

So the above code will use the default value of the hidden fields (such as CSRF token) and prompt the user for other input fields (such as search, email, text, and much more).

Let's see how we can submit it based on the method:

# join the url with the action (form request URL)
url = urljoin(url, form_details["action"])

if form_details["method"] == "post":
    res = session.post(url, data=data)
elif form_details["method"] == "get":
    res = session.get(url, params=data)

I used only GET or POST here, but you can extend this for other HTTP methods such as PUT and DELETE (using session.put() and session.delete() methods respectively).

Alright, now we have res variable that contains the HTTP response, this should contain the web page that the server sent after form submission, let's make sure it worked, the below code prepares the HTML content of the web page to save it on our local computer:

# the below code is only for replacing relative URLs to absolute ones
soup = BeautifulSoup(res.content, "html.parser")
for link in soup.find_all("link"):
    try:
        link.attrs["href"] = urljoin(url, link.attrs["href"])
    except:
        pass
for script in soup.find_all("script"):
    try:
        script.attrs["src"] = urljoin(url, script.attrs["src"])
    except:
        pass
for img in soup.find_all("img"):
    try:
        img.attrs["src"] = urljoin(url, img.attrs["src"])
    except:
        pass
for a in soup.find_all("a"):
    try:
        a.attrs["href"] = urljoin(url, a.attrs["href"])
    except:
        pass

# write the page content to a file
open("page.html", "w").write(str(soup))

All this is doing is replacing relative URLs (such as /wiki/Programming_language) with absolute URLs (such as https://www.wikipedia.org/wiki/Programming_language) so we can properly browse the page locally in our computer. I've saved all the content into a local file "page.html", let's open it in our browser:

import webbrowser
# open the page on the default browser
webbrowser.open("page.html")  

Alright, the code is done, here is how I executed this:

{'action': '//www.wikipedia.org/search-redirect.php',
 'inputs': [{'name': 'family', 'type': 'hidden', 'value': 'wikipedia'},
            {'name': 'language', 'type': 'hidden', 'value': 'en'},
            {'name': 'search', 'type': 'search', 'value': ''},
            {'name': 'go', 'type': 'hidden', 'value': 'Go'}],
 'method': 'get'}
Enter the value of the field 'search' (type: search): python programming language

This is basically the same as manually filling the form in the web browser:

Manually filling the form using the browser

After I hit enter in my code execution, this will submit the form, save the result page locally and automatically open it in the default web browser:

Resulted web page after form submission with Python

This is how Python was seeing the result, so we successfully submitted the search form automatically and loaded the result page with the help of Python !

Conclusion

Alright, that's it. In this tutorial, we made a search on wikipedia, but as mentioned earlier, you can use it on any form you want, especially for login forms, in which you can login and continue to extract data that requires user authentication.

See how you can extend this. For instance, you can try to make a submitter for all forms (since we used only the first form here), or you can make a sophisticated crawler that extracts all website links and tries to find all forms of a particular website. However, keep in mind that a website can ban your IP address if you request a lot of pages within a short period of time. In that case, you can slow down your crawler or use a proxy.

If you have any other ideas of how you can extend this, don't hesitate to share them with us in the comments below !

Further learning: How to Convert HTML Tables into CSV Files in Python.

Happy Scraping ♥

View Full Code
Sharing is caring!



Read Also





Comment panel