How to Access Wikipedia in Python

Abdou Rockikz · 11 sep 2019

Abdou Rockikz · 3 min read · Updated oct 2019 · Web Scraping

Wikipedia is no doubt the largest and most popular general reference work on the internet, it is one of the most popular websites. It features exclusively free content. As a result, being able to access this large amount of information in Python is a handy work. In this tutorial, you will be able to extract information from Wikipedia easily without any hard work.

RELATED: How to Extract Weather Data in Python.

I need to mention that we are not going to web scrap wikipedia pages manually, wikipedia module already did the tough work for us. Let's install it:

pip3 install wikipedia

Open up a Python interactive shell or an empty file and follow along.

Let's get the summary of what Python programming language is:

import wikipedia
# print the summary of what python is
print(wikipedia.summary("Python Programming Language"))

This will print some first sentences, we can specify the number of sentences to extract:

In [2]: wikipedia.summary("Python programming languag", sentences=2)
Out[2]: "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."

Notice that I misspelled the query intentionally, it still gives me an accurate result.

Search for a term in wikipedia search:

In [3]: result = wikipedia.search("Neural networks")
In [4]: print(result)
['Neural network', 'Artificial neural network', 'Convolutional neural network', 'Recurrent neural network', 'Rectifier (neural networks)', 'Feedforward neural network', 'Neural circuit', 'Quantum neural network', 'Dropout (neural networks)', 'Types of artificial neural networks']

Let's get the whole page for "Neural network" which is "result[0]":

# get the page: Neural network
page = wikipedia.page(result[0])

Crawling the title:

# get the title of the page
title = page.title

Getting all the categories of that wikipedia page:

# get the categories of the page
categories = page.categories

Extracting the text after removing all HTML tags ( this is done automatically ):

# get the whole wikipedia page text (content)
content = page.content

All links:

# get all the links in the page
links = page.links

The references:

# get the page references
references = page.references

Finally, the summary:

# summary
summary = page.summary

Let's print them out:

# print info
print("Page content:\n", content, "\n")
print("Page title:", title, "\n")
print("Categories:", categories, "\n")
print("Links:", links, "\n")
print("References:", references, "\n")
print("Summary:", summary, "\n")

Try it out !

Alright, we are done, I hope this was helpful for you to know how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question answering chatbot, making a wrapper application around this and much more! The possibilities are endless.

If you are interested extracting data from YouTube videos, here is a tutorial for that: How to Extract YouTube Data in Python.

Check the full code here and official documentation for this library.

Happy Coding ♥

View Full Code
Sharing is caring!


Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.