Wikipedia is no doubt the largest and most popular general reference work on the internet, it is one of the most popular websites. It features exclusively free content. As a result, being able to access this large amount of information in Python is a handy work. In this tutorial, you will be able to extract information from Wikipedia easily without any hard work.
RELATED: How to Extract Weather Data in Python.
I need to mention that we are not going to web scrape wikipedia pages manually, wikipedia module already did the tough work for us. Let's install it:
pip3 install wikipedia
Open up a Python interactive shell or an empty file and follow along.
Let's get the summary of what Python programming language is:
import wikipedia # print the summary of what python is print(wikipedia.summary("Python Programming Language"))
This will print some first sentences, we can specify the number of sentences to extract:
In : wikipedia.summary("Python programming languag", sentences=2) Out: "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
Notice that I misspelled the query intentionally, it still gives me an accurate result.
Search for a term in wikipedia search:
In : result = wikipedia.search("Neural networks") In : print(result) ['Neural network', 'Artificial neural network', 'Convolutional neural network', 'Recurrent neural network', 'Rectifier (neural networks)', 'Feedforward neural network', 'Neural circuit', 'Quantum neural network', 'Dropout (neural networks)', 'Types of artificial neural networks']
Let's get the whole page for "Neural network" which is "result":
# get the page: Neural network page = wikipedia.page(result)
Crawling the title:
# get the title of the page title = page.title
Getting all the categories of that wikipedia page:
# get the categories of the page categories = page.categories
Extracting the text after removing all HTML tags ( this is done automatically ):
# get the whole wikipedia page text (content) content = page.content
# get all the links in the page links = page.links
# get the page references references = page.references
Finally, the summary:
# summary summary = page.summary
Let's print them out:
# print info print("Page content:\n", content, "\n") print("Page title:", title, "\n") print("Categories:", categories, "\n") print("Links:", links, "\n") print("References:", references, "\n") print("Summary:", summary, "\n")
Try it out !
Alright, we are done, this was a brief introduction on how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question answering chatbot, making a wrapper application around this and much more! The possibilities are endless.
If you are interested extracting data from YouTube videos, here is a tutorial for that: How to Extract YouTube Data in Python.
Happy Coding ♥View Full Code