How to Extract All PDF Links in Python

Learn how you can extract links and URLs from PDF files with Python using pikepdf and PyMuPDF libraries.
  · 4 min read · Updated aug 2020 · Web Scraping


Do you want to extract the URLs that are in a specific PDF file ? If so, you're in the right place. In this tutorial, we will use pikepdf and PyMuPDF libraries in Python to extract all links from PDF files.

We will be using two methods to get links from a particular PDF file, the first is extracting annotations, which are markups, notes and comments, that you can actually click on your regular PDF reader and redirects to your browser, whereas the second is extracting all raw text and using regular expressions to parse URLs.

To get started, let's install these libraries:

pip3 install pikepdf PyMuPDF

Method 1: Extracting URLs using Annotations

In this technique, we will use pikepdf library to open a PDF file, iterate over all annotations of each page and see if there is a URL there:

import pikepdf # pip3 install pikepdf

file = "1810.04805.pdf"
# file = "1710.05006.pdf"
pdf_file = pikepdf.Pdf.open(file)
urls = []
# iterate over PDF pages
for page in pdf_file.pages:
    for annots in page.get("/Annots"):
        uri = annots.get("/A").get("/URI")
        if uri is not None:
            print("[+] URL Found:", uri)
            urls.append(uri)

print("[*] Total URLs extracted:", len(urls))

I'm testing on this PDF file, but feel free to use any PDF file of your choice, just make sure it has some clickable links.

After running that code, I get this output:

[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://gluebenchmark.com/faq
[+] URL Found: https://gluebenchmark.com/leaderboard
...<SNIPPED>...
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 30

Awesome, we have successfully extracted 30 URLs from that PDF paper.

Related: How to Extract All Website Links in Python.

Method 2: Extracting URLs using Regular Expressions

In this section, we will extract all raw text from our PDF file and then we use regular expressions to parse URLs. First, let's get the text version of the PDF:

import fitz # pip install PyMuPDF
import re

# a regular expression of URLs
url_regex = r"https?:\/\/(www\.)?[[email protected]:%._\+~#=\n]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
# extract raw text from pdf
file = "1710.05006.pdf"
# file = "1810.04805.pdf"
# open the PDF file
with fitz.open(file) as pdf:
    text = ""
    for page in pdf:
        # extract text of each PDF page
        text += page.getText()

Now text is the target string we want to parse URLs, let's use re module to parse them:

urls = []
# extract all urls using the regular expression
for match in re.finditer(url_regex, text):
    url = match.group()
    print("[+] URL Found:", url)
    urls.append(url)
print("[*] Total URLs extracted:", len(urls))

Output:

[+] URL Found: https://github.com/
[+] URL Found: https://github.com/tensor
[+] URL Found: http://nlp.seas.harvard.edu/2018/04/03/attention.html
[+] URL Found: https://gluebenchmark.com/faq.
[+] URL Found: https://gluebenchmark.com/leaderboard).
[+] URL Found: https://gluebenchmark.com/leaderboard
[+] URL Found: https://cloudplatform.googleblog.com/2018/06/Cloud-
[+] URL Found: https://gluebenchmark.com/
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 9

Conclusion

This time we only extract 9 URLs from that same PDF file, now this doesn't mean the second method is not accurate. This method parses only URLs that are in text form (not clickable).

However, there is a problem with this method, as URLs may contain new lines (\n), so you may want to allow that in url_regex expression.

So to conclude, if you want to get URLs that are clickable, you may want to use the first method, which is preferrable. But if you want to get URLs that are in text form, the second may help you do that!

If you want to extract tables or images from PDF, there are tutorials for that:

Learn alsoHow to Make an Email Extractor in Python.

Happy Coding ♥

View Full Code
Sharing is caring!



Read Also





Comment panel