How to Watermark PDF Files in Python

Learn how to add and remove watermarks to/from PDF files with PyPDF4 and reportlab libraries in Python.
  · 13 min read · Updated jun 2021 · PDF File Handling


Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1993 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images, and other information needed to display it.

The familiarity of PDF led to its fast and widespread adoption as a solution in the field of digital archiving. Because PDFs are more versatile than other file formats, the information they display is easily viewable from almost any operating system or device.

In this tutorial, you will learn how to watermark a PDF file or a folder containing a collection of PDF files using PyPDF4 and reportlab in Python.

First, let's install the necessary libraries:

$ pip install PyPDF4==1.27.0 reportlab==3.5.59

Starting with the code, let's import the libraries and define some configuration we'll need:

from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
import os
import argparse
from io import BytesIO
from typing import Tuple
# Import the reportlab library
from reportlab.pdfgen import canvas
# The size of the page supposedly A4
from reportlab.lib.pagesizes import A4
# The color of the watermark
from reportlab.lib import colors

PAGESIZE = A4
FONTNAME = 'Helvetica-Bold'
FONTSIZE = 40
# using colors module
# COLOR = colors.lightgrey
# or simply RGB
# COLOR = (190, 190, 190)
COLOR = colors.red
# The position attributes of the watermark
X = 250
Y = 10
# The rotation angle in order to display the watermark diagonally if needed
ROTATION_ANGLE = 45

Next, defining our first utility function:

def get_info(input_file: str):
    """
    Extracting the file info
    """
    # If PDF is encrypted the file metadata cannot be extracted
    with open(input_file, 'rb') as pdf_file:
        pdf_reader = PdfFileReader(pdf_file, strict=False)
        output = {
            "File": input_file, "Encrypted": ("True" if pdf_reader.isEncrypted else "False")
        }
        if not pdf_reader.isEncrypted:
            info = pdf_reader.getDocumentInfo()
            num_pages = pdf_reader.getNumPages()
            output["Author"] = info.author
            output["Creator"] = info.creator
            output["Producer"] = info.producer
            output["Subject"] = info.subject
            output["Title"] = info.title
            output["Number of pages"] = num_pages
    # To Display collected metadata
    print("## File Information ##################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
    print("######################################################################")
    return True, output

The get_info() function collects the metadata of an input PDF file, the following attributes can be extracted: author, creator, producer, subject, title, and the number of pages.

It is worth noting that you cannot extract these attributes for an encrypted PDF file.

def get_output_file(input_file: str, output_file: str):
    """
    Check whether a temporary output file is needed or not
    """
    input_path = os.path.dirname(input_file)
    input_filename = os.path.basename(input_file)
    # If output file is empty -> generate a temporary output file
    # If output file is equal to input_file -> generate a temporary output file
    if not output_file or input_file == output_file:
        tmp_file = os.path.join(input_path, 'tmp_' + input_filename)
        return True, tmp_file
    return False, output_file

The above function returns a path for a temporary output file when no output file is specified, or in case the paths of the input and output files are equal.

def create_watermark(wm_text: str):
    """
    Creates a watermark template.
    """
    if wm_text:
        # Generate the output to a memory buffer
        output_buffer = BytesIO()
        # Default Page Size = A4
        c = canvas.Canvas(output_buffer, pagesize=PAGESIZE)
        # you can also add image instead of text
        # c.drawImage("logo.png", X, Y, 160, 160)
        # Set the size and type of the font
        c.setFont(FONTNAME, FONTSIZE)
        # Set the color
        if isinstance(COLOR, tuple):
            color = (c/255 for c in COLOR)
            c.setFillColorRGB(*color)
        else:
            c.setFillColor(COLOR)
        # Rotate according to the configured parameter
        c.rotate(ROTATION_ANGLE)
        # Position according to the configured parameter
        c.drawString(X, Y, wm_text)
        c.save()
        return True, output_buffer
    return False, None

This function performs the following:

  • Creates a watermark file and stores it in memory.
  • Apply the parameters defined earlier on our created canvas using reportlab.

Note that you can instead of using the drawString() method to write text, you can use drawImage() to draw an image, as written as a comment in the above function.

def save_watermark(wm_buffer, output_file):
    """
    Saves the generated watermark template to disk
    """
    with open(output_file, mode='wb') as f:
        f.write(wm_buffer.getbuffer())
    f.close()
    return True

save_watermark() saves the generated watermark template to a physical file in case you need to visualize it.

Now let's write the function that's responsible for adding a watermark to a given PDF file:

def watermark_pdf(input_file: str, wm_text: str, pages: Tuple = None):
    """
    Adds watermark to a pdf file.
    """
    result, wm_buffer = create_watermark(wm_text)
    if result:
        wm_reader = PdfFileReader(wm_buffer)
        pdf_reader = PdfFileReader(open(input_file, 'rb'), strict=False)
        pdf_writer = PdfFileWriter()
        try:
            for page in range(pdf_reader.getNumPages()):
                # If required to watermark specific pages not all the document pages
                if pages:
                    if str(page) not in pages:
                        continue
                page = pdf_reader.getPage(page)
                page.mergePage(wm_reader.getPage(0))
                pdf_writer.addPage(page)
        except Exception as e:
            print("Exception = ", e)
            return False, None, None
        return True, pdf_reader, pdf_writer

This function aims to merge the inputted PDF file with the generated watermark. It accepts the following parameters:

  • input_file: The path of the PDF file to watermark.
  • wm_text: The text to set as a watermark.
  • pages: The pages to watermark.

It performs the following:

  • Creates a watermark and stores it in the memory buffer.
  • Iterates throughout the pages of the input file and merge each of the selected pages with the watermark that is previously generated. The watermark acts like an overlay on top of the page.
  • Adds the resulting page to the pdf_writer object.
def unwatermark_pdf(input_file: str, wm_text: str, pages: Tuple = None):
    """
    Removes watermark from the pdf file.
    """
    pdf_reader = PdfFileReader(open(input_file, 'rb'), strict=False)
    pdf_writer = PdfFileWriter()
    for page in range(pdf_reader.getNumPages()):
        # If required for specific pages
        if pages:
            if str(page) not in pages:
                continue
        page = pdf_reader.getPage(page)
        # Get the page content
        content_object = page["/Contents"].getObject()
        content = ContentStream(content_object, pdf_reader)
        # Loop through all the elements page elements
        for operands, operator in content.operations:
            # Checks the TJ operator and replaces the corresponding string operand (Watermark text) with ''
            if operator == b_("Tj"):
                text = operands[0]
                if isinstance(text, str) and text.startswith(wm_text):
                    operands[0] = TextStringObject('')
        page.__setitem__(NameObject('/Contents'), content)
        pdf_writer.addPage(page)
    return True, pdf_reader, pdf_writer

The purpose of this function is to remove the watermark text from the PDF file. It accepts the following parameters:

  • input_file: The path of the PDF file to watermark.
  • wm_text: The text to set as a watermark.
  • pages: The pages to watermark.

It performs the following:

  • Iterates throughout the pages of the input file and grab the content of each page.
  • Using the grabbed content, it finds the operator TJ and replaces the string (watermark text) following this operator.
  • Adds the resulting page following the merge to the pdf_writer object.
def watermark_unwatermark_file(**kwargs):
    input_file = kwargs.get('input_file')
    wm_text = kwargs.get('wm_text')
    # watermark   -> Watermark
    # unwatermark -> Unwatermark
    action = kwargs.get('action')
    # HDD -> Temporary files are saved on the Hard Disk Drive and then deleted
    # RAM -> Temporary files are saved in memory and then deleted.
    mode = kwargs.get('mode')
    pages = kwargs.get('pages')
    temporary, output_file = get_output_file(
        input_file, kwargs.get('output_file'))
    if action == "watermark":
        result, pdf_reader, pdf_writer = watermark_pdf(
            input_file=input_file, wm_text=wm_text, pages=pages)
    elif action == "unwatermark":
        result, pdf_reader, pdf_writer = unwatermark_pdf(
            input_file=input_file, wm_text=wm_text, pages=pages)
    # Completed successfully
    if result:
        # Generate to memory
        if mode == "RAM":
            output_buffer = BytesIO()
            pdf_writer.write(output_buffer)
            pdf_reader.stream.close()
            # No need to create a temporary file in RAM Mode
            if temporary:
                output_file = input_file
            with open(output_file, mode='wb') as f:
                f.write(output_buffer.getbuffer())
            f.close()
        elif mode == "HDD":
            # Generate to a new file on the hard disk
            with open(output_file, 'wb') as pdf_output_file:
                pdf_writer.write(pdf_output_file)
            pdf_output_file.close()
            pdf_reader.stream.close()
            if temporary:
                if os.path.isfile(input_file):
                    os.replace(output_file, input_file)
                output_file = input_file

The above function accepts several parameters:

  • input_file: The path of the PDF file to watermark.
  • wm_text: The text to set as a watermark.
  • action: The action to perform whether to watermark or to un-watermark file.
  • mode: The location of the temporary file whether to memory or hard disk.
  • pages: The pages to watermark.

watermark_unwatermark_file() function calls the previously defined functions watermark_pdf() or unwatermark_pdf() depending on the chosen action.

Based on the selected mode, and if the output file has a similar path as the input file or no output file is specified, then a temporary file will be created in case the selected mode is HDD (hard disk drive).

Next, let's add the ability to add or remove watermark from a folder containing multiple PDF files:

def watermark_unwatermark_folder(**kwargs):
    """
    Watermarks all PDF Files within a specified path
    Unwatermarks all PDF Files within a specified path
    """
    input_folder = kwargs.get('input_folder')
    wm_text = kwargs.get('wm_text')
    # Run in recursive mode
    recursive = kwargs.get('recursive')
    # watermark   -> Watermark
    # unwatermark -> Unwatermark
    action = kwargs.get('action')
    # HDD -> Temporary files are saved on the Hard Disk Drive and then deleted
    # RAM -> Temporary files are saved in memory and then deleted.
    mode = kwargs.get('mode')
    pages = kwargs.get('pages')
    # Loop though the files within the input folder.
    for foldername, dirs, filenames in os.walk(input_folder):
        for filename in filenames:
            # Check if pdf file
            if not filename.endswith('.pdf'):
                continue
            # PDF File found
            inp_pdf_file = os.path.join(foldername, filename)
            print("Processing file:", inp_pdf_file)
            watermark_unwatermark_file(input_file=inp_pdf_file, output_file=None,
                                       wm_text=wm_text, action=action, mode=mode, pages=pages)
        if not recursive:
            break

This function loops throughout the files of the specified folder either recursively or not depending on the value of the recursive parameter, and processes these files one by one.

Next, let's make a utility function to check whether a path is a file path or a directory path:

def is_valid_path(path):
    """
    Validates the path inputted and checks whether it is a file path or a folder path
    """
    if not path:
        raise ValueError(f"Invalid Path")
    if os.path.isfile(path):
        return path
    elif os.path.isdir(path):
        return path
    else:
        raise ValueError(f"Invalid Path {path}")

Now that we have all the functions we need for this tutorial, let's make a final one for parsing command-line arguments:

def parse_args():
    """
    Get user command line parameters
    """
    parser = argparse.ArgumentParser(description="Available Options")
    parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
                        required=True, help="Enter the path of the file or the folder to process")
    parser.add_argument('-a', '--action', dest='action', choices=[
                        'watermark', 'unwatermark'], type=str, default='watermark',
                        help="Choose whether to watermark or to unwatermark")
    parser.add_argument('-m', '--mode', dest='mode', choices=['RAM', 'HDD'], type=str,
                        default='RAM', help="Choose whether to process on the hard disk drive or in memory")
    parser.add_argument('-w', '--watermark_text', dest='watermark_text',
                        type=str, required=True, help="Enter a valid watermark text")
    parser.add_argument('-p', '--pages', dest='pages', type=tuple,
                        help="Enter the pages to consider e.g.: [2,4]")
    path = parser.parse_known_args()[0].input_path
    if os.path.isfile(path):
        parser.add_argument('-o', '--output_file', dest='output_file',
                            type=str, help="Enter a valid output file")
    if os.path.isdir(path):
        parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
            str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
    # To Porse The Command Line Arguments
    args = vars(parser.parse_args())
    # To Display The Command Line Arguments
    print("## Command Arguments #################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
    print("######################################################################")
    return args

Below are the arguments defined:

  • input_path: A required parameter to input the path of the file or the folder to process, this parameter is associated with the is_valid_path() function previously defined.
  • action: The action to perform, which is either watermark or unwatermark the PDF file, default is to watermark.
  • mode: To specify the destination of the generated temporary file, whether to memory or hard disk.
  • watermark_text: The string to set as a watermark.
  • pages: The pages to watermark (e.g first page is [0], the second page and the fourth page is [1, 3], etc.). If not specified, then all pages.
  • output_file: The path of the output file.
  • recursive: Whether to process a folder recursively.

Now that we have everything, let's write the main code to execute based on parameters passed:

if __name__ == '__main__':
    # Parsing command line arguments entered by user
    args = parse_args()
    # If File Path
    if os.path.isfile(args['input_path']):
        # Extracting File Info
        get_info(input_file=args['input_path'])
        # Encrypting or Decrypting a File
        watermark_unwatermark_file(
            input_file=args['input_path'], wm_text=args['watermark_text'], action=args[
                'action'], mode=args['mode'], output_file=args['output_file'], pages=args['pages']
        )
    # If Folder Path
    elif os.path.isdir(args['input_path']):
        # Encrypting or Decrypting a Folder
        watermark_unwatermark_folder(
            input_folder=args['input_path'], wm_text=args['watermark_text'],
            action=args['action'], mode=args['mode'], recursive=args['recursive'], pages=args['pages']
        )

Related: How to Extract Images from PDF in Python.

Now let's test our program, if you open up a terminal window and type:

$ python pdf_watermarker.py --help

It'll show the parameters defined and their respective description:

usage: pdf_watermarker.py [-h] -i INPUT_PATH [-a {watermark,unwatermark}] [-m {RAM,HDD}] -w WATERMARK_TEXT [-p PAGES]

Available Options

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_PATH, --input_path INPUT_PATH
                        Enter the path of the file or the folder to process
  -a {watermark,unwatermark}, --action {watermark,unwatermark}
                        Choose whether to watermark or to unwatermark
  -m {RAM,HDD}, --mode {RAM,HDD}
                        Choose whether to process on the hard disk drive or in memory
  -w WATERMARK_TEXT, --watermark_text WATERMARK_TEXT
                        Enter a valid watermark text
  -p PAGES, --pages PAGES
                        Enter the pages to consider e.g.: [2,4]

Now let's watermark my CV as an example:

$ python pdf_watermarker.py -a watermark -i "CV Bassem Marji_Eng.pdf" -w "CONFIDENTIAL" -o CV_watermark.pdf

The following output will be produced:

## Command Arguments #################################################
input_path:CV Bassem Marji_Eng.pdf
action:watermark
mode:RAM
watermark_text:CONFIDENTIAL
pages:None
output_file:CV_watermark.pdf
######################################################################
## File Information ##################################################
File:CV Bassem Marji_Eng.pdf
Encrypted:False
Author:TMS User
Creator:Microsoft® Word 2013
Producer:Microsoft® Word 2013
Subject:None
Title:None
Number of pages:1
######################################################################

And that's how the output CV_watermark.pdf file looks like:

Watermarked PDF with PythonNow let's remove the watermark added:

$ python pdf_watermarker.py -a unwatermark -i "CV_watermark.pdf" -w "CONFIDENTIAL" -o CV.pdf

This time the watermark will be removed and saved into a new PDF file CV.pdf.

You can also set the -m and -p for mode and pages respectively. You can also watermark a list of PDF files located under a specific path:

$ python pdf_watermarker.py -i "C:\Scripts\Test" -a "watermark" -w "CONFIDENTIAL" -r False

Or removing the watermark from a bunch of PDF files:

$ python pdf_watermarker.py -i "C:\Scripts\Test" -a "unwatermark" -w "CONFIDENTIAL" -m HDD -p[0] -r False

Conclusion

In this tutorial, you've learned how to add and remove watermark from PDF files using reportlab and PyPDF4 libraries in Python. I hope this article helped you assimilate this cool feature.

Learn also: How to Extract All PDF Links in Python.

Happy coding ♥

View Full Code
Sharing is caring!



Read Also




Comment panel