How to Perform YOLO Object Detection using OpenCV and PyTorch in Python

Using the state-of-the-art YOLOv3 object detection for real-time object detection, recognition and localization in Python using OpenCV and PyTorch.
Abdou Rockikz · 13 min read · Updated feb 2020 · Machine Learning · Computer Vision


Object detection is a task in computer vision and image processing that deals with detecting objects in images or videos. It is used in a wide variety of real-world applications, including video surveillance, self driving cars, object tracking, etc.

For instance, for a car to be truly autonomous, it must identify and keep track of surrounding objects (such as cars, pedestrians and traffic lights), one of the main source of information is the camera, which uses object detection. On top of that, the detection should be in real-time, in which it requires a relatively fast way, so that the car can safely navigate the street.

In this tutorial, you will learn how you can perform object detection using the state-of-the-art technique YOLOv3 with OpenCV or PyTorch in Python.

YOLO (You Only Look Once) is a real-time object detection algorithm that is a single deep convolutional neural network that splits the input image into a set of grid cells, so unlike image classification or face detection, each grid cell in YOLO algorithm will have an associated vector in the output that tells us:

  • If an object exists in that grid cell.
  • The class of that object (i.e label).
  • The predicted bounding box for that object (location).

There are other approaches such as Fast R-CNN, Faster R-CNN which uses window slides over the image making it requires thousands of predictions on a single image (on each window), as you may guess, this makes YOLOv3 about 1000x faster than R-CNN and 100x faster than Fast R-CNN.

YOLO version 3 is the latest version of YOLO which uses few tricks to improve training and increase performance, check the full details in the YOLOv3 paper.

Getting Started

Before we dive into the code, let's install the required libraries for this tutorial (If you want to use PyTorch code, head to this page for installation):

pip3 install opencv-python numpy matplotlib

It is quite challenging to build YOLOv3 whole system (the model and the techniques used) from scratch, open source libraries such as Darknet or OpenCV already built that for you, or even ordinary people built third-party projects for YOLOv3 (check this for TensorFlow 2 implementation)

Importing required modules:

import cv2
import numpy as np

import time
import sys
import os

Let's define some variables and parameters that we gonna need:

CONFIDENCE = 0.5
SCORE_THRESHOLD = 0.5
IOU_THRESHOLD = 0.5

# the neural network configuration
config_path = "cfg/yolov3.cfg"
# the YOLO net weights file
weights_path = "weights/yolov3.weights"
# weights_path = "weights/yolov3-tiny.weights"

# loading all the class labels (objects)
labels = open("data/coco.names").read().strip().split("\n")
# generating colors for each object for later plotting
colors = np.random.randint(0, 255, size=(len(LABELS), 3), dtype="uint8")

We initialized our parameters, we will talk about them later on, config_path and weights_path represents the model configuration (which is yolov3) and the corresponding pre-trained model weights respectively. labels is the list of all class labels for different objects to detect, we will draw each object class with a unique color, that's why we generated random colors.

Please refer to this repository for the required files, and since the weights file is so huge (about 240MB), it isn't in the repository, please download it here.

The below code loads the model:

# load the YOLO network
net = cv2.dnn.readNetFromDarknet(config_path, weights_path)

Preparing the Image

Let's load an example image (the image is in the repository):

path_name = "images/street.jpg"
image = cv2.imread(path_name)
file_name = os.path.basename(path_name)
filename, ext = file_name.split(".")

Next, we need to normalize, scale and reshape this image to be suitable as an input to the neural network:

h, w = image.shape[:2]
# create 4D blob
blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416), swapRB=True, crop=False)

This will normalize pixel values to range from 0 to 1, resize the image to (416, 416) and reshape it, let's see:

print("image.shape:", image.shape)
print("blob.shape:", blob.shape)

Output:

image.shape: (1200, 1800, 3)
blob.shape: (1, 3, 416, 416)

Making Predictions

Now let's feed this image into the neural network to get the output predictions:

# sets the blob as the input of the network
net.setInput(blob)
# get all the layer names
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# feed forward (inference) and get the network output
# measure how much it took in seconds
start = time.perf_counter()
layer_outputs = net.forward(ln)
time_took = time.perf_counter() - start
print(f"Time took: {time_took:.2f}s")

This will extract the neural network output and prints the total time took in inference:

Time took: 1.54s

Now you're maybe wondering, why it isn't that fast ? 1.5 seconds is pretty slow ? Well, we're using our CPU only for inference, which is not ideal for real world problems, that's why we'll jump into PyTorch later this tutorial. On the other hand, 1.5 seconds is relatively good comparing to other techniques such as R-CNN.

You can also use the tiny version of YOLOv3, which is much faster but less accurate, you can download it here.

Now we need to iterate over the neural network outputs and discard any object that has the confidence less than CONFIDENCE parameter we specified earlier (i.e 0.5 or 50%).

font_scale = 1
thickness = 1
boxes, confidences, class_ids = [], [], []
# loop over each of the layer outputs
for output in layer_outputs:
    # loop over each of the object detections
    for detection in output:
        # extract the class id (label) and confidence (as a probability) of
        # the current object detection
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        # discard out weak predictions by ensuring the detected
        # probability is greater than the minimum probability
        if confidence > CONFIDENCE:
            # scale the bounding box coordinates back relative to the
            # size of the image, keeping in mind that YOLO actually
            # returns the center (x, y)-coordinates of the bounding
            # box followed by the boxes' width and height
            box = detection[:4] * np.array([w, h, w, h])
            (centerX, centerY, width, height) = box.astype("int")
            # use the center (x, y)-coordinates to derive the top and
            # and left corner of the bounding box
            x = int(centerX - (width / 2))
            y = int(centerY - (height / 2))
            # update our list of bounding box coordinates, confidences,
            # and class IDs
            boxes.append([x, y, int(width), int(height)])
            confidences.append(float(confidence))
            class_ids.append(class_id)

This will loop over all the predictions and only save the objects with high confidence, let's see what detection vector represents:

print(detection.shape)

Output:

(85,)

On each object prediction, there is a vector of 85. The first 4 values represents the location of the object, (x, y) coordinates for the centering point and the width and the height of the bounding box, the remaining numbers corresponds to the object labels, since this is COCO dataset, it has 80 class labels.

For instance, if the object detected is a person, the first value in the 80 length vector should be 1 and all the remaining values should be 0, the 2nd number for bicycle, 3rd for car, all the way to the 80th object. That's why we're using np.argmax() function to get the class id, as it returns the index of the maximum value from that 80 length vector.

Drawing Detected Objects

Now we have all we need, let's draw the object rectangles and labels and see the result:

# loop over the indexes we are keeping
for i in range(len(boxes)):
    # extract the bounding box coordinates
    x, y = boxes[i][0], boxes[i][1]
    w, h = boxes[i][2], boxes[i][3]
    # draw a bounding box rectangle and label on the image
    color = [int(c) for c in colors[class_ids[i]]]
    cv2.rectangle(image, (x, y), (x + w, y + h), color=color, thickness=thickness)
    text = f"{labels[class_ids[i]]}: {confidences[i]:.2f}"
    # calculate text width & height to draw the transparent boxes as background of the text
    (text_width, text_height) = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, fontScale=font_scale, thickness=thickness)[0]
    text_offset_x = x
    text_offset_y = y - 5
    box_coords = ((text_offset_x, text_offset_y), (text_offset_x + text_width + 2, text_offset_y - text_height))
    overlay = image.copy()
    cv2.rectangle(overlay, box_coords[0], box_coords[1], color=color, thickness=cv2.FILLED)
    # add opacity (transparency to the box)
    image = cv2.addWeighted(overlay, 0.6, image, 0.4, 0)
    # now put the text (label: confidence %)
    cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX,
        fontScale=font_scale, color=(0, 0, 0), thickness=thickness)

Let's write the image:

cv2.imwrite(filename + "_yolo3." + ext, image)

A new image will appear in the current directory that labels each object detected with the confidence. However, look at this part of the image:

two bounding boxes for a single object

You guessed it, two bounding boxes for a single object, this is a problem, isn't it ? Well, the creators of YOLO used a technique called Non-maximal Suppression to eliminate this.

Non-Maximal Suppression

Non-Maximal Suppression is a technique that suppresses overlapping bounding boxes that do not have the maximum probability for object detection. It is mainly achieved in two phases:

  • It selects the bounding box which got the highest confidence (i.e probability).
  • It then compare all other bounding boxes with this selected bounding box and eliminate the ones that have a high IoU.

What is IoU

IoU (Intersection over Union) is a technique used in Non-Maximal Suppression to compare how close two different bounding boxes are. It is simply demonstrated in the following figure:

Intersection Over Union

The higher the IoU, the closer the bounding boxes are. an IoU of 1 means that the two bounding boxes are identical, while an IoU of 0 means that they're not even intersected.

As a result, we will be using a IoU threshold value of 0.5 (which we initialized in the beginning of this tutorial), it means that we eliminate any bounding box below this value comparing to that maximal probability bounding box.

The SCORE_THRESHOLD will eliminate any bounding box that has the confidence below that value:

# perform the non maximum suppression given the scores defined before
idxs = cv2.dnn.NMSBoxes(boxes, confidences, SCORE_THRESHOLD, IOU_THRESHOLD)

Now let's draw the boxes again:

# ensure at least one detection exists
if len(idxs) > 0:
    # loop over the indexes we are keeping
    for i in idxs.flatten():
        # extract the bounding box coordinates
        x, y = boxes[i][0], boxes[i][1]
        w, h = boxes[i][2], boxes[i][3]
        # draw a bounding box rectangle and label on the image
        color = [int(c) for c in colors[class_ids[i]]]
        cv2.rectangle(image, (x, y), (x + w, y + h), color=color, thickness=thickness)
        text = f"{labels[class_ids[i]]}: {confidences[i]:.2f}"
        # calculate text width & height to draw the transparent boxes as background of the text
        (text_width, text_height) = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, fontScale=font_scale, thickness=thickness)[0]
        text_offset_x = x
        text_offset_y = y - 5
        box_coords = ((text_offset_x, text_offset_y), (text_offset_x + text_width + 2, text_offset_y - text_height))
        overlay = image.copy()
        cv2.rectangle(overlay, box_coords[0], box_coords[1], color=color, thickness=cv2.FILLED)
        # add opacity (transparency to the box)
        image = cv2.addWeighted(overlay, 0.6, image, 0.4, 0)
        # now put the text (label: confidence %)
        cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX,
            fontScale=font_scale, color=(0, 0, 0), thickness=thickness)

You can use cv2.imshow("image", image) to show the image, but we just gonna save it to disk:

cv2.imwrite(filename + "_yolo3." + ext, image)

Check this out:

YOLO Object detection on a city scene imageHere is another sample image:

YOLO Object detection on horses and persons

Or this:

YOLO Object detection on food image

Awesome ! Use your own images and tweak those parameters and see which works best !

Also, if the image got a high resolution, make sure you increase the font_scale parameter so you can see the bounding boxes and its corresponding labels.

PyTorch Code

As mentioned earlier, if you want to use a GPU (which is much faster than a CPU) for inference, then you can use PyTorch library which supports CUDA computing, here is the code for that (get darknet.py and utils.py from that repository):

import cv2
import matplotlib.pyplot as plt
from utils import *
from darknet import Darknet

# Set the NMS Threshold
nms_threshold = 0.6
# Set the IoU threshold
iou_threshold = 0.4
cfg_file = "cfg/yolov3.cfg"
weight_file = "weights/yolov3.weights"
namesfile = "data/coco.names"
m = Darknet(cfg_file)
m.load_weights(weight_file)
class_names = load_class_names(namesfile)
# m.print_network()
original_image = cv2.imread("images/city_scene.jpg")
original_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
img = cv2.resize(original_image, (m.width, m.height))
# detect the objects
boxes = detect_objects(m, img, iou_threshold, nms_threshold)
# plot the image with the bounding boxes and corresponding object class labels
plot_boxes(original_image, boxes, class_names, plot_labels=True)

Note: The above code requires darknet.py and utils.py files in the current directory. Also, PyTorch must be installed (GPU accelerated is suggested).

Conclusion

I have prepared a code for you to use your live camera for real-time object detection, check it here. Also, if you want to read a video file and make object detection on it, this code can help you, here is an example output:

YOLOv3 Real-Time Object Detection using OpenCV and PyTorch in Python

Note that there are some drawbacks of YOLO object detector, one main drawback is that YOLO struggle to detect objects grouped close together, especially for smaller ones. There are SSDs too, which can often give a tradeoff in terms of speed and accuracy.

This tutorial's code depends on these sources:

If you wish to use TensorFlow 2 instead, there are few projects and repositories built by people out there, I suggest you to check this one.

Check the official YOLO tutorial here.

Happy Learning ♥

View Full Code
Sharing is caring!



Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.