Upskill yourself in Minutes with YOLO: Learn AI Object detection with examples

Implementing AI for Object Detection isnt hard. Using YOLO we can learn the usage of AI and setup object detection with ease. In this article we will learn to setup a table detection using the new YOLOv8 model. Follow through the tutorial at the end to get it working practically.

Upskill yourself in Minutes with YOLO: Learn AI Object detection with examples

Breaking the Myth: Object Detection Isn't Hard as Thought

Hearing about AI and object detection can create an illusion among developers that doing such things is far beyond the reach of traditionally trained programmers.

But that's not the case. Object detection is easy to set up and only requires a few minutes of your time.

It's a computer vision technique that works to identify and locate objects within an image or video. For example, traffic surveillance systems, self-driving cars, and facial recognition systems all employ this technology to track down vehicles, faces, and other objects of interest.

Alt text This article will use YOLO (You Only Look Once) for performing the object detection tasks.

Object Detection with YOLO

Introduction to YOLOV8

YOLOv8 (You Only Look Once) is an open-source Computer Vision AI model released on January 10th, 2023. It’s called YOLO because it detects everything inside an image in a single pass. The new version can perform image detection, classification, instance segmentation, tracking, and pose estimation tasks.

The new v8 has better performance and flexibility. This is pre-trained on COCO (Common Objects in Context) and ImageNet datasets.

Using YOLO: An Example

YOLO can be used for a wide variety of applications and use cases. Here is an example of borderless table detection. A detailed section on implementation is presented at the end of the article.

results

The Evolution of YOLO

Alt text

YOLO has eight versions in total, with each subsequent version improving upon the previous one. Initially introduced in 2015, it has since become one of the most popular object detection algorithms worldwide.

YOLO V1

Released: 2015

Paper title: You Only Look Once: Unified, Real-Time Object Detection

Authors: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

YOLO V1 used a convolutional neural network (CNN) to predict bounding boxes for objects in the image. It is very fast, compared to other models.

Drawbacks:

  • Not as accurate as other object detection algorithms at the time, as it had issues with large number of false positive detections.

YOLO V2

Released: 2016

Paper title: YOLO9000: Better, Faster, Stronger

Authors: Joseph Redmon, Ali Farhadi

It introduced several improvements, which made YOLO V2 more accurate and faster than YOLO V1.

Drawbacks:

  • It still had some drawbacks like difficulty detecting smaller objects.

YOLO V3

Released: 2018

Paper title: YOLOv3: An Incremental Improvement

Authors: Joseph Redmon, Ali Farhadi

It introduced several more improvements over YOLO v2. The number of false positive detections got reduced and accuracy got improved.

Drawbacks:

  • May not be ideal for using niche models where large datasets can be hard to obtain.

YOLO V4

Released: 2020

Paper title: YOLOv4: Optimal Speed and Accuracy of Object Detection

Authors: Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao

It introduced several new features which made YOLO v4 more accurate and faster than YOLO v3.

Drawbacks:

  • YOLOv4 models are generally larger. This can lead to higher memory consumption and slower inference speeds.

YOLO V5

Released: 2020

YOLOv5 was specifically designed for high scalability, making it adept at deployment on diverse devices, ranging from powerful GPUs to even low-power mobile devices.

Drawbacks:

  • While YOLOv5 offers different sized models with varying accuracy levels, the most accurate models (e.g., YOLOv5x) can be computationally expensive and require powerful hardware for real-time inference.

YOLO V6

Released: 2022

Paper Title: YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Authors: Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, Yiduo Li, Bo Zhang, Yufei Liang, Linyuan Zhou, Xiaoming Xu, Xiangxiang Chu, Xiaoming Wei, Xiaolin Wei

The sixth version of YOLO was released in 2022. YOLOv6 focused on optimizing the architecture for hardware.

Drawbacks

  • Less flexibility for customization: Making it harder to customize and fine-tune for specific tasks.

YOLO V7

Released: 2022

Paper title: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Authors: Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao

YOLO V7 introduced support for detecting poses, which is based on the COCO Image Dataset. YOLO v7 is more versatile than YOLO v6.

Drawbacks:

  • Slower compared to yolov8.

YOLO V8

Released: 2023

YOLO V8 is a cutting edge model that builds upon the success of previous YOLO versions and introduces new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification.

More on YOLOV8: Exploring Its Internals and Features

Inner Workings of YOLOV8

Anchor-Free Detection

One of the key features of YOLOv8 is its use of anchor-free detection.

Alt text Anchor boxes are predefined boxes that are used to represent objects in an image.

YOLOv8 does not use anchor boxes, instead, it predicts the centre, dimensions, and class of an object directly from the image. This makes YOLOv8 more accurate and faster than previous YOLO versions.

Mosaic data augmentation

Mosaic data augmentation is a technique used to increase the size and diversity of a training dataset for object detection models. It involves randomly selecting four images from the training set and stitching them together into a single composite image. This composite image is used to train the object detection model.

Modern object detectors consist of the following Alt text Backbone: The backbone is a convolutional neural network (CNN) that extracts features from an input image. YOLOv8 uses a variety of different backbones.

Neck: The neck is a series of layers that combine features from different levels of the backbone. This allows YOLOv8 to detect objects of different sizes at different locations in the image.

Head: The head is a series of layers that predict bounding boxes for objects in the image. YOLOv8 uses a new anchor-free detection head, which eliminates the need for anchor boxes and improves the accuracy of object detection.

YOLOV8 capabilities

Alt text

Classify: Image classification involves classifying an entire image into one of a set of predefined classes. For example, an image can belong to the classes "Person", "Tripod" and "Safety Vest".

Detection: Detection is a task that involves identifying the location and class of objects in an image or video stream.

Segment: Segmentation goes a step further than object detection and involves identifying individual objects in an image and segmenting them from the rest of the image.

Track: Tracking the movement of objects over time in a sequence of images or video frames.

Pose: Identifying the location of keypoints, or landmarks, on an object, Imagine you have a picture of a person striking a pose. Pose estimation can analyze this image and identify the precise location of keypoints like elbows, knees and wrists.

Next, we can dive into a Detection example implementation. Follow through the tutorial locally or using Google Colab (Recommended) to try this quickly.

Setup Table Detection Using YOLOV8: A Detailed Guide

To extract the tabular data, you must initially determine the table’s location within the document. Table detection is facilitated by a Python ultralytics library. It can download lightweight YOLO models that have a range of advanced features like object detection, classification, segmentation, etc.

Install Dependencies

Install pytesseract, a Python library and its dependencies for OCR(Optical Character Recognition). Using this we will extract text information from an image.

!sudo apt install tesseract-ocr
!pip install pytesseract transformers ultralyticsplus==0.0.23 ultralytics==8.0.21

Initialize the Imports

We use numpy for numpy arrays which is used to represent the table coordinates. Pytesseract for OCR,ultralyticsplus for calling YOLO.

import numpy as np
import pytesseract
from pytesseract import Output
from ultralyticsplus import YOLO, render_result
from PIL import Image

# Load the initial image
image = './borderless_table.jpg'
img = Image.open(image)
img

This will display the initial document that we will be using for table detection. Alt text

Initial Document

Load the YOLOv8-table-extraction model

This model was taken from awesome-yolov8-models authored by @_keremberke. Big thanks for making this model open-source. This model was made by tuning YOLOv8 with a custom dataset of similar tables.

model = YOLO('keremberke/yolov8m-table-extraction')

# set model parameters
model.overrides['conf'] = 0.25  # NMS confidence threshold
model.overrides['iou'] = 0.45  # NMS IoU threshold
model.overrides['agnostic_nms'] = False  # NMS class-agnostic
model.overrides['max_det'] = 1000  # maximum number of detections per image

Detect the Table

Call the predict method to get the borderless table marked using a bounding box.

results = model.predict(img)
# observe results
print('Boxes: ', results[0].boxes)
render = render_result(model=model, image=img, result=results[0])
render

This code renders the result obtained after table detection. results

Result of Table Detection

Next, we can try to get the table contents as text using OCR.

Cropping the image and performing OCR

boxes_data = results[0].boxes.data.cpu().numpy()
x1, y1, x2, y2, _, _ = tuple(int(item) for item in boxes_data[0])
img = np.array(Image.open(image))
cropped_image = img[y1:y2, x1:x2]
cropped_image = Image.fromarray(cropped_image)
cropped_image

This is the cropped version of the detected table. cropped

Cropped Table

# conversion to text
text = pytesseract.image_to_string(cropped_image)
print(text)

OCR will return the following results. Results of OCR

Results of OCR

Conclusion

The world of AI is vast, but you can easily get started and push your boundaries, as we have shown in this article.

Getting started takes only a few minutes, but the insights gained will fuel your desire for more knowledge.

YOLO isn't the only option, but it's a good starting point for developers who want to get involved in this technology. Bigger and more complex projects are possible; the demo here is just the beginning.

References

Twitter

Hexmos

Hackernews post

Linkedin post