YOLO models for Object Detection Explained [YOLOv8 Updated]
What is YOLO (You Only Look Once)?
One of the most, if not the most, well-known models in Artificial intelligence (AI) is the “YOLO” model series. When I was in school, YOLO used to mean something else. And yet, here I am 15 years later writing an article about it - who would have thought?
Okay strap in–
YOLO (You Only Look Once) is a popular set of object detection models used for real-time object detection and classification in computer vision.
Originally developed by Joseph Redmon, Ali Farhadi, and Santosh Divvala, YOLO aims to achieve high accuracy in object detection with real-time speed. The model family belongs to one-stage object detection models that process an entire image in a single forward pass of a convolutional neural network (CNN).
The key feature of YOLO is its single-stage detection approach, which is designed to detect objects in real time and with high accuracy. Unlike two-stage detection models, such as R-CNN, that first propose regions of interest and then classify these regions, YOLO processes the entire image in a single pass, making it faster and more efficient.
Source: Pjreddie. The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
In this article, we will be focusing on YOLOv8, the latest version of the YOLO system developed by Ultralytics. We will discuss its evolution from YOLO to YOLOv8, its network architecture, new features, and applications. Additionally, we will provide a step-by-step guide on how to use YOLOv8, and lastly how to use it to create model-assisted annotations with Encord Annotate.
Whether you're a seasoned machine learning engineer or just starting out, this guide will provide you with all the knowledge and tools you need to get started with YOLOv8.
So buckle up and let's dive in!
Evolution from YOLO to YOLOv8
If you’re not interested in a quick recap of the timeline of YOLO models please skip this section –
YOLOv1 was the first official YOLO model. It used a single convolutional neural network (CNN) to detect objects in an image and was relatively fast compared to other object detection models. However, it was not as accurate as some of the two-stage models at that time.
YOLOv2 was released in 2016 and made several improvements over YOLOv1. It used anchor boxes to improve detection accuracy and introduced the Upsample layer, which improved the resolution of the output feature map.
YOLOv3 was introduced in 2018 with the goal of increasing the accuracy and speed of the algorithm. The primary improvement in YOLOv3 over its predecessors was the use of the Darknet-53 architecture, a variant of the ResNet architecture specifically designed for object detection.
YOLO v3 also improved the anchor boxes, allowing different scales and aspect ratios to better match the size and shape of the detected objects. The use of Feature Pyramid Networks (FPN) and GHM loss function, along with a wider range of object sizes and aspect ratios and improved accuracy and stability, were also hallmarks of YOLO v3.
YOLOv4, released in 2020 by Bochkovskiy et. al., introduced a number of improvements over YOLOv3, including a new backbone network, improvements to the training process, and increased model capacity. YOLOv4 also introduced Cross mini-Batch Normalization, a new normalization method designed to increase the stability of the training process.
YOLOv5, introduced in 2020, builds upon the success of previous versions and was released as an open-source project by Ultralytics. YOLOv5 used the EfficientDet architecture, based on the EfficientNet network, and several new features and improvements, to achieve improved object detection performance. YOLOv5 became the world's state-of-the-art repo for object detection back in 2020 given its flexible Pythonic structure and was also the first model we incorporated for model-assisted learning at Encord.
YOLOv6 focused on making the system more efficient and reducing its memory footprint. It made use of a new CNN architecture called SPP-Net (Spatial Pyramid Pooling Network). This architecture is designed to handle objects of different sizes and aspect ratios, making it ideal for object detection tasks.
YOLOv7 was introduced in 2022. One of the key improvements in YOLOv7 is the use of a new CNN architecture called ResNeXt.
YOLOv7 also introduces a new multi-scale training strategy, which involves training the model on images at multiple scales and then combining the predictions. This helps the model handle objects of different sizes and shapes more effectively.
Finally, YOLOv7 incorporates a new technique called "Focal Loss", which is designed to address the class imbalance problem that often arises in object detection tasks. The Focal Loss function gives more weight to hard examples and reduces the influence of easy examples.
A funny sidenote and notable version were the YOLO9000. Officially named YOLO9000: Better, Faster, Stronger, Joseph Redmon and Ali Farhadi got together to improve the YOLOv2 model to detect over 9000 object categories.
Over the last year, Ultralytics worked hard on researching and building the newest version of YOLO, YOLOv8. YOLOv8 was launched on January 10th, 2023.
Why should I use YOLOv8?
A few of the main reasons you should consider using YOLOv8 in your next computer vision project are:
- YOLOv8 has better accuracy than previous YOLO models.
- The latest YOLOv8 implementation comes with a lot of new features, we especially like the user-friendly CLI and GitHub repo.
- It supports object detection, instance segmentation, and image classification.
- The community around YOLO is incredible, just search for any edition of the YOLO model and you’ll find hundreds of tutorials, videos, and articles. Furthermore, you can always find the help you need in communities such as MLOps Community, DCAI, and others.
- Training of YOLOv8 will be probably faster than the other two-stage object detection models.
One reason not to use YOLOv8:
- At the current time YOLOv8 does not support models trained in 1280 (in pixels) resolution, thus if you’re looking to run inference at high resolution it is not recommended to use YOLOv8.
How does YOLOv8 compare to previous models?
The Ultralytics team has once again benchmarked YOLOv8 against the COCO dataset and achieved impressive results compared to previous YOLO versions across all five model sizes.
When comparing the performance of the different YOLO lineages and model sizes on the COCO dataset we want to compare different metrics.
- Performance: Mean average precision (mAP)
- Speed: Speed of the inference (In fps)
- Compute (cost): The size of the model in FLOPs and params
For the object detection comparison of the 5 model sizes The YOLOv8m model achieved an mAP of 50.2% on the COCO dataset, whereas the largest model, YOLOv8x achieved 53.9% with more than double the number of parameters.
Overall, YOLOv8's high accuracy and performance make it a strong contender for your next computer vision project.
Whether you are looking to implement object detection in a commercial product, or simply want to experiment with the latest computer vision technologies, YOLOv8 is a state-of-the-art model that you should consider.
If you would like to see try a short tutorial of YOLOv8 from Ultralytics check out their colab tutorial.
Next, we will analyze the architecture and design of the model.
YOLOv8 network architecture and design
The actual published paper has not been released yet but the creators of YOLOv8 promised that it will come out soon (To avoid the drama around YOLOv5).
Thus we do not have a good overview of the methodologies used during creation nor do we have access to the ablation studies conducted by the team. We will release an updated version as soon as it is published.
Luckily we can find plenty of information online and in the GitHub repository.
The layout of YOLOv8
We won’t go too much into detail about the YOLOv8 architecture, but we will cover some of the major differences from previous iterations.
The following layout was made by RangeKing on GitHub and is a great way of visualizing the architecture.
New convolutions in YOLOv8
There are a series of updates and new convolutions in the YOLOv8 architecture according to the introductory post from Ultralytics:
- The backbone of the system underwent changes with the introduction of C2f, replacing C3. The first 6x6 convolution in the stem was switched to a 3x3 convolution. In C2f, outputs from the Bottleneck (which is a combination of two 3x3 convs with residual connections) are combined, whereas in C3 only the output from the last Bottleneck was utilized.
- Two convolutions (#10 and #14 in the YOLOv5 config) were removed.
- The Bottleneck in YOLOv8 remains the same as in YOLOv5, except the first convolution's kernel size was changed from 1x1 to 3x3. This change indicates a shift towards the ResNet block defined in 2015.
Anchor-free detection is when an object detection model directly predicts the center of an object instead of the offset from a known anchor box.
Anchor boxes are a pre-defined set of boxes with specific heights and widths, used to detect object classes with the desired scale and aspect ratio. They are chosen based on the size of objects in the training dataset and are tiled across the image during detection.
The network outputs probability and attributes like background, IoU, and offsets for each tiled box, which are used to adjust the anchor boxes. Multiple anchor boxes can be defined for different object sizes, serving as fixed starting points for boundary box guesses.
The advantage of anchor-free detection is that it is more flexible and efficient, as it does not require the manual specification of anchor boxes, which can be difficult to choose and can lead to suboptimal results in previous YOLO models such as v1 and v2.
Luckily we do not need them anymore, but if you’re interested in knowing how to generate the anchor boxes yourself look at this post.
Let us look at how to use and implement YOLOv8 into your workflows. The model comes bundled with the following pre-trained models that can be utilized off-the-shelf in your computer vision projects to achieve better model performance:
- Instance segmentation models trained on the COCO segmentation dataset with an image resolution of 640.
- Image classification models pre-trained on the ImageNet dataset with an image resolution of 224.
- Object Detection models trained on the COCO detection dataset with an image resolution of 640.
Source: Ultralytics. Example of Classification, Object Detection, and Segmentation.
In the next section, we will cover how to access YOLO via your CLI, python, environment, and lastly in Encord’s Platform.
How do I use the YOLOv8 CLI?
YOLOv8 can be accessed easily via the CLI and used on any type of dataset.
!yolo task=detect \ mode=predict \ model=yolov8n.pt \ source="image.jpg"
To use it simply insert the following commands:
- task in [detect, classify, segment]
- mode in [train, predict, val, export]
- model as an uninitialized .yaml or as a previously trained .pt file
- Source as the path/to/data
Can I pip install YOLOv8?
Complementary to the CLI, YOLOv8 is also distributed as a PIP package, perfect for all Python environments. This makes local development a little harder but unlocks all of the possibilities of weaving YOLOv8 into your Python code.
You can clone it from GitHub:
git clone https://github.com/ultralytics/ultralytics.git
Or pip install from pip:
pip install ultralytics
After pip installing you can import a model and use it in your favorite Python environment:
from ultralytics import YOLO # Load a model model = YOLO("yolov8n.pt") # load a pretrained model # Use the model results = model.train(data="coco128.yaml", epochs=5) # train the model results = model.val() # evaluate model performance on the validation data set results = model("https://ultralytics.com/images/cat.jpg") # predict on an image success = YOLO("yolov8n.pt").export(format="onnx") # export a model to ONNX
Another way to access the YOLO models are in openCV, in ultralytics Google colab notebook, and via Keras API and Tensorflow 2.
What is the Annotation Format of YOLOv8?
YOLOv8 has a simple annotation format which is the same as the YOLOv5 PyTorch TXT annotation format, a modified version of the Darknet annotation format.
Every image sample has one .txt file with one line for each bounding box. The format of each row is presented as follows:
class_id center_x center_y width height
Notice that each field is space delimited and the coordinates are normalized from zero to one.
YOLOv8 annotation format example:
1: 1 0.317 0.30354206008 0.114 0.173819742489 2: 1 0.694 0.33726094420 0.156 0.23605150214 3: 1 0.395 0.32257467811 0.13 0.195278969957
The data.yaml folder contains information used by the model to locate images and map class names to the class ids.
train: ../train/images test: ../test/images val: ../valid/images nc: 5 names: ['fish’, 'cat', 'person', ‘dog’, ‘shark’]
Using YOLOv8 for model-assisted labeling
Another interesting application of YOLOv8 is as an object detector to speed up your labeling workflow. In the same way, ChatGPT has sped up the time it takes to write emails or code, YOLOv8 is perfect for implementing as a backbone for AI-assisted labeling.
Encord Annotate supports a novel approach based on Micro-models, which are purposefully overfitted models trained on just a few labels for targeted use cases.
We can do this in a few steps:
Upload your input images that you’d like to annotate into Encord’s platform via the SDK from your cloud bucket (e.g. S3, Azure, GCP) or via the GUI.
Label 20 samples of any custom object you have defined in your ontology (In this example we will use airplanes from the Airbus Aircraft Detection dataset)
Moving on to model training. Within the platform you navigate to the model tab, and initiate the training of a Micro-model with a YOLOv8 backbone (an object detection model to overfit)
Hint: You can also train Micro models for image segmentation and classification.
Wait a few minutes while the model is being trained on your initial samples.
Run inference on your next training data samples. We select a confidence threshold of 60%. When the model has run predictions it automatically turns into labels automatically.
After a few seconds you can see the new labels in the editor:
Repeatedly retrain the YOLOv8 algorithm to improve the Micro-model one iteration at a time. With this strategy, you will build high-quality datasets for your computer vision models in no time.
Micro-models are best for specific use cases where the objects you’re annotating are semantically similar. Get in touch with us if you’d like to see if they are useful for your use case.
If you want to learn more about applying models to label your data check out this guide.
That was YOLOv8 –
In this article, we have provided an overview of the evolution of YOLO, from YOLOv1 to YOLOv8, and have discussed its network architecture, new features, and applications. Additionally, we have provided a step-by-step guide on how to use YOLOv8 for object detection and how to create model-assisted annotations with Encord Annotate.
At Encord, we help computer vision companies build better models and training datasets. We have built an end-to-end Active Learning Platform for
- AI-assisted annotation workflows
- evaluating and evaluating your training data
- Orchestrating active learning pipelines
- Fixing data and label errors
- Diagnosing model errors & biases.
Encord integrates the new YOLOv8 state-of-the-art model and allows you to train Micro-models on a backbone of YOLOv8 models to support your AI-assisted annotation work.
Get in touch us if you’d like to see it in action on your data.
What is YOLOv8 and how does it differ from previous versions of YOLO?
YOLOv8 is the latest iteration of the YOLO object detection model, aimed at delivering improved accuracy and efficiency over previous versions. Key updates include a more optimized network architecture, a revised anchor box design, and a modified loss function for increased accuracy.
How does YOLOv8 compare in terms of accuracy to other object detection models?
YOLOv8 has demonstrated improved accuracy compared to earlier versions of YOLO and is competitive with state-of-the-art object detection models.
Is YOLOv8 suitable for real-time object detection applications?
YOLOv8 is designed to run efficiently on standard hardware, making it a viable solution for real-time object detection tasks, also on edge.
What is the role of anchor boxes in YOLOv8?
Anchor boxes are used in YOLOv8 to match predicted bounding boxes to ground-truth bounding boxes, improving the overall accuracy of the object detection process.
Is it possible to fine-tune YOLOv8 on custom datasets?
Yes, YOLOv8 can be fine-tuned on custom datasets to increase its accuracy for specific object detection tasks.
Is the YOLOv8 codebase open source?
Yes, the YOLOv8 codebase is open source and available for research and development purposes on GitHub here.
What are the technical requirements for using YOLOv8?
To use YOLOv8, you will need a computer with a GPU, deep learning framework support (such as PyTorch or TensorFlow), and access to the YOLOv8 GitHub.
Where can I find additional information and resources on YOLOv8?
There are many resources available for learning about YOLOv8, including research papers, online tutorials, and educational courses. I would recommend checking out youtube!