Dominic Tarn •October 21, 2022
Action Classifications In Video Annotation: Why Does It Matter?
In almost every video some objects move. A car could be moving from frame to frame, but static annotations limit the amount of data machine learning teams can train a model on. Hence the need for action classifications in video annotation projects.
With action, dynamic or events-based classification, video annotation teams can add a richer layer of data for computer vision machine learning models.
Annotators can label whether a car is accelerating or decelerating, turning, stopping, starting, or reversing, and apply numerous other labels to a dynamic object.
In this post, we will explain action classifications, also known as dynamic or event-based classification in video annotation in more detail, why this is difficult to implement, how it works, best practices, and use cases.
What are Action Classifications in Video Annotation?
Annotators need to apply action classifications to say what an object is doing and over what timescale those actions are taking place. With the right video annotation tool, you can apply these annotation labels so that an algorithm-generated machine-learning model has more data to learn from. This helps improve the overall quality of the dataset, and therefore, the outputs the model generates.
For example, a car could be accelerating in frames 100 to 150, then decelerate in frames 300 to 350, and then turn left in frames 351 to 420. Dynamic classifiers contribute to the ground truth of a video annotation, and the video data a machine learning model learns from.
Action or dynamic classifications are incredibly useful annotation and labeling tools, acting as an integral classifier in the annotation process. However, dynamic classifications and labels are difficult to implement successfully. Very few video annotation platforms come with this feature. Encord does, and that’s why we’re going into more detail as to why dynamic or events classifications matter, how it works, best practices, and use cases.
Action Classification vs. Static Classification: What’s the Difference?
Before we do, let’s compare action with static classifications.
With static classifications, annotators use an annotation tool to define and label the global properties of an object (e.g. the car is blue, has four wheels, and slight damage to the drivers-side door), and the ground truth of video data an ML is trained on. You can apply as much or as little detail as you need to train your computer vision model algorithm using static classifications and labels.
On the other hand, action, or dynamic classifications, describe what an object is doing and when those actions take place. Action classifications are labels that are always inherently time and action-orientated. An object needs to be in motion, whether that’s a person, car, plane, train, or anything else that moves from frame to frame.
An object’s behavior — whether that’s a person running, jumping, walking; a vehicle in motion, or anything else — defines and informs the labels and annotations applied during video annotation work, and the object detection process. When annotated training datasets are fed into a computer vision or machine learning model, those dynamic labels and classifications influence the model’s outputs.
Why are Action Classifications in Video Datasets Difficult to Implement?
Action classifications are a truly innovative engineering achievement.
Despite decades of work, academic research, and countless millions in funding for computer vision, machine learning, artificial intelligence (AI), and video annotation companies, most platforms don’t offer dynamic classification in an easy-to-implement format.
Static classifications and labels are easier to do. Every video annotation tool and platform comes with static labeling features. Dynamic classification features are less common. Hence the advantage of finding an annotation tool that does static and dynamic, such as Encord.
Action classifications require special features to apply dynamic data structures of object descriptions, to ensure a computer vision model understands this data accurately, so that a moving car in one frame is tracked hundreds of frames later in the same video.
How Does Action Classification for Video Data Work?
Annotating and labeling movement isn’t easy. When an object is static, annotators give objects descriptive labels. Object detection is fairly simple for annotation tools. Static labels can be as simple as “red car”, or as complicated as describing the particular features of cancerous cells.
On the other hand, dynamic labels and classifications can cover everything from simple movement descriptors to extremely detailed and granular descriptions. When we think about how people move, so many parts of the body are in motion at any one time. Hence the advantage of using keypoints and primitives (skeleton templates) when implementing human pose estimation (HPE) annotations; this is another form of dynamic classification, when the movements themselves are dynamic.
Therefore, annotations of human movement might need to involve an even higher level of granular detail. In a video of tennis players, notice the number of joints and muscles in action as a player hits a serve. In this one example, we can see that the players’ feet, legs, arms, neck, and head are all in motion. Every limb moves, and depending on what you’re training a computer vision model to understand, it means ensuring annotations cover as much detail as possible.
How to Train Computer Vision Models on Action Classification Annotations?
Answering this question comes down to understanding how much data a computer vision model needs, and whether any AI/ML-based model needs more data when the video annotations are dynamic.
Unfortunately, there’s no clear answer to that question. It always depends on a number of factors, such as the models objectives and project outcomes, interpolation applied, the volume, and quality of the training datasets, and the granularity of the dynamic labels and annotations applied.
Any model is only as accurate as the data provided. The quality, detail, number of segmentations and granularity of labels and annotations applied during the stage influence how well and fast computer vision models learn. And crucially, how accurate any model is before more data and further iterations of that data need to be fed into the model.
As with any computer vision model, the more data you feed it, the more accurate it becomes. Providing a model with different versions of similar data — e.g. a red car moving fast in shadows, compared to a red car moving slowly in evening or morning light — the higher the accuracy of the training data.
With the right video annotation tool, you can apply any object annotation type and label to an object that’s in motion — bounding boxes, polygons, polylines, keypoints, and primitives.
Using Encord, you can annotate the localized version of any object — static and dynamic — regardless of the annotation type you deploy. Everything is conveniently accessible in one easy-to-use interface for annotators, and Encord tools can also be used through APIs and SDKs.
Now let’s take a look at the best practices and use cases for action classifications in video annotation projects.
Best Practices for Action Classifications in Video
Use clean (raw) data
Before starting any video-based annotation project, you need to ensure you’ve got a large enough quantity and quality of raw data (videos). Data cleansing is integral and essential to this process. Ensure low quality or duplicate frames, such as ghost frames, are removed.
Understand the dynamic properties video dataset annotations are trying to explain
Once the videos are ready, annotation and ML teams need to be clear on what dynamic classification annotations are trying to explain. What are the outcomes you want to train a computer vision model for? How much detail should you include?
Answering these questions will influence the granular level of detail annotators should apply to the training data, and subsequent requests ML teams make when more data is needed. Annotators might need to apply more segmentation to the videos, or classify the pixels more accurately, especially when comparing against benchmark datasets.
Understand the dynamic properties video dataset annotations are trying to explain
Next, you need to ensure the labels and annotations being used align with the problem the project is trying to solve. Remember, the quality of the data — from the localized version of any object, to the static or dynamic classifications applied — has a massive impact on the quality of the computer vision model outcomes.
Projects often involve comparing model outcomes with benchmark video classification datasets. This way, machine learning team leaders can compare semantic metrics against benchmark models and machine learning algorithm outcomes.
Go granular with annotation details, especially with interpolation, object detection, and segmentation
Detail and context is crucial. Start with the simplest labels, and then go as granular as you need with the labels, annotations, specifications, segmentations, protocols, metadata, right down to classifying individual pixels. This could involve as much detail as saying a car went from 25kmph to 30kmph in the space of 10 seconds.
What Are The Use Cases for Action Classification in Video Annotation?
Action classification in video annotation is useful across dozens of sectors, with countless practical applications already in use. In our experience, some of the most common right now include computational models for autonomous driving, sports analytics, manufacturing, and smart cities.
Key Takeaways for Using Action Classification in Video Annotation
Any sector where movement is integral to video annotation and computer vision model projects can benefit from dynamic or events-based classifications.
Action classifications give annotators and ML teams a valuable tool for classifying moving and time-based objects. Movement is one of the most difficult things to annotate and label. A powerful video annotation tool is needed, with dynamic classification features, to support annotators when events/time-based action needs to be accurately labeled.
At Encord, our active learning platform for computer vision is used by a wide range of sectors - including healthcare, manufacturing, utilities, and smart cities - to annotate 1000s of videos and accelerate their computer vision model development. Speak to sales to request a trial of Encord
Dominic has over 10 years' experience writing content for high growth AI and SaaS startups. His writing covers a wide range of topics, including machine learning, artificial intelligence and computer vision. Dominic is the founder & CEO of Inbound Sales Content (ISC), an SEO growth-focused B2B content marketing agency. He has a History BA from UCL, has lived in three countries in the last decade, and is now happily settled with a family and cat in the North East of England. https://www.linkedin.com/in/dominicntarn-inboundsalescontent/ https://www.inboundsalescontent.com/
February 1, 2023
12 min read
January 31, 2023
7 min read