Image and Video Understanding: A Roadmap For Implementation


Image and Video Understanding: A Roadmap For Implementation


In the previous two articles of our series, we provided an introduction to computer vision and did a deep dive into some of its current applications. As with any new disruptive technology, it is imperative to develop a roadmap for implementation to generate the most ROI. Poorly thought out applications will expend resources in areas that don’t provide value, and reduce confidence from across your organization.

Since computer vision is such an expansive domain, for this article we will use the example of a deep learning solution in agriculture. Please keep in mind that this is only an example and that specifics will depend on the problem at hand.

What Problem Am I Trying to Solve?

Having a clearly defined problem is the first and most important step of the process.

Image and Video Understanding: A Roadmap for Implementation Infographic

What do I want my model to be able to do?

There are many types of problems computer vision can address, such as classification, semantic segmentation, instance segmentation, and pose estimation, just to name a few. It is crucial to know which applies to your situation.

Let’s say you are building a robot to pick strawberries. To solve this problem, you would need a model that could know whether a strawberry is ripe or not. This would be a called binary classification since you are using computer vision to answer a yes or no question.

On the other hand, if you would like to build a robot that can deal with multiple kinds of fruit, the robot would need to know what kind of crop it is looking at. This would be called a multiclass classification, since you are applying computer vision to categorize different types of crops.

Furthermore, you would want to identify the exact location of your fruits, so the robot can accurately harvest them. This would require semantic segmentation, which allows the model to localize the target object in the space.

The type of problem you want to address will determine how the data is annotated, as well as how feasible the solution will be.

Does my answer present itself in data?

In machine learning, the term “ground truth” refers to the gold standard that the model is tested and trained against. It is important that the ground truth in your dataset is relevant to the problem you want to solve.

This often means that you will need your data to be annotated with either labels, bounding boxes, or masks, depending on the problem at hand. In the three examples presented above, each requires different types of data and different levels of annotation.

It is also important that the dataset used to train the AI model is similar to the environment that the model will be deployed in. This ensures that results achieved in the lab will extend to the real world.

What Does My Data Look Like?

The quality of a machine learning model is entirely dependent on the quality of the data.

Do I have access to large datasets with appropriate labels for training?

These can either be public datasets that have been curated by other researchers, or proprietary datasets specific to your problem that you will need to build out yourself.
It is important to make sure your labels are appropriate. For example, if you are performing a segmentation task, your images will need to be labeled at the pixel level, using either bounding boxes or masks.

Computer vision benefits greatly from transfer learning, where parts of an AI model that has already been trained on a general dataset, like ImageNet, are repurposed to solve a more specific problem. You can also fine-tune existing models by retraining with data specific to your problem. This enables a more agile research process where models can be trained more quickly and with less data.

Is my dataset balanced?

Having a balanced dataset is important for building a robust model. A dataset is balanced if there are roughly the same number of training examples for each class you are trying to detect.

For example, if we are creating a binary classifier to identify if a strawberry is ripe or not, we would not want a dataset where 90% of the images are labelled “ripe”. If this were the case, the model would be able to achieve a 90% accuracy rate by classifying all the images in the dataset as “ripe”, without ever learning to recognize what isn’t. This model would perform poorly in the real world, and would result in the harvest of a lot of sour strawberries.

It is also important to have appropriate intraclass variance. This  refers to the degree of dissimilarity that instances of the same class have from each other. If you are classifying types of fruits, you may want your model to detect fruits at different stages of the growing season, or in different weather and soil conditions. You should attempt to represent the variability and complexity of the real world in the dataset.

Now, if you detect that your dataset is imbalanced, it isn’t the end of the world. You may still be able to solve the problem using specific techniques such as focal loss to mitigate the effect of the data imbalance.

Am I able to generate new labeled data if necessary?

You may need to create a dataset from scratch, generate additional data to balance out an already existing dataset, or label new data as it is created. All of these require a data annotation process, which can be costly and time intensive to implement. Services like Amazon Mechanical Turk and Figure-Eight can help outsource this task and accelerate the process.

Make sure that the people annotating your work have the appropriate expertise! It may be worthwhile to invest in training internal staff or building proprietary annotation software if your task requires domain-specific expert knowledge.

What Are My Expectations?

Machine learning is not magic. It is important to remain realistic about performance expectations, and evaluate what would count as a successful deployment. The results you want to achieve will determine the model you employ, the way you manage your data, and the feasibility of the solution as a whole.


Often, it is not sufficient to use accuracy on its own to understand how well a model is performing. It is therefore important to gain a strong grasp of precision, recall, and F1 scores. These give a better indication of what areas the model is performing well in, and where it may need improvement. Precision can be thought of as how exact your predictions are and recall can be thought of as how complete the model was in capturing all cases.

Understanding Precision and Recall Infographic

Let’s take the ripe strawberry binary classifier as an example.

A model with high precision would return very few false positives, meaning that most strawberries classified as ripe would indeed be ready to pick. However, if the model had low recall, it would only detect a small fraction of all the ripe strawberries in the scene. This is the upper left quadrant in Figure 2.

A model with high recall would return very few false negatives, meaning it would be able to detect and accurately classify almost all of the ripe strawberries. However, if the model had low precision, many of the strawberries identified as ripe may be premature or spoiled. This is the bottom right quadrant in Figure 2.

In simple terms, a high precision algorithm would return mostly relevant results and few irrelevant ones, and a high recall algorithm would return most of the relevant results, although some may be incorrectly flagged.

The accuracy, precision, and recall outcomes required will depend on the problem you are trying to solve. Ask yourself what outcomes are currently achievable by human experts, and use their success as a benchmark.


The speed at which you want your algorithm to return an output depends on the problem you are trying to solve. If you are deploying a multiclass classifier to identify a type of weed so you can spray it with an appropriate herbicide, you will want to return results within seconds. On the other hand, if you are using semantic segmentation on drone photography to gain insights on your crop yield, you do not need an immediate result.

The speed and accuracy required for your solution will determine the methods you employ.


With recent legislation passed in California, as well as GDPR in Europe, data security is top of mind for many public officials and executives. You will need to make a calculated decision about where you want to host your data, how to stream it, and where to perform your computation.

Cloud-based models may be cheaper and more convenient, but hosting your model on-site will give you full control and visibility into your data.

Transparency and Bias

It is critical to be mindful of bias that may creep into your model as a result of imbalances in your training data. This is not only ethical, but will also result in more accurate outcomes. There is a growing body of research in areas such as explainability and transparency, and big companies like IBM and Microsoft have taken public stances on the issue.

Key Takeaways

By some estimates, AI will have an annual impact of up to $15.4 trillion in economic value across all verticals. The specifics of implementing these use cases will differ depending on your industry, market, and internal processes. We believe that the underlying business fundamentals are the same across all applications and that having a rigorous and structured approach to roadmapping your AI journey is imperative to success.

At the end of the day, picking the right partner to define and deploy your AI solution will improve your chances of successful project and accelerate your transformation.

Interested in starting your AI journey? Contact us today.

This site is registered on as a development site.