So much of what we do in our day to day depends on vision. We use it to get from point A to point B, read the news, and understand each other’s emotions. The ability to see is such an important aspect of navigating our physical world that we even use it to describe the metaphysical. For example, when we finally understand a hard to grasp topic, we proclaim “oh, I see!”
We know a bit about how human vision works, but what about computer vision? The challenge of building machines capable of perceiving the world as effectively as we can has plagued researchers for many years, but with recent advancements in deep learning and artificial intelligence, we have finally achieved major scientific breakthroughs.
In 2015, researchers developed models that were able to outperform humans in recognizing a variety of objects in images, and computer vision models are already being leveraged to power everything from the intelligent image search on your iPhone to the perception engine behind Tesla’s Autopilot.
This will be the first in a series of deep dive articles on image and video understanding in the context of AI. We will treat images and videos as one and the same, since a video is just a succession of images. This article will serve as an introduction, part two will discuss current applications, and part three will provide a framework for implementing image and video understanding in your own company or product.
Foundations of Vision
To understand why vision has historically been such a hard task for computers to manage, we should first touch on how human vision works. The modern understanding of visual perception comes from research done with cats back in the 1960s. The researchers, David Hubel and Torsten Wiesel, found that certain neurons in the brain reacted only to a particular kind of stimulus, such as a line in a specific place and orientation in the visual field. Building off of these findings, we have since discovered that human vision is hierarchical. Neurons that detect simple features, such as lines and edges, feed into those that detect more complex features, such as shapes and textures, which eventually feed into complex representations of objects. This would serve as foundational for building computer vision systems.
Armed with this knowledge, computer scientists in the 1990s attempted to extract and hand code features of the visual world into rules that computers could understand. This was a laborious process and involved collaborating with human experts in visual perception to extract important features from objects and developing a set of rules for classifying the object. But researchers quickly hit a wall: combinatorial explosion. The number of rules that an engineer would need to develop for the algorithm to handle the many, many possible states of the world is potentially limitless.
Take the example of a cat. What exact features distinguish a cat from a dog? What exact features do all cats have in common? Are there any exceptions and if so when, where and how do they apply? If coming up with an exhaustive list of these rules seems difficult to you, you are not alone. It wasn’t until recent innovations in deep learning that researchers have been able to build robust computer vision models.
The Rise Of Machine Learning
In 2007, Fei-Fei Li, now director of the Stanford AI Lab, realized that computer vision researchers would never be able to manually encode all the complexity and variability of the natural world. She set out on a mission to build a massive data set of labelled images and in 2010, the first ImageNet competition was held. Researchers from around the world compete to develop models that accurately categorize a wide variety of images, and today ImageNet contains 15 million images and over 20,000 categories.
Although results on ImageNet were acceptable for the first few years of the competition, a breakthrough came in 2012. Using deep convolutional neural networks (CNNs), Geoffrey Hinton’s group from the University of Toronto was able to achieve an error rate twice as low as the runner-up. CNNs and humans perceive the world in a similar way. Initial layers in the CNN encode simple features, such as lines and edges, which feed into later layers that detect more complex features, such as shapes and textures. These are then used to build complex representations of objects. Sounds pretty similar to the research by Hubel and Wiesel we described earlier!
As it turns out, building machine learning models that learn for the data directly, as opposed to researchers hand coding features, allowed the algorithm to encode all the many embodiments of cats, dogs and all other objects in the dataset. This resulted in a paradigm shift in machine learning research and put a focus on building large and robust datasets to train on. These large datasets are crucial to any AI project. Last year, the winning team of the ImageNet competitions used CNNs to achieve an error rate of less than 3%. This was incredibly impressive, as even expert human annotators will have an error rate of 5%.
There are three main reasons recent advances in computer vision specifically, and artificial intelligence broadly, have been possible: more data, faster computing, and better algorithms. All three of these were critical for the groundbreaking results on ImageNet in 2012.
Without the millions of images in the ImageNet dataset, the model may not have been able to distinguish between similar categories. Without the use of graphic processing units (GPUs) running in parallel, the model may have taken too long to train. Without using novel algorithms, the model may have never learned the patterns inherent in the data at all.
Although we have made incredible strides in the field of image and video understanding, there is still a long way to go before computers will be able to interpret images better than humans. Context, meaning, and reasoning still elude our silicon counterparts. These are just some of the exciting scientific hurdles that we are currently tackling in computer vision as one of our core areas of expertise.
It is important to stay realistic about the applications of computer vision, which is why the next article in this series will touch on the state of the art of research and what kind of tasks are currently possible.