A picture is worth a thousand words. While this common English adage describes how a single image conveys complex ideas, the computer vision community interprets it to mean how images and videos can enrich the digital world. Whether this is in better finding our friends in our photo libraries, enabling self-driving cars, or creating new forms of device interaction - the ability of machines to “see” is radically changing the world we live in today. In this first blog of our computer vision series, I will give you an introduction into the field, how platforms tag your images, and discuss some of the tasks we solve today.
What is Computer Vision?
The field of computer vision is concerned with how machines can extract interpretable information from images and videos (loosely speaking, how machines can describe a picture in a thousand words). The field has most recently been a scientific focus due to advances in deep learning, where computers “learn” patterns in order to perform tasks like image classification, description, refinement, object detection, instance segmentation, and more.
Consider cases of medical imagery. A doctor’s time is incredibly valuable, and a lot of it is spent analysing MRI or CAT scans to diagnose a patient’s condition. In an ideal world, the doctor would be spending more time treating patients rather than diagnosing them. Recent advances in deep learning have made it possible for computers to recognize biological patterns in a matter of hours, whereas a doctor would require years of training. In some cases, machines even surpass doctors in accuracy, which is why companies like PathAI are helping doctors speed up and improve pathology diagnostics.
Example: using AI machines see more than humans do
How do machines “see”?
While our eyes read visual input in the form of light waves, devices read images in a simpler way: through a combination of red, green, and blue pixel values. Think of the days of Nintendo 64, when you scrambled to plug RGB cables into your TV.
Image converted to RGB layers
Although people recognize pictures instantly, devices must process millions of data points to do the same. In the past, it was incredibly difficult to read and model all of those data points; today, advances in latency and computation have made it possible for machines to interpret visual input in near real-time. In combination with Convolutional Neural Networks (CNN), developed by now Chief AI Scientist at Facebook Yann LeCun, problems like the medical case described above are much easier to tackle.
Building an Image Recognition model
Let’s describe the process of solving an image recognition task. Consider that you host a platform for sharing travel images. You want to make it as easy as possible for users to find images of interest by automatically tagging specific sceneries and objects visible in an image. To accomplish this, we can build a deep learning model that learns from examples. Once we feel our model has learned enough, we put it into practice by having it tag newly uploaded images.
Different phases of image recognition
The training phase begins by collecting a training set of images with their associated label set(i.e. scenery and object tags). The images collected undergo processing, which consists of scaling the images down and converting them into machine-readable RGB values. We might choose to augment by distorting, scaling, and cropping images so that our model learns beyond the examples that we show it. Lastly, we train a CNN model that tries to reduce mistakes (called a loss). This phase will continue until the model has validated itself on subsets of our data that it has not “seen” before.
Once we are confident in the model’s ability to make predictions, we can use the model on new images shared by users. The images undergo the same processing as in the training phase and are fed into the last trained model. Model serving refers to taking a trained model into production so that it can fulfil requests (often referred to as API calls). We may add some post-processing steps, like ensuring different levels of confidence, or different heuristics, depending on how our users interact with the predictions. Lastly, we expose tags (like nature scene, mountain, etc.) to our platform, allowing users to search for new images of interest.
How we use it
While building a travel image tagger is an easily recognizable example, computer vision today is shaping the economy’s largest sectors. Whether it’s optimizing print media’s advertisement placement, adding new search queries for housing platforms, or predictively maintaining powerplants; we at Cognizant are actively taking a role in bringing computer vision to industry. By applying our theoretical and practical knowhow, we are enabling our customers to see the value they can derive from images. In the next post, we will describe one of those cases in detail.
We can help you to discover the potential of computer vision for your business through one or more interactive ideation sessions. If you are interested in this topic or believe computer vision can add value to your business, please contact Sebastian Vermaas for an inside look into how we can design your solution.