Based on previous experiments with custom object classification & image recognition we came up with the idea to automatically detect beer & soda bottles. So... holding a certain bottle in front of a camera should recognize the brand, no matter from what angle or position the bottle is held.
The picture below illustrates our concept:
A way of detecting your own objects, is to train a custom model that is based on deep learning methods. Deep learning is a type of machine learning that produces promising results for image classification tasks.
A great tool to support these kind of challenges is TensorFlow, an open source software library for numerical computations. Originally developed by researchers from the Google Brain Team. TensorFlow is general enough to be applicable in a wide variety of domains like our bottle recognition challenge. Google itself for example, uses TensorFlow for Google Translate, image recognition and text to speech services.
How does it work?
TensorFlow training works like brute force calculating and determines the optimal gradient descent for each of your input or training images. To specify, it calculates a formula to multiply the given input in such a way that it outputs the expected result. In our case, the input is an array of bytes from a captured video frame. The output is an array as well. This is an array of probabilities of each known class in our classifier. A class can be anything you want to recognize. For example: a chair, an airplane, a door, etc. The result is called the output layer. So, if we can match 10 categories, our output layer will consist of 10 nodes. After processing an image, each of the nodes will contain a value between 0 and 1, indicating the likelihood that, the specific category was matched. A little bit confused? A great TensorFlow starter’s guide can be found here.
We took about 400 images per class (bottle) and retrained a new classifier. During our experiment, we were able to train our network to an accuracy of 89.5%. Its about finding the optimal learning rate and training steps. For training the model, we used the standard retraining script that is included with the TensorFlow distribution. This allowed us to test quickly and iterate without having to deal too much with the specifics of building the neural net. Next, we connected a simple java application that uses the TensorFlow library to a stream of images coming from the webcam. This allowed us to perform realtime analysis on the video stream from the webcam. In total, we processed 4.500 images from different angles and lightning conditions.
We learned a lot about neural networks: how they are created and how they are trained. We know more about the different parameters. For example, how to tweak when training a network like this and about the learning rate and the amount of training steps. Another important learning is that being able to run the net locally allows you to perform a very quick and accurate analysis of a video stream. The quality of the photos is also a very important factor in how well your net performs. We concluded that we were not able to train the network any further because of the quality of our data set. Next time we need to take more time to clean up the dataset and take even more pictures.
*Testing our model: it can detect different brands with high confidence. *
This time we choose to build our network on top of imagenet. Although this leads to good results, it also results in a net that is about 80MB in size. Which is far from ideal if we would choose to use this on a mobile platform. Next steps could be to train on top of mobile net, which would result in a network somewhere around 1 to 10 MBs in size, depending on the source net that we would train on top of. And going one step further than object recognition we wouldn’t want to solely classify the object, but classify and localize all objects in an image. Luckily for us, TensorFlow is just the right tool for the job ;)