During the AWS re:Invent event in Las Vegas last year AWS introduced the Deeplens. This pretty looking AI driven camera was announced as: "world's first deep-learning enabled video camera for developers of all skill levels to grow their machine learning skills through hands-on computer vision tutorials, example code, and pre-built models".
As part of the AWS Amsterdam Meetup earlier this year I took the challenge to deep dive into the hardware, built a nice solution and presented my experiences to the audience. In this blog I will tell you more about my journey with the Deeplens.
AWS Deeplens AI camera
Together with my colleague Lotte and Thijs from Binx.io we started with a brainstorm session to discuss the characteristics of the Deeplens. One of the nice features is the ability to run inference lambda functions locally - without needing to upload the video stream. Therefore, it is possible to have the Deeplens act without an internet connection, so you could for example detect the presence of a visitor in front of your door and let it trigger the opening of the door locally.
Various ideas were brought up such as having people register with their phone and in combination with Deeplens face detection being able to recognize them. Because we wanted to have some results in a few weeks, we settled down a game like Tic-Tac-Toe.
The concept was relatively simple: Using the Alexa Echo Spot, players can start a game using their voice and register themselves. To make a move the player would hold their open hand in front of the Deeplens with their hand position relative to either their face or the camera frame deciding the Tic-Tac-Toe square that would be marked. Next, the player would call out to Alexa to register his move.
Initially we thought that this would be a relatively simple challenge: train a machine-learning model to to detect a hand, use the output to update the game logic state and have Alexa query the game state and display it. Unfortunately, it turned out to be a bit more complex, partly because we had little experience in the world of Deep Learning.
AWS's Sagemaker provides tooling to make this training process relatively straightforward. After grabbing a series of images from Google Image Search and a few local pictures to train on, we thought "Now simply train the model and let it run!". But how do you train a model? In fact, what "model" can we even use?
Into the Deep
After diving into the matter and learning a ton of new things, starting with the different Deep Learning frameworks such as TensorFlow, MXNet, Kaffe, etc we quickly found out that the "simple" problem was not nearly that simple. For example, since we needed the position (of the hand) in addition to the classification we are required to have a model that does localization as well.
In order to train a model for localization, all training images needed to be "labeled" - in other words: every hand in every images needed to be defined with coordinates. There are several tools for doing this, but all involve going through the images and manually drawing a lot of rectangles. One video that helped our understanding of the topic a lot is TensorFlow and Deep Learning without a PhD, a real recommendation to view when you're not a Data Scientist like us but still want to get some understanding of how these models work.
Next we were fighting with examples and trying to incorporate our own training set in them. MXNet has a demo that provides examples on training SSD models, those examples were built upon the PASCAL VOC images set.
Finally, we did manage to get a working trained model out of it, but not only did it take quite a while to train, the accuracy of it was terrible. This is probably due to the low quality and wild variation of our training images in combination with not knowing exactly what to do with the model.
Testing with bottles
Due to the lack of time we decided to use the bottle object example from AWS for our demo, so whenever the model detected a bottle with high enough certainty, the python script would calculate the Tic-Tac-Toe square of this object and write both an output frame for local viewing as well as push a MQTT message to AWS IoT with the details whenever the square was different compared to the previous frame.
Detecting objects with AWS Deeplens
Next a filter in AWS IoT parsed this message and updated this in local storage for later access in the game logic lambda. We picked the System Parameter Store for this because that it was the quickest to code, but in retrospect that was a poor choice: the frequent updates (in some cases several times per second) caused consistency issues, reading from SSM often returned already outdated values.
Finally the Alexa skill function would call a game logic lambda that updated the game state (also in SSM, which worked fine because it was updated a lot less) which used the SSM parameter of the last bottle location. When a player would ask "Alexa, register my move", the game logic would update the game state, make a new image to show on the Echo Spot and Alexa would give the player feedback on the move.
Here is an overview of the architecture:
During the AWS Cloud Meetup, we did our talk about our Deeplens adventure and showed the demo. Luckily this went quite well and it was well received by the audience.
All in all it was a great learning experience and we had a lot of fun getting it to work!
AWS Amsterdam Meetup @ Mirabeau - a dive into AWS Deeplens