Because we believe that people will interact differently with digital touchpoints in the future, it is important to experiment, to find out what works and what doesn't. For example: what if you could have an interaction with a mirror without touching it or pressing specific buttons. Within our innovation lab we are exploring several ways to detect gestures by using a simple webcam.
From a technical point of view the challenge is clear: what kind of technology or AI model do we have to use for proper gesture detection? where from a more human point of view the challenge is that people move very differently. So, we defined three approaches to find out what works best:
- Pose estimation data and using some math to calculate the performed gesture
- Pose estimation data using deep learning to create a model that detects specific gestures
- Action Recognition approach by using machine learning to detect specific actions by analyzing video frames
Approach 1: Posture detection with some math
Using Pose detection with a Tensorflow based version called Tf-pose-estimation we are able to detect human bodies, hands, faces and foot keypoints in a single image or frame. When processing a video stream Tf-pose-estimation constantly provides you with coordinates of the different body parts resulting in a stream of movement data.
To detect gestures – like a swipe movement – the idea is to process the Open Pose data and detect specific gestures using a few algorithms.
To process we simply compare all the x- and y-coordinates of the ‘swipe-hand’ with pre-stored swipe gesture data. Of course, it is a bit more complicated because of the timestamp data to detect speed of movements for example.
However, the disadvantage of this approach is the difficulty to detect deviations. A specific gesture should generate more or less the same data points as our math model expects, otherwise the gesture won’t be detected at all. This method won’t support very natural movements but more fixed movements that follow an exact path.
Approach 2: Deep learning with pose estimation data
So we decided to explore deep learning approaches. The idea behind deep learning is that a model can learn by itself by looking at large amounts of data. In the best case, the gesture model is both accurate and fast, running real-time. On practice, however, deeper models that use richer data perform better, which comes at the cost of speed. That is why we started with a simple version of the model that just uses pose estimation data.
We went for a well-known recurrent LSTM model and trained it on pre-recorded swipe movement data. We asked a few volunteers in the office to do the swipes which were stored as skeletons (much lighter compared to RGB input). To improve the model, we also recorded movements that could be confused for a swipe (e.g. taking a phone, raising hand or waving to someone). Altogether our training set consisted of labeled data with 3 gesture classes (left swipe, right swipe and no swipe).
The results of the deep learning approach are promising as the model achieves up to 90% validation accuracy on the pre-recorded clips of gestures. To launch the model on real-life stream, however, we want to experiment with more data input types and architectures. In that field it is promising to combine recurrent models with convolutional models running on video frame data, which may still achieve real-time performance.
Approach 3: Deep learning with video data
Another promissing approach is action recognition which is about the recognition of different actions from a video file or stream. By analyzing frames as a specific action – like a swipe gesture – may or may not be performed. Read more about it in this great blog article.
Although this approach costs a lot more in terms of computing power the results might be more accurate. So we decided to continue our gesture detection journey using this approach. Please stay tuned for new updates!