Action Recognition, the science of detecting human interactions

Daniël Jrifat

With the increasing popularity and demand of deep learning applications many new interesting use cases have been developed. Think of applications that offer emotion detection, age prediction, object recognition and many more possibilities. A field you might not have heard about yet is called (human) action recognition, the art of detecting specific actions or gestures. In this blog I take a closer look at the different detection models available for action recognition and talk about my smart shelf experiment.


Action recognition

The human body is capable of making many different movements and gestures. It is still a challenge to detect them all using computers. The field of study that is concerned with classifying this, is called human action recognition. These classifications are based on data which can be obtained from sensors or cameras. The difficulty in classifying these movements lies in the variety of movements made by a person and the environment in which these movements are recorded. As an example, an environment with too much light or background noise can lead to occlusion in the image, which leads to faulty classifications.

These varieties make the task of classification more difficult, sometimes very complex! Not only is the variety or occlusion a challenge, but the lack of computational power can also be problematic. Especially if you want to process captured information during every frame of a video.

Deep learning networks

Luckily there are solutions for that, they come in the form of pre-trained deep learning networks and object detection models. These ready-to-use networks are primarily optimized for computers with less compute power and are easily accessible for everyone. Thanks to these models we don’t have to go to the complex challenge of training our own deep learning network.

Below an overview of some of the available deep learning networks. The models are placed inside the table based on their accuracy (visible on the x-axis) and number of operations. For now just focus on the accuracy.

Picture 1 Available deep learning models and their accuracy scores (source)

Detection models

The table below contains an overview of some of the most commonly used object detection models. The first column displays the model name and the top row indicates the dataset the model was trained and tested on. COCO and PASCAL VOC are the two most common used datasets for training.

Picture 2 Overview of the mAP scores on the 2007, 2010, 2012 PASCAL VOC dataset and 2015, 2016 COCO datasets (source).

For my experiment I chose to use the MobileNet model in combination with an SSD detection model (highlighted blue in the table above). Both models have a trade-off in accuracy and speed I found out. Besides, MobileNet is designed for edge devices, such devices can be Raspberry Pi’s, smartphones, or any device with limited power. A great advantage for low budget experiments.

Experiment – tracking consumer behavior

I originally tried to create a smart shelf which can be used to monitor consumer actions inside a store, providing retail store managers new actionable insights. To achieve this, I try to gain insights of customer demographics, better known as face expressions. Think of face expressions you show when looking at a product or commercial.

Below a small overview of the data that’s being captured:

  • Face tracking (follow the face through the video stream)
  • Hand movement
  • Face metrics (emotions, such as joy, surprise, anger, sadness…)
  • Face landmarks (important facial structures, jaw, eyes, nose…)
  • Number of people in front of the shelf
  • Dwell time (How long spends a person in front of the shelf)

Take a look at the video below which shows what the action recogniton model is capable of to do in this stage of the project. Thoroughly testing with different persons however is a challenge due to the currrent #STAYATHOME situation. This lets the opportunity rise for our creative minds to find alternative ways of testing. I simply use my phone and play a video containing multiple persons and hold the device in front of the camera. It is simple, but it works ;)

Action Recognition

New opportunities

With an indefinitely stay at home society, the original purpose of the smart shelf experiment slowly fades away. This absolutely does not mean that the experiment is a waste of has failed. This “problem” brings the opportunity to find new means and implementations for the smart shelf. The different components used for my smart shelf experiment are designed in such a generic way that they can easily be implemented for other use cases. Because after all, it just captures data and depending on the data you can do with it whatever you want. Here are some idea’s I have about implementation outside retail:

  • Mood detector. Detecting your mood at home and change the lights or music accordingly
  • Self-serving restaurants. The shelf would keep track what a customer puts on his/her tray
  • Smart storage. Keep track of products in storage shelves (at home or restaurants) and automatically add a product to the shopping list when its minimum threshold has been exceeded.
  • Social distance checker. Detect if people follow the 1.5m distance guideline.

Next steps

The next step would be to incorporate action recognition. This could lead to a function in which the smart shelf can detect if a customer intends to buy a product or is trying to steal something. Additional features such as face mask detection could also be implemented. This could assist in checking for mandatory wear of face masks.