Using data science to predict lunch traffic - Part 1
Using data science to predict how many people will have a lunch at our office is one of the appealing learning exercises we are working on at our Digital Studio. In this blogpost I would like to share some of the latest results, about analyzing data and different approaches to gather insights.
In the previous blogpost you can read all about how lunch data is collected by using sensors and smart scales. With this data my goal is to make a prediction of how many people are going to lunch in the office per day. This is a very interesting insight for the food ordering process. Also, it would be nice to have an indication of the amount of people working in our Digital Studio.
Besides the data collected by the scales, I make use of other internal and external data. For example: weather data, IoT sensor data in the office (temperature, noise level and presence) and holiday calendar data to see if there is any correlation.
It is all about analyzing
To make a prediction it all starts with analyzing the data. The dataset from the data collected from the scales consists of a timestamp and the weight on the scale at a specific time. One of the first tasks was removing outliers, caused by for example people who press at the scales to see if they are working. After removing the outliers, the next step was to make some visualizations.
Consider for example the average weight on each of the separate scales plotted over time. The plot shows a clear peak at 12 PM. This is when the lunch starts, and it is the moment most people are having lunch.
First, we wanted to see how many plates were being used per day. To get the number of plates, divided the total decreased weight with the average weight of a plate. Fun fact: each plate had a slightly different weight, so that was another challenge I had to face. The outcome is the amount of plates used per scale per day. On an average day 127 plates are used. Another plot made from the data, is the plot below. This plot shows the average plates used per weekday.
Now we have a dataset consisting of a date and the amount of plates used each day. However, the data contained gaps because the scales didn’t work every day. To fill in the missing data, we tested whether the data collected by the scales is correct. Luckily there is another dataset available: the dishwasher! This dataset consists of all counted used plates by the dishwasher for more than a year. Before I could use this data, I had to test statistically if the data points from the two datasets where equal. The conclusion of this test is that the plates counted with the scale data is equal to the counted plates by the dishwasher. Therefore, I filled in the missing data from the scales with the known data from the dishwasher.
By using Autoregressive Integrated Moving Average (Hyndman & Athanasopoulos, 2018), or ARIMA, we are analyzing the time series. With this method it should be possible to forecast with the historical data the amount of people that will enjoy the delicious lunch provided by my colleagues Martin and Jacqueline. In combination with correlation of other variables like weather, days of the week and holidays, my next goal is to make a prediction model.
My final step is to make a dashboard that shows the prediction for the upcoming week of the amount of people who will lunch. Besides the prediction, it will also show the live data of the scales. This will show how accurate the prediction is.
To be continued in Part 2