Within Mirabeau we serve many e-commerce clients, each with their own custom environment and corresponding challenges. Every client wants to be successful online and for that the Black Friday and Cyber Monday sales offers great opportunities to boost online sales in a short period of time. However, this does present some challenges on the infrastructure and application level which I will discuss in this blog article.
Normal day to day traffic is very predictable and spread out over the day with the occasional e-mail newsletter. But with sales like Black Friday and Cyber Monday, internet traffic becomes unpredictable to a certain level. You have to deal with sudden surges in traffic after a newsletter has been sent out. Depending on which region the client does business, you have to deal with these surges a couple of times per day. In a matter of minutes, you can face a 500% traffic increase. The application also has to process way more than usual. For example, there will be many more user sessions, carts and orders to process. Each step in the user experience needs to be flawless or it will negatively impact conversion.
Understanding the bottlenecks
Weeks before the actual sales start, preparations are in full swing. Before we can even make any improvements, we have to understand any possible performance bottlenecks. The best way to do this, is to perform load tests on the environment which will simulate heavy customer load. Our team aligns with all the stakeholders and look at historical data to make an estimate of the expected traffic. This number is then doubled to make sure the infrastructure and application can handle it.
Getting rid of the bottlenecks
Improving the performance is an iterative process. A load test is performed, and its results are interpreted. Changes to the infrastructure and applications are made to remove the bottlenecks found followed up with another load test. This process is repeated until all the stakeholders are satisfied with the results.
Most of the platforms we manage consist of a Content Delivery Network like CloudFront, Akamai or a similar service, internal and external load balancers, web and application servers on EC2 via ASG or containerized in ECS, RDS Aurora DB Clusters or other RDS engines and more. The platforms make use of the features Amazon Web Service (AWS) offers, like auto scaling and other cloud native services. The infrastructures built by our cloud engineering teams are fully immutable and built using Infrastructure as Code concepts. This makes it very easy to make changes without any (or very limited) downtime and replicate these changes from test and acceptance environments to production.
The following improvements among others could be considered:
- Stricter Auto Scaling Target Tracking rules for faster scaling
- Horizontally and/or vertically scale EC2 instances
- Use Provisioned IOPS EBS and Provisioned Throughput EFS where applicable and needed
- Code and/or configuration optimizations to reduce the response times and load on application servers
- More AuroraDB Read Replicas when using Aurora or bigger database instances when using a different database engine
- Improve caching to reduce load on the application servers and databases
Response times before optimizations vs. after, blue line is before, yellow is after.
Monitoring the infrastructure
We have dedicated engineers available during the peak moments. These engineers are continuously monitoring dashboards with all important infrastructure and application performance metrics. The client also has a direct line of communication with the engineer in order to quickly communicate when issues arise. When something does happen, the engineer can intervene quickly to limit the impact. Next to this, the engineer will send a periodic update and report to the stakeholders about all important metrics and the overall health and performance of the platform.
Expect the unexpected
Even though our preparations are very thorough, there will always be something you have not thought about or ran into before. A great example of this is the following:
One of our clients had sent out a scheduled newsletter. The dedicated engineer was already expecting a surge in traffic and was closely monitoring. When the traffic surge occurred, the search back-end was unable to keep up with the huge increase in traffic. This issue did not occur during the load testing before the sale period. Even though it did not pose a bottleneck before, the search back-ends had already been scaled to double the normal size just in case, but this was not enough. The engineer quickly decided to add more instances to the search cluster. Due to the infrastructure being immutable and fully automated, the additional instances were up and running and fully synced 30 minutes after the start of the issue. This would have been a lot harder in a classic on-premise infrastructure.
Response times of the search cluster before, during and after the incident.
Black Friday and Cyber Monday have been a huge success. Due to the close cooperation between Mirabeau operation teams and our clients, the vigilance of the engineers and the thorough planning, no major incidents have happened. The sale records of last year have all been broken! A great success for everyone involved!
This would not have been possible if we did not follow these guidelines:
- Start planning well in advance so you have enough time to test and push through improvements
- Know the breaking point and weak spots of your environment by doing stress testing
- Be in close contact with the business and know the exact planning in regards to newsletters and other campaigns
- Have engineers ready with outstanding knowledge of the environment ready to intervene if needed
Like to know more about our daily e-commerce operations? Please drop me a line, I am happy to answer your questions.