A case of migrating towards SignalFX streaming monitoring

Auteur
Ioannis Petrousov
Datum

At Mirabeau we maintain and monitor business critical platforms on a 24/7 basis. Monitoring is a key activity to offer excellent service levels. Especially when microservices play a role, with these architectures it is very important to keep track of all services and scale-up or down based on data-driven decisions. This blog post is about my experience with the selection and implementation of a new highly promising monitoring platform in the market: SignalFX.

For one of our retail customers we wanted to upgrade the monitoring solution to offer better monitoring of microservices based architectures. The marketplace surely does not lack application/server monitoring products to choose from. Dynatrace, AppDynamics, NewRelic and more all compete to enhance their presence and promote their solution as best fit for all. SignalFX however - a new player in the market - gains rapid adoption from big companies such as Yelp. This got our attention and we decided to select SignalFX after a successful POC.

Reasons for migrating to SignalFX

What SignalFX brings into the game is a set of features which are either unavailable in other platforms and require a custom solution, or are not (yet) available at all. For our case, the following key points played an important role in adopting SignalFX:

  • A single point of reference for all infrastructure monitoring
  • Cost reduction
  • Enhanced level of alerting
  • Ability to do tracing
  • Terraform code for provisioning detectors/alerts

signalfx

Streaming architecture

SignalFX is a time series processing engine using a streaming data architecture. Practically, this mean that the metrics received by the platform are available to be processed on the fly and return results immediately. Naturally this processing requires several components, namely a broker, an analytics engine and an aggregator. However, because SignalFX is offered as Platform as a Service (PaaS) solution, the underlying logic and implementation is not something of concern to the end-user.

Tracing

Using microservices to decouple functionality and bring elasticity to your platform is a very good way to make your platform scalable based on metrics. As the number of microservices grows it becomes more and more difficult to keep track of their functional purpose and most importantly the interconnections between them. Our customer’s platform runs over 30 different microservices with new ones being 'born' every 2-3 months. When choosing a monitoring solution, it was important to be able to visualise the microservices with their interconnections and any connections to external services, such as payment systems.

SignalFX offers the end-user a visual map with all the microservices, their interconnections and any external connections. Distributed tracing is achieved using OpenTracing, an API specification along with frameworks and libraries. This allows teams to rapidly pin-point any failing links between microservices and remediate a solution.

Metrics and dashboards

Another feature our customer requested was the ability to present useful metrics not only to its development teams but also to the business. The number of sales, the number of API calls, the number of pickups from the store, are all very useful metrics. Some of these metrics are available from the integration with AWS and some need to be transferred from the applications themselves.

As mentioned above, SignalFX is able to ingest and process infrastructural but also business KPi’s transmitted from the applications themselves. This allows you to create dashboards, correlate metrics and define behavioural patterns (e.g. when this then that). Moreover, SignalFX allows you to group dashboards and link them with teams depending on the content of interest.

signalfx dashboard

-50% cost reduction

SignalFX's billing model is based on the number of VMs and the number of containers. With the current implementation, we achieved a cost reduction for our customer of 50% compared to the previous monitoring implementation.

Planning and execution

After creating an overview of the exiting alerts, we defined additional metrics that needed to be monitored, given the native integration with AWS. A plan with steps describing the migration process was established. Since this was a new platform, the Mirabeau Cloud team had no prior experience in using it. However, after some self-study and a 2-day professional training from SignalFX, we were ready to proceed with the migration.

The new monitoring platform was a “green field” so we proposed to provision all new detectors using Terraform. That decision created a "snowball effect" of starting to develop in-house re-usable Terraform modules, something that the customer wanted to do for a long time but never had the time or availability to do so.

Lessons learned

Working on this project was a valuable experience for our team and our customer. The lessons we learned will definitely affect the future work we do.

When defining detectors inside SignalFX, the _max_delay_ parameter defines how long the detector will wait for data to arrive. If data arrives after the time expires, the detector will never trigger an alert since it considers it as a past event. This specificity was noticed during a small outage in which none of the alerts fired up. We were able to troubleshoot the issue of non-functional alerts with the awesome support from SignalFX.

A software solution for configuration is a future investment

Migrating the alerts manually would have probably saved us some time required for designing and implementing a solution. However, developing a re-usable Terraform module proved to be a valid investment since the code offers visibility and can now be used by feature/development teams which require custom detectors.

Feature requests on our backlog:

  • Create a Terraform module for dashboards
  • Add more customised dashboards with infra/business metrics

Conclusion

Adopting a new product with no prior experience of using it can be a challenging and time-consuming task. Moreover, visibility can be hindered by over usage and manual work. To enable our customer to maintain control over the new platform and gain visibility on what is actively being monitored, we developed a re-usable Terraform module to migrate exiting alerts from the previous platform. This decision triggered an interest from our customer to start developing more Terraform modules for AWS.

Working with SignalFX was and still is fun. Although there are still hidden spots in the platform, it's relatively easy to find documentation or open a support ticket. At Mirabeau we believe there is no right and wrong when choosing a platform to work with. Every solution has its trade-offs and the question is which ones you are able to work with. If you are on the lookout to try a new monitoring platform, such as SignalFX, or other platforms out there, feel free to contact me and together we can define a solution for your needs.


References

Tags

DevOps Monitoring Automation SLA Service Management Web analytics