Why operational excellence?
‘Operational excellence’ is a term that originates from business management. One of the classics, ‘The Discipline of Market Leaders’ by Treacy & Wiersema, established the notion of competitive advantage - at its core, every business needs to decide which of the following strategies to pursue (while being decent at the other two, at the very least).
- Product leadership: become the leader of the pack in your market segment by providing extremely desirable products. Then slap huge profit margins on top - the Apples of this world.
- Customer intimacy: offer great experiences that keep bringing your customers back to the brand, and build strong loyalty. Then wrap everything in a recurrent revenue bundle and call it Amazon Prime.
- Operational excellence: deliver a combination of quality, price, and ease of purchase. Iterate continuously on efficiency and build at scale.
Operational excellence is the default strongest option of the three for every cloud service provider (AWS, Azure, GCP, to name the strongest). This is applicable to how they compete, and takes specific flavors from the engineering challenges of distributed systems at scale. A few examples: the “race to the bottom” for storage costs, the elastic scalability cloud platforms provide to customers, and the high quality bar they set for service availability.
Much of this applies to any SaaS company, including Mambu. In this article we will look at some of the ways we implement a culture of operational excellence, with emphasis on engineering best practices and processes.
“Highly effective and skilled team, provided with the robust self-service platform, clear objectives as well as principles of operations, can perform fully autonomous and address important customer needs and solve real life problems, with high throughput and great stability” - Ciprian Diaconasu, Mambu’s VP Product Engineering.
Our customers interact with Mambu via REST APIs, and as such, most of what we do in engineering translates into how successful they are composing our services to build great end-customer experiences. These APIs are the entry point to a large distributed system built on top of a native cloud infrastructure. Regardless of how many layers of complexity our internal architecture has, our goal is to happily respond to API requests at the end of the day. What we’ve put in the game is a mix of processes and best practices, held in place with a good deal of automation and feedback loops for continuous improvement. Let’s dive into a few of these.
From monitoring to observability.
The classic approach to monitoring distributed systems involves working with logs and standard metrics based on carefully instrumented business logic. You then build custom dashboards to investigate and track the gathered data. You figure out healthy thresholds and define monitors and alarms. Bind it all together with a metric ton of documentation to cover every known failure mode into an operational run-book.
In our journey so far, Mambu has applied the monitoring approach to infrastructure health, i.e. we work with metrics and alarms in AWS CloudWatch to track the health of our compute (EC2 fleets, Kubernetes clusters) or our RDS instances, among other resources. Depending on the severity of the issue, alarms may end up paging our SRE engineers for intervention.
At the application layer, monitoring looks at logs (we use Loggly for storing and analysis) and API metrics. Since our customers interact with Mambu through REST APIs, we are able to apply industry best practices around defining and tracking at least two application health metrics:
- Availability - returning a healthy HTTP status (basically, not server-side errors).
- Latency - responding within a certain timeframe, e.g. below 500ms.
For each of the two metrics, we define SLIs, SLAs and SLOs.
Primer: SLIs, SLAs, SLOs.
- A Service Level Indicator is a measurable property used to assess system health, e.g. latency of API responses, or the types of HTTP responses returned. If your monitoring system aggregates metrics every 1 minute, that’s your SLI datapoint.
- A Service Level Agreement adds bounds to SLIs. A latency SLA could be to respond within 500ms (at P99 of all traffic to skip outliers). An availability SLA could be to return less than 1% HTTP server-side errors. Any datapoint outside of the defined bounds is in breach of the SLA. Past a certain number of consecutive datapoints in breach, engineering needs to intervene: there’s an operational incident.
- A Service Level Objective frames SLAs against a whole calendar year. These come from well established practices in telecommunications, where for instance “4 nines” means staying within SLA 99,99% of a year, resulting in a maximum budget of datapoints in breach of about 52 minutes.
So far so good - Mambu has in place a decent amount of monitoring - so where’s the observability? Here, things are different. While monitoring is all about a thorough, structured approach where you, in general, know how things work and break; observability is less prescriptive and more exploratory. It’s about large systems that generate high cardinality metrics at a high volume, making structured approaches less favorable. Quick navigation through large datasets of metrics becomes more useful, as does connecting these across multiple different distributed subsystems. This also involves instrumentation that uses aggressive introspection to generate distributed traces that can again be easily explored at scale, to identify anomalies and almost-failure modes. We’ve made some in-roads into observability in 2019 by working with three vendors (DataDog, Dynatrace and HoneyComb), before deciding to venture into a home-baked solution in 2020.
Beyond application health, we’ve also started looking at customer experience and custom business metrics - because nines don’t matter when users aren’t happy. This part gets very specific for each product team across the overall Mambu services platform. For example, one business metric we’re looking at for our payments service is how long it takes to process a payment transaction between multiple microservices until it hits the payment gateway.
Mambu relies on automated software testing of multiple flavors (e.g., unit, integration, contract, UI, load, security) to provide quality gates throughout our integration and deployment pipelines. Our focus moving forward is on four key success metrics:
- Lead time for change: how long it takes to deploy the full implemented changes to production.
- Deployment frequency: how often a deployment cycle is completed.
- Deployment failure rates: what percentage of deployments result in incidents.
- Mean time to recovery: how long it takes to recover from a failed deployment.
Improving on these metrics is part of our 2020 roadmap, and will increase how quickly and how often we are able to successfully deploy value to our customers.
Part of our maturity model for continuous delivery is to build on top of our existing delivery process with new constructs. Below are just some of the upcoming improvements:
- Auto-rollback requires a feedback loop between production systems’ health metrics and the deployment pipeline: if an anomaly is detected during deployment, e.g. sudden spike in latency or server-side errors, then the running deployment is halted and rolled-back automatically, minimizing impact to customers.
- Synthetic monitoring (aka canary testing) involves running continuous testing on production stacks. The key here is to have simple tests that run through all layers of the application, cover at least some of the critical customer-facing functionality, and run with strong idempotency fail-safes, i.e. there’s no risk of overloading a production database with random test data. Synthetic monitoring has strong synergy with auto-rollback, especially for systems with sparse or uneven customer traffic, where some minimal amount of traffic, even synthetic, ensures continuous tracking of system health during deployments.
- Canary releases are a fine-grained deployment technique, where new software is initially deployed to a subset of the overall customer facing fleet. While the new release bakes on a one-box or small percentage of servers, health metrics gathered can provide a good indication of whether it is safe to deploy to the rest of the fleet. Contrast this to blue-green deployments, where two identical fleets are used to switch between software versions.
Game-days: towards Chaos Engineering.
While failure in a production system is always a good opportunity to learn and improve, there’s probably a better way than learning from root cause analysis after customers have been impacted by an outage. Industry leaders have innovated around the concept of Chaos Engineering - the practice of inducing failures into production environments to continuously improve resilience to failures.
As a step towards a similar approach at Mambu, we run game-days where we intentionally introduce failure modes into pre-production systems. This helps us identify bottlenecks, bugs, and make general improvements to system robustness as well as our observability stack. In 2019, one of our microservices teams introduced game-days that ran every two weeks, and we intend to share the practice among more teams in 2020.