Here are the Metrics you Need to Understand Operational Health

In recent polls we’ve conducted with engineers and leaders, we’ve found that around 70% of participants used MTTA and MTTR as one of their main metrics. 20% of participants cited looking at planned versus unplanned work, and 10% said they currently look at no metrics. While MTTA and MTTR are good starting points, they’re no longer enough. With the rise in complexity, it can be difficult to gain insights into your services’ operational health.

In this blog post, we’ll walk you through holistic measures and best practices that you can employ starting today. These will include challenges and pain points in gaining insight as well as key metrics and how they evolve as organizations mature.

Pain points for creating useful metrics

It’s easy to fall into the trap of being data rich but information poor. Building metrics and dashboards with the right context is crucial to understanding operational health, but where do you start? It’s important to look at roadblocks to adoption thus far in your organization. Perhaps other teams (or even your team) have looked into the way you measure success before. What halted their progress? If metrics haven’t undergone any change recently, why is that?

Below are some of the top customer pain points and challenges that we typically see software and infrastructure teams encounter.

With these pain points in mind, let’s look at some key metrics other organizations we’ve spoken to have found success with.

Key metrics for operational health

For all metrics, the incentives across cross-functional software teams, development, operations, and product all need to be aligned. Metrics aren’t there to crack the whip; they’re simply a tool to help with planning. We like to think of them as the compass to help teams go down the right maturity path. If they’re not being met, there should be an effort to understand why without blame. It’s highly likely that they’re not being set in an achievable manner. It takes everyone on board in order to achieve a deeper understanding.

When we look at the metrics, we want to make sure that we have the right context, as metrics without context can lead to the wrong behaviors (such as fear-based management). We are not measuring just for measuring, and we don’t need all the metrics in the world. We need to choose those metrics that help us get closer to doing what’s best for customers and the team.

Here are some common metrics that, with the right context, we’ve seen organizations have success with:

Image for post
Image for post

Each of these metrics on their own are not likely to be indicative of operational heath. For example, teams with higher deploy rates might seem more successful than teams with lower deploy rates. However, if the team with higher deploy rates also has a higher rate of failures or regressions, it’s no longer clear which team is the most successful.

Teams will need to figure out the context relevant to each metric. This might take time. After all, each team and organization will be at a different level of the metrics maturity model.

Metrics maturity model

Image for post
Image for post

Based on our customer interviews, we’ve mapped out the above maturity model which depicts how metrics evolve with ops maturity, ranging from fragile to leaders in the space. We’ve outlined some key characteristics of each.

While achieving the Leader stage may seem nearly impossible, it’s not out of reach for dedicated teams. Rather than try to reach this level all at once, however, teams should look at small, incremental improvements that allow them to gradually build their metrics base, as illustrated in the case study below.

A metrics case study

At the beginning of 2019, one of our customers — a global eCommerce company — wasn’t tracking any of its incidents. We helped the team hone in on a few basic metrics to start with, such as the cost of time spent on incidents, incidents by severity levels, and percentage of time spent on planned work (ie features) vs. unplanned work (ie incidents, bugs). During the first half of the year, the team worked hard to build those metrics.

In the second half of the year, the team had a solid baseline from which to understand trends and improvement opportunities from the metrics. At the checkpoint, the team identified that 45% of total engineering time was spent on unplanned work, which was equivalent to costing $200K per month.

The majority of the incidents happened on the product page in regards to load time and checkout process failures. With this data, the team began doing deeper triage to understand what caused incidents for these points in the user journey. Further investigation showed that the failures were associated with database tag and API to 3rd party fraud services and payment integrators.

During Q1 in 2020, we helped the team build a focused improvement that consisted of the following:

After Q1 of 2020, the team ran the reports again and saw a 76% reduction of incidents on the key user journey flows (product pages and checkout process). The percentage of total engineering time spent on unplanned work dropped by 40%. While this isn’t the end of their journey to understand operational heath, it’s a great starting point.

Image for post
Image for post

How Blameless can help identify and establish useful metrics

If you are looking to establish metrics that allow you to understand your service’s operational health, Blameless’ Reliability Insights features can help. We’ll show you how to explore and analyze your DevOps data, and share insights with other stakeholders such as executives and customer success. Our dashboards facilitate proactive identification of patterns and signals from the noise. Custom and prescriptive dashboards provide some of the more common metrics around your reliability so you can hit the ground running with informed data. This data is compiled in reports that cover business impacting metrics that pertain to all levels of stakeholders. These out-of-the-box reports include:

If you want to learn more about metrics to help your business and other SRE best practices, reach out to us for a demo.

If you enjoyed this post, check out these resources:

Originally published at https://www.blameless.com.

Written by

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store