Here are the Metrics you Need to Understand Operational Health

Pain points for creating useful metrics

  • Lack of data: Your data is fragmented across your APM, ticketing, chatops, and other tools. Even worse, it’s typically also siloed across teams that run at different speeds. A lot of it is tribal knowledge, or it simply doesn’t exist.
  • No feedback loop: There’s limited to no integration between incidents, retrospectives, follow-up action items, planned work, and the customer experience. It’s challenging to understand how it all ties together as well as pinpoint how to actually improve customer experience. You’re constantly being redirected by unplanned work and incidents.
  • Blank slate: Traditional APM and analytics tools are really great for insights, but without a baseline of metrics that are prescriptive and based on operational best practices, it’s hard to know where to start.
  • One-size-fits-all: What works for one team won’t necessarily work for another. Everything needs to be put in the right context in order to provide insights that are truly relevant.

Key metrics for operational health

  • Velocity: Arguably one of the most frequently used dimensions of measurement. Key measures include sprint capacity planning and how quickly the team pushes new features to production.
  • Availability: The probability that a system is operational at a given time. Key measures include understanding the system and team’s ability to recover from incidents and interrupt work.
  • Engineering Toil: How much of the team’s time is spent on thrash, how much operational inefficiency is there in the system. Key measures include reduction of automation and cognitive overhead.
  • Product Quality & Customer Happiness: Understanding customer happiness level. Key measures include understanding the status of key user journeys (SLOs), reactive incident response, and more.

Metrics maturity model

  • Fragile: This level of maturity is where many organizations find themselves. It’s reactive and stressful. At this stage, teams are likely measuring one thing: the number of incidents or customer tickets. This is a context-less stage where only the number is relevant. A quarter with 53 incidents may be seen as less successful than a quarter with 45 incidents, even if the 53 incidents were quick fixes that caused little customer impact. Or perhaps many of the incidents are labeled as either Sev0 or Sev1, either because teams aren’t sure how to classify incidents properly, or because each incident is really that dire. As a result, the majority of the time is spent on unplanned work.
  • Unified: At this level, teams are likely sorting incidents by types and tags, gaining some insight into problem areas that had previously gone undiscovered. With this increased visibility, incidents will likely be spread between severe and less severe due to improved incident classification and mitigation capabilities. However, 30–50% of time is still spent on unplanned work.
  • Advantage: Teams at this maturity level have more advanced metrics as well as SLOs to help pre-empt customer impact. This allows for data-driven tradeoffs in prioritizing reliability work alongside feature work. More mature teams make smaller, faster changes and can better localize the blast radius of incidents, so the majority of incidents typically fall in the Sev2 or Sev3 categories. At this point, less than 30% of time is spent on unplanned work.
  • Leader: This level of maturity is one that less than 1% of enterprise companies have achieved today. It is characterized by advanced practices such as graceful service degradation and fault tolerance in order to minimize customer impact even during the most unexpected events (ie massive, sudden changes in scale). Customers will see less than 1% of incidents and 20% or less time is spent on unplanned or reactive work.

A metrics case study

  • Rewrite the database queries and indexes to improve quality and performance.
  • Improve the API connection handling and error handling.
  • Replace one of the fraud service providers.
  • Make changes on the CDN provider to improve the speed of dynamic objects and increase the TTL of static objects.

How Blameless can help identify and establish useful metrics

  • Incidents by type
  • Incidents by severity
  • MTT* metrics and incident duration
  • Planned versus unplanned work

--

--

--

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

FastAPI Microservice Patterns: Application Monitoring

Java Thread and Multithreading in Java

Capture The Talent — Pwn Write-up : Global Pandemic

Leveraging Prometheus to get MySQL database insights

CS 373 Spring 2021 Week 6: Noah Galloso

You Don’t Have To Be A Developer

What’s it like doing a VR/AR Bootcamp?

Continuous Integration (CI) get easy more than ever with Github + TravisCI + GKE/GCR

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Blameless

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

More from Medium

Managers of effective SOCs do this…

Seven Essentials in HA Team Transition (Navigating the Great Resignation)

Effective Presentations

Are we ready for Data Mesh ?