By: Emily Arnott

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process.

Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover:

  • What are common MTTx metrics…


By: Harry Hull

“Having On-call Nightmares? Runbooks can Help you Wake Up.” with a graph of week-over-week checkouts below.
“Having On-call Nightmares? Runbooks can Help you Wake Up.” with a graph of week-over-week checkouts below.

The nightmare

You aren’t sure how long you’ve been here, but the view outside the window sure is soothing. Before you can fully take in your surroundings, a siren rips you back into the conscious world. Slowly, you begin to piece together that you exist, and you are on call.

The ringing, much louder now, pierces through your skull as you begin to open your bleary eyes. You turn over your pillow, grab your phone, and click through the PagerDuty notification. After quickly ACKing, you start to read the alert:

alertname = CartService5xxError

As fate would have it, you…


Blameless recently had the privilege of hosting SRE leaders Kurt Andersen, SRE Architect at Blameless, Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs, and Tony Hansmann, Former Global CTO at Pivotal Software, Inc. to discuss how to drive SRE adoption within an organization, including the processes teams should put in place, how to change minds and behaviors, how to get the right message to the right people, and how to garner internal support with both individual contributors and leaders.

The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so…


“So you Want an SRE Tool. Do you Build, Buy, or Open Source?” on blue checkered background with the Blameless logo.
“So you Want an SRE Tool. Do you Build, Buy, or Open Source?” on blue checkered background with the Blameless logo.

By: Emily Arnott

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project?

This is a big decision. Switching methods half-way through adoption is costly and can cause thrash. You’ll want to determine which method is the best fit before taking action. Each choice requires a different type of investment and offers different…


By: Emily Arnott

An important SRE best practice is analyzing and learning from incidents. When an incident occurs, you shouldn’t think of it as a setback, but as an opportunity to grow. Good incident analysis involves building an incident retrospective. This document will contain everything from incident metrics to the narrative of those involved. These metrics aren’t the whole story, but they can help teams make data-driven decisions.

But choosing which metrics are best to analyze can be difficult. You need to find the valuable signals among the noise. You’ll want your metrics to reflect how the incident impacted your…


Is it spring yet? Or spring still? Time sure is strange nowadays. At least we have a ton to look forward to in the next few weeks! Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Tweets that have us twittering

SREading

SRE2AUX: How Flight Controllers were the first SREs: Geoff White writes about what vintage space lore has to do with site reliability engineering in the 21st century.

The Netflix Cosmos Platform: This article explains why the Netflix team built Cosmos, how it works, and shares some of the things…


“How to Scale for Reliability and Trust” white text on blue background.
“How to Scale for Reliability and Trust” white text on blue background.

By: Emily Arnott

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well.

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency. It isn’t a problem that you can solve by throwing resources at it. Your organization will have to adapt its way of…


“How to Analyze Contributing Factors Blamelessly” in white text on blue geometric background.
“How to Analyze Contributing Factors Blamelessly” in white text on blue geometric background.

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we’ll look at:

  • A definition for root cause analysis
  • A definition for…


By: Emily Arnott

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

But integrating chaos engineering with other SRE tools and practices can be challenging. To get the most from your experiments, you’ll need to tie in learnings across all…


How to Build an SRE Team with a Growth Mindset in white text on blue abstract background
How to Build an SRE Team with a Growth Mindset in white text on blue abstract background

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow and encourage this mindset across the organization.

In this blog post, we’ll cover:

  • What a growth mindset is and why it helps your SRE team
  • How to hire for a growth mindset
  • How to develop people into SREs with a growth…

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store