“I’m Just Doing my Job,” An SRE Myth

Align incentives through a blameless culture

  • Allocate time for analysis in the first place. If your engineers don’t feel supported to take their time with an analysis, it may not be done well-or at all.
  • Ask what is responsible for an outcome, not who. This helps move blame off individuals and ask hard questions about the system. Human error is not a sufficient analysis. Just like root cause, it is only a starting point for investigation. See The Field Guide to Understanding Human Error by Sidney Dekker.
  • Understanding how operators made the decisions they did during an incident. What was important to them at that time, and why? What else was happening in the setting of the incident that influenced their perspective? Avoid hindsight bias at all costs by looking through the lens of the operator.
  • “Provide accountability that encourages learning” (Dekker). A safe environment to fail encourages more team members to feel comfortable participating in discussions. This establishes more accounts of the incident, and the variety of viewpoints contributes to greater understanding of the system.
  • Reviewing and discovering new contributing factors can get complex… and is impossible to get everything. Set clear expectations for the incident retrospective. Don’t be afraid to timebox the review process, and allow time away to soak between reviews.

Use SLOs to get in touch with customers

--

--

--

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Your are not solving problems and it is killing your career

Build REAL Connections With Intentional Networking

What is the importance of mock interviews for UPSC?

upsc mock interview

This month’s cool reads —September 2020

Overcoming Impostor Syndrome!!!

3 Ways You Can Embrace Your Teams Uniqueness

‘Winning The Game Of Work’ With CEO, Coach, Speaker and Author Terry McDougall

Life as a Rebel / Week 3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Blameless

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

More from Medium

Impostor Syndrome is not a strength

Thoughts Over an Annoying Production Issue

Resiliency and Chaos Engineering — Part 3

NuGet Hell — How to Survive: Part 1