5 Best Practices for Nailing Incident Retrospectives

Reading about incident retrospectives (or postmortems) is different from seeing them in action. Retrospectives are like snowflakes; no two will ever look the same. There isn’t a template that will work in every situation, but there are some best practices that can help. Here are five practices to take your retrospectives to a new level. Each practice has an example that proves the method

Use visuals

As Steve McGhee says, “A ‘what happened’ narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.” Days, weeks, or even years after a retrospective is written, graphs still provide an engineer with a quick and in-depth explanation for what was happening during the incident.

Be a historian

Using timelines when writing retrospectives is very valuable. But, there’s an art to crafting them. As Steve McGhee says, “There is little utility to including the entire chat log of an incident. Instead, consider illustrating a timeline of the important inflection points (e.g. actions that turned the situation around). This may prove to be very helpful for troubleshooting future incidents.” Retrospective timelines need the perfect balance of information. Too much to sift through, and the retrospective will become cluttered. Too little and it’s vague.

Publish promptly

As the Google SRE book says, “A prompt postmortem tends to be more accurate because information is fresh in the contributors’ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!”

Be blameless

Blameless retrospectives are often referred to when talking about best practices. But, what does blameless culture actually look like? When writing blameless retrospectives, there are 3 important things to keep in mind.

  • Everyone on the team is working with good intentions. People make mistakes. It’ rare for a team member to cause problems on purpose. Everyone is doing what makes the most sense to them at the time to be helpful.
  • Failure will happen. There’s no way around it. But, by having a good incident resolution and retrospective practice in place, failure can actually be beneficial. It uncovers areas to focus on to improve resiliency. As long as you learn from an incident, you’ve made progress.

Tell a story

An incident is a story. To tell a story well, many components must work together.

  • Without a plan to rectify outstanding action items, the story loses a resolution.
  • Without a timeline dictating what happened, the story loses its plot.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store