The Ultimate, Free Incident Retrospective Template

Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.

Yet, many teams find themselves unable to complete incident retrospectives on a regular basis. One common reason for this is that day-to-day tasks such as fixing bugs, managing fire drills, and deploying new features take precedence, making it hard to invest in a process to streamline post-incident report completion. To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. Below is an example of what a comprehensive, narrative incident retrospective could look like.

Summary

Example: Google Compute Engine Incident #17007

This summary states “On Wednesday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring.”

People involved and roles

  • Incident commander: Runs the incident. Their ultimate goal is to bring the incident to completion as fast as possible.
  • Communications lead: Is in charge of communications leadership, though for smaller incidents, this role is typically subsumed by the Incident Commander.
  • Technical lead: An individual who is knowledgeable in the technical domain in question, and helps to drive the technical resolution by liaising with Subject Matter Experts.
  • Scribe: a person that’s maybe not completely active in the incident, but who is transcribing key information during the incident.

You may have one or none of these depending on how you structure incident response.

Customer impact

Example: Google Cloud Networking Incident #19009

In the section titled, “DETAILED DESCRIPTION OF IMPACT,” authors thoroughly breakdown which users and capabilities were affected.

Follow-up actions

Example: Sentry’s Security Incident (June 12 2016)

While detailed action items are rarely visible to the public, Sentry did publish a list of improvements the team planned to make after this outage covering both fixes and process changes.

Contributing factors

Example: Travis CI’s Container-based Linux Precise infrastructure emergency maintenance

In this retrospective, authors cover contributing factors such as a change in docker backend executes build scripts, missing coverage in terms of alerting for the errors, and more.

Narrative

Who are the characters and how did they feel and react during the incident? What were the plot points? How did the story end? This will be incomplete without everyone’s perspective.

Make sure the entire team involved in the incident gets a chance to write their own part of this narrative, whether through async document collaboration, templated questions, or other means.

Timeline

Technical analysis

Here are some questions to answer with your team:

  • Have you seen an incident like this before?
  • Has this bug occurred previously, and if so, how often?
  • What dependencies came into play here?

Incident management process analysis

Here are some questions to answer your team:

  • What went well?
  • What went poorly?
  • Where did you get lucky and how can you improve moving forward?
  • Did your monitoring and alerting capture this issue?

Messaging

Here, document the messaging that was disseminated to different categories of stakeholders. This way, you can build templates for the future to continue streamlining communication.

Example: Google Compute Engine Incident #15056

In this incident, Google ensures that all major updates are regularly communicated. The team also lets users know when they can next expect to be updated. “We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.”

Other Best Practices to Keep in Mind

  • Ensure reports are housed such that they can be dynamically surfaced during incidents
  • Add graphics and charts to help readers visualize the incident
  • Be blameless. Remember that everyone is doing their best and failure is an opportunity to learn

Parting Thoughts

By using this template, your team is on the way to taking full advantage of every incident.

If you enjoyed this blog post, check out these resources:

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/