Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.
Yet, many teams find themselves unable to complete incident retrospectives on a regular basis. One common reason for this is that day-to-day tasks such as fixing bugs, managing fire drills, and deploying new features take precedence, making it hard to invest in a process to streamline post-incident report completion. To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. Below is an example of what a comprehensive, narrative incident retrospective could look like.
This should contain 2–3 sentences that gives a reader an overview of the incident’s contributing factors, resolution, classification, and customer impact level. The briefer, the better as this is what engineers will look at first when trying to solve for a similar incident.
This summary states “On Wednesday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring.”
People involved and roles
This section should list the participants in the incident as well as what roles they played. Common roles include:
- Incident commander: Runs the incident. Their ultimate goal is to bring the incident to completion as fast as possible.
- Communications lead: Is in charge of communications leadership, though for smaller incidents, this role is typically subsumed by the Incident Commander.
- Technical lead: An individual who is knowledgeable in the technical domain in question, and helps to drive the technical resolution by liaising with Subject Matter Experts.
- Scribe: a person that’s maybe not completely active in the incident, but who is transcribing key information during the incident.
You may have one or none of these depending on how you structure incident response.
This section describes the level of customer impact. How many customers did the incident affect? Did customers lose partial or total functionality? Adding tags can be helpful here as well to help with future reporting, filtering and search.
In the section titled, “DETAILED DESCRIPTION OF IMPACT,” authors thoroughly breakdown which users and capabilities were affected.
This section is incredibly important to ensure that accountability around addressing incident contributing factors looks forward. Follow-up actions can include upgrading your monitoring and observability, bug fixes, or even larger initiatives like refactoring part of the code base. The best follow-up actions also detail who is responsible for items and when the rest of the team should expect an update by.
While detailed action items are rarely visible to the public, Sentry did publish a list of improvements the team planned to make after this outage covering both fixes and process changes.
With the increase in system complexity, it’s harder than ever to pinpoint a root cause for an incident. Each incident might have multiple dependencies that impact the service. Each dependency might result in action items. So there is no single root cause. To determine a contributing factor, consider using “because, why” statements.
In this retrospective, authors cover contributing factors such as a change in docker backend executes build scripts, missing coverage in terms of alerting for the errors, and more.
This section is one of the most important, yet one of the most rarely filled out. The narrative section is where you write out an incident like you’re telling a story.
Who are the characters and how did they feel and react during the incident? What were the plot points? How did the story end? This will be incomplete without everyone’s perspective.
Make sure the entire team involved in the incident gets a chance to write their own part of this narrative, whether through async document collaboration, templated questions, or other means.
The timeline is a crucial snapshot of the incident. It details the most important moments. It can contain key communications, screen shots, and logs. This can often be one of the most time-consuming parts of a post-incident report, which is why we recommend a tool for automation. The timeline can be aggregated automatically via tooling.
Technical analyses are key to any successful retrospective. Afterall, this serves as a record and a possible resolution for future incidents. Any information relevant to the incident, from architecture graphs, to related incidents, to recurring bugs should be detailed here.
Here are some questions to answer with your team:
- Have you seen an incident like this before?
- Has this bug occurred previously, and if so, how often?
- What dependencies came into play here?
Incident management process analysis
At the heart of every incident is a team trying to right the ship. But how does that process go? Is your team panicked, hanging by a thread and relying on heroics? Or, does your team have a codified process that keeps everyone cool? This is the time to reflect on how the team worked together.
Here are some questions to answer your team:
- What went well?
- What went poorly?
- Where did you get lucky and how can you improve moving forward?
- Did your monitoring and alerting capture this issue?
Communication during an incident is a necessity. Stakeholders such as managers, the line of business (i.e. sales, support, PR, etc.) C-levels, as well as customers will want updates. But communication internally and externally might look very different. Even communication internally might differ between what you would send a VPE, vs. your sales team.
Here, document the messaging that was disseminated to different categories of stakeholders. This way, you can build templates for the future to continue streamlining communication.
In this incident, Google ensures that all major updates are regularly communicated. The team also lets users know when they can next expect to be updated. “We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.”
Other Best Practices to Keep in Mind
- Do the report within 48 hours
- Ensure reports are housed such that they can be dynamically surfaced during incidents
- Add graphics and charts to help readers visualize the incident
- Be blameless. Remember that everyone is doing their best and failure is an opportunity to learn
Failure is the most powerful learning tool, and deserves time and attention. Each retrospective you complete pushes you closer to optimal reliability. While they do take time and effort, the result is an artifact that is useful long after the incident is resolved.
By using this template, your team is on the way to taking full advantage of every incident.
If you enjoyed this blog post, check out these resources:
- List of public-facing postmortems: a collection of postmortems, incident retrospectives, and more from various companies across the industry.
- 5 Best Practices on Nailing Postmortems: 5 best practices that your team can begin using today to take your retrospective to a new level.
- Improving Postmortem Practices with Veteran Google SRE, Steve McGhee: Google Veteran Steve McGhee gives advice from his experiences in the field.
- Improving Postmortems from Chores to Masterclass with Paul Osman: Paul Osman talks about how you can learn more from your incidents.
Originally published at https://www.blameless.com.