Best Practices for Effective Incident Management

Image for post
Image for post

Incident management is a set of processes used by operations teams to respond to latency or downtime, and return a service to its normal state. Incident management practices have long been well-defined through frameworks such as ITIL, but as software systems become more complex, teams increasingly need to adapt their incident management processes accordingly.

Below are five incident management best practices that your team can begin using today to improve the speed, efficiency, and effectiveness of your incident management process.

Why use incident management best practices

In today’s high-stakes, high-availability world, uptime has never been more important to focus on. Reliability has become the №1 feature for companies, and unreliable services can make or break an organization’s revenue and reputation.

Teams responding to incidents have become the soldiers on the front lines for a company’s overall health and well-being. With downtime costs skyrocketing, it’s important that your team is trained, prepared, and ready for battle.

This requires adopting a smooth, effective incident management process in order to resolve issues faster, communicate and collaborate through the process, and learn from these incidents to possibly prevent the same incident from happening again.

Adopt alerting and on-call best practices

Effective incident management begins with setting a strong foundation. Alerting and on-call procedures are crucial for your team’s success. When you’re experiencing an incident, this is how you determine what kind of incident you’re facing as well as who to call for help.

Set alerts that matter

There is such a thing as too much information. When your on-call team is getting paged at 12:34 AM, 1:11 AM, 2:46 AM, and on until dawn, it can be impossible for them to respond adequately to each alert. When pager fatigue sets in, quality and efficiency go down the drain. You need to determine what’s worth alerting on, and what isn’t.

One way to do this is by thinking about your customers first and determining SLIs, or service level indicators.

The touchpoints between the user and your service will involve requests and responses — the building blocks of SLIs. For each touchpoint you identify, you should be able to break down the specific SLIs measuring that interaction, such as the latency of the site’s response, the availability of key functions, and the liveness of data customers are accessing.

Next, you’ll use those SLIs to create SLOs, or service level objectives. This is the internal threshold you want to hit based on your SLI to keep your customers happy. If you exceed this threshold, then an alert should be triggered.

Alerts on SLOs are helpful to diagnose the severity of the incident as well as “quantify impact to clients: when an SLO-alert fires, the responder knows that a client is impacted. Not only do SLO alerts indicate that client’s are affected, they also indicate how many requests are affected.”

There is such a thing as too much information. When your on-call team is getting paged at 12:34 AM, 1:11 AM, 2:46 AM, and on until dawn, it can be impossible for them to respond adequately to each alert. When pager fatigue sets in, quality and efficiency go down the drain.

Prepare your team for on-call

Once you’ve been alerted to an incident, it’s just as important to make sure that your team is prepared to respond, no matter what the level of severity. While there are many components to this, two rise above the rest as priorities to focus on for a healthy on-call process.

So, you have your alerts set up and your on-call team is prepared. What comes next?

There is a huge difference between spending a weekend on call with no incidents, and spending a weekend on call with 3 high-severity incidents. If we only look at time spent on call, we don’t get an accurate view of who is most likely to be too tired or burnt out to respond to another incident.

Prioritize incidents and use runbooks to get ahead of the curve

You’ve been alerted that you have an incident, and you know who to call. But is it time to ring everyone? It’s important to know whether an incident requires waking your entire team in the middle of the night, or if it can wait until Monday morning. It’s also important to know what steps to take once the incident is discovered.

One way to determine the severity of incidents is by customer impact. Afterall, if your customers won’t know anything is wrong, it can probably wait a few hours until your team has had the chance to wake up and grab a cup of coffee.

Image for post
Image for post

For example, PagerDuty published a chart with their defined severity levels, which our team at Blameless has adapted for our own internal processes:

This may not be accurate for your team or service, but it’s important to determine this so your team members can make the right call during an incident. Key information like this should also be baked into a comprehensive runbook.

One way to determine the severity of incidents is by customer impact. Afterall, if your customers won’t know anything is wrong, it can probably wait a few hours until your team has had the chance to wake up and grab a cup of coffee.

Runbooks — which are predefined procedures meant to be performed by operators — are important components of incident response. They help with:

  • Automating the toil from incidents when possible
  • Describing what to do in the event of an incident.

Runbooks can tell you where to check for code, who to escalate to, as well as what the incident postmortem or retrospective process looks like, and can be tailored to the specific type and severity of incidents.

  • Requirements to be able to execute the runbook
  • Constraints on the execution of the runbook
  • Procedure steps and expected outcomes
  • Escalation procedures

Though runbooks are very versatile and customizable, there are some components that all good runbooks should contain. According to AWS, here are a few of these must-haves:

With prioritization and runbooks, your incidents are on the right path towards speedy resolution. But there are some additional incident management best practices that you’ll need to pay attention to as well.

Set defined roles, responsibilities, and communication guidelines

There are countless moving pieces during an incident, and even if you have runbooks, it can be difficult to keep in touch with your team about what you’ve done and haven’t done.

This is especially true in the era of remote work when you can’t simply go to your teammate’s desk or head to the incident war room to check in.

Instead, we need to focus on improving our collaboration skills with defined roles and responsibilities and communication guidelines.

Roles and responsibilities for incident management

There are four main roles during incident management, and each role has different responsibilities. With smaller teams, sometimes you’ll need to combine these roles in order to cover all your bases, and that’s fine. As long as someone takes charge of the responsibilities, the roles can be combined in the way that best fits your team.

There are four main roles during incident management, and each role has different responsibilities. With smaller teams, sometimes you’ll need to combine these roles in order to cover all your bases, and that’s fine. As long as someone takes charge of the responsibilities, the roles can be combined in the way that best fits your team.

Establishing communication guidelines

Once the roles have been filled and responsibilities dolled out, you need to understand how teammates are expected to communicate with each other during an incident. While it’s important to know whether the protocol involves communicating over Slack and Zoom, or whether your team chats over Microsoft Teams, it’s even more important to know how to treat one another.

Every engineer makes mistakes; it’s how lessons are learned. When an incident happens, it’s easy to place blame on the last person who pushed code.

However, people are never the root cause of an incident; processes are. To be great at incident response, you will need to be compassionate in the face of these mistakes and seek to learn from them.

Issues won’t just cause incidents; they’ll pop up during incidents. Sometimes a fix can cause more damage to a service than it repairs, and you’ll need to learn to have compassion during these moments too.

Instead of getting angry with a team member, remember that they are just trying to help. Everyone is making the decisions they feel are best at that moment in time with the information they have.

While it’s important to know whether the protocol involves communicating over Slack and Zoom, or whether your team chats over Microsoft Teams, it’s even more important to know how to treat one another.

Create comprehensive incident retrospectives

It is important that good incident management spans the whole lifecycle of an incident, beyond resolving or closing an incident. Even after resolution, there are important steps to complete for exceptional incident management. Creating comprehensive incident retrospectives to properly document what happened is key to overall success. Not only is it a record that your team can refer back to during future incidents, but it’s also something that you can share more widely to help spread knowledge within the entire organization.

There’s a craft to creating retrospectives that are valuable, however. Below are some tips to help:

Once you have a retrospective that you are proud to publish, it’s time to make sure all that knowledge is fed back into your system. Otherwise, this incident will have just been a hit to the business, and a missed opportunity for learning.

Creating comprehensive incident retrospectives to properly document what happened is key to overall success. Not only is it a record that your team can refer back to during future incidents, but it’s also something that you can share more widely to help spread knowledge within the entire organization.

Close the circle in your incident management lifecycle

With the increasing frequency of incidents and complexity of systems, it’s not enough to simply fix an issue, fill out a quick Google doc for a retrospective, and move on. We need to make sure that we’re taking every opportunity to close the learning gap and take proactive, remediative actions in our incident management lifecycle.

To do this, make sure to track all follow up items assigned from each incident. If some action items are lengthy, costly fixes, make sure to discuss with the product teams how this can be prioritized.

Additionally, you’ll need to make sure that you share concerns about these issues with stakeholders and adjoining teams to create battle plans. If reliability is being compromised for new features, you’ll need to discuss ways to incentivise reliability and encourage buy-in from all stakeholders.

Lastly you will need to regularly examine your SLOs and error budget policies. This helps keep you apprised of changing customer expectations and makes sure you’re on the same page as your consumers.

If you’re consistently exceeding your error budgets yet customer satisfaction isn’t being affected, perhaps you’re not giving your team enough slack. If you’re meeting your SLOs but customers are unhappy, maybe it’s time to make your criteria more stringent.

  • Ensure you’re never caught off guard
  • Minimize stress and thrash and optimize communication during incidents
  • Maximize learning to keep providing excellent customer satisfaction.

Lastly you will need to regularly examine your SLOs and error budget policies. This helps keep you apprised of changing customer expectations and makes sure you’re on the same page as your consumers.

Incident management best practices are crucial components to your team’s success during a crisis. They help you:

With some planning and teamwork, you can begin employing incident management best practices today.

If you’re interested in seeing how Blameless can help you automate toil from incident management, try us out.

If you want to read more from us, check out these resources:

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store