Getting SRE Buy-in from a Manager or Lead for Incident Response

Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and more. In this blog post, we’ll walk you through crafting a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

The situation

First, we need to recognize that your manager will need a lot of support from engineering and DevOps teams for this transition. These teams will need training in this incident management system to use it each time an incident occurs.

Second, you need to define what you mean by incident management. We’ll define incident management as the assembling, investigating, resolution, and learning process. This includes incident response playbooks, measuring time to detection, monitoring systems, and ticketing workflow.

Once you have a handle on the basic proposal, it’s time to think about what the team (manager included) will gain from an incident management system.

The incentives

  • Incident management best practices restore your systems as fast as possible when an incident occurs.
  • A playbook gives everyone a sense of control amidst the chaos. It defines a set of repeatable practices to drive consistency while helping everyone to be thorough with their problem-solving.
  • Measuring time to resolution (TTR) and time to detection (TTD) allows the manager to quantify the team’s improvement on TTR and TTD moving forward.
  • Integration with alerting and ticketing systems reduces context switching between different apps. This lowers the stress from mentally keeping track of many systems.

Yet, explaining these incentives to your manager and hoping for immediate support will not guarantee buy-in. You need to anticipate the resistance your manager will have towards this big change.

The resistance

To make this argument, you’ll need to rely on both a factual, logical appeal, as well as an emotional one. While there is no one right answer to solve this problem, as every organization, team, and manager is different, there are some topics your manager might connect with better than others.

Here, you’ll have to empathize and put yourself in your manager’s shoes. What would motivate you?

The emotional appeal

One of the major sources of fear is loss of control. When an incident occurs, current manual processes fail. With the move to microservices, it can be hard to understand where the incident originated, and how to mitigate it. Rollbacks are an option, but they don’t solve the underlying problem. Your manager is accountable for the service returning to normal efficiency and answering why this happened in the first place.

This responsibility is a considerable challenge. With a better incident management system, your service can be functioning quicker. And with automated runbooks, resolving the incident can requires minimal chaos. Faster and more consistent incident resolution can help your manager regain some control.

Another source of fear is losing your team. If your teammates are waking up at 2:00 AM with no end in sight, morale will be low. Additionally, manual processes are toilsome and stressful. The team wants to see the process getting less stressful over time, not worse as the number of services increases. Operational complexity is inevitable, but if that results in more incidents and unplanned work, that will lead to burnout as well as unhealthy team culture.

People will begin searching for other employment options if these issues are not resolved. When headcount drops and turnover rates soar, your manager will need to keep the ship sailing while drowning in the labor-intensive process of backfilling, hiring, and onboarding new engineers. This cycle is not sustainable, and is enough to keep your manager up at night.

The logical appeal

It’s important to not blame your manager for these struggles. After all, some of these issues are beyond their control. Systems have become more complex, and the bar is higher than ever. Instead of pointing fingers, it’s time to lay on some more logic. For this, you’ll need to provide your manager with two important to promote adoption:

A service catalog for the number of services/microservices you have and their dependencies. Show how these have grown and will continue to grow.

During the new IM proof of concept phase, you’ll need to track the trends of TTD and TTR. If there are positive results, then you can justify rolling out the system and process changes to more teams.

Armed with emotional and logical appeals, you can approach your team lead and discuss improving your incident management system. This is a great first step towards SRE adoption, but you can’t stop here — you’ll reach a local maxima that falls short long-term. You’ll need to think about how to frame SRE adoption for the next level of leadership to gain the buy-in you need.

If you enjoyed this blog post, check out these resources:

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/