Getting SRE Buy-in from C-Levels for Error Budgets and SLOs

You now have postmortems implemented, automated, and well-structured. You’re generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you’re making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic effort will be proving to your C-levels why they should buy into SRE.

The situation

This phase revolves around well-defined SLOs and SLIs hooking into the right parts of the system. You’ll need your business teams agreeing on the SLO, error budget thresholds, and what will happen in case of a threshold breach. To propose this, keep two key thoughts in mind.

  1. What does your error budget policy include? We define error budget policies as including SLOs, SLIs, and error budget responses.
  2. Organization-wide adoption of SRE will be a large undertaking for your C-levels. Your CEO/CTO/CIO will need company-wide support to connect engineering, product, and business units. So, your incentives need to persuade them.

The incentives

  • Long-term competitive advantage: Protect customer experience compared to competitors and increase customer loyalty.
  • Growing complexity of tech stacks and dependency on microservices: Issues worsen if unaddressed. As we move toward a world of complex, distributed systems, the way we operate must evolve to support that. This is the chance to catch up.
  • Reliability is feature №1: If a user can’t access your service or has a degraded experience, then features are irrelevant. Reliability is the foundation that all other features build upon.

Of course, you can expect resistance towards adoption, even with these high-level incentives.

The resistance

The emotional appeal

Additionally, there is a significant financial aspect involved. Without SRE, organizations would have direct customer impact via SLA losses. That can be very expensive and hurtful to the brand and customer trust. If the reliability issues are too disruptive to overlook, customers may churn. The data you can collect from the cost of downtime can indicate how reliability affects your brand value.

To avoid triggering an SLA breach, you’ll need to adopt SLOs. These often act as a safety net, letting you know when you’re in danger before you need to start sounding the alarms. To prove to C-levels that SLOs are crucial, you can do two things.

  1. Quantify the cost of downtime (e.g. SLA losses) and estimate a bottom line for reliability impact.
  2. Show them your organization’s NPS (or net promoter score) alongside a detailed customer satisfaction survey to correlate the score with reliability.

The logical appeal

Show your executive the metrics on the SLOs and explain how they are set to optimize performance of most important paths in the user’s journey. Consider bringing the amount of data and access points in the cloud and the number of services the company depends on. This shows the need for a system that can adapt to the complexity moving towards cloud and microservices.

Proactive is always better than reactive: SLOs and the use of error budgets help us move from a reactive mode (knowing that incidents will occur but not where and why), to a proactive mode of anticipating areas of risk and failure. Error budgets with negotiated terms between the business and engineering teams allow teams to respond in the right way by standardizing actions and protocols.

To prove this, you’ll need two metrics:

  1. Automated reporting on incidents, SLOs, and error budgets that highlight risk areas before customers impact.
  2. A map of all areas of customer impact which could have been prevented with this knowledge.

With these metrics and appeals to both the emotion and logic of your C-level executives, you’ll be able to convince them that investing in SRE is a strategic initiative that impacts the success of the entire company.

If you liked this piece, consider reading these:

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/