SRE Leaders Panel: SRE Adoption as Organizational Transformation

Blameless
30 min readApr 6, 2021

Blameless recently had the privilege of hosting SRE leaders Kurt Andersen, SRE Architect at Blameless, Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs, and Tony Hansmann, Former Global CTO at Pivotal Software, Inc. to discuss how to drive SRE adoption within an organization, including the processes teams should put in place, how to change minds and behaviors, how to get the right message to the right people, and how to garner internal support with both individual contributors and leaders.

The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.‍

Chris Hendrix: Welcome to Blameless’s latest industry panel. I’d like to start off with asking all of our participants to share their name and what they do for a living.

Vanessa Yiu: My name is Vanessa Yiu. I work at Goldman Sachs. I’ve been in the industry for about 17 years and I’ve done a whole range of things from Unix system administration through to managing CI/CD tool chains, DevOps, SRE, and, more recently, running enterprise architecture as well as a risk management program.

Tony Hansmann: I used to be a sysadmin too. I started in 1994 working on system five, release three. I was a line sysadmin until about 2014. At Pivotal, they didn’t have sysadmins like that. They just had developers, so they sent me to a development team. Then they decided they do need sysadmins, and they called them SREs. I helped Pivotal productize all those kinds of things you have to do to run operations. My focus these days is on the transformation that you can do with your business using these tools

Kurt Andersen: I joined Blameless as the SRE architect about a month ago. I’m focused on strategy for the product and for the company. I previously was with LinkedIn working in the product SRE team for about eight years.

Chris Hendrix: I am a staff software engineer and developer evangelist here at Blameless. Today’s panel is focused on the idea of SRE as organizational transformation. I actually chose this topic after reading Vanessa’s essays in the recently published 97 Things Every SRE Should Know, which Kurt also has some essays in

Vanessa’s essays spoke about affecting SRE cultural changes in enterprise companies. That was a very dear topic for me because I come from a consulting background at Pivotal with Tony where it was our job to think about the transformation of other organizations.

I think a lot of our attendees might be participants in an SRE transformation. They might be SRE or DevOps, ICs or managers. But I don’t know how many of them are strategists in the SRE adoption process, or are thinking about this as an active process of change. I wanted to start there for this discussion. Vanessa, can you tell me more about your experiences with the process of SRE adoption? How long have you seen it take? Were there steps along that process that were particularly impactful or tricky?

Vanessa Yiu: The SRE journey at our organization started about three years ago. And at the time, we kind of knew we wanted SRE, but not necessarily what we want SRE for. We actually spent a fair amount of time figuring out where the places were in our organization where SRE can bring the most value.

We kind of knew we wanted SRE, but not necessarily what we want SRE for. We actually spent a fair amount of time figuring out where the places were in our organization where SRE can bring the most value.

To give you some context of the size of the organization, we’re talking about over 11,000 engineers. As a firm, we’re over 100 years old. We’ve got all sorts of different systems serving different businesses at different stages of maturity. Some systems have been around for a long time, some are almost like startups we’re rebuilding new.

We quickly realized that there is really no one-size-fits-all. At the time, the company had only ever experienced SRE in a certain way. And that wasn’t necessarily going to scale for us. For external facing systems where we need very high availability, perhaps it makes sense to have embedded SRE teams where they’re on call, doing support, and very involved in the whole lifecycle management of the product. But not every one of our services needs that. There are internal services that may not need that level of SRE investment.

So then, we decided that maybe SRE can help teach others the core disciplines and make sure they understand SLOs or concepts of toil elimination rather than investing in a fully embedded SRE team. We quickly figured out that there are many ways of doing SRE, and we will need to do different things along the way.

We quickly figured out that there are many ways of doing SRE, and we will need to do different things along the way.

Tony Hansmann: You’re going to be a pioneer or settler in the town-planning or industrialists phase. What I found when I went to most organizations was, in the Fortune500 or so, companies didn’t have much of a basis. You had to start as a pioneer in SRE. What we did is we taught SLOs, SLIs, and SLAs, how they’re actually related, and how to have a half-hour conversation about them without getting lost in the definition of the acronym. Because that is actually a real problem.

You had to start as a pioneer in SRE. What we did is we taught SLOs, SLIs, and SLAs, how they’re actually related, and how to have a half-hour conversation about them without getting lost in the definition of the acronym. Because that is actually a real problem.

When you say, “We’d like to adopt SRE models,” you’re saying, “I want to be a grown-up IT shop.” You’re saying, “We’re willing to work on budgets, we’re willing to make promises, and we’re willing to make revisions to those promises.” The starting process to me is to find one person who has this under their pillow (from the Accelerate book). For example, Vanessa would say, “I’m looking for six people across Goldman Sachs, across a 50,000-person organization who care.”

Then we have to identify a path to production and find the friendliest team on that path to production. We say, “Hey, can we try a few things? There are these four golden signals to measure, can we just measure yours? We don’t need you to do anything new, we just want to measure those. Can we ask you to make promises about SLOs-not even an SLAs-for you to monitor internally?”

The basic thing is that we work with one team on the ground to make sure that they are eliminating toil. We teach them ways to identify toil. Once we’re done, they say, “Oh, my gosh, we never thought we could do this.” Transformation means something that used to be expensive or really hard goes away categorically or is pretty easy to do.

Transformation means something that used to be expensive or really hard goes away categorically or is pretty easy to do.

As soon as we get a team like that, we want to introduce them to the next adjacent team in that path to production. And we want them to tell their story about how their upstream or downstream just got a little better. Are they willing to join a journey like that? That’s the pioneer phase of it. And that’s where almost every organization I’ve ever walked into is at.

Kurt Andersen: I’ve seen that all over the place. I talk to people at conferences and they’re usually in the throes of that early phase. Maybe with or without management support, this helps protect them and support their experimentation. I think there’s another model, which can be the crisis model. I’ll harken back, even before I joined LinkedIn, to 2010, when LinkedIn had just gone public.

We were in merge hell. I’m taking these stories third-hand, but every time there was a release to production, the site would be down for a week while all the problems got worked through, and people would have sleeping bags in the office. It was one of those typical horror stories where life is horrible. Everybody’s burning out. This turned into the SRE organization consisting of eight people. Obviously, a much smaller LinkedIn at that time.

LinkedIn undertook a fundamental cultural change. They said, “This is not working, this will not scale, this will not go forward, we are going to change.” And part of it was that the company was small enough at the time that you could do this across-the-board commitment. It was a commitment from the top level of the engineering organization and the CEO. They just said, “What we have isn’t working, we need a new model, SRE is the way forward.” They brought in some key leaders and developed the whole practice from there.

LinkedIn undertook a fundamental cultural change. They said, “This is not working, this will not scale, this will not go forward, we are going to change.”

Chris Hendrix: That sounds like the dream. It’s not often that you come into an organization that’s willing to really invest all the way through the management structure into something like this. This is the tension that people who are working on transforming either their organization or other organizations are really interested in. You also pointed to crisis as being this key aligner. Jonathan Smart talks about that in Sooner Safer Happier. In particular, he mentions that alignment between the business and the engineering objectives is a key principle in SRE.

That does bring us to our next topic which is, to what degree is the SRE adoption process about changing people’s minds versus changing people’s behaviors? Is there a meaningful difference? What is the tension that exists between those two?

Kurt Andersen: One of the things I like to do at the SRE conferences is run an unconference session. In Dublin 2019, we had an unconference session, and one of the topics that the group participants decided to explore was this idea of, you’ve tasted SRE, you know what SRE is, so what would it take for you to not be able to see things that way anymore?

They came to the conclusion that once you start looking at things in this way, once you start caring about reliability and a systems-wide perspective, you can’t stop seeing things that way. There is a very important mindset or paradigm shift that happens. I think that can be facilitated through the practices. People can go through the motions for a little while, and then things will just click and they’ll start seeing things differently.

They came to the conclusion that once you start looking at things in this way, once you start caring about reliability and a systems-wide perspective, you can’t stop seeing things that way.

Tony Hansmann: That’s the “What behavior needs to change.” You have to go to a group that believes positive change is possible. If they’re just like, “Hey, we’ve been ground down by this organization for 30 years,” then get out. These folks’ management structure disallows it. But the mindset you have to have is like, “Hey, are you willing to look?”. Just like the non-ability to return to comfortable ignorance, that’s what you’re looking to. That’s the transformation hook. And so you hook them by saying, “Look, we’re just going to let you complain all day. And then we’re going to circle some of those complaints.” Like if finance is too slow. Okay, well, we can’t do much about that, right?

However, if we can’t get our machines as fast as we want, that’s a different concern. Maybe we can address that concern. When you’re looking to bootstrap things that can change, you’re usually looking at an already intractable problem that you can split a little bit. And when you split it, some toil goes away, and people are like, “Wait a minute.” And if you fix something like that, they’ll just open the map. They’ll be like, “Well, I don’t like this, and I don’t like this, and I don’t like this.”

As soon as you get a team to that point, you have everything you need. And then you just pull out the book, like, “Oh, we’re just going to start making promises that we’re going to actually track and understand something about.” Then that’s the behavior change. Soon people will be like, “I do need to know my saturation. That actually seems really critical to know, and I don’t know how I haven’t known it all this time.” That’s the behavior change and the mindset change. And as Kurt says, it’s a holographic problem. You don’t get to choose one or the other.

Chris Hendrix: That’s part of the intractability. Behavior comes with mindset, mindset comes with behavior.

Vanessa Yiu: You need both in order for SRE to be successful in the long term. If people don’t believe or see the value in doing something, they’re just not going to do it. Even if you make them do it for a short time, they’ll revert back to the same patterns. A lot of the times people are just in their comfort zone. They’ve been doing things a certain way, so they might not want to change.

If people don’t believe or see the value in doing something, they’re just not going to do it. Even if you make them do it for a short time, they’ll revert back to the same patterns. A lot of the times people are just in their comfort zone. They’ve been doing things a certain way, so they might not want to change.

I talked about the scale problem that we have in our organization, and one of the things that we thought about in terms of how to actually drive the behavioral changes is where we have gaps with our tools or processes. And then actually putting the right structures in place, or building the right processes, so that people don’t have to really think about it. This is the natural path that they will go down: they will use these tools, and then they will end up having a better production running experience as a result. That naturally drives the long term change in behavior. That’s really important for things to stick. It has to be what they do day-to-day, as opposed to something in addition that they have to think about.

Kurt Andersen: That model is called the golden path or the paved road. A lot of organizations adopt that by making it easy to do the right thing. And you can do other things; if you want to go bushwhacking, you can, but it’s on your own.

Chris Hendrix: Vanessa I read in your essays about incentive structures and how you can design incentive structures to make doing the right thing easy and reinforced. Do you have any examples of what might be incentivized for either a team that’s mature in their SRE journey, or early in their SRE journey or what the rewards could be?

Vanessa Yiu: On the ground, the incentive is often what you gain by doing toil elimination. You’re going to have a much better experience when you’re on-call. You’re not going to be up at 3:00 AM. Making sure people understand that the investment upfront will mean that they will reap some of those rewards down the line is important. Other things-like transparency and management also reinforcing this when they see someone do the right thing-help. Make wins known to the broader organization and publicize them internally.

Those are also things that will incentivize the right behaviors across the organization. They see model behavior, and then other people will want to mimic that.

Chris Hendrix: Absolutely. People talk about failure walls being something you might want to implement. It encourages being very open and honest about the things that you’ve learned that didn’t go well, or the incidents that you’ve had that you learned from. Obviously, there was an incident to begin with, which no one enjoys being a part of. But it’s important for management and leadership to publicly value the fact that people are being transparent about when something didn’t go well.

People talk about failure walls being something you might want to implement. It encourages being very open and honest about the things that you’ve learned that didn’t go well, or the incidents that you’ve had that you learned from.

Tony Hansmann: If you have top-level air cover, you can do a whole category of things. It’s just nice. If you read the Accelerate book, and you’re that pioneer, it’s radically harder. I advise that pioneers start with building trust interpersonally. Like, “Hey, I’m a person who’s really interested in understanding the four golden signals, and explaining how that works to the business.” If that’s who you are, then find five other people who care about something like that and band together and start building trust, identify your managers, and convert them.

Say, “We think that this is meaningful, I’m not going to let it go, I want it part of my development plan. My target is in six months that you are converting your boss just like I’m converting you.” That’s how we build the scaffolding. The tools have been out there for a long time. Lean is old. And that’s all we’re doing here.

If you’re that initial pioneer, start building trust internally, because SRE is a broader model of trusting across the organization. That’s what it’s for. And higher trust means lower friction, lower friction means better iterations.

If you’re that initial pioneer, start building trust internally, because SRE is a broader model of trusting across the organization. That’s what it’s for. And higher trust means lower friction, lower friction means better iterations.

Chris Hendrix: Tony, could you talk about what those four golden signals are? we might have attendees who have never heard of those.

Tony Hansmann: Error rate, saturation rate, traffic, and latency. They’re also in the Google SRE Book. For those folks who are putting their toe in the water, the Google SRE book is amazing for Google, but you are going to have to do something radically different. You’re going to have to start at the bottom. So read the Google SRE book and dream in it. But what you’re going to have to do on the ground is going to be different.

Kurt Andersen: I actually recommend people start with the workbook, which was the second one that came out.

Tony Hansmann: That’s what we would say at Pivotal too. The SRE book is great, but start with the handbook because that’s where they’re actually teaching you what you can do on the ground.

Kurt Andersen: One of the key aspects is collaboration and trust. Tony touched on this, and it is so critical to the effective work for SREs. Encouraging people in terms of psychological safety is really important. And you can do that, as Vanessa pointed out, through holding up examples of good behavior. But it also plays out in terms of being willing to intervene when people are exercising bad behavior. This shows up a lot in diversity and inclusion and equity training.

Let’s say I don’t like people with beards, since I can hit half of our panelists with that comment. If I make offhand remarks about how people with beards are untrustworthy, that is bad. And if Vanessa was to say privately, “Hey, wait a minute. Did you realize the comments you made were impugning and negative and undermining the inclusiveness and the psychological safety that people need in order to be innovative and continue creativity?”, those kinds of interactions can help take care of bad behavior or reduce bad behavior while at the same time hold up examples of good behavior.

Vanessa Yiu: Something that Tony said really echoed with me. Tony was like, “Get a few like-minded people and just go off and do the right thing.” And I think, going back to the first question on SRE adoption and the process, we definitely had to do some of that. There were people out there who were a bit skeptical. Initially, people asked, “Okay, well, what is this going to bring to our team or to our systems?”

There’s only so much talking you can do. At some point, you’ve got to just put your foot down and go, “Let’s just instrument those SLOs, get the data, and then we’ll go from there.” There have been examples where the first time an SLO picks up a problem in production before your users noticed, suddenly, it just clicks. And then everyone will be like, “Oh, okay, now I understand why this is valuable.” Then everyone would jump on board. That is actually very key to making sure that you have a successful onboarding.

Tony Hansmann: We call that gratitude turning to hunger. You hit that, you’re like, “Oh, my gosh, that works.” If you’re doing a ground level transformation, there’s a whole bunch of things that you’re looking for, but gratitude turning to hunger is one of the critical ones.

If you’re doing a ground level transformation, there’s a whole bunch of things that you’re looking for, but gratitude turning to hunger is one of the critical ones.

I want to circle back around to psych safety really quick, because that is absolutely critical. And there are two main things that you can do if you’re out there trying to figure this out. Research agile team agreements, and make the agreement beforehand about how we treat each other.

The other part of psychological safety for a team is proper assertiveness. Proper assertiveness means, if I’m sitting in a room, and someone’s talking, and they’re done, and I don’t understand what’s going on, my obligation is to say, I don’t understand. If I picked up this story, I wouldn’t know what to do. And then, someone might say, “Okay, you’re new. And so this thing is weird, and a pair will teach you.” Okay, great.

Or someone might say, “That story doesn’t make any sense to me either.” And then we take it out and rewrite it. But proper assertiveness means speaking up when you don’t understand. And being clear about your opinions and what informs them. We call it strong opinions loosely held. And these are tools that will help you through, because the trust-building process, I’m so happy to hear everyone say it, is what SRE is. It’s the trust-building process across the organization.

Kurt Andersen: Since we’re pitching books, I’ve got to say that Agile Conversations by Douglas Squirrel and Jeffrey Douglas is a pretty amazing book and talks about analyzing how you carry on these conversations to uplevel your conversational discussion capabilities.

Chris Hendrix: We’ve been talking about organizational transformation. And one of the ways that you do that is to create that hunger for people. A skill in this industry, and as an adult, is learning how to reframe a message to cater to an intended audience. I’m interested in how you message the benefits of SRE, and why it should be adopted? How do you cater that message to an executive, to a middle manager, or to an IC?

Vanessa Yiu: It doesn’t really matter who the audience is. There’s something in common, which is you have to articulate why what you’re doing is going to be of use to them. I think this applies across every level. It’s not just executives or ICs. If I am talking to an executive, I’m trying to get sponsorship or buy-in to do this thing. What are the problems that they care about that they’ want to see addressed as a result of us implementing SRE?

It doesn’t really matter who the audience is. There’s something in common, which is you have to articulate why what you’re doing is going to be of use to them. I think this applies across every level. It’s not just executives or ICs.

Is it to mitigate particular risks and environment? Is it that they will end up being able to make operations more efficient? Maybe there are some dollar savings in there. You really have to think about the message and what they’re going to get out of it. And pitch it that way. We also have a lot of business users who are not engineers at all. However, something that everyone at GS is good at is risk management.

Traders might be thinking about the stock positions. Or that portfolio, which is risk management in a different form. So then, how do you think about articulating that from not the technical level, but explaining to them that we want to do this because if your system goes down, you’re going to be out of market at a minimum for this duration? And that is something that they can really get their head around. They will naturally understand what you’re talking about.

Chris Hendrix: That’s fascinating, because I don’t think naturally in terms of risk management, but it sounds like in a financial environment, that is a very common lens to analyze positions through.

Kurt Andersen: Vanessa was right in terms of identifying where their pain points are, what kinds of things keep them up at night, so to speak. Now, SRE isn’t going to be the answer to all of those. I mean, maybe they tend to eat too heavy a meal at night and they can’t sleep because of that.

But on the other hand, if their problem is worrying about whether their product area is going to be called on the carpet at the next executive review or something, then the SRE principles of measure, know what you’re shooting for, how you’re measuring, how you’re achieving or not achieving what you were aiming at, are a valuable principle that can help them be prepared for that management review. Maybe there’s automation that can save them a lot of time. Instead of having to spend eight hours plugging numbers into a spreadsheet, it can be automated and rolled up out of some ticketing system. It saves the toil for the managers, too.

Tony Hansmann: For any kind of transformation like this, you have to have a high-low strategy. For executives, there’s two ends of that high-low strategy. You can just sit down and say, “Look, what’s a goal I can help you meet?” That person is transactional, they want to do something. If you do something good for them and build trust, they might listen to you better next time. I search for that transformational executive, that executive who is just like that person who read Accelerate. I don’t care if that person’s in shipping. I want to do everything I can to make that person effective with the tools.

I search for that transformational executive, that executive who is just like that person who read Accelerate. I don’t care if that person’s in shipping. I want to do everything I can to make that person effective with the tools.

Sometimes with executives, I will lower the shoulder and hit them square on. I’m like, “Software has been industrialized. Google is so much better than you that it’s inconceivable. You’re the buggy whip manufacturer now, Ford has done its thing. And if you keep sitting here, pretending like what you’re doing is acceptable, the world is going to erode out from under you.” I’ve had executives look me right in the face and say, “I know you’re right. I’m not going on this climb.” And so I counsel that executive to leave as soon as they can, find a successor who wants to do it.

The problem is that 90% of the time, you do not get high engagement. Low engagement is where you can engage instantly, because you can just say, “I love a structured complaining session. Let’s just go into a room.” No joke. Everyone has complaints. Just go into a room, allow people to complain for about an hour, and spend half an hour categorizing the complaints into dilemmas, high cost-to-address, and low cost-to-address. Throw dilemmas away. And as a team, agree not to revisit dilemmas more than once every six months.

And if you are a really ambitious team, pick one dilemma to assault. Don’t spend any more emotional team energy on dilemmas for the next six months. For the low side, I want to start with what people have: complaints. We have so many tools to solve these complaints now, let’s just solve them.

Chris Hendrix: Sounds like a facilitated retrospective with a very unique purpose. Agile retrospectives are the core principle of the agile transformation process, why shouldn’t it be the same for adopting SRE?

Agile retrospectives are the core principle of the agile transformation process, why shouldn’t it be the same for adopting SRE?

Vanessa, in your essay, you talked about a potential failure mode when SRE is pushed from top down and doesn’t also have that bottom-up, grassroots engagement. I have seen this. In particular, I’ve seen it with executives who’ve maybe read the book and are now asking for four nines availability for their services. I know any SRE or team member who’s here empathizes with this situation understands why that is so frustrating. Have you seen a similar kind of failure mode with these really aggressive SLOs that are not based in reality? And if so, is there a way for an IC to push back against that? What’s the best way for someone who’s on the receiving end of something like that to participate?

Vanessa Yiu: First of all, if you’re not at four nines, you need to be able to quantify where you are at this point in time. You will need some data to back you up. And then, the conversation might then be “Okay, I’m running at three nines, but I’m not getting any complaints. Maybe the system simply does not need four nines.” If we’re running three nines and no one’s complaining, the users are happy, would you still want to invest dollars into making the service four nines?

If we’re running three nines and no one’s complaining, the users are happy, would you still want to invest dollars into making the service four nines?

We all know, the higher the number of nines we go, the greater the investment. We have to scale out infrastructure, eliminate all of the single points of failures, etc. It’s going to cost money, and it is going to require a lot of re-engineering effort as well. So do you still want that? Have that kind of dialogue with leadership. People typically cannot understand the pros and cons of both, so that’s typically how to start, but you need data to back you up. You need to be able to quantify.

Kurt Andersen: I’ll harken back to Vanessa’s risk management. There’s a working spreadsheet that Google put out under their CRE, which is Customer Reliability Engineering, that lets people estimate the risks and the frequency with which they expect those risks to hit them. So let’s say they have planned maintenance, and it’s one hour every week that they take the systems offline for planned maintenance. This spreadsheet will take that and translate. As a result, you will never hit four nines, you’ll never even hit three nines, because you’re burning it on planned maintenance.

Now I know some people like to give themselves a pass or a mulligan on planned maintenance, but it’s still time when your customers can’t use this system. The reason doesn’t really matter. They can’t do what they want to do.

This spreadsheet gives you a framework where you can estimate the frequency with which these incidents might occur, how impactful they are, how long they take to respond. And then what does that do to your nines? It’s a way for people to get some sense of, if we want to move to four nines, that means we’ve got to fix all of these different things, and what it costs you can extend.

Tony Hansmann: It’s funny, everything revolves around the nines. The Google SRE book tells you to pick 28 days as your time. And if you don’t take anything from this, pick that 28-day period, because you can reason about it as a human being. And so as an IC, you have a 28-day period, and then figure four nines is four minutes of that 28-day period.

Basically, at four minutes, you’re maybe at failover. But you may only be at failback. And so you have to have automated systems all the way through the stack. All you do is look at your path to production. Then you’re like, “We don’t have automation at any of these places. There’s no toil elimination along this track.”

So four minutes a month, not workable. You can laugh an executive out of the room. Don’t do that, right? I’ve made a career of doing that. I don’t recommend as an IC that you do that. But I do recommend that you say, “Look, I hear that you want four nines, right? There’s actually no magic, and four nines is four minutes a month.” Meanwhile, two nines is four hours a month. So that actually can have people in the loop. Four nines can have no people in the loop. When you frame it up that way, executives are super clear about what’s going on.

Two nines is four hours a month. So that actually can have people in the loop. Four nines can have no people in the loop. When you frame it up that way, executives are super clear about what’s going on.

Chris Hendrix: I think we’re going to move into our Q&A session now. What’s the perspective on error budgets, which is a more advanced or mature factor of SRE? How are organizations looking at this and adopting it? Is there a required standardization of consequences and alignment when it’s breached?

Tony Hansmann: We have a position on error budgets. Google will talk about their error budget situation. For instance, their Chubby service routinely outperforms its error budget, so they take it offline on purpose. Error budgets are treated by Google as very serious. From our perspective, error budgets are what allow us to take risks.

Error budgets allow you to take a risk or to know when to turn the risk dial down, because you’re violating your error budgets. In the olden days, Google SREs would come out into the world, and they would say, “Well, what you should do is retask half of your engineering team to fix bugs and increase reliability. And if they can’t do it, you keep taking feature points away from them.” Google’s very punitive about it. In the real world, that doesn’t work, but error budgets are a big deal.

Kurt Andersen: Error budgets are how you get to the point that Vanessa described of catching things before they burn your customers. Essentially, it’s back to what Tony pointed out: you have a target, you have a promise. And you have to have some way of evaluating, are you likely to breach this promise or not? You can get really fancy, but you don’t need to. Ultimately, you want to hit this level, and you’re growing at this level over here. You just do a straightforward extrapolation to start.

You can get more sophisticated, and lots of people have. But it’s like, “I’ve got an error budget of 132 minutes in this 28 day window. And I’m using up two minutes a day. And once a week, I have a big incident where I’ll burn half an hour.” You take that half hour once a week, and you take the two minutes a day, and you add them up and say, “Am I going to be okay at the end of 28 days?” And that’s a start. That’s an error budget.

Vanessa Yiu: We definitely don’t use error budgets to do hard blocks of deployments. If we have remaining budget, we certainly don’t take systems down on purpose. As you can imagine, in our line of business, that’s probably not feasible. However, having the data means that we can drive that risk management dialogue. We do that, and that is important to the organization.

Chris Hendrix: Let’s move on to the next question. “ I feel the SRE role description is still constantly being defined and varies a lot between companies. Would you say that you have an accurate definition of what SRE is as a role?

Kurt Andersen: No. It will be different from company to company, depending on the particular needs that are going on and the mix of the other roles that they have. There are common characteristics, but I don’t think that there’s hard edges that you could draw around it right now.

Tony Hansmann: Become a trust builder, right? SRE is about giving a business tools to build trust in the system so they can take better risks. That’s all it is. And so regardless of practice, and how practice evolves (because practice will evolve with technology), your bottom level is trust and understanding so that the joints of your organization can understand what something far away is doing and how they make promises.

Chris Hendrix: The concept of reliability is obviously an important one. A lot of the SRE role and responsibilities are trying to talk publicly about what reliability means and why it’s important for the business and how you can engineer reliability. That’s maybe the closest that one could get. But as Craig says in the chat, “SRE is more of a culture.” And it being a role is maybe not an anti-pattern, but a supporting structure for a culture. You can do SRE without ever having a SRE role.

A lot of the SRE role and responsibilities are trying to talk publicly about what reliability means and why it’s important for the business and how you can engineer reliability.

Our next question is, “ I work in a platform DevOps team, and they use OKRs for goal setting. Our team struggles a bit with introducing and maintaining SRE principles alongside other OKRs. Do you have any experience with this? And can you share any hints or guidance?

Tony Hansmann: It’s all a matter of how empowered the team is to solve its own problems. And so if a top level OKR collides sideways with something the team needs to do to eliminate toil, then you’re stuck in a positive feedback loop. And if you’re an IC, what you want to do is identify your positive feedback loops, your negative feedback loops, and the ones that are both well balanced, positive and negative. Because right there, from a systems point of view, that just pops your problem space.

If OKRs say you need to be able to defy gravity, then you have to have a method to fold that back into the OKR. Like, “Hey, we don’t have the antigravity machine yet. If you know something, we don’t, we’ll take it.” You can’t say anything about the OKR model because probably the people who invented the philosophy of it are great at it, and everyone who read the article on it is like, “I’m going to call what I do OKRs.”

Kurt Andersen: To a certain degree, it’s like, if you can get your objectives and your key results (which is how you measure achieving your objectives), it can line up with what Tony was describing in terms of your promise. Now, if those are aligned with building trust, building your liability, and breaking down silos, then everything should be good. If they’re orthogonal, then they’re not mutually supporting.

Vanessa Yiu: I was going to echo the same point. Key results are meant to be measurable. In order for you to be successful at delivering your OKRs, you probably will have to apply the authority principles and do certain things in your system to make sure that your OKRs are measurable. There’s quite a lot of synergies between the two.

Kurt Andersen: If your OKRs are imposed upon you, then that’s probably a sign of a broken system in and of itself. The extent to which you can influence those objectives is where you will be building trust and other things in accord with an SRE approach.

If your OKRs are imposed upon you, then that’s probably a sign of a broken system in and of itself. The extent to which you can influence those objectives is where you will be building trust and other things in accord with an SRE approach.

Chris Hendrix: One question we got in the chat is about the transformation to SRE. W hat do you suggest to use for SRE assessment to understand which level of maturity your organization is at or how they measure SRE? Do any of you know of any existing tools or topics?

Tony Hansmann: If you want an assessment of your organization, you say, “What percentage of my teams at the 95th percentile deliver via a machine?” If I have a 10-step path to production, how many of those 10 steps are delivered via machine only? I don’t care how ugly the process is in between, that right there will tell you a maturity of automating in your organization. And SRE is an automation discipline.

You have the most obvious processes like commits not going through continuous integration systems. If your commits are not going through continuous integration systems, that is the highest-leverage thing you can do. If you’re looking at the SRE book, and you’re like, “I have to do all of this,” set it aside and get Martin Fowler’s Continuous Integration book and start teaching that, because that will move you further along than anything else. Then start rolling in SRE.

But notice that if you have CI, all of a sudden measurement becomes really straightforward on a lot of things. You’re going to speed up iteration cycles across the path to production. Don’t evaluate SRE if you don’t have any CI in your organization. If you have CI below the 95th percentile, start working to get it to the 95th percentile, because that’s what you need.

Vanessa Yiu: There are many foundational things to doing SRE. Observability is another one where you can measure where your teams are at in terms of maturity around observability. Are you only doing white box monitoring? Have you got customer journeys mapped out and you’re probing? Another place might be incident management. Do you have clear processes? Do you really have blameless postmortems? And do you really follow up on actions to remediate after incidents happened? You can measure across a number of different things because SRE is a broad discipline that covers many aspects.

Kurt Andersen: And I’ll pick on the terminology, just because sometimes I like to pick on semantics, and I think maturity models are an antiquated concept. I much prefer Nicole Forsgren’s way of phrasing it as sort of like stepping stones, that there are many paths that people pursue in their journey of continuous improvement. And if they are improving, then those are good paths. And the fact that you’re on one path and I’m on a different path is more contextual than an absolute.

There are many paths that people pursue in their journey of continuous improvement. And if they are improving, then those are good paths. And the fact that you’re on one path and I’m on a different path is more contextual than an absolute.

Chris Hendrix: We have one last question. After we get past executive buy-in, what’s the next milestone towards bootstrapping an SRE practice between platform and app teams? Is it establishing joint SLOs and error budgets?

Vanessa Yiu: What we did was around observability. We had gaps with that. That was the place to start because you need to be able to measure. Unless you can measure it, you can’t really do anything at all about it.

Kurt Andersen:

I go back to Mickey Dickerson’s pyramid that monitoring is the foundation of successful practice. Whether it’s monitoring, which was the old term, or observability, which is a slightly more expansive term.

Chris Hendrix: So even before you get to the point of setting SLOs between platform and apps teams, it’s really ensuring that platform and apps teams have a shared understanding and are monitoring and observing their production systems.

Tony Hansmann: I have a slight difference. When you get executive buy in, what you want to find is what I call the transformation executive, a person who, for three to five years, can carry the weight of this whole thing. I think of this person like a talk show host, because he or she is going to be going to other people’s offices and sitting down and saying, “Hey, there are these great practices that are well established, blah, blah, blah,” and the person whose office they will be visiting will be like, “Get out”. Then they have to be like, “Great, I’ll see you next week.” Because they’re going to have to run the loop on all of this stuff, because the organization starts producing results. You want a focus for those results.

Chris Hendrix: Awesome. We’re pretty much at time. I wanted to thank all of our panelists and attendees for joining us on this first of our rebooted Blameless industry panels. If you or anyone that you know is interested, please reach out to Blameless either via Twitter or chris@blameless.com.

Referenced resources:

Originally published at https://www.blameless.com.

--

--

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/