Resilience in Action E6: Oversize Coffee Mugs, SLOs, and ML with Todd Underwood

Audio here.

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O’Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

Before joining Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale across the board for thousands of independently deployable services. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Todd Underwood‘s 4.5 liter coffee mug.

In our sixth episode, Kurt chats with Todd Underwood, ML SRE Lead and PIT Site Lead at Google, about his work as an SRE, the challenges of implementing SLOs for traditional interactive online services, ML-based services and how to think of SLOs for them, and more.

See the full transcript of their conversation below, which has been lightly edited for length and clarity.

Kurt Andersen: Hello, I’m Kurt Anderson and welcome back to Resilience in Action. Today we’re talking with Todd Underwood, who is the machine learning SRE lead and the Pittsburgh site lead at Google, about his work as an SRE, the challenges of implementing SLOs for traditional interactive online services and machine learning based services, and how to think about SLOs for such species. Welcome Todd.

Todd Underwood: Hey, thanks for having me. Great to see you.

Kurt Andersen: I got to start off by asking about that picture that you’ve provided for the feature on the webpage. That coffee cup is impressively large. What does it hold, a gallon?

Todd Underwood: I don’t use those kinds of antiquated deprecated measures like gallon, but I know for sure that four and a half liters of coffee fit in that. It’s a lot. So I guess that’s more than a gallon.

Kurt Andersen: More than a gallon, that is truly impressive. And is that your quota for the day?

Todd Underwood: There is actually a funny story behind the cup. The cup was purchased by some people who were working on Google Shopping 10 years ago. And as we all know, one of the things that happens with online shopping is it’s a little bit difficult to be sure what you’re buying. You hear these stories of people who think they got a really good deal on a bed or a sofa and it comes in and it’s a very, very high quality doll bed that costs $80 instead. This is not what you’re hoping for. So, the coffee cup was trying to help people understand the diversity of products. They found this very, very cheap, enormous cup of coffee but on the picture, it was actually just decent. It was like five or six bucks or something.

One day, one of my coworkers-just to punk me-literally filled the whole thing with coffee and I said, “Challenge accepted. I will drink it.” I’m not too proud to admit, I did not finish the coffee. I spent all day trying to finish the coffee. I did not.

Kurt Andersen: I was trying to think about how you might keep the coffee warm with that volume.

Todd Underwood: That particular size cup, I happen to know, fits exactly in the microwaves that we had in the Google Pittsburgh micro kitchens. And I know this from four or five trips to warm it up again. It was not a good day. It was funny for everyone but me, and sometimes that’s just what you have to do.

Todd Underwood: I joined Google in 2009 as an SRE manager really not knowing what SRE was. I think this is a common experience for many of us in this profession. First, I thought it had something to do with facilities. I was like, “There’s a site. Are you doing power engineering? Is this about failure modes of data centers? I know a little bit about data centers, but this doesn’t seem like my strength.” It took me quite a while to get the general geist of the whole thing. But yeah, I’ve been an SRE at Google since 2009. We can talk through the details, and I’m sure we will ,and there are probably people more talented than me doing this.

In some ways, things have evolved considerably, but it’s a little bit surprising how little things have changed. We tweak it, we look at the facts on the ground. We try to change the way we staff teams and focus our work. But in many ways, I think you could take me from 10 years ago and plop him down here and we’d be okay. We wouldn’t be perfect, but we’d be okay.

Todd Underwood: That’s super insightful. And I think you’ve hit it exactly right. So, the initial services that we all think of as being the most important production services are request-based services. Think a payment system that has to receive requests, or a web search, or an ad serving system, or a website that we want to keep up, or a storage application where we send it an RPC and we get some blocks back or something.

One of the things that many of us did as we were building out these shared computing platforms, at Google we use Board, people have been using Mesas until recently. Also people use Kubernetes clusters this way as well. You might want to mix a couple different kinds of workloads.

You might have a serving workload that has some daily patterns, like a diurnal peak and a trough. But then you also have some people who think of it as filling traffic, batch traffic, “Oh, we’re going to go through the logs and analyze it. We’re going to count up the things.” There’s all kinds of stuff we do that’s not request based.

I think, initially, SRE at Google was definitely focused on SLOs for request-based services. When you ask people, “How many nines do you have?”, they’re like, “Well, divide the 200 requests by all the requests and that’s what you got.” I’m like, “That’s a delightfully simplistic point of view of the world, and it works for some services, but not the rest of the services.”

Two interesting things happen. One, which we identified about 12 years ago, is that there’s this extra class of services. You have the request-based serving services on the left-hand side. You have batch services on the right-hand side, but orthogonal, maybe somewhere in the middle, are these production data processing pipelines. We were like, “What? Why isn’t that just batch?” And the answer is, you care about it in particular. You either have a deadline or a throughput requirement for it and we’ll talk about those in a second. But you care about it in a way that you might not care about others. Like batch, you’ll finish when you get a chance. If it takes two weeks, because the serving didn’t go down, that’s great.

But when you look at the pipelines, it’s frequently like, “Well, this is a billing run and the month is closing. We have to post the results. We have to send out the checks.” There’s no, “Do it when you can,” but it could be delayed in some cases by wall clock minutes or hours, as long as it’s not delayed by wall clock days or months.

And so, you have this very weird middle ground where you want to think about that being a production service, but you don’t want to treat it the same way. I need milliseconds worth of latency here. That’s really where we live. And the reason I bring this up in the context of ML is that the ML services are either serving services or pipelines services. They’re rarely batch services. So they need some kind of production requirements. They’re serving systems, but the training systems and the other ones are not.

You have this very weird middle ground where you want to think about that being a production service, but you don’t want to treat it the same way. I need milliseconds worth of latency here. That’s really where we live.

Todd Underwood: It’s really tricky because many of us have worked on networks or storage or web serving or whatever it may be. And those are not the same. What we started doing to get our heads around this was saying, “Okay, well, why does it need an SLO at all? What happens if we just best-effort the whole thing?” I think that’s actually a really useful experience for doing the counterfactual of, “We do this all the time.” People think like, “Oh, Google is just swimming in staff. You can just put infinite people on infinite things.” But we’re pretty tight on this stuff.

When somebody says, “I want SREs to work on X,” or, “I want a development team to develop Y,” the first question is, “Okay. What happens if we don’t do that?” We start looking into that, and for many of these services, what we see is that catching up takes as long or longer as doing it the first time. So many of the services don’t catch up at the rate of two seconds per second or five seconds, but some of them catch up at a rate barely real time.

That means, if we have an eight hour outage, it might take us 16 or 20 or 30 or 50 or 200 hours to catch up depending on the details of the provisioning, et cetera. That’s where I think things get interesting where people say, “Oh, I don’t need SREs because I can tolerate a four hour outage.” You say, “Great. Can you tolerate a 24-hour?” They’re like, “Oh no, that might not be good.” “Can you tolerate a one week outage?” “No. Definitely not.” I’m like, “Okay. Well then we need to go back and say, how much of an outage can we actually tolerate?” Because the catch-up from the outage processes lots of data. Does that make sense?

If we have an eight hour outage, it might take us 16 or 20 or 30 or 50 or 200 hours to catch up depending on the details of the provisioning, et cetera. That’s where I think things get interesting.

Todd Underwood: In some cases you can, but in some cases I think there are simple answers. Sometimes other people are using them. They don’t want to get off them. And sometimes they have a legitimate argument why their thing is more important than your thing. There are other interesting cases and there’s a whole subset of what we do in ML SRE where we go, “Oh, but this is the ML part of it.”

There are certain kinds of machine learning training that don’t go infinitely fast. So if you try to train too fast, you move the bottlenecks somewhere else. Or, in some cases, you can create oscillations. So one of the ways to think about this is that there’s a map of what’s true somewhere. If you throw too much compute at that storage map without making better provisioning, then what happens is, the updates get queued a little bit.

So my view of what is true falls behind your view, because you’re updating stuff while I’m training on some other stuff. If that delta in time is pretty small, in the sense of milliseconds, it probably doesn’t matter. But as that gets bigger and bigger, the chances that you and I are queuing conflicting updates to the same keys, or the same information, or the same part of the map goes up and up and up. And then what happens is you say, “This is bad.” And I say, “Oh, it looks like nobody thinks that’s bad yet because I haven’t seen your updates yet.” So I say, “Make sure you say that’s bad.”

Then later both of us look, and we’re like, “Well, it’s not that bad.” So we both raise it. And then later we both look and it’s like, “It’s pretty bad. Why did you say it’s so good?” So we both lower it. As this happens, you oscillate around the truth and you never actually converge on a good value for that.

I know that was a long, perhaps boring and technical answer to the question. The first answer is best. Other people might be using the machines.

Kurt Andersen: Some of these edge cases are not as edge as we think. As I would frequently point out to people, when you’re dealing with a problem that happens one in a million times, that’s great until you’re processing a billion events a minute and then it’s happening all the time.

You touched on this idea of throughput because we started with deadlines. And then you touched on this idea of how long it takes you to catch up if you suffer an outage. Are there other aspects of throughput? You talked about this question of conflicting views of reality. Are there aspects of throughput that become important and interesting as you structure SLOs right now?

Todd Underwood: The biggest thing, forgetting about the systems and taking a big step back, is trying to remember what it’s for, because how far you get behind, and how good the models are, and how late they are or how late they aren’t all actually features of, “Well, what were you going to do with it?” This is probably true for all of the services that we deal with as SREs.

Here’s a concrete example that helps people understand this. Imagine a payment provider like Google Pay or Citibank. They have fraud models. All of these payment providers develop these fraud models where they look at a transaction and everything they know about it and they try to figure out if this is fraud. I think most of us experienced this as they turned your credit card off again or let through fraud that should have been obvious again.

Most of us get the frustrating side of this, but it’s important to know that none of our modern payment systems work at all without fraud detection. All of the merchants would refuse to take credit cards. All the credit card vendors would go home and everything would come to a screeching halt. So the fact that it works well enough to permit modern finance is pretty cool. But they should work.

It’s important to know that none of our modern payment systems work at all without fraud detection. All of the merchants would refuse to take credit cards. All the credit card vendors would go home and everything would come to a screeching halt.

So if you’ve got a fraud model and somebody comes up with a new kind of fraud that you start recognizing, the question is, how quickly do you start recognizing it and how much damage can they do before you start recognizing it? And the recognizing part is the machine learning model. You can do some of this with humans, but mostly it’s a machine learning model noticing some things happened, correlating those with some other things that happened, correlating that with a human evaluation of a thing, building a new model, and pushing it out to serving.

And now you can imagine, saying, “Oh, well now we can put a dollar value on the cost of this ML system being down for X minutes, X hours, X days, or even not being down. But what about behind?” When you and I were talking just a minute ago about deadlines versus throughput, this is an example where you really want a throughput capacity SLO for that system that is capable of keeping up with your expected transaction volumes.

If you’re training on transactions, you’d want to say, “Well, how many per X are there and how much compute do I need to process one of those?” And you just start doing some math. You may find that you don’t have that capacity, and at peak traffic times you predictably get behind. The thing about fraudsters is they figure that out. They’re always trying everything all the time. So what happens is, if they notice, “Hey, between 2:00 and 4:00 PM Pacific every day, X company is bad at recognizing new kinds of fraud,” then they will preferentially launch their new kinds of fraud between 2:00 and 4:00 PM Pacific so that they get extra minutes of revenue out of it.

Kurt Andersen: I’m familiar with that from having been in the anti-spam fight for a long time.

Todd Underwood: It’s the same thing. They’re always trying the next thing. The longer it takes you to recognize it and deploy that recognition effectively in production, the bigger the window to walk through. And man, they walk through those windows.

Todd Underwood: The thing I would recommend for people who aren’t super familiar with ML is, don’t worry about the details too much because they’re changing constantly. And a lot of people are getting PhDs in this stuff. And I can’t tell if they’re making it complicated on purpose, but they’re definitely making it complicated. But don’t worry about that. The reality is actually right now for SREs and people working on production engineering here, pretty straightforward.

The story is, you have some data. Sometimes these are called examples, things that look like they’re true. You train a model which is just a computational representation of some insights that you extracted from that data. But it’s done automatically. You push that model into serving and you ask it questions when you need to. That’s the whole thing. Now, I glossed over an enormous amount of complexity, but that’s the whole thing. The most common thing for big models is offline, but periodic. So you train it, you try to get a snapshot representation of what was true 30 seconds ago or 30 days ago into serving.

Kurt Andersen: So it seems like it’s both the training time as well as the push to production, which is an area that often gets neglected, ignored, or glossed over. All of that affects the accuracy of the results you might expect in production from the serving system.

Todd Underwood: Pushing on that even further, I think one of the single most interesting and yet unresolved points about doing reliability engineering on machine learning (there’s actually a couple other kinds of services that have this), is that model quality matters. The point isn’t to push a model into serving. The point is to push a good model into serving that does what we meant it to do.

The point isn’t to push a model into serving. The point is to push a good model into serving that does what we meant it to do.

In general, there’s a little bit of a separation of concerns here. There’s somebody who develops and redevelops and improves that model. There’s often different people who maintain the production infrastructure who train the model. Sometimes that’s the model owner. Sometimes it’s not. We’ll call that first person model owner. The model owner might train their own model or they might not, depending on how big the company is and how advanced the infrastructure is. But the serving system is almost always maintained by someone else. Model owners are not usually in the business of, “I will run my model serving system.”

Kurt Andersen: It’s not their specialty.

Todd Underwood: It’s not their gig at all. It’s not what they want to do. And that’s fine. If you train a new model and your model is terrible, we forget what that means. But it’s terrible in some objective way, and everyone agrees it’s terrible, who needs to fix that? The answer is, it depends. Is this the first time you ever trained this model? It was probably terrible to begin with, go fix it yourself. It’s your model. You should know how to make a better model. Figure it out. Not my job.

I’m sitting here. I’m SRE. Let’s say you’re like, “No, I have trained this model every day for the last two years and today it’s massively worse.” That’s starting to sound like it might be my fault, my problem, my job to troubleshoot, not yours. If that happens at the same time as five models from five other different teams also come out not so good, it’s definitely my problem.

So in the machine learning infrastructure, there’s all kinds of things from an infrastructure point of view that we can do to make models that are no good. I can delay or mess up some of the data. You say, “I’ve got a model that requires these five sources of data.” And I say, “Great. I’m going to give you a truncated version of the third and the fifth ones of those.”

If we aren’t careful, you now have a super garbage model. What if the third bit of data was whether a transaction succeeded and we’re still talking about this fraud example I made up? Well, without knowing whether the transaction succeeded or whether there was a chargeback, it’s irrelevant. And the fifth one could have been the customer fraud history.

So if I take those two things away from you, you probably think everything’s fine. You’re like, “No. There’s no fraud here. It’s great. The world was wonderful.” And with our SRE work, in order to do it properly, you need to be a little bit up in the business of the people who are using it. But if you do that too much or you get too deep, you can’t scale.

We have ML systems at Google that are running thousands of models on behalf of hundreds of other teams. We’re talking about a team of SREs of like six people in two places. So 12 or 13 people total. They are not going to be able to know even what each model is for much less how long ago it was bad. So now you start to really have to think carefully about separating model quality from infrastructure availability, but knowing that they intersect.

You start to really have to think carefully about separating model quality from infrastructure availability, but knowing that they intersect.

Kurt Andersen: Interesting. Seems like a lot to wrap your arms around.

Todd Underwood: That’s the crux of it. A couple of us have been working on a book about productionizing machine learning which will come out early next year. And we’re talking about machine learning reliability. For it to be reliable, it doesn’t just have to be in serving. It’s got to be usable and of good quality in serving, but that is not the traditional domain of infrastructure engineers. That’s not where SREs spend their time.

For it to be reliable, it doesn’t just have to be in serving. It’s got to be usable and of good quality in serving, but that is not the traditional domain of infrastructure engineers. That’s not where SREs spend their time.

Kurt Andersen: That would be like having to worry about whether the compiler did its job correctly in building whatever artifact is being shoved out into production.

Todd Underwood: Or even worse, worrying about what color the webpages are. I am a little bit serious. You work for a company which has some color branding. If they ship a webpage with the wrong color, it sometimes starts a reliability problem because now you’ve messed up the webpage. Or if you ship CSS that doesn’t let you press buttons on the webpage, that’s maybe a better example. All of us work in these companies that have mobile infrastructure these days. You ship an app. It’s a bundle of bytes. I got it onto the phone. Now, if the app refuses to load or the app loads, but can’t do anything useful, as an infrastructure engineer, that’s going to show up as less demand required. It’s a lot easier. But in reality, users experience a complete outage of whatever thing that the app now no longer does.

Todd Underwood: Right now, what we’re trying to do is to walk from one side to the other. And I’ll be honest. At Google, we’re still trying to figure out where we’re going to meet in the middle. I haven’t heard great answers or the right answer for this. So I’ll tell you how we’re approaching it, but I just want to super explicitly acknowledge this is not a problem we’ve solved. And if somebody else has solved it, I would be excited to learn more. I would go to that talk in a heartbeat.

We have processes running on machines. SRE 101. We’re like, “Is stuff up?” If it’s not up, we are not good. Are the data sets available? Are we reading data from the data sets at an unexpected rate? Are there errors reading data? Is the data formatted roughly as we expected?

On the integration test side, is the training system producing a model? Does the model pass some basic sanity checks? If we’re training in TensorFlow, is this a TensorFlow-safe model? If it is, cool. If it is not, then something obviously went catastrophically wrong. But again, something could be a valid TensorFlow-safe model, of the appropriate size, and produced recently yet be complete garbage. But it is necessary that all those things be true. So we cover all of that.

On the other side, model owners are the people who understand the problem domain. What are they trying to accomplish with this model? They’re trying to predict whether you like ads. They’re trying to figure out how much you’re trying to defraud us. They’re trying to guess the next word you’re going to type on your phone. They’re doing all this stuff.

They devise quality metrics that they can get out of the serving system or out of some other system. Frequently, they’ll either take a copy of their model and do some analytics or grab real-world queries against their model from the serving system and look at the coverage, look at the quality of those. Look for an exogenous signal.

So, in web systems, you often see a search system that will show you some results. If you don’t click on any of those results, you could have just gotten distracted, but in reality, there were probably pretty bad results. If you do a search, you want to do something. You want to answer a question. I give you 10 answers ,and you don’t like any of them. Those are probably all 10 bad answers. On the other hand, if you consistently like the third one, then my ranking might not be as good.

Maybe I should try to get that third one up to one. And why has nobody ever clicked on that first one for that query? There are signals in there that are from the humans, but what we don’t have is in the middle. So far, we don’t have a completely general way of saying, “Is this model any good?” because it’s always about what it’s trying to do. Then for each model, it comes down to some very specific metrics about that task. That makes it difficult to generalize the service.

Todd Underwood: I think that’s mostly the latter. This is the domain of model owners. There are a couple of cases where we make an exception. Ads is one at Google where the machine learning infrastructure teams are so… is it rude to say old? They’ve been around for a long time. Those teams are very experienced. Those teams existed before 2010. We have machine learning infrastructure teams in ads at Google.

That team has been around for a long time working with some of the same model owners. Their problem domain is more narrow. They’re not trying to predict fraud. They’re not trying to necessarily guess what words you’re going to type or rank some search results. They’re trying to do this more narrow, specialized set of things related to one particular set of tasks for Google’s ads products.

We see this in other areas, too. If you say to SREs, “Run the compute thing for anybody who wants to do compute,” their engagement is going to be fairly thin. They’re like, “I just got to get compute up. I don’t know what you’re doing. If your thing is seg faulting all the time and her thing is not, well, that’s probably your problem.” You’re like, “It might not be.” It might be the container. It might be some set of OS things. There might be a system library. It might be my fault as an infrastructure provider, but I just don’t have time to look into it.

But on the other hand, if I am the compute provider for the photo service, I’m going to know a lot about that service. I’m going to know about photo transformations and transcoding, different ways of end mapping files, etc. But I think it’s still mostly for model owners, for people who are building models who want them to work well. They should be thinking very early on, how they would know if the model is good. What would be five or six or 10 metrics that they would track to know whether their model is working well?

Kurt Andersen: I think that’ll be helpful to those readers who maybe haven’t had to deal with ML yet and it’ll be in their future, undoubtedly.

Todd Underwood: There’s going to be some more of it coming. It’s been the hype now. I thought this hype would die down a little bit, but it seems not to be yet.

Kurt Andersen: As a wrap up question here, where do you think the next frontiers are for SRE as a profession?

Todd Underwood: I’ll take this in a little bit of a different direction than the technical or the professional. I think SRE is still insufficiently representative and inclusive of the kinds of people that I want to work with, that I see are talented. One of the things I love about SRE is I actually do think, as a profession, we have great culture. I think we are generally inclusive of people. Most of us are from goofy backgrounds and where we were the weirdo in some teams. And I think that’s cool. We are generally not the people you have to look and behave exactly like.

I think SRE is still insufficiently representative and inclusive of the kinds of people that I want to work with, that I see are talented.

But I still think there is a little bit of the SysOps, SysAdmin, just-say-no culture that underlies some of SRE. I want people who thought they wanted to be software engineers, but are cranky and skeptical that anything’s going to work well. I want those people to come to SRE regardless of whether they ever wanted to work at an ISP, or build storage infrastructure for Snapchat, or something. I want us to be like the kind of place lots of different kinds of people want to work at.

Kurt Andersen: Nice. That’s diversity across all the different axes.

Todd Underwood: Absolutely. There was an SRE I worked with for a long time. SRE, not just in Google but I think lots of places, has a cultural tradition of talking about beer and whiskey. And the person we worked with was like, “Hey, it’s totally cool if some people like beer and whiskey. But you’ve got to have alternatives because there’s a bunch of people who don’t want to drink, who can’t drink, who don’t drink. And this is all for all of us.” S tea became an official beverage at Google. It’s endorsed. We have high tea in Pittsburgh once a month. Back when we were in the office, we’d have scones and little cakes and amazing tea. It was fantastic. I’m not opposed to whiskey, but I’m super in favor of also having tea. That’s just this one little way of making it available for everyone.

Kurt Andersen: Awesome. Well, thank you very much for joining us today, Todd.

Todd Underwood: Excellent. Thanks so much for having me.

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/