Getting to 99.999% Availability with Twilio’s Tyler Wells

A remarkable milestone for any company’s site reliability engineering (SRE) is five 9s availability. That’s less than 30 seconds of service unavailability per month! Exactly what Twilio has accomplished. Twilio is the world’s leading communication platform with more than two million developer accounts. When we get an anonymous call or text from our Uber/Lyft drivers, that’s Twilio at play.

Tyler Wells is the Director of Engineering at Twilio. He oversees all Programmable Video (WebRTC) and Client SDK teams distributed across the globe: Vancouver, NYC, Austin, Madrid, Latvia, to name a few. He shares how Twilio guides widely distributed teams to exceptional operational excellence.Throughout the interview, Tyler exudes rigor and discipline in his thinking and expression. Here are the building blocks he shares of getting to five 9s: Empathy, Chaos Engineering, and the Operational Maturity Model (OMM). The key points that Tyler shared are summarized into the article below.(If you haven’t heard of SRE, SLA, SLO, or SLIs, now is a good time to quickly read through this cheat sheet.)

If our service is not reliable, a person considering suicide may not get the help at a time of their greatest need.

At Twilio, each one of the 300+ engineers practices SRE principles. Small autonomous teams at Twilio take their products from idea/concept all the way through production. Teams are responsible for the operational excellence and upkeep of their systems.

Empathy — The foundation of 99.999% availability

To achieve five 9s availability, the engineering org must understand how it impacts people’s lives when you are not providing five nines.

Chaos engineering — Break everything yourself

Do your own chaos engineering. Break everything yourself. Use a tool like Gremlin. Understand:

  • How long does it take for you to detect something that’s gone wrong in your systems?
  • How long does it take to resolve when you detected that something has gone wrong?
  • What are the tools and instrumentation you put in that gives you the signal that something is not quite right.

Before a system ever even reaches production, you should’ve broken it a thousand times.

Removing fear with chaos testing

By the time they get to production, teams have got the muscle memory and know how to react to incidents. They have validated their graphs and their indent tests to know that they are getting a clean signal. Teams should be confident that the monitors they’ve created will provide directionality on why things are breaking.

Delays that improve availability are okay

There’s been a case during chaos testing when we expected media failure, but our dashboard showed nothing! We then stopped the test to see what’s wrong. This type of delay is okay. By breaking everything yourselves, you can prevent the worst thing — letting the customer tell you about something that’s failed inside of your systems.

Operational Maturity Model (OMM): Don’t expect five 9s from day one

  • Be willing to communicate and include an SLA for your service in a contract
  • If there’s a breach in SLO/SLA, you know the customer impact and which customers are affected

Products don’t start at five 9s. For newer products going from Aware to Scaling, they go into beta stages at three to four 9s (99.9% — 99.99%). The beta launch gives the team time to have early incidents and learn from them, improving SLO performance for availability before launch. Teams can take from a couple of weeks to a couple of months to get Ironman certified.When a product is Ironman certified, it means that the product is generally available (GA) and published to customers. A customer can now trust in us that we have put a lot of time into operating, securing, and scaling our systems, and making them reliable.

Meeting SLOs — How we win customers’ trust

  • If a service is error-prone because of failures in downstream dependency, can I ensure that it doesn’t cause an outage? (E.g. Can I route around it? Can I find another carrier that can handle that traffic by shifting everything from US East to US West?)
  • If we see a concentration of 400 level errors inside of specific customers, we would reach out to these customers and ask “how can we reduce these errors for the future?” (E.g. better documentation, UI, etc.)

Using our performance against SLOs, we are always analyzing: which service is throwing the most 500s and why? The answers can enable us to act and constantly improve our availability.

Your site is reliable if you are: available + functional + resilient

Functional: Are we responding correctly, timely, and accurately with the expected response codes? (E.g. Is my request throwing up 500s when I’m expecting 200s?)

Resilient: Can we resolve incidents quickly or prevent them in the first place? (E.g. Is time taken to resolve incidents decreasing over time?)

Ultimately, SRE is a specialty that needs to be embedded in the minds of every developer.

Written by Charlie Taylor

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store