Using Automation and SLOs to Create Margin in your Systems

With the difficulties we’re facing during this time, it can be difficult to keep up with the increasingly vast demand for our services. You need to make use of all the tools in your toolbelt in order to conserve your team’s cognitive resources. Two ways you can do this are through automating toil from your processes and prioritizing with SLOs.

Creating margin in the system to allow for adaptive capacity

Brain space is at a premium during a crisis. With stress levels mounting, cognitive capacity is diminished. While teams may be too busy putting out fires to focus on automation, it’s actually more important than ever to decrease the cognitive load teams are facing. Additionally, automation can help build a buffer between the loss of productivity teams face during this crisis and the need to perform at an increased capacity. This can also increase the likelihood of the 50/50 engineering and toil split, giving you more room for innovation despite the constraints on resources.

Your team will also function better with decreased strain and toil. Richard Cook from Adaptive Capacity Labs notes that during this crisis, “Social spaces will become more tightly coupled. The effects of events and strains at work will transfer to home and vice versa. The influence of work on home (and home on work!) is usually moderated via social conventions. As stress saps energy it becomes more difficult to maintain boundaries.”

When toil becomes overwhelming, teams will lose energy and productivity. Automation helps build margin for your teams to recharge, take time with their families, and deal with this difficult time in a healthy way.

One way to bake in automation is with runbooks, easing incident response. Here are some key steps to consider when creating automated runbooks:

  1. Understand and map your system architecture: To create runbooks that automatically use a variety of services, you’ll need to understand how each service functions and how they connect. Map these connections and include information on how automation tools can control each service to lay a solid foundation for future runbooks.
  2. Identify the right service owners: Once you’ve mapped out your architecture, you’ll need a repository of the owners of each service. This will help future runbook authors contact the right people for collaboration, advice, and sign-offs. Complex automated runbooks will work through many service areas, so involving the owners and experts of each space is a must.
  3. Lay out key procedures and checklist tasks: Common tasks often have common steps — subtask procedures like auditing, version control, and deployment are likely to overlap. Identify these key steps and clearly define their processes, then compile them into a list. Future runbook authors should use steps from this list when possible for consistency.
  4. Identify methods to bake into automation: Now that you have a list of key procedures that recur in many tasks, you also have a great starting point for finding automation opportunities. Look for things that can be scripted, and ways to have scripts trigger subsequent scripts. Make your automated steps modular so they can be baked into a variety of runbooks.
  5. Continue refining, learning, and improving: Resources like the architecture map, service owner repository, and list of common tasks aren’t to be created once and left untouched. Include updating these resources as a checklist task on procedures that would modify them, and also have regular checks to ensure they’re up to date. When you revisit them, take the opportunity to learn from them again, looking for new opportunities to automate and optimize.

In addition to automated runbooks, you can also use SLOs to help create margin through compassionate prioritization.

Using SLOs (compassionately) to drive prioritization

As teams experience unprecedented strain and are hit simultaneously with increases in unplanned work as well as reduced capacity, a game of tug of war could erupt. This means that even policies and metrics of success must change during this time. As such, SLOs and error budgets should be established with the team’s context in mind. As Alex said, “The best way to use the concept of an error budget isn’t that you have to actually have measurements, but rather that the concepts behind it give you a different way of thinking about things. And to have good discussions with people with that data and to help you make decisions based upon that.”

He also stressed the importance of revisiting a target whenever necessary: whether that’s due to an incident, change in code base, or a massive black swan event. Relaxing your error budget and compassionately setting flexible SLOs can help facilitate your team’s adaptive capacity, while improving shared prioritization of the work that matters most.

If you liked this blog post, check out these as well:

Originally published at

Giving you all you need to know about Site Reliability Engineering.