What is a Kubernetes Operator and Why it Matters for SRE

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud native companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual convention!

What has made Kubernetes become such a fundamental service? A major factor is its automation capabilities. Kubernetes can automatically make changes to the configuration of deployed containers or even deploy new containers based on metrics it tracks or requests made by engineers. Having Kubernetes handle these processes saves time, eliminates toil, and increases consistency.

If these benefits sound familiar, it might be because they overlap with the philosophies of SRE. But how do you incorporate the automation of Kubernetes into your SRE practices? In this blog post, we’ll explain the Kubernetes Operator-the Kubernetes function at the heart of customized automation-and discuss how it can evolve your SRE solution.

What the Kubernetes Operator can do

Kubernetes Operators complete sophisticated tasks

  • Deploying applications
  • Updating applications to new versions
  • Reconfiguring application settings
  • Scaling applications up and down depending on usage
  • Failure handling
  • Setting up monitoring infrastructure

Without Kubernetes Operators, engineers would need to complete these tasks. Automating them saves time and toil, and makes the procedures and results consistent.

Kubernetes Operators control custom resources and applications

Kubernetes Operators make stateful decisions

For example, a custom resource could define the desirable state of a new server instance as some amount of load capability based on its physical resources.The Operator would then adjust the configuration until new instances reached these standards.

Kubernetes Operators and SRE

Operator monitoring, SLIs, and SLOs

The process of determining metrics with greatest impact is similar for Operators and SLIs. In the Kubernetes Operators textbook, Dobies and Wood suggest looking first at the “four golden signals” ( a concept from Google’s SRE book) to determine what the Operator should monitor. These are:

Creating Operators for your applications will help you understand what SLIs and SLOs should be set for them. Likewise, setting SLIs and SLOs can help you understand what your Operators should monitor.

You might notice that when servers are overloaded, your customers are unhappy with the application’s availability.

You can set a custom resource to monitor the disk space available. At 5% remaining capacity, your custom resource will spin up new server instances, giving your customers better service. Your SLI will be based on availability and will monitor disk space. Your SLO might dictate that you need to achieve 99.9% availability to keep your customers happy, informing the Operator’s intervention points.

Automating SRE application deployment

SRE tools represent an investment in reliability. The time spent implementing them is paid for by the time they save. Creating Operators is a similar investment. By creating Operators, you save time on each deployment. Furthermore, deployments are consistent and reliable. Your SRE practices have less overhead and can scale with your organization.

Operators and incident management

When developing your incident response plan, the behavior of your Operators can be a valuable resource. If you know that the Operator will automatically try to correct the behavior, you can incorporate that into your expectations and procedures. For example, if you have an incident response plan for oversaturated servers, your Operator could spin up new server instances or reconfigure load balancing. Your response plan would take this into account, saving you some troubleshooting steps and allowing you to focus on the originating issue. By combining Operators and automated runbooks, you can minimize the amount of manual escalation and resolve many incidents without human intervention. As automation is another core goal of SRE, this is another way that Kubernetes Operators fit into your reliability strategy.

As you shift your services to a container-based model and Kubernetes becomes more fundamental to your DevOps practices, it’s important to incorporate Operators into your reliability strategy. Operators allow you to extend Kubernetes with custom resources and responses, allowing for more automation and less toil.

If you enjoyed this post, check out these resources:

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store