When we talk about the reliability of services, SRE encourages us to take a holistic view. Unreliability in service delivery can be due to anything, from hardware malfunctions to errors in code. One source of unreliability that is often overlooked is security. A security breach can damage customer trust far beyond the impact of the breach itself. Even smaller infractions, like failing a service audit, can make users wary. As reliability is a subjective quality which is determined by users, teams can’t ignore issues that make users feel insecure.
Implementing SRE can help. The goals of SRE and security are well-aligned. Both teams want to avoid as many incidents as possible and create the most resilient system they can. SRE practices and tools can help achieve security objectives. In this blog post, we’ll break down how to use SRE to enhance your security procedures.
Audit compliance increases resiliency
Security audits are an important requirement for any IT organization. They can be mandated by an agency, conducted by a third party organization, or run in-house. Regardless of the method, passing security audits is both an internal and external indicator that a system is trustworthy.
From an SRE perspective, passing the audits increases your resiliency. If you pass the respective audits, it signals that you have protection against a certain level of known security threats. This lowers your “ attack surface.” The attack surface of a software environment is the sum of the different points (attack vectors) where an unauthorized user can try to enter or extract data from an environment. This can include ransomware, compromised credentials, and more.
Teams can do this by creating a CVE, or Common Vulnerabilities and Exposures process. The CVE system is a way of identifying known vulnerabilities, their impact, and their attack vector. The CVE system assigns vulnerabilities an identity code so that organizations can assess the exposure to a vulnerability and take appropriate action. Below is a chart with metrics for CVEs, how they’re measured, and the desired trend.
The auditing process is critical to exposing vulnerabilities and larger patterns of vulnerabilities. Most audits concentrate on whether certain known CVEs have been resolved, mitigated or at least addressed with the conscious decision that they do not provide a substantial threat. Each iteration process of discovering vulnerabilities and then resolving them results in a more resilient system. Additionally, it can help protect against future security incidents.
Audit compliance also gives security teams a chance to work closely with dev teams, increasing the dev teams’ level of security awareness and helping them have a better understanding of how to prioritize the resolution of vulnerabilities. With SRE, the goal is to shift quality, reliability, and security left into the software development lifecycle.
As Blameless SRE Geoff White says, “ If you don’t have a security team who will go Darth Vader on the other teams, nothing will save you. However, if security is as much a priority as a feature, then you don’t need Darth Vader.”
With open communication between teams, collaboration, and an emphasis on resilience, security teams can use SRE methods to ease compliance.
Security robustness with chaos engineering
SRE teaches us that failure is inevitable. There will always be security risks that can’t be captured in an audit. Tracking down these unknown risks is one of the biggest challenges of security. It can feel like stumbling around in the dark, never being sure that you’ve gone far enough down any given path. One SRE best practice to test for these vulnerabilities is chaos engineering.
In an article for opensource.com, Patrick B and Aaron Rinehart describe the combination of chaos engineering and security as “security experimentation.” They show how this forms a more proactive approach to identifying security risks. Using chaos engineering as a framework for security experiments allows you to set standards for the frequency and severity of tests. Chaos engineering can also help you understand the ramifications of a potential breach and practice countermeasures. You can play out the worst possible scenario and ensure that you still have a plan.
Patrick B. and Aaron Rinehart also talk about how chaos engineering provides a feedback loop within security. They argue that the increasing complexity of systems means security risks continue to evolve and change. Without a feedback loop, security would lag behind these environmental changes. Chaos engineering experimentation is an ideal way to keep the loop moving.
Risk analysis with SLOs and SLIs
You may think that you need a 100% perfectly secure system. It is a tempting idea, but unfortunately, like a 100% reliable system, it is impossible. SRE teaches us that dreaming of 100% is detrimental and virtually unattainable. Even attempting to achieve the revered “5 nines” (99.999%) is a very costly effort. Effort spent that might be more impactful elsewhere.
The minimum acceptable level of security is the indicator threshold. Changes that may risk security are measured against the indicator threshold. If they don’t exceed it, they’re approved to progress. When security dips underneath the threshold, an alert is triggered. From that point, the incident response process can begin.
If this all sounds familiar to SREs reading this, you’re not wrong. In an article for Markerbench, Andrew Jaquith breaks down how security and SRE terminology are similar.
By connecting security concepts to SRE tools, we see how we can apply other features of those tools. SLIs correspond to key performance indicators. KPIs are the metrics security engineers use to evaluate the security of a system. In developing SLIs, you look for the areas of biggest customer impact. This perspective of considering customer impact also helps in developing KPIs. Putting yourself in the customer’s shoes, think of what potential breaches would be most impactful. Your KPIs can prioritize these areas.
SLOs correspond to the indicator threshold. SLOs are set to the customer’s pain point. As long as the metrics are above this point, the customer won’t perceive any unreliability. This mentality can apply to the indicator thresholds as well. Consider what matters to your customer’s perception of your security. These factors can range from passing security audits to implementing specific tools like 2-factor authentication. Setting your indicator thresholds based on the impact of these factors will ensure that you aren’t spending development resources on something without impact.
A holistic perspective on security
As services continue to evolve from monolithic to microservice-based structures, understanding possible security threats becomes more complex. Here is a very incomplete list of security risks that many organizations now face:
- Vulnerabilities in third party incorporated services
- Vulnerabilities in in-house tools
- Malicious attacks aimed at third parties your organization relies on
- Password leaks of other services that compromise accounts on your service
- Breaches of cloud service providers
- Data leaks through account hacks in infrastructure tools such as Slack or JIRA
You’ll notice that all these risks involve third party tools or services. These are things outside of engineers’ control. They cannot perform audits on Slack’s code, or run chaos engineering experiments in AWS. But, this doesn’t mean that these areas fall outside of security’s purview. If a security breach affects your service, users will still be wary of your security’s reliability. This is regardless of whether the breach was within your control or not.
To prepare for issues with third party dependencies, teams can use monitoring tools. Understanding how an outage of one third party tool affects the system as a whole can help mitigate risk. Security should also communicate with internal teams about what their monitoring captures. Is it good enough to troubleshoot with in the event of a breach? If not, security should work with the service owners to make sure that monitoring is up to snuff.
Security should also keep up to date on recent incidents. Reading incident retrospectives is a great way to do this. Even incidents that seem unrelated to security issues can reveal vulnerabilities. At the very least, they can help decrease knowledge silos.
If you’re interested in adopting SRE best practices in your security teams, Blameless can help. To learn more, sign up for a free demo.
If you enjoyed this post, check out these resources:
- Choosing the Right SRE Tools
- The Essential Guide to SRE
- Availability, Maintainability, Reliability: What’s the Difference?
Edited by: Geoff White
Originally published at https://www.blameless.com.