We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?
To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.
Availability is the simplest building block of reliability. This metric describes what percentage of the time a service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.
A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimal places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime-but never 100%
Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.
Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.
There are two major components of maintainability: proactive and reactive.
- Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
- Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively.
Maintainability is reflected in availability metrics. Shortening downtime in length or frequency results in higher availability. But, maintainability isn’t only a means to an end for availability. Taking that approach can result in poorly allocated development resources. Investing in maintainability may not immediately result in better uptime. When you refactor old code to resolve technical debt, the service will function the same as before, with the same availability. It isn’t until incidents occur that you’ll see the benefits of this higher maintainability. Maintainability should be thought of as an investment in reliability, rather than just a component of availability.
Reliability can be defined as the likelihood of a service functioning as expected when accessed by a user. This may seem identical to how we defined availability, but there are key differences. Availability looks at whether the service is working, whether a user is accessing it or not. If users accessed the service uniformly across all features and at all times, availability would determine reliability. This is generally never the case. Consider two services:
- User log-on page has 97% availability
- Catalog search has 97% availability
- Site settings page has 97% availability
- User log-on page has 99% availability
- Catalog search has 98% availability
- Site settings page has 90% availability
Just looking at the metric of availability, Service A wins out. But if the log-on page is used by 100% of users, the catalog search by 90% of users, and the site settings page by only 30% of users, Service B will be perceived as more reliable. Reliability accounts for actual usage, converting availability metrics into a measure of customer happiness.
By understanding the reliability of a system, development can avoid wasting time improving availability beyond what the customer can appreciate. Service level indicators bundle metrics such as latency and availability into a more impactful measurement. Then, service level objectives can be set at the threshold for customers becoming dissatisfied. This approach looks at reliability from the perspective of customers. How they perceive the reliability of the service is more important than its availability.
Maintainability can also be evaluated through this lens. The time spent responding to incidents drains a service’s error budget for uptime. SLIs and SLOs can help allocate development efforts to improving the maintainability and incident response procedures most impacting customer happiness.
Here is a table summarizing the distinctions between availability, maintainability, and reliability:
Reliability isn’t only a collection of metrics or a quality of your codebase. It’s a big-picture concept, incorporating the perspective of the users, the inevitability of change and growth, and the humans developing your code. This holistic approach is the foundation of SRE, a collection of practices and cultural lessons that improve your service’s reliability.
Blameless helps take your reliability solution to the next level. Understand the impact of your availability metrics, improve incident response with better collaboration and retrospectives, and focus development with SLOs and error budgets. If you want to thrive in the era of SRE, learn the basics by checking out our bi-weekly live demo.
Originally published at https://www.blameless.com.