Blameless’ SRE Journey

SRE is a practice adopted by best-in-class companies all over the world. As a software reliability platform purpose-built for SREs, Blameless strives to practice what we preach and utilizes SRE best practices daily to cultivate a culture of resilience.

However, this wasn’t always the case. In the early days of our company’s history (like many other companies at the beginning of their journeys), we often needed to move fast without looking through the lens of reliability and prioritize feature development and product-market fit over scalability and resilience. As you can imagine, this isn’t sustainable, and needed to make a change.

In this post, we will share our SRE journey and how we operationalized the best practices we hold dear.

The initial pain

Something needed to change, so CEO Ashar Rizqi halted all current feature development. In Ashar’s words, “You can’t improve what you can’t measure. We needed to objectively prove that we are a reliable platform because that’s core to establishing trust in our customers. Vulnerability is the most important thing. The second is transparency.”

This change allowed the engineering team to invest their efforts in fixing technical debt and reliability. At this point, Blameless didn’t have an SRE program or team, so we decided it was time for us to become customer zero of our own product.

Vulnerability is the most important thing. The second is transparency.

Practice makes perfect

  • Setting up our own SLOs inside Blameless using integrations like Prometheus, and watching our dashboard.
  • Weekly operational review meetings where we looked at key user journeys, the associated SLOs, and the SLO statuses.
  • Setting error budget policies for those SLOs and tracking them.
  • Getting buy-in from respective component owners to commit to changing their sprints if we violated our error budget.
  • Mandating that any regression from customer expectations would be considered a high severity incident requiring immediate attention.

We also began setting KPIs for both the software development and SRE functions, such as number of production deployments, how many lines of code are changed, commits per deployment, and number of regressions (which were prioritized in operational review). These changes required a big divergence from how our teams were structured and operating previously.

Taking the questioning out of QA & testing

In the past, developers would write a piece of code, but the code was tested by a manual QA tester. Someone would write code, merge it into the DEV branch and the main branch, and wait a week for someone from manual QA to run an end-to-end test. Then, if QA found an issue, the team would open a ticket, adding a delay.

Ashar and the team decided to move away from this process. Ashar stated, “If you are building the feature, and you are writing the code, then you are the one who will be held accountable for the quality of that code.”

Our team succeeded in making this change, and we were excited by the results. We started moving drastically faster as developers began finding errors before turning in code to QA, eliminating the lengthy turnaround. Additionally, the manual QA team was freed up to automate away toil and focus on more important testing.

If you are building the feature, and you are writing the code, then you are the one who will be held accountable for the quality of that code.

Incidents reimagined

What this feature enables us to do is to start looking at how long it takes us to respond to an incident, and where the gaps are occurring if the response time is too long. Our KPIs for this time range are:

  • <5 minutes for SEV 0
  • < 30 minutes for SEV 1

Beyond incident management, we also set KPIs for our new postmortem process. At Blameless, for every production incident, we require a 100% completion rate for the postmortem survey and 100% completion rate for the resulting action items. Filling out the survey, in particular, has a strict SLA around it. Rather than spending more time on creating postmortems (especially for minor incidents), we created our survey function in Blameless which is highly customizable and hones in on the key questions. We put our survey responses into our big data analytics product that bubbles up key insights quickly to inform engineering decision making. This also helps to streamline the process of starting to build a narrative. As Ashar states, “The writeup can always happen later, but we want a 100% completion rate on this survey while the memory is still fresh.”

Empowering on-call

“The idea there was to give our team members encouragement that they can own the troubleshooting of their services, including infrastructure,” Ashar said. The team, led by Moiz Virani, also implemented better practices for the documentation and handoffs for the on-call process. Now, the on-call staff member creates an on-call incident within Blameless where they track all issues and activities during the on-call shift. The postmortem for that on-call incident becomes the complete and detailed handoff for the next person coming in, giving them a confidence boost at the beginning of their shift.

The SRE dream team

The SRE team would not be responsible for production services; instead, it was only responsible for the SRE frameworks. Our SREs don’t determine what the dev team’s SLO is going to be, but they are responsible for guiding devs through the process of setting up the SLO and making sure the postmortems are being completed.

Ashar put it best when he said, “SREs are not going to necessarily resolve incidents for you, but they will be the catalyst to make sure that SRE best practices are being obeyed throughout the process.”Initially, the SRE team’s main focus was to help set up the SLOs for the most critical user journeys in Blameless. During this time, the SRE team also owned infrastructure engineering, the monitoring systems, observability platforms, and key decision making in terms of reliability and tooling.

After the team laid down this groundwork, the role evolved into being the caretakers of reliability here at Blameless. That means our SREs have two major focuses:

  1. Making sure that we’re healthy about tracking all of our key KPIs around reliability.
  2. Governing our reliability practices and making sure our people are disciplined about following those practices.

These big changes have been a success, yielding significant business impacts for us.

SREs are not going to necessarily resolve incidents for you, but they will be the catalyst to make sure that SRE best practices are being obeyed throughout the process.

Blameless today

One of our engineers, Dyllen, was so excited by this massive and speedy overhaul that he wrote his own story of Blameless’ journey. According to Dyllen, “By using Blameless, we identified our critical customer issues, created tickets for tracking progress through our SCRUM process, orchestrated an area for collaboration between my backend engineer, our product owners, and myself. We finally resolved the issue with indexable information on how we as a team will improve our processes to ensure that our product becomes hardened through our growth.”

These changes also gave our leadership and board more confidence. According to Ashar, “At a business outcome level, what we have now is confidence in the ability to move faster. We have this laser sharp focus. We know what we need to build and focus on. We know how much we can push and what the output of that push is going to be.”

At a business outcome level, what we have now is confidence in the ability to move faster. We have this laser sharp focus. We know what we need to build and focus on. We know how much we can push and what the output of that push is going to be.

Our SRE journey in conclusion

  • Happier, more productive engineers
  • More confidence in handling on-call
  • Better customer experience
  • Increased platform reliability
  • Focus and alignment on prioritizing engineering work
  • More confidence from our investors and board

If your team is kicking off its own journey to SRE and would like some help on where to invest first, we’re here to help. Contact us, and check out the following resources:

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/