Blameless’ SRE Journey

The initial pain

Practice makes perfect

  • Setting up our own SLOs inside Blameless using integrations like Prometheus, and watching our dashboard.
  • Weekly operational review meetings where we looked at key user journeys, the associated SLOs, and the SLO statuses.
  • Setting error budget policies for those SLOs and tracking them.
  • Getting buy-in from respective component owners to commit to changing their sprints if we violated our error budget.
  • Mandating that any regression from customer expectations would be considered a high severity incident requiring immediate attention.

Taking the questioning out of QA & testing

Incidents reimagined

  • <5 minutes for SEV 0
  • < 30 minutes for SEV 1

Empowering on-call

The SRE dream team

  1. Making sure that we’re healthy about tracking all of our key KPIs around reliability.
  2. Governing our reliability practices and making sure our people are disciplined about following those practices.

Blameless today

Our SRE journey in conclusion

  • Happier, more productive engineers
  • More confidence in handling on-call
  • Better customer experience
  • Increased platform reliability
  • Focus and alignment on prioritizing engineering work
  • More confidence from our investors and board

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Blameless

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/