SREview Issue #10 February 2021

Is love in the air? We think so. While we don’t have chocolate or flowers for you, we have something just as sweet. Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this February.

Image for post
Image for post

Tweets that have us twittering

How much of what is labeled “human error” is simply real world limits of human perception, information processing and other constraints of being human in a complex world?

- matt scanlon (@picudoc13) February 8, 2021

The ability to roll back safely is important, but once you have a reasonable feedback loop and gradual rollout, the vast majority of your problems in prod will be long-extant problems that were surfaced under just the right conditions.

- Vallery Lancey (@vllry) February 15, 2021

Tired: blameless post-mortems
Wired: gripping accounts of the incident experience

- Lorin Hochstein E_TOO_MANY_FAILURE_MODES (@lhochstein) February 13, 2021

SREading

“I’m Just Doing my Job,” An SRE Myth: Blameless SRE Darrell Pappa writes about how organizations can become more customer-centric. Featured in SRE Weekly #256.

On Not Being a Cog in the Machine: Honeycomb’s first SRE Fred Hebert writes about his thoughts on human processes, socio-technical systems, and observability.

Communication Tool Down? Here are 3 Ways to Handle it: Learn how to work through a communication tooling failure via chaos engineering, eliminating SPOFs, and more.

Slack’s Outage on January 4th 2021: Laura Nolan writes an in-depth retrospective on Slack’s recent incident.

4 Tips on Preparing for a [Great] Failure: SRE techniques for mitigating the impacts of system failure including building runbooks, assessing with SLOs, monitoring metrics, and more.

How Cloud Services Platform Teams Can Drive The Adoption Of Effective SRE Practices: Tina Huang writes about using cloud transformations to drive SRE adoption.

Give it a whirl

Teams have a new tool in their tool belts. Blameless Runbook Documentation is available for early access.

Image for post
Image for post

Runbooks are an industry best practice, empowering teams to codify the incident response process and drive process repeatability and consistency. These sets of instructions allow teams to resolve incidents faster with greater confidence and less toil.

Fill out this form to see Runbook Documentation in action.

Blameless Bi-Weekly Demo March 2 at 8 AM PST: Check out a live demo of Blameless as we walk you through operations best practices, and get your questions answered.

SRE Thought Leader Panel: Watch for our announcement on Twitter! This panel will be one you won’t want to miss.

Want to contribute?

If you’re looking to share your insights with the SRE and resilience engineering community, we’d love to partner with you on content. Fill out our form here and we’ll reach out!

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store