SREview Issue #12 April 2021

Spring is here! We have rain! We have flowers! We have allergies! We also have some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Tweets that have us twittering

We need to stop with the “they need to feel their own pain” framing for service owners being on-call for their products. That’s such a counter-productive message. On-call is an opportunity to gain expertise on how the service works in the context of Prod.

- Alex Elman (@_pkill) April 7, 2021

Half of my job is Googling the other half is documentation. There I said it.

- ca$s:e cage 💫 (@akolsuoicauqol) April 7, 2021

INCIDENT RESOLVED: This outage has been resolved. In investigating the incident, our engineers learned that the “fifth nine” was their friendship all along

- dan slimmon (@danslimmon) April 7, 2021

SREading

Incident analysis as guerrilla case study research: Lorin Hochstein writes about how to use the desire for closure to justify spending time examining how work is really done.

Having On-call Nightmares? Runbooks can Help you Wake Up.: Senior Software Engineer Harry Hull writes about how to use runbooks to improve your incident response, even at 2 AM.

Advice for someone moving from SRE to backend engineering: Charles Cary writes about how dynamic Ops and SRE are, misconceptions about creativity, and on-call duties.

Resilience in Action, Episode 6: Our podcast, Resilience in Action, is back! Host Kurt Andersen speaks with Todd Underwood, ML SRE Lead and Pittsburgh Site Lead for Google.

The Mightiest Monolith: Robert Barron writes what modern developers, DevOps practitioners and Site Reliability Engineers can learn from the Space Shuttle program.

Give it a whirl

Updated Incident Summary: The incident summary has been updated to display a more concise rundown. The new summary has improved UX/UI design and shares the incident summary, severity, status, and type as well as timestamps and the people involved.

Automated shortcut message: As you can see above, we also added a message that gives users access to documents and shortcuts to help automate commands. This lowers the toil for responders.

Incident help suggestions: When launching a new incident, the Blameless bot will now provide suggestions for commands. The bot will direct the user to the list of commonly used “slash” commands and provide a link to the web site where the commands are all listed.

Task list enhancements: In addition to creating an expanded checklist, we’ve also allowed tasks to be checkmarked within the Slack UI for a smoother user experience. To limit the noise within the Slack channel, now only task owners can see their assigned tasks. When creating a new task, it will appear to the owner in readable format and updates the person’s main task list.

Lastly, rather than change task status from a drop down, tasks are crossed off as they are complete. This makes the previous Blameless commands “_mark task as pending” and “/blameless complete task” irrelevant. Instead, users should use the command “/blameless show tasks” and cross off completed items from the checklist that appears after the command.

If you’d like to see these upgrades in action, try Blameless today.

Events

Failover Conf April 27: Learn how teams have adapted over the past year, share your own stories, and engage with others in the reliability community.

99 Percent Visible: DevOps Reliability April 27 9 AM PDT: Kat Cosgrove will give her talk, “Learning to Learn by Teaching” and discuss her experience teaching developers.

Blameless Bi-Weekly Demo April 27 at 8 AM PDT: Check out a live demo of Blameless as we walk you through operations best practices, and get your questions answered.

SRE Leaders Panel: Business Agility & SRE April 29 at 11 AM PDT: Join Chris Hendrix, Garima Bajpai, and Jason Fraser for a discussion on how reliability impacts the flow of value.

Deserted Island DevOps April 30: A single-day virtual event streamed on Twitch. All presentations will take place in the world of Animal Crossing: New Horizons.

Want to contribute?

Originally published at https://www.blameless.com.

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store