Is love in the air? We think so. While we don’t have chocolate or flowers for you, we have something just as sweet. Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this February.

Image for post
Image for post

Tweets that have us twittering

How much of what is labeled “human error” is simply real world limits of human perception, information processing and other constraints of being human in a complex world?

- matt scanlon (@picudoc13) February 8, 2021

The ability to roll back safely is important, but once you have a reasonable feedback loop and gradual rollout, the vast…

“QA Engineers, This is How SRE will Transform your Role” on blue background with comic later in article inserted.
“QA Engineers, This is How SRE will Transform your Role” on blue background with comic later in article inserted.

When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams. A common misconception is that SRE “replaces” QA. People believe SLOs and other SRE best practices render the traditional role of QA engineering obsolete, as testing and quality shift left in the SDLC. This leads to QA teams resisting SRE adoption.

But QA teams can and should embrace the transformation that SRE can bring, as SRE elevates their role to a strategic partner in designing performant software and scalable practices. SRE removes silos from QA expertise…

“Getting started as an SRE? Here are 3 Things you Need to Know” on a blue background.
“Getting started as an SRE? Here are 3 Things you Need to Know” on a blue background.

We live in the era of reliability. The most important feature for a service is how dependable it is in the eyes of a user. Companies are hiring with this in mind. In a 2019 LinkedIn article, site reliability engineers were listed as the 2nd most promising career in the United States.

But how do you get started as an SRE? In this blog post, we’ll look at:

  • Key comprehensions and skills for an SRE
  • Positions and credentials that can develop into the SRE role
  • The career paths of some successful SREs

Key Comprehensions and Skills for an SRE

SRE is a multifaceted role. You will contribute…

Image for post
Image for post

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness.

Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover:

  • How to make a production checklist
  • Why production checklists are helpful
  • Keeping your checklist up to date
  • How Blameless can help integrate your checklists

How to make a production checklist

Production checklists should be holistic. They should cover everything from…

4 Tips on Preparing for a [Great] Failure on a blue checked background
4 Tips on Preparing for a [Great] Failure on a blue checked background

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning.

Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns. It follows that understanding how to evaluate severity and respond appropriately becomes a complex task, making preparedness even more critical. …

Image for post
Image for post

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

SRE also teaches us the importance of planning for failure and reacting to it as resiliently as possible. Failures are not limited to our own systems, either. With the rise of microservices, our…

By: Darrell Pappa

“Sorry, but I’m just doing my job.” I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

As an SRE, I know one thing: we exist to serve an end user. That’s it. Sure, we are people, too, with…

Image for post
Image for post

Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team’s workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.

With all of these additions, it may seem inevitable that new steps would slow down the process. But investing in reliability will actually save you time. …

In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including:

  • Large hosting providers have many infrastructure redundancies, which means individual servers can fail without affecting customers
  • Cloud providers benefit from strong security measures to mitigate breaches
  • Clouds have high bandwidth and capacity, reducing the risk of outages

However, as with all things, cloud providers present their own risks and challenges as…

New year, new SRE! We’ve said goodbye to 2020 and hello to 2021. Here’s some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community so far this year.

Image for post
Image for post

Tweets that have us twittering

Coders often talk about refactoring, but I’d like to see more “prefactorings” — refactoring done to make a subsequent change simpler. Put these into their own commits (or even PRs!) which are verifiably “no-impact”. Use them to make your “real” change more obvious and surgical.

- Tim Hockin (@thockin) January 4, 2021

Abstraction teaches us that we must elide details in order to be able…

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store