Is love in the air? We think so. While we don’t have chocolate or flowers for you, we have something just as sweet. Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this February.
How much of what is labeled “human error” is simply real world limits of human perception, information processing and other constraints of being human in a complex world?
- matt scanlon (@picudoc13) February 8, 2021
The ability to roll back safely is important, but once you have a reasonable feedback loop and gradual rollout, the vast…
When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams. A common misconception is that SRE “replaces” QA. People believe SLOs and other SRE best practices render the traditional role of QA engineering obsolete, as testing and quality shift left in the SDLC. This leads to QA teams resisting SRE adoption.
But QA teams can and should embrace the transformation that SRE can bring, as SRE elevates their role to a strategic partner in designing performant software and scalable practices. SRE removes silos from QA expertise…
We live in the era of reliability. The most important feature for a service is how dependable it is in the eyes of a user. Companies are hiring with this in mind. In a 2019 LinkedIn article, site reliability engineers were listed as the 2nd most promising career in the United States.
But how do you get started as an SRE? In this blog post, we’ll look at:
SRE is a multifaceted role. You will contribute…
When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness.
Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover:
Production checklists should be holistic. They should cover everything from…
The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning.
Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns. It follows that understanding how to evaluate severity and respond appropriately becomes a complex task, making preparedness even more critical. …
January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.
SRE also teaches us the importance of planning for failure and reacting to it as resiliently as possible. Failures are not limited to our own systems, either. With the rise of microservices, our…
By: Darrell Pappa
“Sorry, but I’m just doing my job.” I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?
As an SRE, I know one thing: we exist to serve an end user. That’s it. Sure, we are people, too, with…
Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team’s workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.
With all of these additions, it may seem inevitable that new steps would slow down the process. But investing in reliability will actually save you time. …
In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including:
However, as with all things, cloud providers present their own risks and challenges as…
New year, new SRE! We’ve said goodbye to 2020 and hello to 2021. Here’s some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community so far this year.
Coders often talk about refactoring, but I’d like to see more “prefactorings” — refactoring done to make a subsequent change simpler. Put these into their own commits (or even PRs!) which are verifiably “no-impact”. Use them to make your “real” change more obvious and surgical.
- Tim Hockin (@thockin) January 4, 2021
Abstraction teaches us that we must elide details in order to be able…
Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/