Who else is glad that 2020 is almost over? We’ve had one of the most difficult years in recent history. With everything going on, it’s been difficult to think further than a few days out, much less into the new year. But, we’re hopeful that 2021 will be a better year for everyone. And we’re predicting some exciting things in the future for SRE.
Here’s our two cents: SRE adoption will only continue to grow. Yet, the practice and culture shift, rather than the role, will take priority in 2021. More people (not only SREs) will have a reliability mindset, which means reliability will be shifting left through the software lifecycle. SLIs, SLOs, and error budget policies will become common practice. …
In light of the pandemic, the global economy is suffering. While this downturn is extreme, it’s not irreparable. In fact, after experiencing economic meltdowns such as the Great Depression and the Great Recession, we’ve learned much about how to regulate our economies to prevail through and recover from such upsets.
“ Are We Safer? The Case for Strengthening the Bagehot Arsenal” by previous United States Secretary of the Treasury and President of the Federal Reserve Bank of New York Tim Geithner focuses on how disaster happens, disaster response, and the craft of financial crisis management. …
Black Friday-we all know what it looks like. Hundreds of people swarming stores after Thanksgiving, jostling for the best deals. But in light of COVID-19, this arrangement could be dangerous.
Over the last few years, Black Friday has become a digital event, and this year should be even moreso. According to Forbes writer Richard Kestenbaum, “ 88% of global consumers told a Visa study they’re planning to buy gifts this holiday season.” Yet “only 20% of U.S. consumers plan to do their shopping exclusively in-store, while nearly a third plan to do most of their shopping online.”
This mostly-digital Black Friday event will mean retailers must be at the top of their game. With downtime costs per minute like $220,318.80 (Amazon) and $40,771.20 (Walmart), outages are expensive. …
In a recent fireside chat with Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc. sat down with Blameless Co-Founder Ashar Rizqi. They talked about scaling while avoiding silos, exciting day-to-day challenges, instilling a culture of empowerment, and more. Here are their top insights and the lightly edited transcript of their conversation.
At Blameless, we value every opportunity to learn. Whether it’s taking time on Focus Fridays to attend a cool webinar, or conducting retrospectives for incidents, lost deals, events, and more, learning is core to our mission.
To learn even more about our craft, we decided to start a book club at Blameless. People from every team (engineering, sales, SRE, marketing, product, people, and more) attended. One of the books we’ve been reading together is none other than Alex Hidalgo’s Implementing Service Level Objectives.
Below is a summary of key topics from Alex’s book, along with thoughts our team had while reading. In this blog post, we’ll cover part one of Implementing Service Level Objectives, “SLO Development.” …
We’re drinking Pumpkin Spice Lattes, lighting candles, and wearing flannel. Oh, and reading a bunch of great stuff. Here’s the November issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.
You’re never going to know everything there is to know in tech.
Instead, aim for the confidence to know that you can figure it out.
- Angie Jones (@techgirl1908) October 20, 2020
Watched Jurassic Park with my 11-yr old son. There’s a scene where Samuel L. Jackson’s character says rebooting the entire system *should* resolve the problem, but they’ve never done it before and it may not come back at all, and it’s the most realistic computer scene in history. …
Written by: Chris Hendrix and Hannah Culver
Metrics are the golden ticket to knowing what’s going on with your system… or so everyone thinks. But there can be too much of a good thing. Are your metrics really doing you any favors? Are they letting you see into what your customers truly want from you? If not, you might have a problem. You might be fetishizing your metrics. The good news is you’re definitely not alone
Like The Hobbit ‘s dragon Smaug laying on his pile of gold, never spending and only hoarding, many of us often stockpile pretty, feel-good, but useless metrics that never make a difference. In fact, they could actually be clouding your ability to get the context and clarity you need from your metrics. …
Blameless recently had the pleasure of interviewing Yury Niño Roa, Site Reliability Engineer, Solutions Architect and Chaos Engineering Advocate at ADL Digital Labs. She’s worked in roles ranging from solutions architect, to software engineering professor, to DevOps engineer, to SRE. Additionally, Yury is an avid blogger and conference speaker who regularly presents at events such as Chaos Conf, DevOpsDays Bogotá, and more.
In this interview, we’ll delve into what draws Yury to SRE and chaos engineering, how she defines resilience, as well as her predictions on emerging trends in the SRE landscape.
I am a Site Reliability Engineer and Chaos Engineering Advocate in Colombia. I love building software applications, reading blogs, writing articles, solving hard performance and resilience issues, and teaching software concepts. …
Onboarding is an essential yet challenging part of the hiring process. As your organization matures, more of its processes become unique. This makes it harder for new employees to get up to speed. Investing in custom processes and tooling to achieve your specific goals is a valuable practice. But, you must balance this with an investment in onboarding.
Fortunately, an investment in SRE is also an investment in onboarding, as one of the important goals of SRE is to help democratize context across software teams. At first, SRE may seem like an area with a high learning curve. The diversity of the skills expected of the SRE role can make it difficult to hire for. However, these skills help broaden engineer’s abilities and understanding of their organization’s systems. …
Atlassian JIRA, one of the most popular ticketing systems, allows teams to catalogue incidents, follow-up actions, bugs, stories, and more. As a common tool in any DevOps/SRE operation’s toolchain, JIRA is a key integration at Blameless.
Blameless’ integration with JIRA allows teams to automatically generate a ticket within both Blameless and JIRA. This integration also allows teams to track follow-up actions via Blameless’ postmortem tool.
Creating an incident in Blameless and JIRA: At the start of an incident, you can create a new incident with Blameless. By creating a new incident, it will spin up three things: