The Essential List of Top SRE Resources

Blameless
4 min readJul 17, 2020

--

Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!

The big books

These comprehensive tomes of SRE expertise are a great place to start.

Google provides an overview ofSRE implementation, covering the guiding principles that led the organization-wide adoption of SRE, and detailing practices ranging from upper-level management to the nuances of load balancing.

The Essential Guide to SRE Best Practices

Offered by Blameless, this eBook guides you through implementing your own SRE solution and is centered around three key principles: creating a mindset of resiliency, reducing engineering problems and innovation blockers, and approaching systems from a human perspective. If you’re looking to see how SRE will work within your organization, this eBook provides solutions that are not one-size-fits-all which you can begin implementing today..

Site Reliability Engineering

This O’Reilly textbook offers the most comprehensive dive into the inner workings of an SRE solution, covering everything from the fundamental theories of SRE to a breakdown of work-as-done. A companion book, The Site Reliability Handbook, provides illustrative case studies.

If you’re more pressed for time, Principal Developer Advocate for Honeycomb Liz Fong-Jones offers a playlist of essential O’Reilly SRE resources.

Site Reliability Engineering Tools

A variety of tools have been developed to help you on your SRE journey. These guides will help you decide what best fits your needs.

Blameless Buyers’ Guide for Reliability

Offered by Blameless, this guide looks at the goals of a successful SRE solution, and discusses what features a tool should have to accomplish them. It also breaks down the pros and cons of building tooling yourself, purchasing a tool, or adapting an open-source tool.

Awesome Site Reliability Tools

Curated by SREs, this list of tools is sorted by functions to help you find vendors who provide services ranging from project management tools to infrastructure and container orchestration..

This article looks at a complete cycle of development and operations and breaks down how SRE tooling could help DevOps teams at each stage.

Choosing the Right Tools when Building Your SRE Toolchain

This talk by engineers at VictorOps, Grafana, and Influxdata outlines what an SRE toolchain could look like and how to experiment with options to build a solution.

Hiring Site Reliability Engineers: Why You Need an SRE

Thinking about staffing an SRE team? Having dedicated engineers working on the long view of reliability problems is a worthy investment in your reliability. But how can you find good SREs, and what should they be doing? These articles and talks will answer these questions and more.

This SREcon talk given by Andrew Fong breaks down how Dropbox hired its SRE team, covering everything from sourcing talent to interviewing rubrics.

This guide explains the importance of investing in reliability staff and outlines how to find the perfect candidate for your first SRE role.

From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams

Andrew Widdowson outlines Google’s recommendations for training your new SRE hires. Techniques such as learning opportunities, systems thinking, and imparting the philosophy of SRE help you get your team up and running.

Becoming a Certified SRE

Are you looking to step into the exciting role of SRE? These links will help you find site reliability engineering certifications and other learning opportunities.

Kubedex — How do I become a SRE?

This guide provides a concise spreadsheet of online courses in SRE topics. It builds up the SRE role from fundamental skills in Linux system administration and software development, making it the perfect guide for someone starting their career.

Site Reliability Engineering: Measuring and Managing Reliability on Coursera

Created by the Google Cloud team, this course covers the Google SRE book in an engaging guided format. Quizzes and short assignments reinforce your learning, with an optional paid certification for completion.

Site Reliability Engineering Philosophy and Culture

SRE isn’t just a set of practices and tools. The underlying philosophies of SRE motivating these practices are fundamental to making your organization truly resilient. These articles and blogs will help you embrace failure as inevitable, put aside blame, develop for resiliency, and more.

The Many Shapes of Site Reliability Engineering

This article looks at the different ways SRE can be implemented and the benefits of each on both practical and cultural levels.

What exactly is the difference between DevOps and SRE? How do you incorporate the practices of each? This presentation by Google will answer these questions and more.

Convincing Management to Invest in Reliability

This talk by Blameless co-founder Lyon Wong provides strategies for getting SRE buy-in at the level of management, VPs, and CTOs. You can also read a series of blog posts covering the topic here: management, VP level, CTO level.

This weekly newsletter curated by Lex Neva, SRE at Fastly, brings you the latest in case studies, think pieces, and SRE news.

Everything Else

Many links in this list were sourced from the Awesome Site Reliability Resources page. Check it out if you’d like further resources for any of these topics, or there are other areas of SRE you’d like to explore.

If you’d like to learn more about SRE and how to begin employing best practices in your organization, feel free to reach out to us for a demo or try us out for free.

Originally published at https://www.blameless.com.

--

--