Choosing the Right SRE Tools

Monitoring Tools

  • Resource monitoring: reports on how servers are running with metrics such as RAM usage, CPU load, and remaining disk space.
  • Network monitoring: reports on incoming and outgoing traffic which can be broken down into the frequency and size of specific requests.
  • Application performance monitoring: reports on the performance of services by sending internal requests to them and monitoring metrics such as response time, completeness of response, and data freshness.
  • Third-party component monitoring: reports on the health and availability of third-party services integrated into your system.

SLOs and Error Budgeting

  • Consolidating monitoring data into the service level indicators, combining several sources into a single measurement.
  • Empowering you to set thresholds for this metric over time, such as a total amount of downtime per month.
  • Dictating policies to be enacted when the metric exceeds these thresholds, integrating into alerting and collaboration tools.

Alerting

Incident Management

  • Assessing and prioritizing through incident classification
  • Prepared responses based on classification, including runbooks
  • Alerting and escalation to get the correct people involved
  • Communication and role-based coordination
  • Logging and documenting the response in an incident retrospective
  • Learning from the retrospective and integrating it into further development

Incident Retrospectives

Chaos Engineering

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Blameless

Blameless

Giving you all you need to know about Site Reliability Engineering. https://www.blameless.com/blog/