How to Choose Monitoring Tools for DevOps and SRE

When developing for reliability or implementing resilient DevOps practices, the heart of your decision-making is data. Without carefully monitoring key metrics like uptime, network load, and resource usage, you’ll be blind to where to spend development efforts or refine operation practices. Fortunately, a wide variety of monitoring tools are available to help you collect and get visibility into this data.

While it might be tempting to try to monitor absolutely everything in your system, more focused monitoring will be easier to implement and leave you with more actionable data. SRE practices like SLOs are most useful when based on metrics for customer impact. Deciding what and how to monitor is an important decision. We’ll walk you through the basics in this blog post. We’ll also suggest a few popular monitoring tools for your consideration.

Where to implement monitoring

It’s important to decide where in your system architecture you’ll implement monitoring. This will allow you to develop your architecture around the monitoring tool, rather than having to retrofit existing code. Depending on the location of implementation, monitoring tools will be able to observe different types of data. Here’s a breakdown of the most common types of monitoring implementations, along with examples of tools offering that type of monitoring:

Resource monitoring: Also known as server monitoring or infrastructure monitoring, this operates by gathering data on how your servers are running. Resource monitoring tools report on RAM usage, CPU load, and remaining disk space. In architectures with physical servers, information on hardware health-like CPU temperatures and component uptime-can also be helpful to avoid server failure. In cloud-based environments, aggregates of your virtual server system are more useful.

Network monitoring: This looks at the data coming in and out of your computer network. Your monitoring tool captures all incoming requests and outgoing responses across all components such as switches, firewalls, servers, and more. The data collected from network monitoring can be as simple as the total amount of data coming and going or as nuanced as the frequency of particular requests.

Application performance monitoring: APM solutions collect data on how an overall service is performing. These tools will send their own requests to the service and track metrics such as the speed and completeness of the response. The goal is to drive detection and diagnosis of application performance issues to ensure services perform at expected levels.

Third-party component monitoring: This involves monitoring the health and availability of third-party components in your architecture. In this era of microservices, it’s likely that your service depends on the proper functioning of external services, from cloud hosting to ad servers. Like application performance monitoring, tools can check the status of these services with their own requests.

You will likely want to include some of each type of monitoring in your overall solution. Prioritize having robust, redundant monitoring tools to ensure potential issues aren’t missed. At the same time, metrics and alerts should be tied to services to ensure relevance with business impact.

What you need from your data

Having actionable data isn’t just about the data itself; in order to respond properly to what your monitoring tools are reporting, you need to have that data presented in the most useful way. Here are some things that monitoring tools can do for you:

  • Trigger alerts when metrics exceed certain thresholds
  • Create logs of events, highlighting based on parameters
  • Create graphs of metrics over time
  • Provide a dashboard of key service health components at a glance
  • Create databases of logs that can be queried

When making development decisions or responding to an incident, try to get in the habit of asking yourself, “What would I need to be looking at right now to make the best choice?”. Visualize what data it would contain and the metrics that matter.

Open source vs purchased

Another important point to consider is where you’ll find your monitoring tools and who will maintain them. There are both open source and purchasable tools with their own pros and cons.

These tools are free, which is an advantage for companies with limited tooling budgets. They’re also completely customizable, allowing you to integrate them into your own architecture. However, this customization will require dedicated development time and perhaps specialized knowledge. Furthermore, there is no SLA guaranteeing availability, security, update frequency, etc. Your team would own these responsibilities.

These tools cost but offer robustness that open source tools cannot. The service provider will be accountable for keeping the tool functioning and up-to-date. The provider will likely offer customer service, training, documentation, and other resources to help you integrate the tool with your stack. In the era of reliability, making investments to ensure your monitoring eyes are always open is worth considering.

Comparison of Monitoring Tools

Here are 10 of the most popular monitoring tools for SRE and DevOps to consider for your system.

No matter what monitoring tools you ultimately use, you’ll want to make the most of the data they provide in context of a larger reliability solution that drives actionability. Blameless helps you transform monitoring data into SLOs and error budgets, and incorporate it into reliability insights To see more of what Blameless can do, join us for a demo!

Originally published at

Giving you all you need to know about Site Reliability Engineering.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store