Collecting metrics, traces and logs

March 18, 2025

Introduction

A few days ago, I wrote an article about why we need to collect logs. I wrote it because I’ve been working lately on designing a new internal standard for my company, Digitalis, on how we’re upgrading the way we monitor.

The legacy model

When we engage a new managed services customer, we deploy a monitoring stack at the customer’s premises to monitor the applications for which we become responsible. If the application in question is Kafka or Cassandra, we use AxonOps. Otherwise, we use a Prometheus-based monitoring.

If deploying a monitoring stack to Kubernetes, Helm charts are commonly used to install Grafana, Mimir, Prometheus, and Alertmanager. These tools form the backbone of monitoring and observability in Kubernetes environments:

Grafana: A powerful visualization tool for creating dashboards and analyzing metrics.
Mimir: A scalable time-series database integrated with Grafana for storing metrics.
Prometheus: A widely-used metrics collection system for monitoring Kubernetes clusters.
Alertmanager: Used for managing alerts generated by Prometheus.

Once the stack is deployed, Prometheus exporters are employed to monitor applications, whether running inside or outside of Kubernetes. Exporters provide metrics in a format that Prometheus can scrape. Examples include:

Node Exporter: Collects resource utilization data from nodes (details).
Kube-State-Metrics: Exposes Kubernetes API object metrics (details).
Custom application exporters can be configured using libraries like Prometheus Client Libraries.

For streamlined deployment, the kube-prometheus-stack Helm chart provides an integrated solution that includes Prometheus, Grafana, and Alertmanager along with pre-configured dashboards and alerting rules.

For logs, we prefer Grafana Loki and these are sent over from the clients using promtail.

However, if we’re deploying to Virtual Machines or baremetal we use a slightly different approach. We are using standard Grafana (not Mimir) and Elasticsearch for logs. At the client side, we use filebeat.

We currently deploy a single Prometheus server that scrapes metrics from application-specific exporters. For example, we use the Prometheus Node Exporter for OS metrics and the Prometheus Elasticsearch Exporter for Elasticsearch metrics.

Everything is installed and managed with Ansible.

Why the discrepancy?

It’s simple: the tech stack we use in Kubernetes and Kubernetes itself hasn’t been around for that long. Grafana Loki was launched only in 2022 and Elasticsearch is at least 10 years older.

Furthermore, we use Grafana Mimir and Loki because most of the Kubernetes deployments we manage are in the cloud and we can take advantage of the S3, GCS and Azure that are cheaper and more resilient for storage.

This is why I’m unifying the monitoring stack into a new solution.

Why do we need to change

Aside from unifying Kubernetes with VMs and baremetal, there are some other important reasons for the change.

Dynamic environments: dynamic environments are now the norm. Most companies deploying to the cloud are using dynamic DNS, dynamic IPs, etc. Whilst it is possible to configure prometheus to discover services using plugins, it has security implications. See this example for AWS.
Security: as per the above point, in dynamic environments, the prometheus server will need additional permissions to perform discovery. This is not a security problem per se but it adds up to the overall security profile.
Complexity: we use more vendors than we need, this adds complexity to our own monitoring stack.
Cost: this level of complexity also implies higher cost: time-to-market, virtual machines costs and maintenance time

The new model

The highlights are:

Remove ElasticSearch in favour of Grafana Loki
Change the direction of metrics from pull to push
Use Grafana alloy to push metrics, logs and if available, telemetry
Retain Prometheus and Alertmanager

Grafana vs. Grafana Mimir: Choosing the Right Tool for Your Needs

When comparing Grafana and Grafana Mimir, it’s essential to understand their unique strengths. Grafana is renowned for its versatile data visualization capabilities, supporting various data sources and offering customizable dashboards, alerting, and notifications. It is user-friendly and widely adopted, making it an excellent choice for small deployments where simplicity and ease of use are paramount.

On the other hand, Grafana Mimir is a scalable, Prometheus-compatible time series database designed for large-scale environments. It excels in handling massive amounts of data with features like multi-tenancy, longer retention, and faster queries.

However, for small deployments, the complexity and resource requirements of Grafana Mimir may be unnecessary. You’ll have to asses your individual needs and decide on one or the other.

Grafana Loki

When Grafana Loki first launched, I have to admit I didn’t like it. I found the Grafana UI for querying logs slow and cumbersome compared to Elastic with Kibana. However, it has improved since then, and I’ve learned to use it better.

Loki’s architecture is designed to scale by separating readers and ingesters when the volume is high. Additionally, you can choose more cost-effective storage options such as Amazon Simple Storage Service (S3), Google Cloud Storage (GCS), and Azure Blob Storage.

Prometheus and Alertmanager

Grafana Mimir can replace Prometheus and Alertmanager. However, I’m still not too keen on it. Personally, I find it much easier to use and manage Prometheus, and both applications are simple and well-understood by the entire team.

Furthermore, the Prometheus Operator running in Kubernetes is excellent, making alert management very straightforward from an automation perspective (think Ansible, Terraform, GitOps, etc.).

Additionally, I find the Grafana query manager quite challenging to use compared to the Prometheus query browser.

Grafana Alloy

The cherry in the cake is Grafana Alloy. Until now, for metrics, we’ve been using different prometheus exporters running in the node and exposed over https with authentication to the prometheus server to be scraped.

Then, for logs, we use different tools. We have servers running promtail (which by the way, if you haven’t heard, it deprecated since Feb 2025), some others, filebeat… It depends on when they were set up and whether they use Elastic or Loki.

What I like the most about Alloy is that it excels in both performance and ease of installation and configuration. There is a good Ansible role for it, and there is also a well-maintained Helm chart available.

The other important change is the direction. Until now, prometheus is responsible for connecting to the exporters to gather the metrics. With Grafana Alloy, however, the logs and metrics are pushed to Grafana and Prometheus using the remote-write option or via the Mimir push API.

The other advantage is that Alloy has a core set of exporters ready to use. For example, we always install the prometheus node exporter and with Alloy we no longer need it as we can use the prometheus.exporter.unix that it provides.

Although not discussed in this blog post, another great feature in Grafana Alloy is the support for OpenTelemetry. Grafana Alloy operates as an OpenTelemetry (OTEL) agent, enabling the collection and transmission of telemetry data from diverse sources to monitoring systems for analysis and visualisation.

Final words

Essentially, I’m responsible for ensuring the smooth operation of numerous client platforms. My primary objectives are twofold: firstly, to maintain a consistent 100% uptime for these platforms, and secondly, to minimise the need for after-hours incident response, specifically, to avoid disruptive PagerDuty alerts that have a knack for ruining perfectly good dreams. 😄

To achieve these goals, a robust and reliable monitoring platform is indispensable. It’s about proactive management, preventing issues before they escalate, and ensuring a stable environment for our clients.

Contact our team for a free consultation to discuss how we can tailor our approach to your specific needs and challenges.

I, for one, welcome our new robot overlords

‍

Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Ready to Transform  
Your Business?

Let’s Talk

Beyond Datadog: How to Create a Scalable, Cost-Effective Monitoring Solution

Why Enterprise Data Platforms Are Still Insecure

How to run Kubernetes on premises for beginners

Ready to Transform Your Business?

Ready to Transform  
Your Business?