Prometheus Blackbox-Exporter – monitoring TLS certificates

17 Aug, 2020

LinkedInTwitter

Introduction

In any environment that needs to expose endpoints, it’s important to ensure that they are secured and monitored. Although there is plenty of documentation and examples on performance monitoring, there is minimal documentation on how to monitor the security of the endpoints and specifically the TLS/SSL non-HTTPS certificates. This blog outlines how you can use the Prometheus Blackbox Exporter to do this – I hope you find it useful!

As an example, when we deploy Apache Cassandra it is typical to secure the database endpoints with certificates to ensure the data is encrypted in-flight and, if required, enforce client certificate validation. Knowing when these certificates expire is important and being able to monitor and alert based on this is critical. However, this example also applies to any application that exposes a TLS/SSL endpoint like LDAP, Kafka, ELK, etc.

For the sake of simplicity, we will use a single Blackbox probe located on the same VM as our single Prometheus instance to monitor a certificate on an Apache Cassandra database.

p
Note:

In a high availability or production environment, it is always suggested to use multiple probes and multiple Prometheus instances. For Apache Cassandra in particular, we also suggest to use a complete monitoring system / operational toolset such as https://www.axonops.com/

First of all, let’s manually check the connection from the Prometheus instance to our node, cas01.dev.db.myexample.io port 9142 for Apache Cassandra. We can do that using the OpenSSL command:

[email protected]:~# echo -n | openssl s_client -connect cas01.dev.db.myexample.io:9142 2> /dev/null | openssl x509 -noout -text
Certificate:
    Data:
        Version: 1 (0x0)
        Serial Number: 13315684638806572674 (0xb8cad0a12530fe82)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=US, O=myexample.io, OU=DevCluster, CN=rootCA
        Validity
            Not Before: Jan 22 13:01:15 2019 GMT
            Not After : Jan 21 13:01:15 2021 GMT
        Subject: C=US, O=myexample.io, OU=DevCluster, CN=cas01.dev.db.myexample.io
[...]

echo -n makes OpenSSL return the prompt immediately after the command, openssl s_client connects to the endpoint to read the certificate and openssl x509 displays the certificates.

 Now let’s configure the TCP module on the Blackbox Exporter probe like this:

[email protected]:~# curl 'http://prom.blog.myexample.io:9115/probe?target=cas01.dev.db.myexample.io%3A9142&module=tcp_cert&debug=true'
Logs for the probe:
ts=2020-08-01T09:39:41.498464359Z caller=main.go:304 module=tcp_cert data-et-target-link=cas01.dev.db.myexample.io:9142 level=info msg="Beginning probe" probe=tcp timeout_seconds=5
ts=2020-08-01T09:39:41.498568598Z caller=tcp.go:41 module=tcp_cert data-et-target-link=cas01.dev.db.myexample.io:9142 level=info msg="Resolving target address" ip_protocol=ip6
ts=2020-08-01T09:39:41.503386291Z caller=tcp.go:41 module=tcp_cert data-et-target-link=cas01.dev.db.myexample.io:9142 level=info msg="Resolved target address" ip=10.0.4.20
ts=2020-08-01T09:39:41.503413503Z caller=tcp.go:111 module=tcp_cert data-et-target-link=cas01.dev.db.myexample.io:9142 level=info msg="Dialing TCP with TLS"
ts=2020-08-01T09:39:41.524602341Z caller=main.go:119 module=tcp_cert data-et-target-link=cas01.dev.db.myexample.io:9142 level=info msg="Successfully dialed"
ts=2020-08-01T09:39:41.524669931Z caller=main.go:304 module=tcp_cert data-et-target-link=cas01.dev.db.myexample.io:9142 level=info msg="Probe succeeded" duration_seconds=0.026147547


Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.004831258
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.026147547
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry date
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.611234074e+09
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Returns the TLS version used, or NaN when unknown
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1



Module configuration:
prober: tcp
timeout: 5s
http:
    ip_protocol_fallback: true
tcp:
    ip_protocol_fallback: true
    tls: true
    tls_config:
        insecure_skip_verify: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true
[email protected]:~#
Perfect, we know that the probe works, can connect to our Apache Cassandra database and we can also see the various metrics exported.

The metrics that are very useful are::

  • probe_success allows us to make sure that the endpoint is reachable;
  • probe_ssl_earliest_cert_expiry is the expiry time of the certificate;
  • probe_duration_seconds the time that the probe took, useful to check the responsiveness of the node

Finally, let’s set Prometheus to scrape our node:

[email protected]:~# cat /var/lib/prometheus/config/prometheus.yml

[...]
- job_name: blackbox_cas
  params:
    module:
    - tcp_cert
  metrics_path: /probe
  static_configs:
  - targets:
    - cas01.dev.db.myexample.io:9142
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115

Summary

To visualize the metrics we can use the dashboards from the Grafana website, like: https://grafana.com/grafana/dashboards/7587 or https://grafana.com/grafana/dashboards/11529

Monitoring TLS/SSL certificates, also if alone is not sufficient for high availability or production environment, should be part of any monitoring system. No one likes to be woken up in the middle of the night because the entire production environment is down due to an expired certificate. The suggestion is always to make sure that your monitoring system displays and alerts for certificate expirations.

Mario Nugnes

Mario Nugnes

DEvOps Engineer

Mario Nugnes is a middle level DevOps Engineer. Mario has extensive experience with both large and small companies. He worked on complex and relative simple environments and he is very keen to constantly improve himself and the systems he is working on. His competence is mainly on Prometheus, Cassandra and Ansible. Mario has also experience with Kafka and Elastic.

Categories

Archives

Related Articles