Using PGBouncer with Consul for Postgresql high availability

April 29, 2022

If you work with Postgresql you have most probably heard about PGBouncer. It’s a lightweight external connection pooler which allows Postgresql to process more client connections using fewer server-side backend processes.

Normally to handle a large number of clients PgBouncer is installed on the same host as Postgresql server, this way it allows handling more real clients than the max_connections setting in Postgresql would allow. There are a few cases though when you would want to have PGBouncer on the client side.

The first one is when an application needs to establish connections pretty often. Connection to Postgresql is an expensive operation so if you want to reduce the connection latency time it’s a good idea to install PGBouncer on the same host or container as an application. In this case, the client would connect to PGBouncer which would provide an already established connection thus reducing the connection time.

Another case might not be that obvious. When a database restarts or failover happens the established client connections become dead so the application has to reconnect to the database if the current connection dropped. Unfortunately not every application is written with this fact in mind, thus having PGBouncer installed between your database and application can help to handle this kind of issue.

In our previous article we described how to leverage Consul DNS to handle automatic failover of application connections.

Let’s change our demo a little so that the application doesn’t have any reconnect logic in it and test how it behaves with a direct connection to Postgresql and via PGBouncer. The entire working example you can find in the pgbouncer branch of our postgresql-consul-demo project.

So here is our app:

#!/usr/bin/env python3

import logging
import time
import psycopg2
import sys

logging.basicConfig(format='%(asctime)s %(levelname)s %(message)s', stream=sys.stdout, level=logging.DEBUG)

class DB:

    def __init__(self):
        logging.info("Starting.")
        while True:
            try:
                self.connect()
                break
            except Exception as e:
                logging.error(e)
                time.sleep(2)

    def connect(self):
        logging.info('Connecting.')
        self.conn = psycopg2.connect("")
        self.conn.set_session(autocommit=True)

    def connected_to(self):
        try:
            cur = self.conn.cursor()
            cur.execute(("SELECT inet_server_addr(), "
                         "CASE WHEN pg_is_in_recovery()"
                         " THEN 'replica'"
                         " ELSE 'primary'"
                         "END"))
            res = cur.fetchall()
            cur.close()
            return res[0]
        except Exception as e:
            logging.error(e)
            return (None, None)


db = DB()

while True:
    db_ip, role = db.connected_to()
    msg = f"I'm connected to {db_ip} and it's {role}"
    logging.info(msg)
    time.sleep(1)

It connects to Postgresql on start and then every second executes a query to fetch the IP address of the database it’s connected to and the role of the database instance, i.e. if it’s primary or replica.

When we connect it via Consul straight to Postgresql and then at some point do a database switchover it will generate the following log:

11:10:02,171 INFO I'm connected to 172.19.0.6 and it's primary
11:10:03,177 INFO I'm connected to 172.19.0.6 and it's primary
11:10:04,179 INFO I'm connected to 172.19.0.6 and it's primary
11:10:05,180 ERROR terminating connection due to administrator command
SSL connection has been closed unexpectedly
server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
11:10:05,180 INFO I'm connected to None and it's None
11:10:06,181 ERROR connection already closed
11:10:06,181 INFO I'm connected to None and it's None
11:10:07,181 ERROR connection already closed
11:10:07,181 INFO I'm connected to None and it's None

Since our application doesn’t have the reconnect logic after the database switchover the connection remains dead even though the Consul DNS would redirect it to the proper postgresql primary host.

Now let’s connect it to the PGBouncer listening on localhost port 6432 and do the same:

11:13:49,857 INFO I’m connected to 172.19.0.6 and it’s primary
11:13:50,860 INFO I’m connected to 172.19.0.6 and it’s primary
11:13:51,861 INFO I’m connected to 172.19.0.6 and it’s primary
11:13:53,448 INFO I’m connected to 172.19.0.5 and it’s primary
11:13:54,450 INFO I’m connected to 172.19.0.5 and it’s primary
11:13:55,452 INFO I’m connected to 172.19.0.5 and it’s primary

Wow, this time we see no errors on the application side, the IP of the database has changed, and the DB role remained primary. This means that our application survived the database switchover without any extra care.

This is our PGBouncer config file:

[databases]
postgres = host=master.pglab.service.consul port=5432 dbname=postgres
[pgbouncer]
listen_port = 6432
listen_addr = localhost
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /home/pgbouncer/pgbouncer.pid
admin_users = app_user
pool_mode = transaction
dns_max_ttl = 0
server_login_retry = 0

The first section configures the remote database connection to Postgresql primary database via Consul DNS. The second section configures the PGBouncer itself. The important parameters here are:

pool_mode = transaction — the magic will work only with transaction or statement pooling modes since a session can’t be swapped to another connection because of its nature
dns_max_ttl = 0 — this makes PGBouncer do DNS cache update for each lookup. It’s important when you rely on Consul DNS
server_login_retry = 0 — don’t wait for connection retry if the connection to DB lost

Let’s change dns_max_ttl to its default value 15 and see how the behavior changes:

11:44:09,755 INFO I’m connected to 172.19.0.5 and it’s primary
11:44:10,757 INFO I’m connected to 172.19.0.5 and it’s primary
11:44:11,759 INFO I’m connected to 172.19.0.5 and it’s primary
11:44:14,903 INFO I’m connected to 172.19.0.5 and it’s replica
11:44:15,905 INFO I’m connected to 172.19.0.5 and it’s replica
11:44:16,908 INFO I’m connected to 172.19.0.5 and it’s replica

When PGBouncer restores the broken connection to Postgresql its internal DNS cache still has the old IP address of the primary node. So instead of connecting to a new primary, it connects to the same IP address which became a replica at this point.

Now let’s return dns_max_ttl to 0 and change server_login_retry to 15:

11:47:15,726 INFO I’m connected to 172.19.0.6 and it’s primary
11:47:16,729 INFO I’m connected to 172.19.0.6 and it’s primary
11:47:17,732 INFO I’m connected to 172.19.0.6 and it’s primary
11:47:33,812 INFO I’m connected to 172.19.0.5 and it’s primary
11:47:34,816 INFO I’m connected to 172.19.0.5 and it’s primary
11:47:35,818 INFO I’m connected to 172.19.0.5 and it’s primary

Note the 15 seconds gap between the app is connected to .6 and .5 IP addresses. PGBouncer holds the client’s request for 15 seconds until it reconnects to Postgresql.

Summary

As you can see, having a PGBouncer on the application side can help you to handle Postgresql restarts and failovers smoothly even for applications which are not prepared for that and in other cases would require restart. It’s worth mentioning that it will work only for the applications that don’t rely on session logic. You can also combine PGBouncer with other HA tools instead of Consul such as HAProxy.

‍

Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Ready to Transform  
Your Business?

Let’s Talk

How to run Kubernetes on premises for beginners

Collecting metrics, traces and logs

Why and how we collect logs

Ready to Transform Your Business?

Ready to Transform  
Your Business?