Introducing pg_auto_failover: Open source extension for automated failover and high-availability in PostgreSQL

By Lukas Fittl, Senior Program Manager, Microsoft Azure

Content type
News

Programming language
PostgreSQL

When you’re building a new app, there is a lot you need to focus on. Behind the scenes, there is data stored somewhere, often in a Postgres database. Data is essential, and it needs to be accessible and available. Most of us don’t want to become a database expert early on, and instead focus on what gets customers to engage with the product.

At Microsoft we care about enhancing developers’ day-to-day productivity. Earlier this year we released our support for PostgreSQL in Azure Data Studio to provide a modern open source GUI for working with Postgres.

Today, we are excited to introduce pg_auto_failover, an extension for automated failover in Postgres. It takes care of keeping your Postgres database available, so you don’t have to.

pg_auto_failover is focused on simple, automated failover built on Postgres streaming replication, without any external third-party dependencies. Here is how it works:

As you can see we have three major components: the primary, the secondary and the monitor. Each of the components runs a Postgres server, as well as a keeper process for orchestration. The application runs SQL queries directly against the primary, with the secondary specified as a fallback connection in the client.

Let’s try it out

To keep it simple, we’ll install all three components on the same Linux VM. We’ll setup the primary and secondary Postgres servers and use pg_auto_failover to replicate data between them. We’ll simulate failure in the primary and see how the system switches (fails over) to the secondary.

This release is focused on Linux, and we provide packages for both RPM and Debian based distributions. We support Postgres 10 and Postgres 11. In this example, we install the pg_auto_failover package for Postgres 11 on an RPM-based Linux distribution:

# Enable the package repository that distributes pg_auto_failover
curl https://install.citusdata.com/community/rpm.sh | sudo bash

# Install the pg_auto_failover version 1.0 for Postgres 11
sudo yum install -y pg-auto-failover10_11

Let’s start by setting up the monitor. It is the first component to run. It periodically attempts to contact the other nodes and watches their health. It also maintains global state that keepers on each node consult to determine their own roles in the system.

We use the “pg_autoctl” command to control pg_auto_failover:

sudo su - postgres
export PATH="$PATH:/usr/pgsql-11/bin"

pg_autoctl create monitor   \
  --pgdata ./monitor     \
  --pgport 6000          \
  --nodename `hostname --fqdn`

Next, we’ll create the primary database. But, to simulate what happens if a node runs out of disk space, we’ll store the primary node’s data files in a small temporary filesystem.

# Create intentionally small disk for node A
sudo mkdir /mnt/node_a
sudo mount -t tmpfs -o size=400m tmpfs /mnt/node_a
sudo mkdir /mnt/node_a/data
sudo chown postgres -R /mnt/node_a

# Initialize Postgres on that disk
pg_autoctl create postgres     \
  --pgdata /mnt/node_a/data \
  --pgport 6010             \
  --nodename 127.0.0.1      \
  --pgctl `which pg_ctl` \
  --monitor postgres://autoctl_node@127.0.0.1:6000/pg_auto_failover

In the example above, the keeper creates a primary database. It chooses to set up node A as primary because the monitor reports there is no primary registered yet. This is one example of how the keeper is state based: it makes observations and then adjusts its state, in this case from “init” to “single.”

At this point the monitor and primary nodes are running. Now we need to run the keeper. It’s an independent process so that it can continue operating even if the Postgres primary goes down:

pg_autoctl run --pgdata /mnt/node_a/data

This will remain running in the terminal, outputting logs. We can open another terminal and start a secondary database the same way we created the primary:

pg_autoctl create postgres \
  --pgdata ./node_b      \
  --pgport 6011          \
  --nodename 127.0.0.1   \
  --pgctl `which pg_ctl` \
  --monitor postgres://autoctl_node@127.0.0.1:6000/pg_auto_failover
pg_autoctl run --pgdata ./node_b

All that differs here is that we’re running it on another port and pointing at another data directory. It discovers from the monitor that a primary exists and then starts utilizing streaming replication to catch up.

Watch the replication

First, let’s verify that the monitor knows about our nodes, and see what states it has assigned them:

pg_autoctl show state --pgdata ./monitor

     Name |   Port | Group |  Node |   Current State |  Assigned State
----------+--------+-------+-------+-----------------+----------------
127.0.0.1 |   6010 |     0 |     1 |         primary |         primary
127.0.0.1 |   6011 |     0 |     2 |       secondary |       secondary

This looks good. We can add data to the primary and watch it be reflected in the secondary.

# Add data to primary
psql -p 6010 -c 'create table foo as select generate_series(1,1000000) bar;'

# Query secondary
psql -p 6011 -c 'select count(*) from foo;'

  count
---------
 1000000

Cause a failover

Let’s make it interesting and introduce a problem. We’ll run the primary out of disk space and watch the secondary get promoted. In one terminal let’s keep an eye on events:

watch pg_autoctl show events --pgdata ./monitor

In another terminal we’ll consume node A’s disk space and try to restart the database. It will refuse to start up.

pg_ctl -D /mnt/node_a/data stop &&
   dd if=/dev/zero of=/mnt/node_a/bigfile bs=300MB count=1

# Confirm there is no free disk space
df /mnt/node_a
Filesystem     1K-blocks   Used Available Use% Mounted on
tmpfs             409600 409600         0 100% /mnt/node_a

After a few failed attempts to restart node A, its keeper signals that the node is unhealthy and the node is put into the “demoted” state. The monitor promotes node B to be the new primary.

pg_autoctl show state --pgdata ./monitor

     Name |   Port | Group |  Node |   Current State |  Assigned State
----------+--------+-------+-------+-----------------+----------------
127.0.0.1 |   6010 |     0 |     1 |         demoted |      catchingup
127.0.0.1 |   6011 |     0 |     2 |    wait_primary |    wait_primary

Node B is not considered in full “primary” state since there is no secondary present. It is marked as “wait_primary” until a secondary appears.

Let’s say we want to connect to this cluster. We could use a proxy, but that is complex to setup. What if we could get a client to do this?

Starting with Postgres 10, clients like psql, and application frameworks based on libpq, can attempt connections to more than one database server. This works by listing all known servers and adding the “target_session_attrs” parameter. pg_autoctl provides a helper to get us the correct connection URL:

pg_autoctl show uri --formation default --pgdata ./monitor

psql 'postgres://127.0.0.1:6010,127.0.0.1:6011/?target_session_attrs=read-write'

When nodes A and B were both running, psql connected to node A because B was read-only. Now that A is offline and B is writeable, it will connect to B. If we free the disk space on node A, it will become available again as a secondary.

That’s it! You now have a simple Postgres setup that stays available automatically.

You can start using pg_auto_failover today. Microsoft is releasing the source code under the Postgres license. You can find the source and documentation on GitHub.

If you are looking for a managed offering or would like to scale out using the Citus extension, we recommend Azure Database for PostgreSQL. We take care of high-availability, disaster recovery and more on your behalf.

Questions or feedback? Please let us know in the comments below.

Lukas Fittl

Senior Program Manager, Microsoft Azure

See more articles from this author

Jun 30 •

4 min read

Expanding platform engineering capabilities with Radius Resource Types

Now, with Radius Resource Types, platform engineers can define resource types specific…
Jun 10 •

4 min read

Drasi accepted into CNCF sandbox for change-driven solutions

The Azure Incubations team is proud to share that Drasi has officially…
Jun 9 •

3 min read

Microsoft and F5 join forces on OpenTelemetry with Apache Arrow in Rust

Microsoft and F5 are collaborating on Phase 2 of the OpenTelemetry with…
Mar 26 •

10 min read

Hyperlight Wasm: Fast, secure, and OS-free

We're announcing the release of Hyperlight Wasm: a Hyperlight virtual machine (VM)…