Richard Gooding, Author at digitalis.io

Incremental backups with rsync and hard links

Richard Gooding — Fri, 13 Nov 2020 16:33:28 +0000

Incremental backups with rsync and hard links

by Richard Gooding



13 Nov, 2020



AWS | DevOps | Insights | Linux

In this post I am going to describe a way to build a simple incremental backup solution using rsync and hard links. You may already be familiar with rsync but for anyone who is not, rsync is a command-line tool commonly used on Linux and other UNIX-like operating systems to copy and synchronise directories. I will assume some prior knowledge of rsync in this post so if you have not used it before there may be some parts that confuse you!

A bit of background

Before we go into the details you should understand how files are stored on the filesystem and how hard links work.

All files and directories are represented in the filesystem by an inode number which is the filesystem’s internal identity for the file. If you run ls -li in a directory you can see the inode numbers listed on the left:

[user1@backupbox dir1]$ ls -li
total 128
33839002 -rw-rw-r--. 1 user1 user1 12942 Oct  2 16:14 file1
33839003 -rw-rw-r--. 1 user1 user1 14106 Oct  2 16:14 file2
33839004 -rw-rw-r--. 1 user1 user1 19360 Oct  2 16:14 file3
33839005 -rw-rw-r--. 1 user1 user1 17093 Oct  2 16:14 file4
33839006 -rw-rw-r--. 1 user1 user1 16094 Oct  2 16:14 file5

A “file” as we see it by path and filename is in fact a reference to the inode and is often referred to as a “link”. When you create a hard link from one file to another you are creating a separate reference (link) from a new filename to the same inode number. This is different from a “soft” or “symbolic” link (symlink) which is a reference from one location to another path in the filesystem. You can see the difference in the output of ls -li:

[user1@backupbox dir1]$ ls -li
total 64
33839002 -rw-r--r--. 2 user1 user1 12942 Oct  2 16:14 file1
33839003 -rw-r--r--. 2 user1 user1 14106 Oct  2 16:14 file2
33839002 -rw-r--r--. 2 user1 user1 12942 Oct  2 16:14 hardlink1
33839003 -rw-r--r--. 2 user1 user1 14106 Oct  2 16:14 hardlink2
33695760 lrwxrwxrwx. 1 user1 user1     5 Oct  2 16:15 symlink1 -> file1
33695762 lrwxrwxrwx. 1 user1 user1     5 Oct  2 16:15 symlink2 -> file2

When you edit the original file the changes are also visible in the hard-linked version:

[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ cat file1
This is file1
[user1@backupbox dir1]$ cat hardlink1
This is file1
[user1@backupbox dir1]$ echo "an extra line" >>file1
[user1@backupbox dir1]$ cat file1
This is file1
an extra line
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line

And if you edit the hard-linked file the changes are seen in the original file:

[user1@backupbox dir1]$ echo "another extra line" >>hardlink1
[user1@backupbox dir1]$ cat file1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
another extra line

Changing the ownership and permissions also affects both files:

[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ sudo chown root.root file1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 root  root  47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 root  root  47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ sudo chmod 0666 hardlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-rw-rw-. 2 root  root  47 Oct  2 16:19 file1
33839002 -rw-rw-rw-. 2 root  root  47 Oct  2 16:19 hardlink1

Now if we delete the original file we will see that the hard link still exists and the file content remains intact. In contrast a symlink pointing to the original file will no longer be valid:

[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
33695760 lrwxrwxrwx. 1 user1 user1  5 Oct  2 16:15 symlink1 -> file1
[user1@backupbox dir1]$ rm -f file1
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct  2 16:19 hardlink1
33695760 lrwxrwxrwx. 1 user1 user1  5 Oct  2 16:15 symlink1 -> file1
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ cat symlink1
cat: symlink1: No such file or directory

We can even create another hard link and delete the existing one and the data still remains intact:

[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ ln hardlink1 newlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 newlink1
[user1@backupbox dir1]$ rm hardlink1
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct  2 16:19 newlink1
[user1@backupbox dir1]$ cat newlink1
This is file1
an extra line
another extra line

When you delete a file using the rm command, or any other method, what you are actually doing is just removing the link to the inode. This is why the function to delete a file in languages such as C and PHP is called “unlink”. When all links to an inode have been removed the inode itself will be deleted. As long as there is at least one link pointing to it the inode and the data will remain intact.

So what does this have to do with rsync and incremental backups?

Let’s say we want to create a mirror of a remote directory /home/data from a server named server1 into a local directory /backup/server1. Typically we would do something like this:

rsync -av --delete server1:/home/data/ /backup/server1/

We would then run the same command again each time we wanted to update the mirror with the latest changes from the server.

To implement a basic incremental backup system we might consider making a local copy of the previous backup before starting the rsync:

[user1@backupbox dir1]$ cp -a /backup/server1/ /backup/server1Old/

Then we update our mirror from the remote server:

[user1@backupbox dir1]$ rsync -av --delete server1:/home/data/ /backup/server1/

Obviously this isn’t very efficient in either time or space so we could improve this by using hard links instead, which can be done by adding the -l argument to the cp command:

# Create a hard-linked clone of the current backup
cp -al /backup/server1 /backup/server1Old
# update our mirror from the remote server
rsync -av --delete server1:/home/data/ /backup/server1/

This previous backup is preserved in /backup/server1Old and /backup/server1 will contain the entire new backup and only uses the space required for the new and changed files. This creates an efficient way to implement incremental backups, however it still has its limitations especially when dealing with large numbers of files.

To improve things further we can use a feature in rsync which enables us to efficiently create hard-linked copies of a directory’s contents with only the changed files taking up space on disk. The rsync feature we need is the –link-dest argument.

Taking this as a starting point:

server1:/home/data: Remote source directory

/backup/server1New: Destination for a new backup. Does not yet exist

/backup/server1Old: Existing previous backup

The result we want in /backup/server1New is that all unchanged files are hard links to the existing files in /backup/server1Old and only the changed files are copied from the remote server and take up space in the new backup.

This is exactly what the –link-dest argument does for us. It performs a normal rsync from server1:/home/data to /backup/server1New but if the file does not exist in /backup/server1New it will look at the same relative path under /backup/server1Old to see if the file has changed. If the file in /backup/server1Old is the same as the file on the remote server then instead of copying it over rsync will create a hard link from the file in /backup/server1Old into /backup/server1New.

To use this we just add the “old” directory as the –link-dest argument to our rsync command:

rsync -av --link-dest /backup/server1Old server1:/home/data/ /backup/server1New/

Here we can see the old backup directory’s contents:

[user1@backupbox ~]$ ls -lRi /backup/server1Old/
/backup/server1Old/:
total 0
68876 drwxrwxr-x. 3 user1 user1 53 Oct  2 17:30 files
 
/backup/server1Old/files:
total 72
33651935 drwxrwxr-x. 2 user1 user1    42 Oct  2 17:30 bar
   68882 -rw-rw-r--. 1 user1 user1 28883 Oct  2 17:30 foo1
   68883 -rw-rw-r--. 1 user1 user1 27763 Oct  2 17:30 foo2
   68884 -rw-rw-r--. 1 user1 user1 10487 Oct  2 17:30 foo3
 
/backup/server1Old/files/bar:
total 76
33695759 -rw-rw-r--. 1 user1 user1 32603 Oct  2 17:30 bar1
33838984 -rw-rw-r--. 1 user1 user1 15318 Oct  2 17:30 bar2
33839003 -rw-rw-r--. 1 user1 user1 26122 Oct  2 17:30 bar3

On the server we then modify a file:

[user1@server1 files]$ echo "Hello world" >/home/data/files/foo3

Now we run our incremental backup command:

[user1@backupbox ~]$ rsync -av --link-dest=/backup/server1Old server1:/home/data/ /backup/server1New/
receiving incremental file list
created directory /backup/server1New
files/foo3
 
sent 136 bytes  received 272 bytes  816.00 bytes/sec
total size is 130,701  speedup is 320.35

We can see from the rsync output that only the changed file has been copied but if we list the contents of the new directory we can see it contains all of the files:

[user1@backupbox ~]$ ls -lRi /backup/server1New/
/backup/server1New/:
total 0
101051460 drwxrwxr-x. 3 user1 user1 53 Oct  2 17:30 files
 
/backup/server1New/files:
total 64
    68885 drwxrwxr-x. 2 user1 user1    42 Oct  2 17:30 bar
    68882 -rw-rw-r--. 2 user1 user1 28883 Oct  2 17:30 foo1
    68883 -rw-rw-r--. 2 user1 user1 27763 Oct  2 17:30 foo2
101051461 -rw-rw-r--. 1 user1 user1    12 Oct  2 17:40 foo3
 
/backup/server1New/files/bar:
total 76
33695759 -rw-rw-r--. 2 user1 user1 32603 Oct  2 17:30 bar1
33838984 -rw-rw-r--. 2 user1 user1 15318 Oct  2 17:30 bar2
33839003 -rw-rw-r--. 2 user1 user1 26122 Oct  2 17:30 bar3

If you compare the inode numbers to the listing of /backup/server1Old above you will see that only the modified file and the directories have different inode numbers.

Using du we can also see that the second backup takes up less space on disk:

[user1@backupbox ~]$ du -chs /backup/server1*
140K	/backup/server1New
12K	/backup/server1Old
152K	total

Putting it all together

Here is an example script that can be used to create daily incremental backups of a directory. Each backup is stored in a directory named after today’s date and it will look for yesterday’s backup to create the hard links:

#!/bin/bash
 
# The source path to backup. Can be local or remote.
SOURCE=servername:/source/dir/
# Where to store the incremental backups
DESTBASE=/backup/servername_data
 
# Where to store today's backup
DEST="$DESTBASE/$(date +%Y-%m-%d)"
# Where to find yesterday's backup
YESTERDAY="$DESTBASE/$(date -d yesterday +%Y-%m-%d)/"
 
# Use yesterday's backup as the incremental base if it exists
if [ -d "$YESTERDAY" ]
then
	OPTS="--link-dest $YESTERDAY"
fi
 
# Run the rsync
rsync -av $OPTS "$SOURCE" "$DEST"

The beauty of doing your backups this way is that each daily backup is a full mirror of the remote directory. This means there is no complex logic required to find the latest version of a file or to find a file from a specific date, just go to the directory named with the date you want and open the file as normal. Each backup directory is completely independent of the others so if you need to free up some space you can just delete any of the backups that you no longer require. Removing a backup will not impact the backups before or after, a simple rm -rf is all you need!

Limitations

As with every backup solution this one has its limitations and you must choose a method that fits your particular use-case. Here are a few examples of limitations in this solution:

Changes in permissions or ownership on a source file mean the file is counted as a new file so it will be copied again even if its contents have not changed. There are options in rsync to control this behaviour.
If you move or rename a file on the source server it will count as a new file and will be copied in full even if its contents have not changed and it still has the same inode number.
Directories themselves cannot be hard linked on most filesystems so this is not supported by rsync. For most use cases this is not a problem but if you have an enormous number of directories in the backup they will start to take a noticeable amount of space on the backup disk.

Conclusion

When it comes to using rsync for backups this is only the tip of the iceberg. There are many different options that control the behaviour of the backup process and how it determines what files to copy, link or delete. Further information about rsync can be found on their website, https://rsync.samba.org/.

Richard Gooding

Technical Lead

Richard has a varied history in development, devops and databases so he is always comfortable on either side of the dev/ops fence. His past experience includes web and email hosting, software testing, building desktop and mobile apps, managing large Cassandra clusters, building and running large-scale distributed applications and more.

Getting started with Kafka Cassandra Connector

Jun 21, 2021

If you want to understand how to easily ingest data from Kafka topics into Cassandra than this blog can show you how with the DataStax Kafka Connector.

K3s – lightweight kubernetes made ready for production – Part 3

Jun 2, 2021

Do you want to know securely deploy k3s kubernetes for production? Have a read of this blog and accompanying Ansible project for you to run.

K3s – lightweight kubernetes made ready for production – Part 2

Jun 2, 2021

Do you want to know securely deploy k3s kubernetes for production? Have a read of this blog and accompanying Ansible project for you to run.

The post Incremental backups with rsync and hard links appeared first on digitalis.io.

Kafka Connect gotcha – SSL

Richard Gooding — Wed, 19 Feb 2020 10:38:56 +0000

Kafka Connect gotcha – SSL

by Richard Gooding



19 Feb, 2020



Insights | Kafka | Security

We recently deployed a Kafka Connect environment to consume Avro messages from a topic and write them into an Oracle database. Everything seemed to be functioning just fine until we got a message from the team saying their connectors had suddenly stopped working.

On further investigation we found errors like this in the Kafka Connect logs:

2020-01-17 12:56:48 ERROR Uncaught exception in thread 'kafka-producer-network-thread | producer-25':
java.lang.OutOfMemoryError: Java heap space

2020-01-17 13:02:54 ERROR Uncaught exception in thread 'kafka-producer-network-thread | producer-57':
java.lang.OutOfMemoryError: Direct buffer memory

Our first thought was that Kafka Connect just needed more heap space so we increased it from the defaults (256MB-1GB) up to a fixed 8GB heap but the errors kept coming. We increased it further up to 20GB and the errors were still happening. The machine was receiving one message every few seconds but the Kafka Connect process was using around 97% of the RAM and over 80% CPU. This machine has 8 CPUs and 32GB RAM so clearly something wasn’t right!

In this case we were using a custom Kafka Connect plugin to convert messages from the topic into the required format to be inserted into Oracle so the first thought was, do we have a memory leak in our code? We went over and over our plugin code and could not see anywhere that could possibly be leaking memory so we looked back at Kafka Connect itself.

What we could see was that our sink connectors would run until an invalid message was pushed into the topic, at which point the OutOfMemoryError exceptions started appearing in the logs. This made sense as the errors were only ever logged from producer threads and this Kafka Connect instance was only running Sink connectors, so it must be related to pushing the invalid messages to dead letter queues.

A typical connector configuration for this use case looks something like this:

{
  "name": "oracle-sink-test",
  "config": {
    "connector.class": "com.mydomain.PluginClass",
    "connector.type": "sink",
    "tasks.max": "1",
    "topics": "source_topic",
    "topic.type": "avro",
    "connection.user": "DBUserName",
    "connection.password": "DBPassword",
    "connection.url": "jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=oracle1)(PORT=9020))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=SINKTEST)))",
    "db.driver": "oracle.jdbc.driver.OracleDriver",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "https://schemaregistry:8443",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name":"dlq_sink_test",
    "errors.deadletterqueue.topic.replication.factor": 1,
    "errors.deadletterqueue.context.headers.enable": true
  }
}

As you can see in the example configuration our connectors were configured with dead letter queues and so we tried changing the connectors to and removing the dead letter queue config. As we had hoped, this change meant the connectors would now fail with an error when they encountered an invalid message but we also observed much lower CPU and RAM usage while the connectors were running. The next thing we tried was leaving set to and putting the dead letter queue config back into the connector. This resulted in the CPU and RAM use going back up again and instead of failing on an invalid message the connector job would hang and its consumer would eventually get timed out by the broker.

For a detailed explanation of error handling in Kafka Connect see this blog post which explains it far better than I ever could: https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/.

So what was going on?

For a hint, here is an example of what the end of our looked like:

bootstrap.servers=broker1:9095,broker2:9095
security.protocol=SSL
ssl.truststore.location=/path/to/truststore.jks
ssl.truststore.password=truststorepassword
ssl.keystore.location=/path/to/keystore.jks
ssl.keystore.password=keystorepassword
ssl.key.password=keypassword

consumer.bootstrap.servers=broker1:9095,broker2:9095
consumer.security.protocol=SSL
consumer.ssl.truststore.location=/path/to/truststore.jks
consumer.ssl.truststore.password=truststorepassword
consumer.ssl.keystore.location=/path/to/keystore.jks
consumer.ssl.keystore.password=keystorepassword
consumer.ssl.key.password=keypassword

As you can see we have SSL enabled on the brokers. We didn’t give this much thought because there are SSL settings in there, we had no SSL-related errors in the logs and nothing else suggested that the issue was related to SSL. After much investigation and Googling for answers we found this open issue in the Kafka bug tracker: “JVM runs into OOM if (Java) client uses a SSL port without setting the security protocol” (https://issues.apache.org/jira/browse/KAFKA-4090). This was the hint we needed to fix our problem.

We had configured SSL settings for Kafka Connect’s internal connections and for the consumers but we had not configured SSL for the producer threads. This was possibly an oversight as we were only running Sink connectors on this environment, but of course there are producer threads running to push invalid messages to the dead letter queues. Based on the information in KAFKA-4090 we decided to add explicit SSL settings for the producer threads like this:

producer.bootstrap.servers=broker1:9095,broker2:9095
producer.security.protocol=SSL
producer.ssl.truststore.location=/path/to/truststore.jks
producer.ssl.truststore.password=truststorepassword
producer.ssl.keystore.location=/path/to/keystore.jks
producer.ssl.keystore.password=keystorepassword
producer.ssl.key.password=keypassword

After making this change we restarted Kafka Connect and suddenly the CPU use went from 80-90% down to 5% and the RAM use went down from 95% to just the Java heap size plus a little (around 300MB in total as the heap was now set to 256MB-1GB).

We reverted the connector configuration back to with the dead letter queue configured and hey presto! messages started being consumed from the topic and the invalid messages were being correctly pushed to the dead letter queue.

So in summary, if you are seeing unexpected out of memory exceptions in Kafka Connect and you are using SSL to communicate with the brokers, make sure you configure the SSL settings individually for all three types of connection – internal connections, consumers and producers.

The post Kafka Connect gotcha – SSL appeared first on digitalis.io.

Richard Gooding, Author at digitalis.io

Incremental backups with rsync and hard links

by Richard Gooding

13 Nov, 2020

AWS | DevOps | Insights | Linux

A bit of background

So what does this have to do with rsync and incremental backups?

Putting it all together

Limitations

Conclusion

Richard Gooding

Recent Posts

Categories

Archives

Related Articles

Getting started with Kafka Cassandra Connector

K3s – lightweight kubernetes made ready for production – Part 3

K3s – lightweight kubernetes made ready for production – Part 2

Kafka Connect gotcha – SSL

by Richard Gooding

19 Feb, 2020

Insights | Kafka | Security

Recent Posts

Categories

Archives

Related Articles

K3s – lightweight kubernetes made ready for production – Part 3

K3s – lightweight kubernetes made ready for production – Part 2

K3s – lightweight kubernetes made ready for production – Part 1