In this post I am going to describe a way to build a simple incremental backup solution using rsync and hard links. You may already be familiar with rsync but for anyone who is not, rsync is a command-line tool commonly used on Linux and other UNIX-like operating systems to copy and synchronise directories. I will assume some prior knowledge of rsync in this post so if you have not used it before there may be some parts that confuse you!
A bit of background
[user1@backupbox dir1]$ ls -li
total 128
33839002 -rw-rw-r--. 1 user1 user1 12942 Oct 2 16:14 file1
33839003 -rw-rw-r--. 1 user1 user1 14106 Oct 2 16:14 file2
33839004 -rw-rw-r--. 1 user1 user1 19360 Oct 2 16:14 file3
33839005 -rw-rw-r--. 1 user1 user1 17093 Oct 2 16:14 file4
33839006 -rw-rw-r--. 1 user1 user1 16094 Oct 2 16:14 file5
[user1@backupbox dir1]$ ls -li
total 64
33839002 -rw-r--r--. 2 user1 user1 12942 Oct 2 16:14 file1
33839003 -rw-r--r--. 2 user1 user1 14106 Oct 2 16:14 file2
33839002 -rw-r--r--. 2 user1 user1 12942 Oct 2 16:14 hardlink1
33839003 -rw-r--r--. 2 user1 user1 14106 Oct 2 16:14 hardlink2
33695760 lrwxrwxrwx. 1 user1 user1 5 Oct 2 16:15 symlink1 -> file1
33695762 lrwxrwxrwx. 1 user1 user1 5 Oct 2 16:15 symlink2 -> file2
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 hardlink1
[user1@backupbox dir1]$ cat file1
This is file1
[user1@backupbox dir1]$ cat hardlink1
This is file1
[user1@backupbox dir1]$ echo "an extra line" >>file1
[user1@backupbox dir1]$ cat file1
This is file1
an extra line
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
[user1@backupbox dir1]$ echo "another extra line" >>hardlink1
[user1@backupbox dir1]$ cat file1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 hardlink1
[user1@backupbox dir1]$ sudo chown root.root file1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 root root 47 Oct 2 16:19 file1
33839002 -rw-r--r--. 2 root root 47 Oct 2 16:19 hardlink1
[user1@backupbox dir1]$ sudo chmod 0666 hardlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-rw-rw-. 2 root root 47 Oct 2 16:19 file1
33839002 -rw-rw-rw-. 2 root root 47 Oct 2 16:19 hardlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 hardlink1
33695760 lrwxrwxrwx. 1 user1 user1 5 Oct 2 16:15 symlink1 -> file1
[user1@backupbox dir1]$ rm -f file1
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct 2 16:19 hardlink1
33695760 lrwxrwxrwx. 1 user1 user1 5 Oct 2 16:15 symlink1 -> file1
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ cat symlink1
cat: symlink1: No such file or directory
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct 2 16:19 hardlink1
[user1@backupbox dir1]$ ln hardlink1 newlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 hardlink1
33839002 -rw-r--r--. 2 user1 user1 47 Oct 2 16:19 newlink1
[user1@backupbox dir1]$ rm hardlink1
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct 2 16:19 newlink1
[user1@backupbox dir1]$ cat newlink1
This is file1
an extra line
another extra line
So what does this have to do with rsync and incremental backups?
rsync -av --delete server1:/home/data/ /backup/server1/
We would then run the same command again each time we wanted to update the mirror with the latest changes from the server.
To implement a basic incremental backup system we might consider making a local copy of the previous backup before starting the rsync:
[user1@backupbox dir1]$ cp -a /backup/server1/ /backup/server1Old/
Then we update our mirror from the remote server:
[user1@backupbox dir1]$ rsync -av --delete server1:/home/data/ /backup/server1/
Obviously this isn’t very efficient in either time or space so we could improve this by using hard links instead, which can be done by adding the -l argument to the cp command:
# Create a hard-linked clone of the current backup
cp -al /backup/server1 /backup/server1Old
# update our mirror from the remote server
rsync -av --delete server1:/home/data/ /backup/server1/
To improve things further we can use a feature in rsync which enables us to efficiently create hard-linked copies of a directory’s contents with only the changed files taking up space on disk. The rsync feature we need is the –link-dest argument.
server1:/home/data: Remote source directory
/backup/server1New: Destination for a new backup. Does not yet exist
/backup/server1Old: Existing previous backup
This is exactly what the –link-dest argument does for us. It performs a normal rsync from server1:/home/data to /backup/server1New but if the file does not exist in /backup/server1New it will look at the same relative path under /backup/server1Old to see if the file has changed. If the file in /backup/server1Old is the same as the file on the remote server then instead of copying it over rsync will create a hard link from the file in /backup/server1Old into /backup/server1New.
To use this we just add the “old” directory as the –link-dest argument to our rsync command:
rsync -av --link-dest /backup/server1Old server1:/home/data/ /backup/server1New/
Here we can see the old backup directory’s contents:
[user1@backupbox ~]$ ls -lRi /backup/server1Old/
/backup/server1Old/:
total 0
68876 drwxrwxr-x. 3 user1 user1 53 Oct 2 17:30 files
/backup/server1Old/files:
total 72
33651935 drwxrwxr-x. 2 user1 user1 42 Oct 2 17:30 bar
68882 -rw-rw-r--. 1 user1 user1 28883 Oct 2 17:30 foo1
68883 -rw-rw-r--. 1 user1 user1 27763 Oct 2 17:30 foo2
68884 -rw-rw-r--. 1 user1 user1 10487 Oct 2 17:30 foo3
/backup/server1Old/files/bar:
total 76
33695759 -rw-rw-r--. 1 user1 user1 32603 Oct 2 17:30 bar1
33838984 -rw-rw-r--. 1 user1 user1 15318 Oct 2 17:30 bar2
33839003 -rw-rw-r--. 1 user1 user1 26122 Oct 2 17:30 bar3
On the server we then modify a file:
[user1@server1 files]$ echo "Hello world" >/home/data/files/foo3
Now we run our incremental backup command:
[user1@backupbox ~]$ rsync -av --link-dest=/backup/server1Old server1:/home/data/ /backup/server1New/
receiving incremental file list
created directory /backup/server1New
files/foo3
sent 136 bytes received 272 bytes 816.00 bytes/sec
total size is 130,701 speedup is 320.35
We can see from the rsync output that only the changed file has been copied but if we list the contents of the new directory we can see it contains all of the files:
[user1@backupbox ~]$ ls -lRi /backup/server1New/
/backup/server1New/:
total 0
101051460 drwxrwxr-x. 3 user1 user1 53 Oct 2 17:30 files
/backup/server1New/files:
total 64
68885 drwxrwxr-x. 2 user1 user1 42 Oct 2 17:30 bar
68882 -rw-rw-r--. 2 user1 user1 28883 Oct 2 17:30 foo1
68883 -rw-rw-r--. 2 user1 user1 27763 Oct 2 17:30 foo2
101051461 -rw-rw-r--. 1 user1 user1 12 Oct 2 17:40 foo3
/backup/server1New/files/bar:
total 76
33695759 -rw-rw-r--. 2 user1 user1 32603 Oct 2 17:30 bar1
33838984 -rw-rw-r--. 2 user1 user1 15318 Oct 2 17:30 bar2
33839003 -rw-rw-r--. 2 user1 user1 26122 Oct 2 17:30 bar3
Using du we can also see that the second backup takes up less space on disk:
[user1@backupbox ~]$ du -chs /backup/server1*
140K /backup/server1New
12K /backup/server1Old
152K total
Putting it all together
Here is an example script that can be used to create daily incremental backups of a directory. Each backup is stored in a directory named after today’s date and it will look for yesterday’s backup to create the hard links:
#!/bin/bash
# The source path to backup. Can be local or remote.
SOURCE=servername:/source/dir/
# Where to store the incremental backups
DESTBASE=/backup/servername_data
# Where to store today's backup
DEST="$DESTBASE/$(date +%Y-%m-%d)"
# Where to find yesterday's backup
YESTERDAY="$DESTBASE/$(date -d yesterday +%Y-%m-%d)/"
# Use yesterday's backup as the incremental base if it exists
if [ -d "$YESTERDAY" ]
then
OPTS="--link-dest $YESTERDAY"
fi
# Run the rsync
rsync -av $OPTS "$SOURCE" "$DEST"
Limitations
- Changes in permissions or ownership on a source file mean the file is counted as a new file so it will be copied again even if its contents have not changed. There are options in rsync to control this behaviour.
- If you move or rename a file on the source server it will count as a new file and will be copied in full even if its contents have not changed and it still has the same inode number.
- Directories themselves cannot be hard linked on most filesystems so this is not supported by rsync. For most use cases this is not a problem but if you have an enormous number of directories in the backup they will start to take a noticeable amount of space on the backup disk.
Conclusion
Richard Gooding
Technical Lead
Related Articles
Getting started with Kafka Cassandra Connector
If you want to understand how to easily ingest data from Kafka topics into Cassandra than this blog can show you how with the DataStax Kafka Connector.
K3s – lightweight kubernetes made ready for production – Part 3
Do you want to know securely deploy k3s kubernetes for production? Have a read of this blog and accompanying Ansible project for you to run.
K3s – lightweight kubernetes made ready for production – Part 2
Do you want to know securely deploy k3s kubernetes for production? Have a read of this blog and accompanying Ansible project for you to run.