Incremental backups with rsync and hard links

by Richard Gooding



13 Nov, 2020



AWS | DevOps | Insights | Linux

In this post I am going to describe a way to build a simple incremental backup solution using rsync and hard links. You may already be familiar with rsync but for anyone who is not, rsync is a command-line tool commonly used on Linux and other UNIX-like operating systems to copy and synchronise directories. I will assume some prior knowledge of rsync in this post so if you have not used it before there may be some parts that confuse you!

A bit of background

Before we go into the details you should understand how files are stored on the filesystem and how hard links work.

All files and directories are represented in the filesystem by an inode number which is the filesystem’s internal identity for the file. If you run ls -li in a directory you can see the inode numbers listed on the left:

[user1@backupbox dir1]$ ls -li
total 128
33839002 -rw-rw-r--. 1 user1 user1 12942 Oct  2 16:14 file1
33839003 -rw-rw-r--. 1 user1 user1 14106 Oct  2 16:14 file2
33839004 -rw-rw-r--. 1 user1 user1 19360 Oct  2 16:14 file3
33839005 -rw-rw-r--. 1 user1 user1 17093 Oct  2 16:14 file4
33839006 -rw-rw-r--. 1 user1 user1 16094 Oct  2 16:14 file5

A “file” as we see it by path and filename is in fact a reference to the inode and is often referred to as a “link”. When you create a hard link from one file to another you are creating a separate reference (link) from a new filename to the same inode number. This is different from a “soft” or “symbolic” link (symlink) which is a reference from one location to another path in the filesystem. You can see the difference in the output of ls -li:

[user1@backupbox dir1]$ ls -li
total 64
33839002 -rw-r--r--. 2 user1 user1 12942 Oct  2 16:14 file1
33839003 -rw-r--r--. 2 user1 user1 14106 Oct  2 16:14 file2
33839002 -rw-r--r--. 2 user1 user1 12942 Oct  2 16:14 hardlink1
33839003 -rw-r--r--. 2 user1 user1 14106 Oct  2 16:14 hardlink2
33695760 lrwxrwxrwx. 1 user1 user1     5 Oct  2 16:15 symlink1 -> file1
33695762 lrwxrwxrwx. 1 user1 user1     5 Oct  2 16:15 symlink2 -> file2

When you edit the original file the changes are also visible in the hard-linked version:

[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ cat file1
This is file1
[user1@backupbox dir1]$ cat hardlink1
This is file1
[user1@backupbox dir1]$ echo "an extra line" >>file1
[user1@backupbox dir1]$ cat file1
This is file1
an extra line
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line

And if you edit the hard-linked file the changes are seen in the original file:

[user1@backupbox dir1]$ echo "another extra line" >>hardlink1
[user1@backupbox dir1]$ cat file1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
another extra line

Changing the ownership and permissions also affects both files:

[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ sudo chown root.root file1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 root  root  47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 root  root  47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ sudo chmod 0666 hardlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-rw-rw-. 2 root  root  47 Oct  2 16:19 file1
33839002 -rw-rw-rw-. 2 root  root  47 Oct  2 16:19 hardlink1

Now if we delete the original file we will see that the hard link still exists and the file content remains intact. In contrast a symlink pointing to the original file will no longer be valid:

[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 file1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
33695760 lrwxrwxrwx. 1 user1 user1  5 Oct  2 16:15 symlink1 -> file1
[user1@backupbox dir1]$ rm -f file1
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct  2 16:19 hardlink1
33695760 lrwxrwxrwx. 1 user1 user1  5 Oct  2 16:15 symlink1 -> file1
[user1@backupbox dir1]$ cat hardlink1
This is file1
an extra line
another extra line
[user1@backupbox dir1]$ cat symlink1
cat: symlink1: No such file or directory

We can even create another hard link and delete the existing one and the data still remains intact:

[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct  2 16:19 hardlink1
[user1@backupbox dir1]$ ln hardlink1 newlink1
[user1@backupbox dir1]$ ls -li
total 8
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 hardlink1
33839002 -rw-r--r--. 2 user1 user1 47 Oct  2 16:19 newlink1
[user1@backupbox dir1]$ rm hardlink1
[user1@backupbox dir1]$ ls -li
total 4
33839002 -rw-r--r--. 1 user1 user1 47 Oct  2 16:19 newlink1
[user1@backupbox dir1]$ cat newlink1
This is file1
an extra line
another extra line

When you delete a file using the rm command, or any other method, what you are actually doing is just removing the link to the inode. This is why the function to delete a file in languages such as C and PHP is called “unlink”. When all links to an inode have been removed the inode itself will be deleted. As long as there is at least one link pointing to it the inode and the data will remain intact.

So what does this have to do with rsync and incremental backups?

Let’s say we want to create a mirror of a remote directory /home/data from a server named server1 into a local directory /backup/server1. Typically we would do something like this:

rsync -av --delete server1:/home/data/ /backup/server1/

We would then run the same command again each time we wanted to update the mirror with the latest changes from the server.

To implement a basic incremental backup system we might consider making a local copy of the previous backup before starting the rsync:

[user1@backupbox dir1]$ cp -a /backup/server1/ /backup/server1Old/

Then we update our mirror from the remote server:

[user1@backupbox dir1]$ rsync -av --delete server1:/home/data/ /backup/server1/

Obviously this isn’t very efficient in either time or space so we could improve this by using hard links instead, which can be done by adding the -l argument to the cp command:

# Create a hard-linked clone of the current backup
cp -al /backup/server1 /backup/server1Old
# update our mirror from the remote server
rsync -av --delete server1:/home/data/ /backup/server1/

This previous backup is preserved in /backup/server1Old and /backup/server1 will contain the entire new backup and only uses the space required for the new and changed files. This creates an efficient way to implement incremental backups, however it still has its limitations especially when dealing with large numbers of files.

To improve things further we can use a feature in rsync which enables us to efficiently create hard-linked copies of a directory’s contents with only the changed files taking up space on disk. The rsync feature we need is the –link-dest argument.

Taking this as a starting point:

server1:/home/data: Remote source directory

/backup/server1New: Destination for a new backup. Does not yet exist

/backup/server1Old: Existing previous backup

The result we want in /backup/server1New is that all unchanged files are hard links to the existing files in /backup/server1Old and only the changed files are copied from the remote server and take up space in the new backup.

This is exactly what the –link-dest argument does for us. It performs a normal rsync from server1:/home/data to /backup/server1New but if the file does not exist in /backup/server1New it will look at the same relative path under /backup/server1Old to see if the file has changed. If the file in /backup/server1Old is the same as the file on the remote server then instead of copying it over rsync will create a hard link from the file in /backup/server1Old into /backup/server1New.

To use this we just add the “old” directory as the –link-dest argument to our rsync command:

rsync -av --link-dest /backup/server1Old server1:/home/data/ /backup/server1New/

Here we can see the old backup directory’s contents:

[user1@backupbox ~]$ ls -lRi /backup/server1Old/
/backup/server1Old/:
total 0
68876 drwxrwxr-x. 3 user1 user1 53 Oct  2 17:30 files
 
/backup/server1Old/files:
total 72
33651935 drwxrwxr-x. 2 user1 user1    42 Oct  2 17:30 bar
   68882 -rw-rw-r--. 1 user1 user1 28883 Oct  2 17:30 foo1
   68883 -rw-rw-r--. 1 user1 user1 27763 Oct  2 17:30 foo2
   68884 -rw-rw-r--. 1 user1 user1 10487 Oct  2 17:30 foo3
 
/backup/server1Old/files/bar:
total 76
33695759 -rw-rw-r--. 1 user1 user1 32603 Oct  2 17:30 bar1
33838984 -rw-rw-r--. 1 user1 user1 15318 Oct  2 17:30 bar2
33839003 -rw-rw-r--. 1 user1 user1 26122 Oct  2 17:30 bar3

On the server we then modify a file:

[user1@server1 files]$ echo "Hello world" >/home/data/files/foo3

Now we run our incremental backup command:

[user1@backupbox ~]$ rsync -av --link-dest=/backup/server1Old server1:/home/data/ /backup/server1New/
receiving incremental file list
created directory /backup/server1New
files/foo3
 
sent 136 bytes  received 272 bytes  816.00 bytes/sec
total size is 130,701  speedup is 320.35

We can see from the rsync output that only the changed file has been copied but if we list the contents of the new directory we can see it contains all of the files:

[user1@backupbox ~]$ ls -lRi /backup/server1New/
/backup/server1New/:
total 0
101051460 drwxrwxr-x. 3 user1 user1 53 Oct  2 17:30 files
 
/backup/server1New/files:
total 64
    68885 drwxrwxr-x. 2 user1 user1    42 Oct  2 17:30 bar
    68882 -rw-rw-r--. 2 user1 user1 28883 Oct  2 17:30 foo1
    68883 -rw-rw-r--. 2 user1 user1 27763 Oct  2 17:30 foo2
101051461 -rw-rw-r--. 1 user1 user1    12 Oct  2 17:40 foo3
 
/backup/server1New/files/bar:
total 76
33695759 -rw-rw-r--. 2 user1 user1 32603 Oct  2 17:30 bar1
33838984 -rw-rw-r--. 2 user1 user1 15318 Oct  2 17:30 bar2
33839003 -rw-rw-r--. 2 user1 user1 26122 Oct  2 17:30 bar3

If you compare the inode numbers to the listing of /backup/server1Old above you will see that only the modified file and the directories have different inode numbers.

Using du we can also see that the second backup takes up less space on disk:

[user1@backupbox ~]$ du -chs /backup/server1*
140K	/backup/server1New
12K	/backup/server1Old
152K	total

Putting it all together

Here is an example script that can be used to create daily incremental backups of a directory. Each backup is stored in a directory named after today’s date and it will look for yesterday’s backup to create the hard links:

#!/bin/bash
 
# The source path to backup. Can be local or remote.
SOURCE=servername:/source/dir/
# Where to store the incremental backups
DESTBASE=/backup/servername_data
 
# Where to store today's backup
DEST="$DESTBASE/$(date +%Y-%m-%d)"
# Where to find yesterday's backup
YESTERDAY="$DESTBASE/$(date -d yesterday +%Y-%m-%d)/"
 
# Use yesterday's backup as the incremental base if it exists
if [ -d "$YESTERDAY" ]
then
	OPTS="--link-dest $YESTERDAY"
fi
 
# Run the rsync
rsync -av $OPTS "$SOURCE" "$DEST"

The beauty of doing your backups this way is that each daily backup is a full mirror of the remote directory. This means there is no complex logic required to find the latest version of a file or to find a file from a specific date, just go to the directory named with the date you want and open the file as normal. Each backup directory is completely independent of the others so if you need to free up some space you can just delete any of the backups that you no longer require. Removing a backup will not impact the backups before or after, a simple rm -rf is all you need!

Limitations

As with every backup solution this one has its limitations and you must choose a method that fits your particular use-case. Here are a few examples of limitations in this solution:

Changes in permissions or ownership on a source file mean the file is counted as a new file so it will be copied again even if its contents have not changed. There are options in rsync to control this behaviour.
If you move or rename a file on the source server it will count as a new file and will be copied in full even if its contents have not changed and it still has the same inode number.
Directories themselves cannot be hard linked on most filesystems so this is not supported by rsync. For most use cases this is not a problem but if you have an enormous number of directories in the backup they will start to take a noticeable amount of space on the backup disk.

Conclusion

When it comes to using rsync for backups this is only the tip of the iceberg. There are many different options that control the behaviour of the backup process and how it determines what files to copy, link or delete. Further information about rsync can be found on their website, https://rsync.samba.org/.

Richard Gooding

Technical Lead

Richard has a varied history in development, devops and databases so he is always comfortable on either side of the dev/ops fence. His past experience includes web and email hosting, software testing, building desktop and mobile apps, managing large Cassandra clusters, building and running large-scale distributed applications and more.

by Richard Gooding

13 Nov, 2020

AWS | DevOps | Insights | Linux

A bit of background

So what does this have to do with rsync and incremental backups?

Putting it all together

Limitations

Conclusion

Richard Gooding

Recent Posts

Categories

Archives

Related Articles

Getting started with Kafka Cassandra Connector

K3s – lightweight kubernetes made ready for production – Part 3

K3s – lightweight kubernetes made ready for production – Part 2

Twitter

Linkedin

GitHub

© 2022 digitalis.io – All rights reserved