How to learn a code base - rsync Part 2
Now that we have a working setup with binaries built from source we can think about making modifications to the code.
The plan is to add support for snappy a recently released compression/decompression library developed in-house by Google.
Table of Contents
Part 2 - Making things snappy
Before we start planning our changes, we need to take a step back and see how our compilation issue was resolved by the project maintainer. Our attempt to compile the code initially resulted in a problem with the linux/falloc.h header file missing on our system. We resolved the problem by replacing the include directive in rsync.h with an include of fcntl.h. We also sent a report of the encountered problem and the applied solution to the mailing list. The problem was later reviewed and fixed by the maintainer.
The proper fix was to add a configuration check for the presence of the header file and conditionally include the header only if it's present. The additional include of fcntl.h we added to rsync.h is redundant because it's already done higher up in rsync.h - this was a mistake and we should have grep'ed for it in the first place.
Let's fetch the newest code and recompile to see if everything passes nicely as expected.
We are now on revision d518e02243c8a4aad550b402f3421720860f340e.
It's now time to
We know that rsync is written in C. Fortunately for us snappy has an official C binding mentioned in the README in the Usage section - the bindings are kept in snappy-c.h.
Here is the general plan of our modification.
Before we start planning our changes, we need to take a step back and see how our compilation issue was resolved by the project maintainer. Our attempt to compile the code initially resulted in a problem with the linux/falloc.h header file missing on our system. We resolved the problem by replacing the include directive in rsync.h with an include of fcntl.h. We also sent a report of the encountered problem and the applied solution to the mailing list. The problem was later reviewed and fixed by the maintainer.
The proper fix was to add a configuration check for the presence of the header file and conditionally include the header only if it's present. The additional include of fcntl.h we added to rsync.h is redundant because it's already done higher up in rsync.h - this was a mistake and we should have grep'ed for it in the first place.
Let's fetch the newest code and recompile to see if everything passes nicely as expected.
git fetch
git merge origin
We are now on revision d518e02243c8a4aad550b402f3421720860f340e.
It's now time to
make reconfigure # we need this because our files changed after fetching the newest changes
make clean
./configure
make
mulander@bunkier_mysli:~/code/blog/lac/rsync$ ./rsync --version
rsync version 3.1.0dev protocol version 31.PR13
Copyright (C) 1996-2011 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 32-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc
rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you
are welcome to redistribute it under certain conditions. See the GNU
General Public Licence for details.
Everything went fine so let's talk a bit about our planned modification.
The idea originated by observing this thread on the official rsync mailing list. The author is restoring a 6 TB backup which takes a long time to do. He observed that rsync is currently CPU bound maxing out one core of the machine - plenty of network bandwidth is available for use. Friendly folks at the mailing list suggested trying out two changes to reduce the CPU load. Turning off the delta-transfer algorithm is the first one and can be done by using the --whole-file option. The second suggestion is to drop compression this is done by not passing the --compress flag. The author claims that making both changes resulted in a 4.5 times transfer increase. Still rsync remained CPU bound even after these changes but no further discussion went on the mailing list.
Now, we already can assume that changing the compression will not be the silver bullet to solving all problems. The point is that it's highly probable that there are rsync users not using any compression because of the CPU overhead it introduces during run-time. It's possible that these people would opt-in to a compression option which is an order of magnitude faster but has a smaller compression rate then the default provided by zlib. The most important point though is that this modification should be fun, not overly complicated and a really good way to learn a part of the code base faster.
So what exactly is snappy? From the project home page:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)
We know that rsync is written in C. Fortunately for us snappy has an official C binding mentioned in the README in the Usage section - the bindings are kept in snappy-c.h.
Here is the general plan of our modification.
- Download and compile snappy locally
- Write a simple C test to compress / decompress a file (just to get the hang of it)
- See how zlib is handled by the build system
- Add snappy to the build system and a configure step allowing to build rsync with/without snappy support
- See where and how zlib is used in the code
- Add an rsync flag --snappy-compress that makes rsync use snappy instead of zlib
- Implement compression/decompression using snappy
- Try out our modifications
- Document the changes and send them upstream for a review
Hopefully we will end up with an interesting change that we can submit upstream to the maintainer.
1. Getting snappy
Download the 1.0.2 release of snappy. Unpack it and compile using:
I didn't encounter any problems with this step, you might consider installing it system wide with:
It's also a good idea to refer to your distributions package manager instead of doing a system install from the sources.
2. Trying out snappy
Before we add any code, it's a good idea just to try out simple things with the library. In order to do that I wrote a simple throw away program that I called test.c. The program opens a test.txt file from the current directory, compresses it's content and saves it to test.snappy as a final step the compressed content (from memory not the one saved on disk) is decompressed and saved to test.orig. I then did a simple diff to see if both files (test.txt and test.orig) are still the same after that pass.
3. How is zlib handled by the build system
Initially I thought that seeing how zlib is plugged into the build system would be a good guide to adding support for snappy. Unfortunately zlib is a special case here since the rsync developers had to introduce rsync specific changes to it's code base so it's bundled with rsync and used in the build even if a system wide zlib installation is available. Forcing rsync to build against zlib without rsync specific modifications would make it impossible for this particular version to work with rsync built with the bundled library (most of the world).
4. Modifying the build system
We use AC_ARG_WITH in order to add the new flag and it's description.
Later in the file we define an additional test starting with AC_MSG_CHECKING which checks if the --with-snappy-compress flag was provided to the configure script. If the flag was passed a new constant will be defined with the name SUPPORT_SNAPPY_COMPRESSION and initialization to 1.
We can see that our modifications work by calling:
Here is the diff of the changes:
Now that we have our constants figured out we can add a conditional include to rsync.h that will add the snappy-c.h header file if it's available and the --with-snappy-compress flag was provided:
5. Observing how zlib is being used
token.c
Part of the code is responsible for deciding if a file is worth to be compressed and with what level of compression. Sending and receiving tokens/buffer of data both with compression enabled and disabled is also handled by functions in this file.
There seem to be three exported functions in this code.
At this point I think our modification to token.c will end up with an introduction of two new functions recv_snappy_token and send_snappy_token - initially based on simple_send/recv_token.
Overall we will have a lot of mundane testing ahead of us to see if our changes didn't break anything important ;)
6. Adding a --snappy-compress flag
We want our changes to be transparent to the way rsync worked before. If someone used the --compress flag we still want to make sure that only the zlib code is being used. In order to use the snappy library the user will have to provide the --snappy-compress flag or both --compress and --snappy-compress. We will still implicitly set the --compress flag to true if the user just provides the --snappy-compress flag the reason behind this is not to skip code that could be not directly tied to zlib but still take part in compression/decompression (protocol handling? gathering statistics?). We will perform incremental changes and make additional modifications if we notice problems introduced by our changes.
A quick grep of the code base revealed that the actual command line argument parsing takes place in option.c.
All options in this file have a defined variable to store the parsed argument. We will define a new one called do_snappy_compression the name is based on the names of existing variables.
The code in options.c also performs some other tasks, one of them is printing out the list of available features when rsync is called with the --version flag. The function responsible for this feature is called print_rsync_version and generally has a set of defined string constants initialized to "no ". After initialization each feature is checked (for example by checking if a compile time constant was defined) and set to an empty string if the feature is present. Further down each feature is printed via a printf wrapper called rprintf and the current value of the feature constant is printed before the feature name. The result of this is the string "no " printed before the feature name if it's not present or just the feature name if it's available.
Next up is the usage function which is responsible for printing out a list of all the available flags and basic syntax the application handles. We will add our new flag along with a short description here guarded by a conditional based on the definition of SUPPORT_SNAPPY_COMPRESSION which we defined in rsync.h if the feature is available.
According to it's documentation set_refuse_options is used to disable all the options the rsyncd.conf file defines to turn off. At this point I think this only refers to rsync working in daemon mode. Regular compression is handled there so we will remember to take a look here later on but for now we will skip modifying this part of the code.
In parse_arguments we will add a conditional check based on do_snappy_compression. If our flag was provided we will additionally set do_compression to 1.
Now let's recompile and check if our flag is present.
Here is the listing of changes we introduced to options.c:
That will be all as far as we are concerned today. The last three (biggest) steps are left for the next post(s). Implementing the actual code for compression/decompression. Making sure that everything works correctly. Creating a github clone of the official repository, storing the changes and notifying the maintainers on the mailing list.
1. Getting snappy
Download the 1.0.2 release of snappy. Unpack it and compile using:
./configure
make
I didn't encounter any problems with this step, you might consider installing it system wide with:
make install
It's also a good idea to refer to your distributions package manager instead of doing a system install from the sources.
2. Trying out snappy
Before we add any code, it's a good idea just to try out simple things with the library. In order to do that I wrote a simple throw away program that I called test.c. The program opens a test.txt file from the current directory, compresses it's content and saves it to test.snappy as a final step the compressed content (from memory not the one saved on disk) is decompressed and saved to test.orig. I then did a simple diff to see if both files (test.txt and test.orig) are still the same after that pass.
3. How is zlib handled by the build system
Initially I thought that seeing how zlib is plugged into the build system would be a good guide to adding support for snappy. Unfortunately zlib is a special case here since the rsync developers had to introduce rsync specific changes to it's code base so it's bundled with rsync and used in the build even if a system wide zlib installation is available. Forcing rsync to build against zlib without rsync specific modifications would make it impossible for this particular version to work with rsync built with the bundled library (most of the world).
4. Modifying the build system
In order to include snappy in the build system we need to add a header check for snappy-c.h to configure.ac. This allows us to make sure that the C bindings for the library are available in the system. If the header file is present a constant HAVE_SNAPPY_C initialized to 1 will be defined in config.h. This can be achieved by adding snappy-c.h at the end of the AC_CHECK_HEADERS list (after linux/falloc.h).
Additionally we will add an optional flag --with-snappy-compress required to be passed in order to build the feature. Since some people might not want the additional code we will make it disabled by default and compile the additions only when configure is explicitly called --with-snappy-compress.
We use AC_ARG_WITH in order to add the new flag and it's description.
Later in the file we define an additional test starting with AC_MSG_CHECKING which checks if the --with-snappy-compress flag was provided to the configure script. If the flag was passed a new constant will be defined with the name SUPPORT_SNAPPY_COMPRESSION and initialization to 1.
We can see that our modifications work by calling:
make reconfigure
./configure --help
... SNIP ...
Optional Packages:
--with-PACKAGE[=ARG] use PACKAGE [ARG=yes]
--without-PACKAGE do not use PACKAGE (same as --with-PACKAGE=no)
--with-included-popt use bundled popt library, not from system
--with-snappy-compress http://code.google.com/p/snappy/ faster than zlib
but worse compression
... SNIP ...
Here is the diff of the changes:
Now that we have our constants figured out we can add a conditional include to rsync.h that will add the snappy-c.h header file if it's available and the --with-snappy-compress flag was provided:
5. Observing how zlib is being used
We've read this usage example for zlib in order to better understand how it might be used in the rsync code base. Next step was to grep inflate/deflate and zlib.h to find places where the library is used. This search resulted in:
- options.c - option parsing
- batch.c - responsible for batch mode
- token.c - code responsible for the actual file transfer
options.c
This file is mostly responsible for parsing and initializing the options passed from the command line. The zlib.h header file is included here in order to access some constants defined in the header (for example the compression level) but actual compression/decompression code is not called here.
batch.c
The batch mode is used to save time while performing rsync operations to a set of different hosts replicating the same source data. The batch writing is responsible for recording all the steps required to perform the same synchronization to a different host. Performing the write allows rsync to skip checking the file status, checksums and data block generation. The file is created using the --write-batch option and later replayed with the --read-batch option.
The zlib.h header file is included in batch.c not sure why as no code is called from it - probably in order to use some constants defined in the header. The do_compression flag is present on a list of flags with a comment marking the protocol version in which it was introduced. There is also a list of literal names for the flags. We will have to later on add --snappy-compress to this list in a similar fashion.
This file is mostly responsible for parsing and initializing the options passed from the command line. The zlib.h header file is included here in order to access some constants defined in the header (for example the compression level) but actual compression/decompression code is not called here.
batch.c
The batch mode is used to save time while performing rsync operations to a set of different hosts replicating the same source data. The batch writing is responsible for recording all the steps required to perform the same synchronization to a different host. Performing the write allows rsync to skip checking the file status, checksums and data block generation. The file is created using the --write-batch option and later replayed with the --read-batch option.
The zlib.h header file is included in batch.c not sure why as no code is called from it - probably in order to use some constants defined in the header. The do_compression flag is present on a list of flags with a comment marking the protocol version in which it was introduced. There is also a list of literal names for the flags. We will have to later on add --snappy-compress to this list in a similar fashion.
token.c
Part of the code is responsible for deciding if a file is worth to be compressed and with what level of compression. Sending and receiving tokens/buffer of data both with compression enabled and disabled is also handled by functions in this file.
There seem to be three exported functions in this code.
- send_token - responsible for sending data. If compression is enabled it calls send_deflated_token otherwise simple_send_token is called. send_token is called from match.c
- recv_token - responsible for getting data from the other end. If compression is enabled it calls recv_deflated_token otherwise simple_recv_token is called. recv_token is called from receiver.c
- see_token - seems to be used to feed the zlib history buffer - not sure about this one. Will require more reading. see_token is called from receiver.c
At this point I think our modification to token.c will end up with an introduction of two new functions recv_snappy_token and send_snappy_token - initially based on simple_send/recv_token.
Overall we will have a lot of mundane testing ahead of us to see if our changes didn't break anything important ;)
6. Adding a --snappy-compress flag
We want our changes to be transparent to the way rsync worked before. If someone used the --compress flag we still want to make sure that only the zlib code is being used. In order to use the snappy library the user will have to provide the --snappy-compress flag or both --compress and --snappy-compress. We will still implicitly set the --compress flag to true if the user just provides the --snappy-compress flag the reason behind this is not to skip code that could be not directly tied to zlib but still take part in compression/decompression (protocol handling? gathering statistics?). We will perform incremental changes and make additional modifications if we notice problems introduced by our changes.
A quick grep of the code base revealed that the actual command line argument parsing takes place in option.c.
All options in this file have a defined variable to store the parsed argument. We will define a new one called do_snappy_compression the name is based on the names of existing variables.
The code in options.c also performs some other tasks, one of them is printing out the list of available features when rsync is called with the --version flag. The function responsible for this feature is called print_rsync_version and generally has a set of defined string constants initialized to "no ". After initialization each feature is checked (for example by checking if a compile time constant was defined) and set to an empty string if the feature is present. Further down each feature is printed via a printf wrapper called rprintf and the current value of the feature constant is printed before the feature name. The result of this is the string "no " printed before the feature name if it's not present or just the feature name if it's available.
Next up is the usage function which is responsible for printing out a list of all the available flags and basic syntax the application handles. We will add our new flag along with a short description here guarded by a conditional based on the definition of SUPPORT_SNAPPY_COMPRESSION which we defined in rsync.h if the feature is available.
According to it's documentation set_refuse_options is used to disable all the options the rsyncd.conf file defines to turn off. At this point I think this only refers to rsync working in daemon mode. Regular compression is handled there so we will remember to take a look here later on but for now we will skip modifying this part of the code.
In parse_arguments we will add a conditional check based on do_snappy_compression. If our flag was provided we will additionally set do_compression to 1.
Now let's recompile and check if our flag is present.
mulander@bunkier_mysli:~/code/blog/lac/rsync$ make
mulander@bunkier_mysli:~/code/blog/lac/rsync$ ./rsync --help
... SNIP ...
--link-dest=DIR hardlink to files in DIR when unchanged
-z, --compress compress file data during the transfer
--compress-level=NUM explicitly set compression level
--snappy-compress use snappy compression instead of zlib
--skip-compress=LIST skip compressing files with a suffix in LIST
... SNIP ...
mulander@bunkier_mysli:~/code/blog/lac/rsync$ ./rsync --version
rsync version 3.1.0dev protocol version 31.PR13
Copyright (C) 1996-2011 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 32-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc, snappy,
rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you
are welcome to redistribute it under certain conditions. See the GNU
General Public Licence for details.
That will be all as far as we are concerned today. The last three (biggest) steps are left for the next post(s). Implementing the actual code for compression/decompression. Making sure that everything works correctly. Creating a github clone of the official repository, storing the changes and notifying the maintainers on the mailing list.