How to learn a code base - rsync Prologue

With this post I'm introducing a series in which I will try to illustrate on the example of rsync how one can go about learning the code, community, tools and documentation - the ecosystem behind a project.

We will start without any knowledge about the project. In the process we will produce documentation of the methods we used to obtain information and end up with a guide which should help new developers get a kick start on the project.

Table of Contents

Prologue - This post
Part 1 - Getting the source and compiling it
Part 2 - Making things snappy

Prologue

First of all. Why are we looking at rsync? To be honest I wanted to get this series done using the SETI@Home Service as an example. Which is an Ada implementation of the popular distributed computation effort. Unfortunately, the required command line client seems no longer available and not being able to run the code would be a major blocker in the series. After hitting that wall, I asked Gynvael to pitch an open source project for this purpose and rsync was his choice. Now that we know who to blame - let's start with gathering some basic information about the project.

I said we will start without any knowledge about the project. This is almost perfectly true. The only thing I know about rsync at this point is that it's an open source application used to synchronize files and directories between two locations. I never had the need to use it and scp was always enough for my needs.

We will record our progress in a org-mode file under version control (using git) and published in this github repository.

The best way to start is to find the projects Wikipedia entry and homepage. Let's read the Wikipedia entry first and note down bits of information that seem interesting at this point. We are searching for a list of supported platforms, algorithms, utilized protocols and possible feature enumerations. These things are often used as boundaries for components in a software application. Their identification can speed up code navigation later on.

From the Wikipedia entry we can learn that both Unix (Mac OS X, GNU/Linux) and Windows platforms are supported. It's worth to note that the application was ported to Windows via Cygwin. This indicates that the origin of the application is in Unix and may be reflected in it's dependencies. The link in the tutorial section also shows that it's possible to run it on FreeBSD.

Minimizing data transfer seems like the main goal of the software this is a hint that the most complicated part of the application will be the data exchange algorithms and protocols used for it. Wikipedia notes that smaller transfers are achieved by the use of delta encoding.

The application can be run in daemon mode and by default listens on port 873. It can serve files in the native rsync protocol or via remote shell such as RSH or SSH. In the latter case the executable has to be present on both sides of the transaction. We need to keep this in mind and see which protocols are implemented in the rsync code base and which are simply used as external libraries or executable files in the target environment. Since the daemon mode is mentioned separately we can assume that the application can work both in a daemon mode and a client mode.

Rsync was written as a replacement to rcp and scp and because of that has a similar user interface to the mentioned tools. It's worth to later compare where the interfaces differ and why. This section of the article also mentions scriptability of rsync and this might indicate an important component of the tool in the code.

Wikipedia also notes the rsync algorithm invented by Andrew Tridgell which is used for efficient transmission of a structure (such as a file) across a communications link when the receiver already has a different (but similar) version of the same structure. The paper is linked in the articles and has a nice rundown on the Wikipedia itself. Both of this resources might come in extremely handy while we go through the code later on.

The application is actively maintained (this is very important) and had it's latest stable release 40 days ago (March 26, 2011). It was initially released on June 19th, 1996.

Other interesting bits noted from Wikipedia:

Synchronizes files and directories
Mirroring takes place only with one transmission in each direction
Optionally can use compression and recursion
Licensed under the GNU GPL

A comment found in the discussion section on the Wikipedia article indicates that rsync and librsync are two different implementations of a similar algorithm this is something we should check out later.

Seems this covers the information we can get from Wikipedia. Let's head to the project homepage for a quick look.

Several things are quickly brought to our attention. First the project intro describes the tool as fast we need to later identify what is the main enabler of this feature (small transfers? efficient implementation? both? which is more important?). The second thing is a security advisory for users of an xattr-enabled rsync older than 3.0.2, a writable rsync daemon older than 3.0.0 or a version of rsync older than 2.6.6. We will need to see this advisory later on to learn what was wrong with the code but it's interesting how much additional information we can scavenge from this sentence. Extended file attributes (xattr) are mentioned as a feature that can be possibly toggled in rsync - so this marks a third mode of operation for the software also a writable mode is mentioned for the daemon mode which implicates also a read-only mode. Totaling to five possible modes of operation (Daemon writeable/readonly, client, xattr-enabled/disabled) we don't know how they depend on each other yet but it's worth to remember it.

We can update our features list from the one listed on the homepage.

can update whole directory trees and filesystems
optionally preserves symbolic links, hard links, file ownership, permissions, devices and times
requires no special privileges to install
internal pipelining reduces latency for multiple files
can use rsh, ssh or direct sockets as the transport
supports anonymous rsync which is ideal for mirroring

Optional preservation of symbolic links, hard links, file ownership, permissions, devices and times this is probably a separate mode to the xattr mode we saw before. So our supported modes could go up to six but we leave this out of the list for now. We will just remember to see which compilation/configuration flag relates to this behavior.

I'm interested in the third point. What allows the application to install without special privileges? How does it hinder it's features (if at all)?

What is the internal pipelining. Is it internal to the application or the OS?

We knew about rsh and ssh, I wonder if direct sockets refer to the native rsync protocol.

We need to see what makes anonymous rsync perfect for mirroring?

It's good to see that the project has a bug tracker, documentation some examples of usage, resources, FAQ and a mailing lists. All of this should come in handy soon.

I also like to take a look on ohloh.net if a project is listed there. In the case of rsync we can quickly see that it's mostly written in C (around 38k lines of code) has a mature, well-established codebase a small development team and a decreasing year-over-year development activity. Note this are just automatic estimations from ohloh.net and may be wrong on some parts but they are a nice overview if you just want to take a glimpse on the project before diving in.

At this point I must note that I have a very basic knowledge and experience with C. I see this as a great opportunity to better learn the language and improve but if you can't stand beginner struggles and mistakes you might want to skip this entire series. On the other hand if you want to see how I approach learning a language and a code base - this will be exactly the thing you should have your eye out for.

In our next post we will get rsync on our system (preferably built from source), try some basic use cases and start navigating it's code base.