Chapter 10. Large Projects

Centralized Version Control

Centralized version control is the most popular model, and perhaps the easiest to understand. In this model, there is a central repository, operated by the project administrators. This repository keeps a virtual filesystem and a history of the changes made to that filesystem over time.

Figure 10-1 illustrates the typical working model used for centralized development.

Figure 10-1. Centralized version control

A developer follows this basic procedure to work with a version control system:

Create a working copy (a local copy of the code for development) by performing a checkout. This downloads the latest revision of the code.
Work on the code locally. Periodically issue update commands, which will retrieve any changes that have been made to the repository since the last checkout or update. These changes can usually be merged automatically, but sometimes manual intervention is required.
When the unit of work is complete, perform a commit to send the changes to the repository. Repeat from step 2, as you already have a working copy.

CVS

The Concurrent Versions System (CVS, http://www.nongnu.org/cvs/) is the oldest version control system still in common use. Although Subversion is generally favored as its replacement, CVS pioneered several features considered essential to centralized version systems, such as the following:

Concurrent access: Previous version control systems such as RCS required a developer to "check out" a file, locking it for update, and then check it in to release the lock. CVS introduced the copy-modify-merge model for text files. Under this model, many developers can work on different parts of the same file concurrently, merging the changes together upon commit.
Repository hooks: The ability to run external scripts upon commit—to run tests, notify a team, start a build, or anything else.
Branches and modules: CVS allows multiple concurrent branches for development. Vendor branches can pull code from unrelated projects; modules map symbolic names to groups of files for convenience.

One often-cited drawback to CVS is that it does not guarantee atomic commits.If interrupted while in process, commits can leave the working copy and repository in an inconsistent state. Most other version control systems provide an atomicity guarantee: commits are applied to the repository either in full or not at all.

In practice, this is more of an annoyance than a critical flaw. Important repositories should be backed up regularly, regardless of what version control system they are backed by. Still, this and other limitations make some developers uncomfortable. In response, many other version control systems have evolved out of the CVS model, and CVS is not used very much anymore for new projects.

Subversion (http://subversion.tigris.org/) is currently the most popular version control system among Rails developers. It was designed to be a replacement for CVS, and it has been very successful. Developers used to CVS will feel at home with Subversion's commands.

As a centralized version control system, Subversion uses one primary server that keeps a master repository. Developers check out a working copy, make their changes, and check them back in. By default, Subversion uses the copy-modify-merge model for concurrent development. Multiple people can check out the same file, make concurrent changes, and have their work merged. Non-overlapping changes will be merged automatically; conflicting changes must be merged by hand.

Files that cannot be merged (such as image files) can be locked for serialized access:

	$ svn lock images/logo.png -m "Changing header color"
	(work with logo.png...)
	$ svn ci images/logo.png -m "Changed header to blue"

You can also use the svn:needs-lock property to designate that a file should be locked before editing. If a file marked with that property is checked out without a lock, the working copy version will be set as read-only to remind the developer to lock the file before changing it.

Subversion was designed as a replacement for CVS, and it improves on CVS in many ways:

Subversion has truly atomic commits; interrupted commits are guaranteed to leave the repository in a consistent state (though they may leave outstanding locks in the working copy).
Subversion supports constant-time branching using copy-on-write semantics for copies. Branches and tags are simply directories; they are not separate objects as in CVS.
Directories are tracked independently of the files they contain. Directories and files can be moved while retaining their version history.
Symbolic links can be stored in the repository and versioned as links.

Subversion provides the best fit for many developers, especially in the open source world. Many projects have migrated from CVS to Subversion over the past few years. Subversion is successful in a large part because it strikes a good balance between features and ease of use.

One drawback of Subversion is that it can be difficult to build the server from source because of its dependencies. It is built on top of APR (the Apache Portable Run-time), a portability layer for network applications. Although the basic dependencies are included for a svnserve installation, you may run into difficulty if you want to use Apache as a Subversion server. However, once you have the dependencies in order, building the server is straightforward.

Decentralized Version Control

Centralized version control has some drawbacks, especially when working in larger teams. The central server and repository can become a bottleneck, especially when dealing with many developers, as in large open source projects. A new paradigm, decentralized version control, is attempting to fix some of these issues. Though it is not widely used among Rails developers, it is worth knowing about, as it is extremely useful for certain situations.

In contrast to the hierarchical structure of centralized versioning systems, decentralized systems provide a more egalitarian approach. (It's the cathedral versus the bazaar, if you will.) Rather than having many working copies that all must communicate their changes back to the repository, each working copy is in fact a full repository. Any of the local repositories can pull and push changes to and from each other, and changesets destined for production will ultimately be pushed back to the authoritative repository.

In fact, the only thing that designates a repository as authoritative is community support: project administrators set up a repository as the master and publish its network address. Developers can pull changes from any repository that decides to publish them. Interestingly, this parallels the meritocracy inherent in open source software: your worth is measured by how much you contribute and how many people listen to you.

The decentralized development model can be more complicated to learn, but it is much more flexible, especially with large projects that have many contributors. The Mercurial wiki gives an example of distributed development best practices based on the development style of the Linux kernel.^[84]

At first glance, a distributed workflow might look fairly similar to a centralized one. In fact, a decentralized version control system can be used as a centralized system; its functionality is a superset of that of centralized systems. Using Mercurial, a developer

can "check out" a codebase (hg clone), make modifications, update from the repository (hg pull; hg merge), and check in (hg commit). This process is illustrated in Figure 10-2.

Figure 10-2. Decentralized version control

There is a slight difference here from the centralized paradigm, in that the pull and merge steps are second. Mercurial gives the developer complete control over the local repository and working copy, so merges do not take place unless requested.

The real power comes from the ability to synchronize repositories. Changesets can be pulled from any repository, not just the master. So, if Bob developed a feature that Alice needs to test, Alice can pull it directly from him, merge it into her repository, and test it before committing it to the master. This is most commonly done today with centralized systems using diff and patch, but distributed systems formalize this method. The process looks something like Figure 10-3.

Figure 10-3. A repository can pull from or push to any other repository

One of the most compelling features of decentralized version control is its compatibility with offline development. With a centralized system, the developer must be able to contact the server whenever he wants to check code in. Under the decentralized model, a developer can check in code to his local repository on his laptop in the Bahamas, and then push all of the changesets at once to the authoritative repository when he has an Internet connection. This keeps the changesets clean and focused, while not requiring a connection to the main repository on every commit. In effect, this method creates a hierarchy of repositories (see Figure 10-4).

Figure 10-4. Disconnected or offline development with decentralized version control

The primary technical drawback to distributed systems (other than their complex-ity) is that each working copy is a full repository. Because each repository contains a full change history, a checkout of a large or often-changing system can be quite large. As an example, the Linux kernel source code is around 50 MB (bzipped), but a git clone checkout of the same source (with history) transfers hundreds of megabytes across the network.

Branching and Merging

In large software development projects, there is usually a need to keep multiple lines of development separate. This need exists for a few reasons:

Ongoing feature development will take place almost immediately after a release is issued. If the release is buggy, the developers need a mechanism to fix the version that was released without introducing any of the changes that were introduced since the release.
A development team will often work on multiple features concurrently. It would be a nightmare if each developer had to ensure that his half-developed, half-tested feature worked with another developer's half-developed, half-tested feature every time he checked in code.
When creating a release for public consumption, there is often a period of testing and evaluation. If the entire development team were frozen during this test period, it would be very hard to get anything done.
A team may offer support for multiple versions of the software at the same time, in effect making public their branching system. Bugfixes and occasionally features must be backported to old releases of the software.

Most version control systems offer flexible branching and merging support. A branch is an independent line of development that can be developed on its own and merged back into the trunk.

Subversion branching and merging

Subversion does not actually have a built-in branching or tagging mechanism as such; all branches and tags are simply copies of part of the directory tree. Subversion creates cheap copies using copy-on-write semantics; data is written to disk only when the copy is actually changed. The amount of extra information required to maintain a branch is roughly proportional to the difference between the branch and its parent.

This characteristic has some drawbacks, though. Subversion 1.4 has very primitive merging support. It does not keep track of when branches were created or merged, and does not prevent a change from being applied twice. Most developers who do at least a moderate amount of merging use svnmerge.py,^[85] which keeps track of this metadata in Subversion properties.

There are many different paradigms for how branches are used. Here are some of the most common ones for web development:

Production branch

The trunk is used for ongoing development. When a feature is fully developed and tested, it is merged into the production branch and deployed. This style is well suited to web applications, which tend to have a single development team working on one feature package at a time. Urgent production defects can be fixed in the production branch without disturbing feature work, and later can be merged into the trunk.

For typical web applications, there is only one release branch, as there is only one version of the software running at a time. When multiple release versions must be supported, the production-branch model is strong, as multiple branches can be created. This can be useful on occasion even in web applications; for example, a large feature release can be staged as a "beta" to a subset of users. If the beta is long-lived, it is useful to create a branch so that development can continue independently.

This model is a slight deviation from the ordinary non-web software development model. In that model, features are developed in the trunk, stable work toward a release is kept in a branch, and finished releases are tagged by copying them to the tags directory. The Rails framework itself uses that model.

Feature branches

This is essentially the opposite of the production-branch model. One branch is created for each new feature to be deployed. The trunk is always expected to be stable and represents the latest stable version of the software.

Some prefer the feature-branch model over the production-branch model for web applications, as it compartmentalizes features and isolates them from one another during development and testing. It supports the single-deployment-environment paradigm, but it is difficult to support multiple releases under this model.

Developer branches

Again, the trunk is a stable codebase. Each developer has his own branch that he can use as his "sandbox" for developing and testing features. He will either merge code into the trunk himself or submit the changesets to be integrated by one person. Often, this is found in large teams, as it integrates well with a for-mal code review process.^[86]

If you have a large enough team that developer branches are necessary, you may find yourself passing around and manually applying patches way too often. In that situation, it may be worthwhile to consider moving to a distributed version control system such as Mercurial or Bazaar.

Of course, the appropriate model will vary from project to project. Do not feel constrained by these models. The trunk, branches, and tags directories are only the traditional conventions used by Subversion developers. You could just as easily set up features, production, and snapshots if it suited your fancy.

Mercurial branching and merging

Branching under distributed version control systems such as Mercurial is much more natural. Any Mercurial repository is automatically a branch, because any repository can pull changes from and push changes to any other repository, even between two different directories on the same filesystem. Thus, the standard branching method under Mercurial is to clone an entire project to a new directory, make the changes, and then use hg pull to retrieve and merge the changes from a branch when needed.

As an example, suppose we are changing an application's color scheme and want to branch to keep the color-related changes together while doing other development. First, we clone the trunk to a new feature branch:

	$ hg clone trunk trunk-newcolors
	47 files updated, 0 files merged, 0 files removed, 0 files unresolved

Now trunk-newcolors contains an identical copy of the trunk, including all history. We are going to make changes to trunk-newcolors, preview them, and then merge them back into trunk. We now make the appropriate changes to trunk-newcolors and commit them:

	$ cd trunk-newcolors/
	$ sed -ie 's/color: red/color: blue/g' public/stylesheets/main.css
	$ hg ci -m "Changed red to blue in main stylesheet"
	$ hg tip
	changeset:   1:18bb8b07ec40
	tag:         tip
	user:        Brad Ediger <brad.ediger@madriska.com>
	date:        Fri Oct 26 13:08:01 2007 -0500
	summary:     Changed red to blue in main stylesheet

We can preview this line of development for as long as we like, and then merge it back into trunk. To merge, we first pull the changes from trunk-newcolors into the trunk repository:

	$ cd ../trunk/
	$ hg pull ../trunk-newcolors
	pulling from ../trunk-newcolors
	searching for changes
	adding changesets
	adding manifests
	adding file changes
	added 1 changesets with 1 changes to 1 files (+1 heads)
	(run 'hg heads' to see heads, 'hg merge' to merge)

This indicates that there have been changes to the trunk since the branch, so we will need to merge.

Tip

Mercurial requires an explicit merge step, even if the merge turns out to be trivial. In some cases, when you pull, you do not want to merge. An extension called FetchExtension provides an hg fetch command to automate the pull/merge/commit process in the case of trivial merges.

We use the hg heads command to see the two heads (two branches of development), one from our local repository at trunk and the other from trunk-newcolors. The merge step using hg merge is simple, and in this case, it is a trivial merge (without any conflicts). Had there been conflicts, hg merge would have attempted to find a three-way merge tool such as FileMerge or kdiff3 to help us resolve the changes. When the merge is complete and we have approved it, we need to commit the merge.

	$ hg heads
	changeset:  2:18bb8b07ec40
	tag:        tip
	parent:     0:65aca7b5860a
	user:       Brad Ediger <brad.ediger@madriska.com>
	date:       Fri Oct 26 13:08:01 2007 -0500
	summary:    Changed red to blue in main stylesheet

	changeset:  1:800424c888ed
	user:       Brad Ediger <brad.ediger@madriska.com>
	date:       Fri Oct 26 13:08:57 2007 -0500
	summary:    added another CSS class

	$ hg merge
	merging public/stylesheets/main.css
	0 files updated, 1 files merged, 0 files removed, 0 files unresolved
	(branch merge, don't forget to commit)
	$ hg ci -m "Merged"

The newly committed merge shows the two changesets from earlier as its parents:

	$ hg tip
	changeset:  3:5f98ca15ccbc
	tag:        tip
	parent:     1:800424c888ed
	parent:     2:18bb8b07ec40
	user:       Brad Ediger <brad.ediger@madriska.com>
	date:       Fri Oct 26 13:10:00 2007 -0500
	summary:    Merged

Often, cloning a repository in this way can be difficult. Rails applications can accumulate a good deal of configuration files (in particular, database.yml) that are not version controlled, and so must be recreated on each clone. There are a few ways around this:

hg clone is basically an atomic recursive copy when working between two repositories on the same filesystem. So, if you can be sure that the source repository will not change during the copy, the following two commands are roughly equivalent:
```
	$ hg clone trunk trunk-newcolors
	$ cp -R trunk trunk-newcolors
```
Of course, the latter has the advantage of preserving files that are not kept under revision control.
Mercurial keeps all of its revision control metadata, including the entire repository, in a single .hg directory under the project root. You can recursively copy this .hg directory over the .hg directory of another repository and then perform an hg update --clean from the target repository to update the working copy (which may contain extra, non-version-controlled files).

Mercurial also has support for named branches, which are separate branches of devel-opment within one repository. This support has been mature since version 0.9.4. However, named branches complicate certain aspects of using Mercurial, and they are a somewhat advanced feature. Named branches are preferable for long-lived development branches, while branching by cloning is still preferred for feature branches. Chapter 8 of Distributed Revision Control with Mercurial goes into detail about branching and merging (http://hgbook.red-bean.com/hgbookch8.html)

Database Migrations

When working with large Rails projects, especially those with multiple developers or feature branches, an issue that frequently comes up is synchronizing database migra-tions. Since Rails migrations are numbered sequentially in the order in which they are generated (with respect to the current project), the generate script will happily use a number that may have been in use elsewhere, in other versions of the project. This causes difficulty upon merging. The typical workflow is this:

The current migration version number in the trunk is 123. You branch the project for a new feature, and in the branch you generate a migration for the database support:
```
	[branches/feature]$ script/generate migration AddNewFeature 
	      exists  db/migrate 
	      create  db/migrate/124_add_new_feature.rb
```
You need to fix an issue in the trunk, so you create and apply a migration to trunk. It is created with version number 124, because the other version 124 is not visible yet:
```
	[trunk]$ script/generate migration BugFix 
	      exists  db/migrate 
	      create  db/migrate/124_bug_fix.rb
```
Upon merging, there are two migrations with version 124. These must be manually renumbered, which can be difficult if there were many migrations. The data-base must then be migrated down to the lowest migration common to both branches, and migrated back up. If the migrations are not fully reversible, the changes may have to be applied manually.

This situation can also happen when there are multiple developers generating their own migrations. The solution for that situation is good communication: developers should always pull the db/migrate directory from the version control system immediately before generating a migration. Conversely, these migrations should be checked in as soon as is practical after generation, so all developers have access.

Unfortunately, when using branches, it is not generally possible to publish every schema change across all branches. If it were, a simple solution would be to set up a shared migrations directory in the version control repository, and import it via a svn: externals(or equivalent) declaration. In most cases, schema changes to separate branches must be kept separate; at the least, production databases should not be polluted with database changes for new features. So, another solution must be found. There are several schools of thought on how this should work.

The simplest solution, which is probably the most popular, is Courtenay's Independent Migrations plugin (http://blog.caboo.se/articles/2007/3/27/independent-migrations-plugin). The basic assumption is that migrations which are created in different branches or working copies are logically independent of each other. (If this assumption doesn't hold, you will have problems when merging, no matter how you slice it.)

After installing the plugin, simply tag your independent migrations as such by inheriting from ActiveRecord::IndependentMigration rather than ActiveRecord::Migration.

	class AddFeature < ActiveRecord::IndependentMigration 
	  # ...
	end

Multiple independent migrations will then be applied concurrently, so migrations can be merged without renumbering. However, this does not eliminate the need to migrate down and back up when several migrations have been applied to a database; the plugin will not search old version numbers (older than the current version) for new migrations.

My solution, Subverted Migrations,^[87] is more complicated, but it aims to be as transparent as possible once you understand it. As the name suggests, it only works with Subversion. The intent is to synchronize version numbers across all branches. That way, all developers and all branches have the same view of the migrations that have been applied project-wide. It applies two changes to the Rails version-numbering mechanism:

It serializes version numbers across all branches by scanning the Subversion repository for all branches to find a free version number.
It changes the semantics of the schema_version table: rather than being the number of the latest-applied migration, the schema version is a list of migrations that have been applied to the database. When older changes from other branches are merged in, a simple rake db:migrateapplies them without the need to migrate down and up.

Of course, this only works if all developers promptly check in their new migrations, and if the migrations are truly independent from each other in the first place. The multiple-developer scenario always requires good communication. Another drawback of Subverted Migrations is that it requires access to the Subversion repository every time a migration is generated. The other solutions operate only with the working copy.

The last solution is François Beausoleil's Timestamped Migrations patch. This patches Rails to use UTC timestamps rather than simple version numbers. Like Subverted Migrations, this method changes the semantics of the schema_info table to reflect exactly which migrations have been applied. Timestamped Migrations is not available as a plugin, but only as a patch against edge Rails (http://blog.teksol.info/articles/search?q=timestamp).