Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it.
—Alan Perlis
This chapter introduces several concepts that are related to deploying large applications in general, and Rails applications in particular. These are valuable concepts for any project, regardless of the framework being used.
For all but the tiniest of projects, version control is non-negotiable. Version control is like a time machine for a project; it aids in collaboration, troubleshooting, release management, and even systems administration. Even for a solo developer working on a small project on one workstation, the ability to go back in time across a codebase is one of the most valuable things to have.
There are two primary models for version control systems: centralized and decentralized. Though the former is the most widely known, the latter is steadily gaining in popularity and has some amazing capabilities.
Centralized version control is the most popular model, and perhaps the easiest to understand. In this model, there is a central repository, operated by the project administrators. This repository keeps a virtual filesystem and a history of the changes made to that filesystem over time.
Figure 10-1 illustrates the typical working model used for centralized development.
A developer follows this basic procedure to work with a version control system:
Create a working copy (a local copy of
the code for development) by performing a checkout
. This downloads the latest
revision of the code.
Work on the code locally. Periodically issue update
commands, which will retrieve any
changes that have been made to the repository since the last
checkout or update. These changes can usually be merged
automatically, but sometimes manual intervention is
required.
When the unit of work is complete, perform a commit
to send the changes to the
repository. Repeat from step 2, as you already have a working
copy.
The Concurrent Versions System (CVS, http://www.nongnu.org/cvs/) is the oldest version control system still in common use. Although Subversion is generally favored as its replacement, CVS pioneered several features considered essential to centralized version systems, such as the following:
Previous version control systems such as RCS required a developer to "check out" a file, locking it for update, and then check it in to release the lock. CVS introduced the copy-modify-merge model for text files. Under this model, many developers can work on different parts of the same file concurrently, merging the changes together upon commit.
The ability to run external scripts upon commit—to run tests, notify a team, start a build, or anything else.
CVS allows multiple concurrent branches for development. Vendor branches can pull code from unrelated projects; modules map symbolic names to groups of files for convenience.
One often-cited drawback to CVS is that it does not guarantee atomic commits.If interrupted while in process, commits can leave the working copy and repository in an inconsistent state. Most other version control systems provide an atomicity guarantee: commits are applied to the repository either in full or not at all.
In practice, this is more of an annoyance than a critical flaw. Important repositories should be backed up regularly, regardless of what version control system they are backed by. Still, this and other limitations make some developers uncomfortable. In response, many other version control systems have evolved out of the CVS model, and CVS is not used very much anymore for new projects.
Subversion (http://subversion.tigris.org/) is currently the most popular version control system among Rails developers. It was designed to be a replacement for CVS, and it has been very successful. Developers used to CVS will feel at home with Subversion's commands.
As a centralized version control system, Subversion uses one primary server that keeps a master repository. Developers check out a working copy, make their changes, and check them back in. By default, Subversion uses the copy-modify-merge model for concurrent development. Multiple people can check out the same file, make concurrent changes, and have their work merged. Non-overlapping changes will be merged automatically; conflicting changes must be merged by hand.
Files that cannot be merged (such as image files) can be locked for serialized access:
$ svn lock images/logo.png -m "Changing header color" (work with logo.png...) $ svn ci images/logo.png -m "Changed header to blue"
You can also use the svn:needs-lock
property to designate that
a file should be locked before editing. If a file marked with that
property is checked out without a lock, the working copy version
will be set as read-only to remind the developer to lock the file
before changing it.
Subversion was designed as a replacement for CVS, and it improves on CVS in many ways:
Subversion has truly atomic commits; interrupted commits are guaranteed to leave the repository in a consistent state (though they may leave outstanding locks in the working copy).
Subversion supports constant-time branching using copy-on-write semantics for copies. Branches and tags are simply directories; they are not separate objects as in CVS.
Directories are tracked independently of the files they contain. Directories and files can be moved while retaining their version history.
Symbolic links can be stored in the repository and versioned as links.
Subversion provides the best fit for many developers, especially in the open source world. Many projects have migrated from CVS to Subversion over the past few years. Subversion is successful in a large part because it strikes a good balance between features and ease of use.
One drawback of Subversion is that it can be difficult to build the server from source because of its dependencies. It is built on top of APR (the Apache Portable Run-time), a portability layer for network applications. Although the basic dependencies are included for a svnserve installation, you may run into difficulty if you want to use Apache as a Subversion server. However, once you have the dependencies in order, building the server is straightforward.
Centralized version control has some drawbacks, especially when working in larger teams. The central server and repository can become a bottleneck, especially when dealing with many developers, as in large open source projects. A new paradigm, decentralized version control, is attempting to fix some of these issues. Though it is not widely used among Rails developers, it is worth knowing about, as it is extremely useful for certain situations.
In contrast to the hierarchical structure of centralized versioning systems, decentralized systems provide a more egalitarian approach. (It's the cathedral versus the bazaar, if you will.) Rather than having many working copies that all must communicate their changes back to the repository, each working copy is in fact a full repository. Any of the local repositories can pull and push changes to and from each other, and changesets destined for production will ultimately be pushed back to the authoritative repository.
In fact, the only thing that designates a repository as authoritative is community support: project administrators set up a repository as the master and publish its network address. Developers can pull changes from any repository that decides to publish them. Interestingly, this parallels the meritocracy inherent in open source software: your worth is measured by how much you contribute and how many people listen to you.
The decentralized development model can be more complicated to learn, but it is much more flexible, especially with large projects that have many contributors. The Mercurial wiki gives an example of distributed development best practices based on the development style of the Linux kernel.[84]
At first glance, a distributed workflow might look fairly similar to a centralized one. In fact, a decentralized version control system can be used as a centralized system; its functionality is a superset of that of centralized systems. Using Mercurial, a developer
can "check out" a codebase (hg
clone
), make modifications, update from the repository
(hg pull; hg merge
), and check in
(hg commit
). This process is
illustrated in Figure 10-2.
There is a slight difference here from the centralized paradigm, in that the pull and merge steps are second. Mercurial gives the developer complete control over the local repository and working copy, so merges do not take place unless requested.
The real power comes from the ability to synchronize
repositories. Changesets can be pulled from any repository, not just
the master. So, if Bob developed a feature that Alice needs to test,
Alice can pull it directly from him, merge it into her repository, and
test it before committing it to the master. This is most commonly done
today with centralized systems using diff
and patch
, but distributed systems formalize
this method. The process looks something like Figure 10-3.
One of the most compelling features of decentralized version control is its compatibility with offline development. With a centralized system, the developer must be able to contact the server whenever he wants to check code in. Under the decentralized model, a developer can check in code to his local repository on his laptop in the Bahamas, and then push all of the changesets at once to the authoritative repository when he has an Internet connection. This keeps the changesets clean and focused, while not requiring a connection to the main repository on every commit. In effect, this method creates a hierarchy of repositories (see Figure 10-4).
The primary technical drawback to distributed systems (other
than their complex-ity) is that each working copy is a full
repository. Because each repository contains a full change history, a
checkout of a large or often-changing system can be quite large. As an
example, the Linux kernel source code is around 50 MB (bzipped), but a
git clone
checkout of the same
source (with history) transfers hundreds of megabytes across the
network.
In large software development projects, there is usually a need to keep multiple lines of development separate. This need exists for a few reasons:
Ongoing feature development will take place almost immediately after a release is issued. If the release is buggy, the developers need a mechanism to fix the version that was released without introducing any of the changes that were introduced since the release.
A development team will often work on multiple features concurrently. It would be a nightmare if each developer had to ensure that his half-developed, half-tested feature worked with another developer's half-developed, half-tested feature every time he checked in code.
When creating a release for public consumption, there is often a period of testing and evaluation. If the entire development team were frozen during this test period, it would be very hard to get anything done.
A team may offer support for multiple versions of the software at the same time, in effect making public their branching system. Bugfixes and occasionally features must be backported to old releases of the software.
Most version control systems offer flexible branching and merging support. A branch is an independent line of development that can be developed on its own and merged back into the trunk.
Subversion does not actually have a built-in branching or tagging mechanism as such; all branches and tags are simply copies of part of the directory tree. Subversion creates cheap copies using copy-on-write semantics; data is written to disk only when the copy is actually changed. The amount of extra information required to maintain a branch is roughly proportional to the difference between the branch and its parent.
This characteristic has some drawbacks, though. Subversion 1.4 has very primitive merging support. It does not keep track of when branches were created or merged, and does not prevent a change from being applied twice. Most developers who do at least a moderate amount of merging use svnmerge.py,[85] which keeps track of this metadata in Subversion properties.
There are many different paradigms for how branches are used. Here are some of the most common ones for web development:
The trunk is used for ongoing development. When a feature is fully developed and tested, it is merged into the production branch and deployed. This style is well suited to web applications, which tend to have a single development team working on one feature package at a time. Urgent production defects can be fixed in the production branch without disturbing feature work, and later can be merged into the trunk.
For typical web applications, there is only one release branch, as there is only one version of the software running at a time. When multiple release versions must be supported, the production-branch model is strong, as multiple branches can be created. This can be useful on occasion even in web applications; for example, a large feature release can be staged as a "beta" to a subset of users. If the beta is long-lived, it is useful to create a branch so that development can continue independently.
This model is a slight deviation from the ordinary non-web software development model. In that model, features are developed in the trunk, stable work toward a release is kept in a branch, and finished releases are tagged by copying them to the tags directory. The Rails framework itself uses that model.
This is essentially the opposite of the production-branch model. One branch is created for each new feature to be deployed. The trunk is always expected to be stable and represents the latest stable version of the software.
Some prefer the feature-branch model over the production-branch model for web applications, as it compartmentalizes features and isolates them from one another during development and testing. It supports the single-deployment-environment paradigm, but it is difficult to support multiple releases under this model.
Again, the trunk is a stable codebase. Each developer has his own branch that he can use as his "sandbox" for developing and testing features. He will either merge code into the trunk himself or submit the changesets to be integrated by one person. Often, this is found in large teams, as it integrates well with a for-mal code review process.[86]
If you have a large enough team that developer branches are necessary, you may find yourself passing around and manually applying patches way too often. In that situation, it may be worthwhile to consider moving to a distributed version control system such as Mercurial or Bazaar.
Of course, the appropriate model will vary from project to project. Do not feel constrained by these models. The trunk, branches, and tags directories are only the traditional conventions used by Subversion developers. You could just as easily set up features, production, and snapshots if it suited your fancy.
Branching under distributed version control systems
such as Mercurial is much more natural. Any Mercurial repository is
automatically a branch, because any repository can pull changes from
and push changes to any other repository, even between two different
directories on the same filesystem. Thus, the standard branching
method under Mercurial is to clone an entire project to a new
directory, make the changes, and then use hg pull
to retrieve and merge the changes
from a branch when needed.
As an example, suppose we are changing an application's color scheme and want to branch to keep the color-related changes together while doing other development. First, we clone the trunk to a new feature branch:
$ hg clone trunk trunk-newcolors 47 files updated, 0 files merged, 0 files removed, 0 files unresolved
Now trunk-newcolors contains an identical copy of the trunk, including all history. We are going to make changes to trunk-newcolors, preview them, and then merge them back into trunk. We now make the appropriate changes to trunk-newcolors and commit them:
$ cd trunk-newcolors/ $ sed -ie 's/color: red/color: blue/g' public/stylesheets/main.css $ hg ci -m "Changed red to blue in main stylesheet" $ hg tip changeset: 1:18bb8b07ec40 tag: tip user: Brad Ediger <brad.ediger@madriska.com> date: Fri Oct 26 13:08:01 2007 -0500 summary: Changed red to blue in main stylesheet
We can preview this line of development for as long as we like, and then merge it back into trunk. To merge, we first pull the changes from trunk-newcolors into the trunk repository:
$ cd ../trunk/ $ hg pull ../trunk-newcolors pulling from ../trunk-newcolors searching for changes adding changesets adding manifests adding file changes added 1 changesets with 1 changes to 1 files (+1 heads) (run 'hg heads' to see heads, 'hg merge' to merge)
This indicates that there have been changes to the trunk since the branch, so we will need to merge.
Mercurial requires an explicit merge step, even if the merge
turns out to be trivial. In some cases, when you pull, you do not
want to merge. An extension called FetchExtension provides an
hg fetch
command to automate
the pull/merge/commit process in the case of trivial
merges.
We use the hg heads
command
to see the two heads (two branches of
development), one from our local repository at
trunk and the other from
trunk-newcolors. The merge step using hg merge
is simple, and in this case, it
is a trivial merge (without any conflicts). Had there been
conflicts, hg merge
would have
attempted to find a three-way merge tool such as FileMerge or kdiff3
to help us resolve the changes. When the merge is complete and we
have approved it, we need to commit the merge.
$ hg heads changeset: 2:18bb8b07ec40 tag: tip parent: 0:65aca7b5860a user: Brad Ediger <brad.ediger@madriska.com> date: Fri Oct 26 13:08:01 2007 -0500 summary: Changed red to blue in main stylesheet changeset: 1:800424c888ed user: Brad Ediger <brad.ediger@madriska.com> date: Fri Oct 26 13:08:57 2007 -0500 summary: added another CSS class $ hg merge merging public/stylesheets/main.css 0 files updated, 1 files merged, 0 files removed, 0 files unresolved (branch merge, don't forget to commit) $ hg ci -m "Merged"
The newly committed merge shows the two changesets from earlier as its parents:
$ hg tip changeset: 3:5f98ca15ccbc tag: tip parent: 1:800424c888ed parent: 2:18bb8b07ec40 user: Brad Ediger <brad.ediger@madriska.com> date: Fri Oct 26 13:10:00 2007 -0500 summary: Merged
Often, cloning a repository in this way can be difficult. Rails applications can accumulate a good deal of configuration files (in particular, database.yml) that are not version controlled, and so must be recreated on each clone. There are a few ways around this:
hg clone
is basically
an atomic recursive copy when working between two repositories
on the same filesystem. So, if you can be sure that the source
repository will not change during the copy, the following two
commands are roughly equivalent:
$ hg clone trunk trunk-newcolors $ cp -R trunk trunk-newcolors
Of course, the latter has the advantage of preserving files that are not kept under revision control.
Mercurial keeps all of its revision control metadata,
including the entire repository, in a single
.hg directory under the project root. You
can recursively copy this .hg directory
over the .hg directory of another
repository and then perform an hg update --clean
from the target
repository to update the working copy (which may contain extra,
non-version-controlled files).
Mercurial also has support for named branches, which are separate branches of devel-opment within one repository. This support has been mature since version 0.9.4. However, named branches complicate certain aspects of using Mercurial, and they are a somewhat advanced feature. Named branches are preferable for long-lived development branches, while branching by cloning is still preferred for feature branches. Chapter 8 of Distributed Revision Control with Mercurial goes into detail about branching and merging (http://hgbook.red-bean.com/hgbookch8.html)
When working with large Rails projects, especially those with multiple developers or feature branches, an issue that frequently comes up is synchronizing database migra-tions. Since Rails migrations are numbered sequentially in the order in which they are generated (with respect to the current project), the generate script will happily use a number that may have been in use elsewhere, in other versions of the project. This causes difficulty upon merging. The typical workflow is this:
The current migration version number in the trunk is 123. You branch the project for a new feature, and in the branch you generate a migration for the database support:
[branches/feature]$ script/generate migration AddNewFeature exists db/migrate create db/migrate/124_add_new_feature.rb
You need to fix an issue in the trunk, so you create and apply a migration to trunk. It is created with version number 124, because the other version 124 is not visible yet:
[trunk]$ script/generate migration BugFix exists db/migrate create db/migrate/124_bug_fix.rb
Upon merging, there are two migrations with version 124. These must be manually renumbered, which can be difficult if there were many migrations. The data-base must then be migrated down to the lowest migration common to both branches, and migrated back up. If the migrations are not fully reversible, the changes may have to be applied manually.
This situation can also happen when there are multiple developers generating their own migrations. The solution for that situation is good communication: developers should always pull the db/migrate directory from the version control system immediately before generating a migration. Conversely, these migrations should be checked in as soon as is practical after generation, so all developers have access.
Unfortunately, when using branches, it is not generally possible
to publish every schema change across all branches. If it were, a
simple solution would be to set up a shared migrations directory in
the version control repository, and import it via a svn: externals
(or equivalent) declaration.
In most cases, schema changes to separate branches must be kept
separate; at the least, production databases should not be polluted with database changes for new features. So, another solution
must be found. There are several schools of thought on how this should
work.
The simplest solution, which is probably the most popular, is Courtenay's Independent Migrations plugin (http://blog.caboo.se/articles/2007/3/27/independent-migrations-plugin). The basic assumption is that migrations which are created in different branches or working copies are logically independent of each other. (If this assumption doesn't hold, you will have problems when merging, no matter how you slice it.)
After installing the plugin, simply tag your independent migrations as such by inheriting from
ActiveRecord::IndependentMigration
rather than ActiveRecord::Migration
.
class AddFeature < ActiveRecord::IndependentMigration # ... end
Multiple independent migrations will then be applied concurrently, so migrations can be merged without renumbering. However, this does not eliminate the need to migrate down and back up when several migrations have been applied to a database; the plugin will not search old version numbers (older than the current version) for new migrations.
My solution, Subverted Migrations,[87] is more complicated, but it aims to be as transparent as possible once you understand it. As the name suggests, it only works with Subversion. The intent is to synchronize version numbers across all branches. That way, all developers and all branches have the same view of the migrations that have been applied project-wide. It applies two changes to the Rails version-numbering mechanism:
It serializes version numbers across all branches by scanning the Subversion repository for all branches to find a free version number.
It changes the semantics of the schema_version
table: rather than being
the number of the latest-applied migration, the schema version is
a list of migrations that have been applied to the database. When
older changes from other branches are merged in, a simple rake db:migrate
applies them without the
need to migrate down and up.
Of course, this only works if all developers promptly check in their new migrations, and if the migrations are truly independent from each other in the first place. The multiple-developer scenario always requires good communication. Another drawback of Subverted Migrations is that it requires access to the Subversion repository every time a migration is generated. The other solutions operate only with the working copy.
The last solution is François Beausoleil's Timestamped Migrations patch. This patches Rails to use UTC timestamps rather than simple version numbers. Like Subverted Migrations, this method changes the semantics of the schema_info table to reflect exactly which migrations have been applied. Timestamped Migrations is not available as a plugin, but only as a patch against edge Rails (http://blog.teksol.info/articles/search?q=timestamp).
[86] Google uses a similar method, without explicit developer branches, for its internal development. They use NFS, Perforce, and a code review tool called Mondrian developed by Guido van Rossum.