Refactoring

Note

Programmers

Every day, our code is slightly better than it was the day before.

Entropy always wins. Eventually, chaos turns your beautifully imagined and well-designed code into a big mess of spaghetti.

At least, that’s the way it used to be, before refactoring. Refactoring is the process of changing the design of your code without changing its behavior—what it does stays the same, but how it does it changes. Refactorings are also reversible; sometimes one form is better than another for certain cases. Just as you can change the expression x2 - 1 to (x + 1)(x - 1) and back, you can change the design of your code—and once you can do that, you can keep entropy at bay.

Refactoring enables an approach to design I call reflective design. In addition to creating a design and coding it, you can now analyze the design of existing code and improve it. One of the best ways to identify improvements is with code smells: condensed nuggets of wisdom that help you identify common problems in design.

A code smell doesn’t necessarily mean there’s a problem with the design. It’s like a funky smell in the kitchen: it could indicate that it’s time to take out the garbage, or it could just mean that Uncle Gabriel is cooking with a particularly pungent cheese. Either way, when you smell something funny, take a closer look.

[Fowler 1999], writing with Kent Beck, has an excellent discussion of code smells. It’s well worth reading. Here are a summary of the smells I find most often, including some that Fowler and Beck didn’t mention.[46]

Reflective design requires that you understand the design of existing code. The easiest way to do so is to ask someone else on the team. A conversation around a whiteboard design sketch is a great way to learn.

In some cases, no one on the team will understand the design, or you may wish to dive into the code yourself. When you do, focus on the responsibilities and interactions of each major component. What is the purpose of this package or namespace? What does this class represent? How does it interact with other packages, namespaces, and classes?

For example, NUnitAsp is a tool for unit testing ASP.NET code-behind logic. One of its classes is HttpClient, which you might infer makes calls to an HTTP (web) server—presumably an ASP.NET web server.

To confirm that assumption, look at the names of the class’ methods and instance variables. HttpClient has methods named GetPage, FollowLink, SubmitForm, and HasCookie, along with some USER_AGENT constants and several related methods and properties. In total, it seems pretty clear that HttpClient emulates a web browser.

Now expand that understanding to related elements. Which classes does this class depend on? Which classes depend on this one? What are their responsibilities? As you learn, diagram your growing understanding on a whiteboard.

Creating a UML sequence diagram[47] can be helpful for understanding how individual methods interact with the rest of the system. Start with a method you want to understand and look at each line in turn, recursively adding each called method to your sequence diagram. This is fairly time-consuming, so I only recommend it if you’re confused about how or why a method works.

Although these techniques are useful for understanding the design of unfamiliar code, you shouldn’t need them often. You should already have a good understanding of most parts of your software, or be able to talk to someone on the team who does. Unless you’re dealing with a lot of legacy code, you should rarely have trouble understanding the design of existing code: your team wrote it, after all. If you find yourself needing these techniques often, something is wrong—ask your mentor for help.

Reflective design helps you understand what to change; refactoring gives you the ability to make those changes. When you refactor, proceed in a series of small transformations. (Confusingly, each type of transformation is also called a refactoring.) Each refactoring is like making a turn on a Rubik’s cube. To achieve anything significant, you have to string together several individual refactorings, just as you have to string together several turns to solve the cube.

The fact that refactoring is a sequence of small transformations sometimes gets lost on people new to refactoring. Refactoring isn’t rewriting. You don’t just change the design of your code; to refactor well, you need to make that change in a series of controlled steps. Each step should only take a few moments, and your tests should pass after each one.

There are a wide variety of individual refactorings. [Fowler 1999]’s Refactoring is the classic work on the subject. It contains an in-depth catalog of refactoring, and is well worth studying—I learned more about good code from reading that book than from any other.

You don’t need to memorize all the individual refactorings. Instead, try to learn the mindset behind the refactorings. Work from the book in the beginning. Over time, you’ll learn how to refactor intuitively, creating each refactoring as you need it.

Note

Some IDEs offer automated refactorings. Although this is helpful for automating some tedious refactorings, learn how to refactor manually, too. There are many more refactoring options available to you than your IDE provides.

Any significant design change requires a sequence of refactorings. Learning how to change your design through a series of small refactorings is a valuable skill. Once you’ve mastered it, you can make dramatic design changes without breaking anything. You can even do this in pieces, by fixing part of the design one day and the rest of it another day.

To illustrate this point, I’ll show each step in the simple refactoring from my TDD example (see the TDD example in Test-Driven Development” earlier in this chapter). Note how small each step is. Working in small steps enables you to remain in control of the code, prevents confusion, and allows you to work very quickly.

Note

Don’t let the small scale of this example distract you from the main point: making changes in a series of small, controlled steps. For larger examples, see Further Reading” at the end of this section.

The purpose of this example was to create an HTTP query string parser. At this point, I had a working, bare-bones parser (see A TDD Example” earlier in this chapter). Here are the tests:

  public class QueryStringTest extends TestCase {

      public void testOneNameValuePair() {
          QueryString query = new QueryString("name=value");
          assertEquals(1, query.count());
          assertEquals("value", query.valueFor("name"));
      }

      public void testMultipleNameValuePairs() {
          QueryString query = new QueryString("name1=value1&name2=value2&name3=value3");
          assertEquals(3, query.count());
          assertEquals("value1", query.valueFor("name1"));
          assertEquals("value2", query.valueFor("name2"));
          assertEquals("value3", query.valueFor("name3"));
      }

      public void testNoNameValuePairs() {
          QueryString query = new QueryString("");
          assertEquals(0, query.count());
      }

      public void testNull() {
          try {
              QueryString query = new QueryString(null);
              fail("Should throw exception");
          }
          catch (NullPointerException e) {
              // expected
          }
      }
  }

The code worked—it passed all the tests—but it was ugly. Both the count() and valueFor() methods had duplicate parsing code. I wanted to eliminate this duplication and put parsing in just one place.

  public class QueryString {
      private String _query;

      public QueryString(String queryString) {
          if (queryString == null) throw new NullPointerException();
          _query = queryString;
      }

      public int count() {
          if ("".equals(_query)) return 0;
          String[] pairs = _query.split("&");
          return pairs.length;
      }

      public String valueFor(String name) {
          String[] pairs = _query.split("&");
          for (String pair : pairs) {
              String[] nameAndValue = pair.split("=");
              if (nameAndValue[0].equals(name)) return nameAndValue[1];
          }
          throw new RuntimeException(name + " not found");
      }
  }

To eliminate the duplication, I needed a single method that could do all the parsing. The other methods would work with the results of the parse rather than doing any parsing of their own. I decided that this parser, called from the constructor, would parse the data into a HashMap.

Although I could have done this in one giant step by moving the body of valueFor() into a parseQueryString() method and then hacking until the tests passed again, I knew from hard-won experience that it was faster to proceed in small steps.

My first step was to introduce HashMap() into valueFor(). This would make valueFor() look just like the parseQueryString() method I needed. Once it did, I could extract out parseQueryString() easily.

  public String valueFor(String name) {
      HashMap<String, String> map = new HashMap<String, String>();

      String[] pairs = _query.split("&");
      for (String pair : pairs) {
          String[] nameAndValue = pair.split("=");
          map.put(nameAndValue[0], nameAndValue[1]);
      }
      return map.get(name);
  }

After making this refactoring, I ran the tests. They passed.

Note

By removing the RuntimeException, I had changed the behavior of the code when the name was not found. That was OK because the RuntimeException was just an assertion about an unimplemented feature. I hadn’t yet written any tests around it. If I had, I would have changed the code to maintain the existing behavior.

Now I could extract the parsing logic into its own method. I used my IDE’s built-in Extract Method refactoring to do so.

  public String valueFor(String name) {
      HashMap<String, String> map = parseQueryString();
      return map.get(name);
  }

  private HashMap<String, String> parseQueryString() {
      HashMap<String, String> map = new HashMap<String, String>();

      String[] pairs = _query.split("&");
      for (String pair : pairs) {
          String[] nameAndValue = pair.split("=");
          map.put(nameAndValue[0], nameAndValue[1]);
      }
      return map;
  }

The tests passed again, of course. With such small steps, I’d be surprised if they didn’t. That’s the point: by taking small steps, I remain in complete control of my changes, which reduces surprises.

I now had a parseQueryString() method, but it was only available to valueFor(). My next step was to make it available to all methods. To do so, I created a _map instance variable and had parseQueryString() use it.

  public class QueryString {
      private String _query;
      private HashMap<String, String> _map = new HashMap<String, String>();

  ...

      public String valueFor(String name) {
          HashMap<String, String> map = parseQueryString();
          return map.get(name);
      }

      private HashMap<String, String> parseQueryString() {
          String[] pairs = _query.split("&");
          for (String pair : pairs) {
              String[] nameAndValue = pair.split("=");
              _map.put(nameAndValue[0], nameAndValue[1]);
          }
          return _map;
      }
  }

This is a trickier refactoring than it seems. Whenever you switch from a local variable to an instance variable, the order of operations can get confused. That’s why I continued to have parseQueryString() return _map, even though it was now available as an instance variable. I wanted to make sure this first step passed its tests before proceeding to my next step, which was to get rid of the unnecessary return.

  public class QueryString {
      private String _query;
      private HashMap<String, String> _map = new HashMap<String, String>();

  ...

      public String valueFor(String name) {
          parseQueryString();
          return _map.get(name);
      }

      private void parseQueryString() {
          String[] pairs = _query.split("&");
          for (String pair : pairs) {
              String[] nameAndValue = pair.split("=");
              _map.put(nameAndValue[0], nameAndValue[1]);
          }
      }
  }

The tests passed again.

Because parseQueryString() now stood entirely on its own, its only relationship to valueFor() was that it had to be called before valueFor()’s return statement. I was finally ready to achieve my goal of calling parseQueryString() from the constructor.

  public class QueryString {
      private String _query;
      private HashMap<String, String> _map = new HashMap<String, String>();

      public QueryString(String queryString) {
          if (queryString == null) throw new NullPointerException();
          _query = queryString;
          parseQueryString();
      }

  ...

      public String valueFor(String name) {
          return _map.get(name);
      }

  ...

  }

This seemed like a simple refactoring. After all, I moved only one line of code. Yet when I ran my tests, they failed. My parse method didn’t work with an empty string—a degenerate case that I hadn’t yet implemented in valueFor(). It wasn’t a problem as long as only valueFor() ever called parseQueryString(), but it showed up now that I called parseQueryString() in the constructor.

Note

This shows why taking small steps is such a good idea. Because I had only changed one line of code, I knew exactly what had gone wrong.

The problem was easy to fix with a guard clause.

  private void parseQueryString() {
      if ("".equals(_query)) return;

      String[] pairs = _query.split("&");
      for (String pair : pairs) {
          String[] nameAndValue = pair.split("=");
          _map.put(nameAndValue[0], nameAndValue[1]);
      }
  }

At this point, I was nearly done. The duplicate parsing in the count() method caused all of this mess, and I was ready to refactor it to use the _map variable rather than do its own parsing. It went from:

  public int count() {
      if ("".equals(_query)) return 0;
      String[] pairs = _query.split("&");
      return pairs.length;
  }

to:

  public int count() {
      return _map.size();
  }

I love it when I can delete code.

I reviewed the code and saw just one loose end remaining: the _query instance variable that stored the unparsed query string. I no longer needed it anywhere but parseQueryString(), so I demoted it from an instance variable to a parseQueryString() parameter.

  public class QueryString {
      private HashMap<String, String> _map = new HashMap<String, String>();

      public QueryString(String queryString) {
          if (queryString == null) throw new NullPointerException();
          parseQueryString(queryString);
      }

      public int count() {
          return _map.size();
      }

      public String valueFor(String name) {
          return _map.get(name);
      }

      private void parseQueryString(String query) {
          if ("".equals(query)) return;

          String[] pairs = query.split("&");
          for (String pair : pairs) {
              String[] nameAndValue = pair.split("=");
              _map.put(nameAndValue[0], nameAndValue[1]);
          }
      }
  }

When you compare the initial code to this code, there’s little in common. Yet this change took place as a series of small, careful refactorings. Although it took me a long time to describe the steps, each individual refactoring took a matter of seconds in practice. The whole series occurred in a few minutes.

How often should we refactor?

Constantly. Perform little refactorings as you use TDD and bigger refactorings every week. Every week, your design should be better than it was the week before. (See Incremental Design and Architecture” later in this chapter.)

Isn’t refactoring rework? Shouldn’t we design our code correctly from the beginning?

If it were possible to design your code perfectly from the beginning, then refactoring would be rework. However, as anybody who’s worked with large systems knows, mistakes always creep in. It isn’t possible to design software perfectly. That’s why refactoring is important. Rather than bemoan the errors in the design, celebrate your ability to fix them.

What about our database? That’s what really needs improvement.

You can refactor databases, too. Just as with normal refactorings, the trick is to proceed in small, behavior-preserving steps. For example, to rename a table, you can create a new table, copy the data from one to the next, update constraints, update stored procedures and applications, then delete the old table.[48] See Further Reading” at the end of this section for more.

How can we make large design changes without conflicting with other team members?

Take advantage of communication and continuous integration. Before taking on a refactoring that will touch a bunch of code, check in your existing code and let people know what you’re about to do. Sometimes other pairs can reduce the chance of integration conflicts by mirroring any renaming you’re planning to do. (IDEs with refactoring support make such renaming painless.)

During the refactoring, I like to use the distributed version control system SVK, built atop Subversion. It allows me to commit my changes to a local repository one at a time, then push all of them to the main repository when I reach the point of integration. This doesn’t prevent conflicts with other pairs, but it allows me to checkpoint locally, which reduces my need to disturb anyone else before I finish.

My refactoring broke the tests! They passed, but then we changed some code and now they fail. What happened?

It’s possible you made a mistake in refactoring, but if not, it could be a sign of poor tests. They might test the code’s implementation rather than its behavior. Undo the refactoring and take a look at the tests; you may need to refactor them.

Is refactoring tests actually worthwhile?

Absolutely! Too many developers forget to do this and find themselves maintaining tests that are brittle and difficult to change. Tests have to be maintained just as much as production code does, so they’re valid targets for refactoring, too.

Exercise caution when refactoring tests. It’s easier to unwittingly break a test than it is to break production code because you can make the test pass when it shouldn’t. I like to temporarily change my production code to make the tests fail just to show that they still can. Pair programming also helps.

Sometimes it’s valuable to leave more duplication in your tests than you would in the code itself. Tests have a documentation value, and reducing duplication and increasing abstraction can sometimes obscure the intent of the tests. This can be a tough judgment call—err on the side of eliminating duplication.

When you use refactoring as an everyday part of your toolkit, the code constantly improves. You make significant design changes safely and confidently. Every week, the code is at least slightly better than it was the week before.

Refactoring requires good tests. Without it, it’s dangerous because you can’t easily tell whether your changes have modified behavior. When I have to deal with untested legacy code, I often write a few end-to-end tests first to provide a safety net for refactoring.

Refactoring also requires collective code ownership. Any significant design changes will require that you change all parts of the code. Collective code ownership makes this possible. Similarly, refactoring requires continuous integration. Without it, each integration will be a nightmare of conflicting changes.

It’s possible to spend too much time refactoring. You don’t need to refactor code that’s unrelated to your present work. Similarly, balance your need to deliver stories with the need to have good code. As long as the code is better than it was when you started, you’re doing enough. In particular, if you think the code could be better, but you’re not sure how to improve it, it’s OK to leave it for someone else to improve later.

There is no real alternative to refactoring. No matter how carefully you design, all code accumulates technical debt. Without refactoring, that technical debt will eventually overwhelm you, leaving you to choose between rewriting the software (at great expense and risk) or abandoning it entirely.

“Clean Code: Args—A Command-line Argument Parser” [Martin 2005] is a rare treasure: a detailed walk-through of an extended refactoring. If you liked my refactoring example but want more, read this article. It’s online at http://www.objectmentor.com/resources/articles/Clean_Code_Args.pdf.

Refactoring: Improving the Design of Existing Code [Fowler 1999] is the definitive reference for refactoring. It’s also a great read. Buy it.

Refactoring to Patterns [Kerievsky] takes Fowler’s work one step further, showing how refactorings can string together to achieve significant design changes. It’s a good way to learn more about how to use individual refactorings to achieve big results.

Refactoring Databases: Evolutionary Database Design [Ambler & Sadalage] shows how refactoring can apply to database schemas.



[46] Wannabee Static, Time Dependency, Half-Baked Objects, and Coddling Nulls are new in this book.

[47] [Fowler & Scott] is a good resource for learning more about UML.

[48] These steps assume that the database isn’t live during the refactoring. A live refactoring would have a few more steps.