Chapter 2.1

P.11: Encapsulate messy constructs, rather than spreading through the code

In the course of this chapter we’ll learn about encapsulation, and also about the closely related concepts of information hiding and abstraction. These three are often used interchangeably, usually at the cost of clarity. Before we start, we want to set the scene by building a parser. We will return to these three concepts in a moment.

All in one gulp

One of the qualities that marks out a great engineer is the ability to notice when things start to get out of hand. We are all excellent at spotting when things have got completely out of hand: who has not started a code review with the phrase, “There’s quite a lot going on here and I found it hard to follow.” Sadly, that skill rarely stretches to anticipating the bloom of messy code.

Let’s walk through an example. Early in the execution of my program, I want to read in some options from an external file identified on the command line. They are declared as key-value pairs. There are only a dozen possible options, but being a smart engineer, I decide to create a separate function for doing this. I am going to call it parse_options_file. It will take a filename which I will pull out of the command line. If none is declared, I will not call the function. This leads me to the function signature:

void parse_options_file(const char*);

The function body is simple: open the file, read each option line by line, and update the state appropriately, until we reach the end of the file. It will probably look some-thing like this:

void parse_options_file(const char* filename)
{
  auto options = std::ifstream(filename);
  if (!options.good()) return;
  while (!options.eof())
  {
    auto key = std::string{};
    auto value = std::string{};
    options >> key >> value;
    if (key == "debug")
    {
      debug = (value == "true");
    }
    else if (key == "language")
    {
    language = value;
    }
    // and so on...
  }
}

This is great! I can easily add new options in one place and withdraw old options without any hassle. If the file contains invalid options, I can just notify the user in the final else declaration.

A couple of days later, a new option is defined whose value could be several words. That’s fine, I can just read until the end of the line in that option. I’ll create a 500-character buffer and make a string from it. That will look like this:

else if (key == "tokens")
{
  char buf[500];
  options.getline(&buf[0], 500);
  tokens = std::string{buf};
}

I totally rock. I’ve come up with a simple, flexible way of parsing options. I can just add code for each option! Great!

The following week, a colleague tells me that some of the tokens are only valid if the user has set the debug flag. That’s also fine: I set the debug flag at the top of the function, so I can query it later in the function and be sure that I only apply tokens when they’re applicable. Mind you, I’m keeping track of state now, and that is some-thing to be mindful of.

Mmm...

The next month, incorrect preferences files are causing a stir. I am asked to apply the preferences only if they are all valid. I sigh. That is fine. I can create a preferences object containing all the new state and return that if it is valid. std::optional will come in useful here. I go to modify the function and discover that my lovely, neat, tidy, elegant, beautiful, special function has been the target of considerable interest as other engineers have thrown their preferences tokens into the mix. There are now 115 preferences being interrogated. That is going to be quite the maintenance issue, but that is fine, it is just a set of values that I am going to set in the function and then transfer piecewise at the call site...

Stop. Just stop. Look at yourself. You have a 600-line function full of lots of state and a bajillion conditional statements. Are you really the new Donald Knuth?1 What happened here?

1. https://en.wikipedia.org/wiki/Donald_Knuth

You have created, or allowed the creation of, a messy construct. A single function, many screens long, several tabs deep, and still steadily growing is the very definition of such a thing. This function has suffered from appalling scope creep and you need to find some way of effecting scope retreat before it all comes crashing down around you: you know full well that it’s only a matter of time before the bugs start pouring in and you are enslaved to the constant maintenance of this beast. You must plead for time to refactor before disaster engulfs the codebase and your career.

What it means to encapsulate a messy construct

We mentioned encapsulation, information hiding, and abstraction at the start of this item. As promised, we will take a look at these concepts now.

Encapsulation is the process of enclosing one or more things into a single entity. Confusingly, that entity is called an encapsulation. C++ offers a number of encap-sulation mechanisms. The class is the most obvious: take some data and some func-tions, wrap them up in a pair of braces, and put class (or struct) and an identifier at the front. Of course, there is also the enumeration: take a bunch of constants, wrap them up in a pair of braces, and put enum and an identifier at the front. Function defi-nitions are a form of encapsulation: take a bunch of instructions, wrap them up in a pair of braces, and put an identifier and a pair of parentheses, optionally containing parameters, at the front.

How about namespaces? Take a bunch of definitions and declarations, wrap them up in a pair of braces, and put namespace and an optional identifier on the outside. Source files work in a similar way: take a bunch of definitions and declarations, put them in a file, and save it to your file system with a name. Modules are the first new encapsulation mechanism in a long time. These work in a similar way to source files: take a bunch of definitions and declarations, put them in a file, add the export key-word at the top, and save it to your file system with a name.

Encapsulation is only part of the story, as anyone with any experience of modules will tell you. All we have done in each of these examples is gathered things together and named them as a single entity. If we are smart, we will have gathered related things together. Information hiding is a more subtle activity, requiring you to make more careful decisions. In addition to gathering, you must decide which items you are going to reveal to the outside world, and which you are going to hide. Informa-tion hiding implies that some encapsulation is taking place, but encapsulation does not imply that information hiding is taking place.

Information hiding implies that some encapsulation is taking place, but encapsulation does not imply that information hiding is taking place.

Some of the encapsulation mechanisms of C++ support information hiding. The class offers us access levels. Members within the private implementation are hidden from clients of the struct. This is how we relieve clients of the burden of enforcing class invariants: by hiding the implementation and thus preventing clients from breaking them. The enumeration offers no information hiding: there is no way of exposing only a few members of an enumeration. Functions hide information perfectly, by merely exposing an identifier and a return type, while hiding away the implementation. Namespaces can expose declarations and hide definitions by dis-tributing over more than one file. Header and source files do the same thing, as do modules.

Consider the problem at hand. How will encapsulation help us? We have lots and lots of options being handled in a single function. Perhaps we can have a different function for each option. We then call the correct function within the if statement that checks the key. The function could return a bool depending on whether or not the parameter data was valid.

This looks good: we have encapsulated all the different options in their own func-tion and we can easily add further functions for new options; we just need to grow the parsing function for each one. We can even capture the return value to validate the options file. We still have to create an object which can apply the options if they are all valid, so we’ll need to update that when we add a new option, but that’s an easy thing to document, and anyway, other engineers will get the pattern when they look at other examples of these option functions.

Mmm…

Your Spidey-Sense should be kicking off about the places where this can go wrong. You are still relying on the other engineers to do the right thing when they add new options. They might misunderstand the nature of the validation process, or forget to check for the function in the if statement, or misspell the option text. You have improved matters considerably, but sometimes encapsulation and information hid-ing are not enough. To solve these remaining problems, we are going to have to bring out the big guns: abstraction.

The purpose of language and the nature of abstraction

Abstraction is a tricky word. Matters aren’t helped by the fact that the result of abstraction is an abstraction, just as an encapsulation is the result of encapsulation. Let us consider the process of abstraction, in the same way that we just considered the processes of encapsulation and information hiding.

Literally, abstraction means to draw off from. In a programming context, it means identifying and isolating the important parts of a problem, drawing them off, and discarding the remainder. We separate them from the details of implementation. We label abstractions with identifiers. Again, consider the nature of functions: we bun-dle a set of instructions into a single entity and label it. The function is an abstrac-tion with a name meaningful to the problem domain. Similarly with classes: the class is an abstraction with a name meaningful to the problem domain, containing rel-evant functionality to model behavior implied by the name.

However, the art of abstraction is deciding what should lie within the scope of the abstraction and what should stay outside. This is where it differs from mere encap-sulation. Also, we use the word “art” advisedly: there is no mechanical method for deciding where to draw the line between what is relevant to the abstraction and what is not. That ability comes with practice and experience.

Returning to our problem, we are trying to parse a file of key-value pairs and apply the results to the environment if they are valid. The function is well named: parse_options_file. The problem we have is safely adding arbitrary key-value pairs. Is the identity of the full set of pairs actually relevant to parse_options_file? Is it within scope? Can we separate the options from the function?

At the moment we are simply pulling keys from the file and checking each in an ever-growing if-else statement since we cannot switch-case on strings. This sounds like an associative container. In fact, a map of keys against function point-ers sounds perfect here. Suddenly our function has lost a huge amount of repeti-tion and been replaced with a single interrogation of a map and a corresponding function call.

auto options_table = std::map<std::string, bool(*)(std::string const&)>
{{"debug"s, set_debug},
 {"language"s, set_language}}; // Extend as appropriate

void parse_options_file(const char* filename) {
  auto options = std::ifstream(filename);
  if (!options.good()) return;
  while (!options.eof()) {
    auto key = std::string{};
    auto value = std::string{};
    options >> key >> value;
    auto fn = options_table.find(key);
    if (fn != options_table.end()) {
      (*(fn->second))(value);
    }
  }
}

The important part of this function is that it parses an options file and does some-thing with each key. Unfortunately, along the way, we have lost the capability for values to contain spaces. The chevron operator will stop extracting when it reaches white space. We’ll return to this shortly.

However, this is certainly feeling better. All we must do is initialize the map of keys and function pointers. Unfortunately, we have just moved the problem around. The initializer is another point where users can trip up: it is easy to forget to update the initializer. Perhaps we can automate that?

We absolutely can. Rather than mapping keys against function pointers, we can map them against function objects with constructors, and create static objects rather than functions. The constructor can insert the address of the object into the map. In fact, we can derive all the function objects from a base class that will do that for us. Also, now that we have a base class, we can add a validation function and perform a validation pass followed by a commit pass. It all seems to be coming together.

auto options_table = std::map<std::string, command*>{};

class command {
public:
  command(std::string const& id) {
      options_table.emplace(id, this);}
  virtual bool validate(std::string const&) = 0;
  virtual void commit(std::string const&) = 0;
};

class debug_cmd : public command {
public:

  debug_cmd() : command("debug"s) {}
  bool validate(std::string const& s) override;
  void commit(std::string const& s) override;
};
debug_cmd debug_cmd_instance;

class language_cmd : public command {
public:
  language_cmd() : command("language"s) {}
  bool validate(std::string const& s) override;
  void commit(std::string const& s) override;
};
language_cmd language_cmd_instance;

What next? Although we are parsing an options file, we are only reading a series of characters. They do not have to come from a file: they could come from the com-mand line itself. We should rename the function parse_options and change the input parameter to a std::istream. If a key is not found, it could be treated as a filename and an attempt could be made to open the file. Then we could simply recurse.

void parse_options(std::istream& options) {
  while (options.good()) {
    auto key = std::string{};
    auto value = std::string{};
    options >> key >> value;
    auto fn = options_table.find(key);
    if (fn != options_table.end()) {
      if ((*(fn->second))->validate(value)) {
        (*(fn->second))->commit(value);
      }
    } else {
      auto file = std::ifstream(key);
      parse_options(file);
    }
  }
}

Now that we have separate function objects for each key, we are not limited to initial-izing data. We can treat each key as a command, and suddenly, we have a basic script-ing facility. Whereas at the start of this chapter the engineer had to extend a function in an unbounded fashion, all they must do now is derive a new class from command and override validate and commit.

We have now moved from a single, potentially enormous parsing function to a small bounded function and a dictionary of parsing objects. We have also gained command-line parsing at very little cost as an added extra. This was all achieved by considering what was relevant to which part of the problem. What started life as a messy construct has become a clean and easily maintainable scripting facility with bonus content. Everybody wins.

Levels of abstraction

Another way of moving from the single parsing function to smaller bundles of func-tionality might have been to group related activities in separate functions, something like this:

void parse_options_file(const char* filename)
{
  auto options = std::ifstream(filename);
  if (!options.good()) return;
  while (!options.eof())
  {
    auto key = std::string{};
    auto value = std::string{};
    options >> key >> value;
    parse_debug_options(key, value);
    parse_UI_options(key, value);
    parse_logging_options(key, value);
    // and so on...
  }
}

This does indeed address the issue of encapsulating a messy construct: you now have several functions, each labeled by category. However, this has only moved the prob-lem around rather than improved matters. Future maintainers must decide which is the correct function to add their parser to. Decisions need to be made when those functions get too big about how to divide them further. Such an approach does not respect levels of abstraction.

To explain levels of abstraction, consider the seven-layer OSI model.2 This model partitions a communication system into abstraction layers. Each layer exposes an interface to the next layer, but not to the previous layer. Engineers work in the layer that fits their specialty. For example, I am a software engineer rather than an elec-tronics engineer. I would feel very uncomfortable working in layer 1, the physical layer, and much happier working in layer 7, the application layer. You may have heard the term “full-stack engineer.” This engineer is comfortable in all the layers. They are mythical creatures.

2. https://en.wikipedia.org/wiki/OSI_model

The levels of abstraction in the parsing problem can be described thus:

  1. The streaming layer, which delivers a stream of data to...

  2. The parsing layer, which delivers individual symbols to...

  3. The dictionary layer, which matches symbols to tokens and delivers them to...

  4. The token layer, which validates input and updates values

These abstractions are all distinct, nonoverlapping parts of the problem.

Abstraction by refactoring and drawing the line

The key to abstraction is knowing where to draw the line that separates the different layers. As we remarked earlier, it is an art, not a science, but there are three things you can look for.

First, excessive detail. Does the code spend time carrying out tasks that seem sin-gularly unrelated to the task at hand? The Core Guideline for this chapter uses a busy for loop involving reading a file, validating, and performing reallocation as an exam-ple: there is too much going on in there. One can also consider designing baroque data structures especially for local use. Is this data structure of any use outside of this context? Will it ever be? If the answer is yes, take this data structure and move it from the code under examination to a more generic library. Separating detail into different libraries is a form of abstraction that is applicable to both the guideline example and the notion of pulling out data structures.

Second, verbose repetition. What sort of patterns can you see? Has there been some devious copying and pasting? Are you able to express them as an individual function or function template? Pull that code out into a function, give it a name, and rejoice in having identified an abstraction.

Third, and we would love a better word for this, wheel reinvention. This is slightly different from repetition and is a combination of the first two items. A lot of time has been spent identifying and naming the fundamental algorithms of computing. Make sure you are familiar with what they are and how they are offered through the standard library.

Repetition is a hint that there is an algorithm waiting to be uncovered, and a good algorithm is merely a concise description of what a piece of code does. In 2013 Sean Parent gave a talk called C++ Seasoning,3 which spent much of the first half addressing the mantra “no raw loops.” His advice was to use an existing algorithm such as std::find_if, or implement a known algorithm as a function template and contribute it to an open source library, or devise a brand-new algorithm, write a paper about it, and become famous giving talks. This is excellent advice that will guide you on your way to eliminating messy code.

3. https://www.youtube.com/watch?v=W2tWOdzgXHA

Summary

Messy code stops the reader from understanding at a glance what is going on. Avoid this by:

  • Identifying patterns of existing code or existing algorithms, preferably as they happen, and lifting them out of the mess and abstracting them into their own function

  • Identifying the different levels of abstraction

  • Identifying how to encapsulate all these parts