Chapter 5.4

ES.22: Don’t declare a variable until you have a value to initialize it with

The importance of expressions and statements

This is the fourth guideline from the ES section of the Core Guidelines, Expressions and Statements, to which we have devoted a chapter. One reason for that is the size of this section (over sixty items), but more importantly it gets to the heart of C++. The section opens with the following:

“Expressions and statements are the lowest and most direct way of expressing actions and computation.”

This guideline binds together with ES.5: “Keep scopes small” and ES.10: “Declare one name (only) per declaration.” As you saw in ES.5, keeping scopes small is a great improvement to readability, since scope creates context. Small scopes also improve resource management, since their release is supported by deterministic destruction, minimizing their retention. In ES.¹⁰ you saw the importance of separating declara-tions into individual objects rather than grouping them together C-style. The facility to group declarations behind a single type is a feature of backward compatibility, no more than that. It offers no advantage to the C++ programmer, and in fact can serve to confuse and complicate.

Another way of improving readability is to delay the declaration of objects until the last possible moment. Functions grow and change. They just do. Often, we do not quite spot their rampant expansion in a timely fashion and we are left with a somewhat unwieldy collection of objects and logic. When we realize what has hap-pened, we seek to partition the function into useful parts, to abstract it into smaller functions, hiding the complexity and restoring order to our source file. This process is hampered if declaration and initialization are spread over multiple lines, especially if they are not near to each other in the source. Let’s take a quick journey through three programming styles.

C-style declaration

C used to require you to declare all your variables at the top of a function before any code was executed. It was quite normal to inadvertently declare more variables than were in use in the function. As the function changed algorithm, changed purpose, mutated in any number of ways in fact, the context required to execute it would change, and inevitably some variables would no longer be needed. For example:

Click here to view code image

int analyze_scatter_data(scatter_context* sc) {
  int range_floor;
  int range_ceiling;
  float discriminator;
  int distribution_median;
  int distribution_lambda;
  int poisson_fudge;
  int range_average;
  int range_max;
  unsigned short noise_correction;
  char chart_name[100];
  int distribution_mode;
  … // and on we go
}

This function was first written 10 years ago and has been through a lot of revisions. Five years ago, distribution_median was exchanged for distribution_mode as a sum-mary value, but both were calculated without consideration for whether the median was needed. Nobody withdrew the state declared to support an unused part of the algorithm. Out of sight, out of mind. Additionally, it is the engineer’s habit to append to a list rather than find the proper location, so range_average and range_max are separated from range_floor and range_ceiling.

If the programmer were dedicated, keen, and not in a hurry, those variables would be withdrawn from the function, improving its clarity, and the remainder appropri-ately collected together to highlight how pieces of state are related. If the program-mer had Lint to hand, a static code analysis tool released to the public in 1979, it would warn about unused objects, making this task easier. Otherwise, the function would consume a few extra unnecessary bytes on the stack. Today this would not even be noticed as compilers would eliminate the unused variables anyway thanks to the as-if rule, but in the constrained environments of the 1970s, where the size of offline files could exceed the RAM in your machine, this might be significant.

Another problem this introduced was variable reuse. The function may reuse an int for a loop counter, or for storing the result of a function call. Again, as the function mutated, reused variables might take on unexpected new meanings, and their reuse might be inappropriate. The trade-off being made was readability over resource constraint.

A further impact of early declaration was that initialization was somewhat hap-hazard. It was easy to introduce dependencies between variables and to use them before they had been initialized. Every function had to be closely examined for vari-able coupling. Whenever a line of code was moved, it was important to ensure that any dependencies were honored.

This is not a pleasant way to write code. Declaring a batch of variables, and then carefully initializing them one by one, is painful, error prone, and makes your gums ache. C++ introduced late declaration, and the first thing that happened was that structs and built-in types were declared and then initialized in a safe order. This was a significant leap forward. Rather than worrying about whether a struct was ready for initialization, you could declare all its dependent data first, initialize it, and then safely initialize the struct, without a care in the world.

Declare-then-initialize

When I came to C++, code still looked like C, although it was a new style of pro-gramming. I would see things like:

Click here to view code image

class Big_type {
  int m_acc;
  int m_bell;
  int m_length;

public:
  Big_type();
  void set_acc(int);
  void set_bell(int);
  void set_length(int);
  int get_acc() const;
  int get_bell() const;
  int get_length() const;
};

void fn(int a, int b, char* c)
{
  Big_type bt1;
  bt1.set_acc(a);
  bt1.set_bell(b);
  bt1.set_length(strlen(c));
 …
  Big_type bt2;
  bt2.set_acc(bt1.get_bell() * bt1.get_acc());
  bt1.set_bell(bt2.get_acc() * 12);
 …
}

There are some good things about this code. The data is in the private interface, the function variables were declared as they were needed, but the get and set functions grated on me. Why go to all the trouble of putting your implementation details in the private interface only to expose them through get and set functions? Also, why put the private interface at the top of the definition, when most clients should only want to see the public interface?

We looked at this in C.131: “Avoid trivial getters and setters”: this was a hangover from the days of declare-then-initialize. The constructor would most likely set all the values to zero if it existed at all. I encountered local coding rules such as “ensure every data member has a getter and a setter” put in place to support this program-ming style. Even in the early 2000s I would encounter reluctance to declare construc-tors because of the associated overhead of initializing everything to zero only to subsequently overwrite the values. It was only when strict determinism was required that I could persuade my colleagues to write and use constructors.

We are of course talking about default constructors, where the class designer would dictate what the initial value would be (typically 0 throughout). The great shift was when programmers would start to use nondefault constructors and even eliminate default constructors altogether. This is the third programming style.

Maximally delayed declaration

Default constructors have their place, but they presume the existence of a default value. Some early implementations of C++ would require classes to have default constructors if they were to be contained in a std::vector. This requirement would propagate through the member data, requiring those types to also have default val-ues, or for the default constructor of the containing class to be able to construct them with a meaningful value. Happily, this is no longer the case and we always advise that default constructors be added deliberately and carefully.

You might be wondering how this relates to the title of this guideline. Simply put, if a constructor demands a full set of initial conditions, you cannot create an instance of the class until you are ready to use it. As you can see from the preceding styles of initialization, declare-then-initialize, in whatever form, is an accident waiting to hap-pen, since there is no indication that an object is ready for use. Additional instruc-tions can be inserted during the initialization which only serves to further confuse the development of the state in your function. Reasoning about which objects are ready for use is an unnecessary burden, relieved by insisting on full initialization at the point of declaration. Let’s modify class Big_type:

Click here to view code image

class Big_type {
  int m_acc;
  int m_bell;
  int m_length;

public:
  Big_type(int acc, int bell, int length);
  void set_acc(int);
  void set_bell(int);
  void set_length(int);
  int get_acc() const;
  int get_bell() const;
  int get_length() const;
};

void fn(int a, int b, char* c)
{
  Big_type bt1(a, b, strlen(c));
 …
  Big_type bt2(bt1.get_bell() * bt1.get_acc(), bt2.get_acc() * 12, 0);
 …
}

Both bt1 and bt2 can now be declared const, which is preferable to mutable state. The setters have been kept but it is entirely likely that they will now be unnecessary. This code is immediately more apprehensible.

There is another reason for delaying declaration until immediately prior to first use, which is eliminating redundancy. Look at this code:

Click here to view code image

class my_special_type {
public:
  my_special_type(int, int);
  int magical_invocation(float);
…
};

int f1(int a, int b, float c) {
  my_special_type m{a, b};
  if (a > b) return a;
  prepare_for_invocation();
  return m.magical_invocation(c);
}

Clearly, m does not need to be declared until the prepare_for_invocation() function call has returned. In fact, there is no need to declare a named value at all. If we rewrite this function while observing ES.5: “Keep scopes small,” we might arrive at this:

Click here to view code image

int f2(int a, int b, float c) {
  if (a > b) return;
  prepare_for_invocation();
  return my_special_type{a, b}.magical_invocation(c);
}

The assembly generated by the compiler will likely be identical, in accordance with the as-if rule, assuming no side effects to construction, so there is no performance optimization going on here. However, there is one less line of code to read, as well as no possibility of introducing confusing code between declaration and use of the my_special_type instance.

Observe how far we have delayed the instantiation of the object. It started off as an lvalue named m at the top of the function and has ended up as an rvalue at the bot-tom. This function now has no state other than that which is passed in. Again, this is easier to apprehend since there is nothing to keep track of other than the execution order of the function calls.

Localization of context-specific functionality

Delaying instantiation has other benefits. Look at this function:

Click here to view code image

my_special_type f2(int a, int b) {
  int const x = perform_f2_related_checks(a, b);
  int const y = perform_boundary_checks(a, x);
  int const z = perform_initialization_checks(b, y);
  return {y, z};
}

Not only is the object instantiated at the end of the function, but it also benefits from copy elision via return value optimization.

In that last example, you will have seen attention being paid to P.10: “Prefer immu-table data to mutable data”; x, y, and z were declared const. While this is a trivially easy guideline to follow for built-in types, more complex considerations can require more involved initialization. Look at this fragment of code:

Click here to view code image

int const_var;

if (first_check()) {
  const_var = simple_init();
} else {
  const_var = complex_init();
}

We would like to make const_var a constant value, but if we assign it within a condi-tional statement the name falls out of scope. This example could be resolved with:

Click here to view code image

int const var = first_check() ? simple_init() : complex_init();

but clearly this is not going to scale well.

Also, consider this class:

Click here to view code image

class special_cases {
public:
  special_cases(int, int);
…
private:
  my_special_type m_mst;
};

As you may be able to infer from the function f2 above, construction of my_special_type includes some rather particular conditions being satisfied. How are we to construct the member datum m_mst? You might immediately suggest:

Click here to view code image

special_cases::special_cases(int a, int b)
 : m_mst(f2(a, b))
{}

but unless f2 has other uses, you have created a special function solely for invoca-tion by the constructor. That is not a nice thing to do for your maintainers, who are most likely to be future you. This is a use case for an Immediately Invoked Lambda Expression, or IILE. The IILE is a simple idiom that looks like this:

Click here to view code image

special_cases::special_cases(int a, int b)
 : m_mst([=](){ // Capture by value, taking no parameters
  int const x = perform_f2_related_checks(a, b);
  int const y = perform_boundary_checks(a, x);
  int const z = perform_initialization_checks(b, y);
  return my_special_type{y, z}; }
 ())            // Immediately invoke the lambda expression
{}

We declare the lambda expression, and then we immediately invoke it. Naming is hard, and sometimes you just have to say what you are doing, convert it into an acro-nym, and live with it (see also RAII). Now we have our initialization function in one place.

We can also apply this to the other example:

Click here to view code image

int const var = [](){
  if (first_check()) return simple_init();
  return complex_init();
}();

Single pieces of functionality used to initialize objects often result in the creation of temporary state whose utility expires after the declaration of the object in question. Since the temporary state is a dependency of the object, it must be declared at the same scope. Bundling it into a lambda expression creates a local scope which can export a value.

Eliminating state

Late declaration is even applicable to containers. Consider this function:

Click here to view code image

void accumulate_mst(std::vector<my_special_type>& vec_mst,
                    std::vector<int> const& source) {
  auto source_it = source.begin();
  while (source_it != source.end()) {
    auto s_it = source_it++;
    my_special_type mst{*s_it, *s_it};
    vec_mst.push_back(mst);
 }
}

Within the while loop, an instance of my_special_type is constructed and pushed onto the vector. You might consider avoiding the construction entirely and push back an rvalue instance instead:

Click here to view code image

void accumulate_mst(std::vector<my_special_type>& vec_mst,
                    std::vector<int> const& source) {
  auto source_it = source.begin();
  while (source_it != source.end()) {
    auto s_it = source_it++;
    vec_mst.push_back(my_special_type{*s_it, *s_it});
  }
}

In an unoptimized build this will create a temporary object and invoke push_back(my_special_type&&), allowing the move constructor to be used rather than the copy con-structor. But we can go even further by using emplace_back:

Click here to view code image

void accumulate_mst(std::vector<my_special_type>& vec_mst,
                    std::vector<int> const& source) {
  auto source_it = source.begin();
  while (source_it != source.end()) {
   auto s_it = source_it++;
   vec_mst.emplace_back(*s_it, *s_it);
 }
}

Declaring at point of use improves readability of code, and not declaring state at all improves things even further. Reasoning about state requires a comprehensive memory, which is a diminishing asset as codebases expand.

We have eliminated some state from the function. Now, you may think that you have improved the performance of your program by doing this. However, each of those functions does the same thing, and the compiler will probably generate the same assembly for each. In the first example the compiler can see that the instance of my_special_type does not per-sist beyond the while loop, and so is able to invoke push_back(my_special_type&&) rather than push_back(my_special_type const&). In the third function we are delaying the construction still further, but this is simply a case of moving the location of the copy elision. The object will be con-structed once, in the correct place, thanks to copy elision. In fact, emplace_back is more expensive to compile than push_back: it is a class template member function template rather than a class template member function. This may have an impact on your decision to use it.

These examples assume that the move constructor and the copy constructor have been defaulted and are trivial. If there is no move constructor, or it is costly to exe-cute, then you might consider using emplace_back. By default, though, use push_back and construct in place.

Summary

You might be asking yourself, “Well, if they all generate the same code, what’s the difference? Why should I choose one of these over another?” The answer is that this guideline is not about performance, but about maintenance. Declaring at point of use improves readability of code, and not declaring state at all improves things even further. Reasoning about state requires a comprehensive memory, which is a dimin-ishing asset as codebases expand.

Delaying declaration as late as possible carries several benefits. Unnecessary state can be reduced or even eliminated. Scope is minimized. Objects are not used prior to initialization. Readability is improved. All of this adds up to improved safety and, occasionally, improved performance characteristics. I think this is a golden rule to be followed as far as possible.