Chapter 3.3

C.90: Rely on constructors and assignment operators, not `memset` and `memcpy`

Chasing maximum performance

C++ has a proud reputation for bare metal performance. Other languages have come and gone, trying to take the performance crown from C++, but still it remains the go-to language for zero-overhead abstractions. It has inherited this pedigree from C, which offers some very efficient library functions. Some of these can be implemented as single processor instructions.

For example, consider double floor(double arg). This function lives in header <cmath> and will return the largest integer value not greater than arg. There is a sin-gle x86 instruction that will do this for you, called ROUNDSD. A call to floor in the hands of a smart optimizing compiler can be replaced with an inline invocation of that instruction. This will fill a typical performance-hungry engineer with delight.

There are several surprising instructions available to this CISC processor. Perhaps you want to know the number of leading zeros in a value so that you can assess what the nearest power of 2 is. There is an instruction called LZCNT which does precisely this. Perhaps you want to know the number of set bits in a value because you are cal-culating Hamming distance. Step forward, POPCNT. This is such a useful instruc-tion that Clang and GCC will spot if you are trying to write it and replace your code with a call to POPCNT. This is excellent service. Remember to tip your compiler writer.

When I first started programming, I quickly jumped from BASIC to assembly lan-guage, first Z80, then 68000. When I first learned C, I was, in the right light, able to treat it as a macro assembly programming language, which made my transition to C rather smooth. I was reasoning about my code like it was assembly language, except that it was faster to write and easier to test and debug. I was producing excellent code much more quickly than when I was using 68000 assembly.

When I started moving to C++, I was a little suspicious about certain aspects of the language. A little examination would usually alleviate my suspicions. For exam-ple, virtual functions looked like dark magic until I realized that they were function pointers at the top of my struct, although they were an additional level of indirection away from where I would expect them. Function overloads and function templates were nice, since I was able to eliminate swathes of symbols and I learned about elimi-nating implementation from interface, leading to much more readable code.

The things I liked most in the language were the syntactic sugar that allowed me to write clearer code. Things that slowed me down were definitely not wanted on voyage.

Constructors, though. They were the worst.

The horrendous overhead of the constructor

The moment I learned assembly language, I learned to isolate an area of memory for work, fill it with zeros with a single assembly instruction, and get on with my life. If I was feeling particularly confident I wouldn’t even zero it, I would initialize it according to context, although that occasionally complicated debugging as I would lose track of which addresses I had already set and which were yet to be set.

In C, I quickly learned to declare all my ints, floats, and structs at the top of a function and, in debug builds, call the memset library function in <string.h> to ini-tialize everything to zero in one call. I was simply increasing (or decreasing) the stack pointer and backfilling the space with zeros.

With C++ I had to unlearn this habit. I had to get used to the existence of default constructors. I had to learn that they would be called no matter what, that I could not suppress them. I had to look at the generated assembly and wince slightly. Noth-ing was as fast as The Old Ways. The best mitigation I could come up with was to call memset in the body of the constructor. Initialization lists just would not do the trick: I would directly set everything to zero in one assembly instruction.

You can imagine how I felt about assignment operators and copy constructors. Why weren’t they calling memcpy? What was it with this delicate, dainty, member-by-member stuff? I could understand in those cases where I actually had something to do during the constructor body, but when I was simply setting aside an area of memory, why was there such wastage?

I struggled through, cursing these inefficiencies, and trading them against the more intelligible code the other features were yielding. Sometimes I would write the truly performance-critical parts in C and exploit the fact that both languages were mutually intelligible to the linker.

The chimera is a mythical fire-breathing beast with the head of a lion, the body of a goat, and the tail of a dragon. I was writing these ghastly code chimeras in the 1990s. It took me a long time to realize that the error I was making was in declaring my objects too early in the function before they were ready to be used. In addition, this was prior to standardization and the introduction of the as-if rule. It took me even longer to realize the true value of that fine piece of legislation. Let’s look a little closer at the rules about construction.

The standard describes initialization of classes over 12 pages of standardese, starting at [class.init],¹ referring a further eight pages at [dcl.init]² if there is no con-structor. This really isn’t the place to parse all that, so we’ll keep things simple and summarize, starting with aggregates.

1. https://eel.is/c++draft/class.init

2. https://eel.is/c++draft/dcl.init

The simplest possible class

Aggregates are classes with

No user-declared or inherited constructors
No nonpublic data members which are not static
No virtual functions
No nonpublic or virtual base classes

Here is an example:

struct Agg {
  int a;
  int b;
  int c;
};

Aggregates are useful things. You can initialize them with curly braces like so:

Agg t = {1, 2, 3};

The initialization rule is that each element is copy initialized from the corresponding element. In the above example, this looks like:

t.a={1};
t.b={2};
t.c={3};

If no elements are explicitly initialized, then each element is initialized with a default initializer or copy initialized from an empty initializer, in order of declaration. This will fail if one of the members is a reference since they must be bound at instantia-tion. For example:

auto t = Agg{};

Declaring t like this will result in t.a being initialized with {}, then t.b, then t.c. However, since these are all ints, that initialization will be a no-op: there is no con-structor for built-in types. “Ah!” we hear you exclaim. “So, this is where I call memset, obviously. The contents of the struct are nondeterministic and that is a bad thing, so I shall simply zero them. It’s clearly the right thing to do.”

No, it is not. The right thing to do is to add a constructor that performs this ini-tialization, like so:

Click here to view code image

struct Agg {
  Agg() : a{0}, b{0}, c{0} {};
  int a;
  int b;
  int c;
};

“But now it’s not an aggregate anymore,” you observe, correctly. “I want that brace initialization feature and memset please.”

All right then, what you can do is use member initializers, like this:

struct Agg {
  int a = 0;
  int b = 0;
  int c = 0;
};

Now if you declare

auto t = Agg{};

t.a will be initialized with = 0, as will t.b and t.c. Even better, you can use desig-nated initializers, new to C++20, which allow parts of the object to be initialized with different values, like this:

auto t = Agg{.c = 21};

Now, t.a and t.b will still be initialized with 0, but t.c will be initialized with 21.

“All right, yes, designated initializers are nice and it’s back to being an aggre-gate,” (I can sense a “but” forming in your consciousness) “but the members are still being initialized one at a time! I want to use memset to initialize them in a single instruction.”

That is a really bad idea. You are separating the initialization of the object from its definition. What happens when you add members to the aggregate? Your call to memset will only cover part of the object. C++ allows you to collect the whole life cycle of an object into a single abstraction, the class, which is of immense value. You should not try and subvert it.

“I shall use sizeof to ensure that I remain independent of any changes to the class.”

Still not a good idea. What if you introduce a member that does NOT default to zero initialization? You will then have to ensure that your memset call honors the value of that member, perhaps by splitting it into two. That is simply an accident waiting to happen.

“I do not accept this! I own the aggregate, it is defined in a private implementation file, not a header, it is NOT going to be modified in nature without my knowledge, it is TOTALLY SAFE to call memset! What is going on here? Why shouldn’t I call memset?”

Well, the fact of the matter is that you do not actually need to call memset. Let’s talk about the abstract machine.

What is the standard talking about anyway?

P.2: “Write in ISO Standard C++” is one of the first Core Guidelines. The standard dictates how a conforming implementation of C++ behaves. Any divergence from this is not standard C++. There are many implementations of C++ for many plat-forms, all of which behave in different ways depending on things like the machine word size and other target-specific features. Some platforms do not feature offline storage in the form of disk drives. Others do not feature a standard input. How does the standard accommodate all this variation?

The first three clauses of the standard are Scope,³ Normative references,⁴ and Terms and definitions.⁵ On pages 10 to 12 the fourth clause, General principles,⁶ precisely explains this problem. This is only one of the reasons why it is important to RTFM (Read The Front Matter).

3. https://eel.is/c++draft/intro.scope

4. https://eel.is/c++draft/intro.refs

5. https://eel.is/c++draft/intro.defs

6. https://eel.is/c++draft/intro

The first four clauses that make up the front matter tell you how the document is structured, what the conventions are, what “undefined behavior” means, what an “ill-formed program” is: in fact, the entire frame of reference is described here. In the General principles clause, in particular in section [intro.abstract],⁷ you will find the following text:

7. https://eel.is/c++draft/intro.abstract

“The semantic descriptions in this document define a parameterized nondeter-ministic abstract machine. This document places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the struc-ture of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.”

A footnote is attached to this paragraph, which says:

“This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this document as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.”

This is a marvelous get-out clause (more correctly, get-out paragraph). All that an implementation must do is emulate the observable behavior. This means it can look at your code, examine the result of its execution, and do whatever is necessary to match that result. This is how optimization works: by looking at a result and substi-tuting the optimal set of instructions required for achieving it.

What does this mean for our example aggregate class?

Since the member initializers are all zero, the compiler will see that a default instantiation of an Agg object will set three ints to zero. This is identical to a call to memset, so it will probably call memset. A manual call to memset is unnecessary.

But wait! The class consists of only three integers. On a typical 64-bit platform with 32-bit integers, this means that only 12 bytes need to be set to zero. This can be done in two instructions on an x64 platform. Why on earth would you want to call memset? We can check this by visiting the Compiler Explorer website and trying out some code:

struct Agg {
  int a = 0;
  int b = 0;
  int c = 0;
};
void fn(Agg&);

int main() {
 auto t = Agg{}; // (1)
 fn(t);          // (2)
}

The function call at (2) prevents the compiler from optimizing away t.

The x86-64 gcc compiler, with the optimization flag set to -O3, yields the following:

Click here to view code image

main:
      sub    rsp, 24
      mov    rdi, rsp
      mov    QWORD PTR [rsp], 0   // (1)
      mov    DWORD PTR [rsp+8], 0
      call   fn(Agg&)             // (2)
      xor    eax, eax
      add    rsp, 24
      ret

We can see the two mov instructions doing the work of zeroing the three ints. The compiler writer knows that this is the fastest way of setting three ints to zero. If there were many more members to be set to zero, the MMX instruction set would be brought into play. The joy of the Compiler Explorer website is that you can try this out yourself very easily.

We hope this convinces you not to use memset.

But what about `memcpy`?

Just as I would use memset in my C programs to zero a struct, so would I use memcpy to assign it to another instance. C++ assignment is very similar to initialization: by default, it copies data member-wise in the order of declaration using the assignment operator of that member’s type. You can write your own assignment operator, and, unlike the constructor, it does not start by implicitly performing a member-wise copy. You might think the argument for calling memcpy is stronger here, but for the same reasons as above, it is neither a good idea nor necessary. We can return to the Compiler Explorer website and make a modest change to the source:

Click here to view code image

struct Agg {
  int a = 0;
  int b = 0;
  int c = 0;
};

void fn(Agg&);

int main() {
  auto t = Agg{}; // (1)
  fn(t);          // (2)
  auto u = Agg{}; // (3)
  fn(u);          // (4)
  t = u;          // (5)
  fn(t);          // (6)
}

This now yields the following:

Click here to view code image

main:
      sub     rsp, 40
      mov     rdi, rsp
      mov     QWORD PTR [rsp], 0      // (1)
      mov     DWORD PTR [rsp+8], 0
      call    fn(Agg&)                // (2)
      lea     rdi, [rsp+16]
      mov     QWORD PTR [rsp+16], 0   // (3)
      mov     DWORD PTR [rsp+24], 0
      call    fn(Agg&)                // (4)
      mov     rax, QWORD PTR [rsp+16] // (5)
      mov     rdi, rsp                // (6)
      mov     QWORD PTR [rsp], rax    // (5)
      mov     eax, DWORD PTR [rsp+24]
      mov     DWORD PTR [rsp+8], eax
      call    fn(Agg&)                // (6)
      xor     eax, eax
      add     rsp, 40
      ret

As you can see, the compiler has generated the same QWORD/DWORD trick and has emit-ted code to directly copy the memory from the original object in four instructions. Again, why would you call memcpy?

Note that if you turn down the optimization level, then the generated code will behave more explicitly like the standard dictates and will make less use of the as-if rule. This code is faster to generate and easier to step through in the general case. If you are considering using memset and memcpy, then we are going to assume that optimization is at the top of your priority list, and you would be content to generate the most optimized code. In the above assembly you can see that some unexpected reordering has taken place. The compiler author knows all about the execution char-acteristics of these instructions and has reordered the code appropriately: all that is required is to emulate the observable behavior.

Never underestimate the compiler

The way to get the most out of your compiler is to tell it exactly what you want it to do and at the highest available level of abstraction. As we have seen, memset and mem-cpy have higher levels of abstraction available to them: construction and assignment. As a final example, consider std::fill. Rather than setting a range of memory with a single value, or copying a multiword object to a single piece of memory, std::fill solves the problem of duplicat-ing a multiword object to a range of memory.

The way to get the most out of your compiler is to tell it exactly what you want it to do and at the highest available level of abstraction.

The naïve implementation would be to create a raw loop and iteratively construct in place or assign to the existing object:

Click here to view code image

#include <array>

struct Agg {
  int a = 0;
  int b = 0;
  int c = 0;
};

std::array<Agg, 100> a;

void fn(Agg&);
int main() {
  auto t = Agg{};
  fn(t);
  for (int i = 0; i < 1000; ++i) { // Fill the array
    a[i] = t;
  }
}

std::fill will do this for you, though, so there is less code to read, and you are less likely to insert a bug as happened above. (Did you see that? Check the size of the array and the iteration count of the for loop.)

Click here to view code image

int main() {
  auto t = Agg{};
  fn(t);
  std::fill(std::begin(a), std::end(a), t); // Fill the array
}

Compiler writers go to a lot of effort to generate the best code possible. The typical implementation of std::fill will include SFINAE machinery (or, more likely now, requires clauses) to enable a simple memcpy for trivially constructible and trivially copyable types where memset is safe and constructor invocation is not necessary.

The motivation behind this guideline is not simply to dissuade you from using memset and memcpy. It is to persuade you to use the facilities of the language to give the compiler the best possible information to generate the optimal code. Do not make the compiler guess: it is asking you “what would you like me to do?” and will respond best of all to the correct and fullest answer.

Summary

In summary:

Use construction and assignment rather than memset and memcpy.
Use the highest available level of abstraction to communicate with the compiler.
Help the compiler to do the best job it can for you.

Chapter 3.3

C.90: Rely on constructors and assignment operators, not memset and memcpy

Chasing maximum performance

The horrendous overhead of the constructor

The simplest possible class

What is the standard talking about anyway?

But what about memcpy?

Never underestimate the compiler

Summary

C.90: Rely on constructors and assignment operators, not `memset` and `memcpy`

But what about `memcpy`?