5

Digital Canaries

The second core component of security, following confidentiality, is data integrity, which provides evidence that information has not been altered. During your typical day in the physical world, you relied on several integrity mechanisms, such as envelope seals and banknote holograms. However, you also relied on context to provide integrity, such as believing in the integrity of prescribed medication because it resembled genuine medicine and was sold to you by someone who looked like a pharmacist. In cyberspace, context is a far less reliable basis for assessing security. Ensuring integrity in cyberspace requires some tools.

The Unreliability of Data

When considering what is meant by security of data, people usually think about confidentiality. Since we all have secrets, the provision of confidentiality usually tops cryptographic shopping lists. But should it?

Consider your bank account for a moment. However much, or little, is amassed there, you probably don’t want everyone to know the balance. You might be concerned about unsolicited approaches from marketers, offering luxury products if it is high or loans if it is low. You might also be worried about others speculating about your lifestyle. It seems entirely reasonable to want to keep details of a bank balance confidential.

What if, however, you had to choose between keeping your balance secret or ensuring it was correct? Hopefully, you will never be faced with such a ridiculous choice, but what if you were? A mistakenly high balance might be welcome, but could you cope with one that was inaccurately low?1

Unlike the written word, data on computers is prone to becoming accidentally corrupted. This can happen at any time. Data can be accidentally altered while it is being either written to or read from memory. It can be damaged while being processed by an application, or when being transmitted over a network, particularly a wireless one. It can even change during storage, while simply sitting around doing nothing other than aging.

Much more concerning, however, is deliberate manipulation. Data is very easy to alter. In just a few seconds of unauthorized access, a “data vandal” can wreak havoc with a company’s annual-return spreadsheet or change the ending of a novelist’s latest work. Careful addition of a single zero to your bank balance might propel you to financial security; deletion of a zero could be your ruin.

Data integrity provides assurance that data has not changed since the point at which it was created legitimately. Crucially, a data integrity mechanism cannot prevent data from corruption. What a data integrity mechanism can do is indicate when it seems likely that the data has changed in some way.2 A data integrity mechanism can serve as a warning, just like a canary falling off its perch in a Victorian coal mine.3

Degrees of Integrity

The concept of data integrity is not as clear-cut as confidentiality, since there are nuances. As a result, a range of different security mechanisms provide data integrity.

One nuance is the severity of the perceived threat to integrity. Some security mechanisms identify only accidental changes to data, but not deliberate changes.

Another nuance is whether data integrity should include identification of the source of the data, meaning whoever created the data in the first place. We expect such assurance with many everyday uses of data. For example, when you transfer money, you expect the recipient to be sure it’s you making the transfer. Rather than providing assurance that data has not changed since the point at which it was legitimately created by whomever, this assurance is stronger, verifying that data has not changed since the point at which it was legitimately created by an identifiable source. For this reason, this stronger notion of data integrity is sometimes known as data origin authentication.

A third nuance concerns the entities to which data integrity should be demonstrable. In many situations, such as when one person sends a file to another, only the recipient needs the ability to verify the integrity of the received data. For digital contracts, however, it is extremely important to be able to demonstrate data integrity to someone else as well, such as a judge resolving a future dispute.

Coming up soon are examples of data integrity mechanisms appropriate for providing these many different degrees of data integrity.

Fake News

It is worth noting that while data integrity is all about making sure information is correct, in the sense that it is unchanged, this is not the same as a consideration of whether information is true.

To appreciate the difference, think about the concept of fake news,4 where misinformation is presented as fact. It’s easy for a journalist to fabricate a fake news story and release it into the wild jungles of online media. Fake news is well suited to cyberspace, since the lack of physical context surrounding a digital news story makes it harder for people to determine its truth.5 A fake news story might well be an untrue story, but as long as readers receive the fake news story that the journalist intended to write, I would argue that data integrity has been preserved. A data integrity mechanism enables readers to detect whether any changes have been made to a story since it was originally created. In this sense, a fake news story can be shown to be correct (as written), even though it is untrue. In other words, readers can be assured that the story is precisely as fake as the day it was penned.

This confusion between truth and correctness arises because the traditional concept of integrity has different meanings.6 One definition is “the quality of being honest and having strong moral principles,” both of which seem to be somewhat lacking in the world of fake news. Honesty and morality are qualities best assessed by human beings, not machines—which means that cryptographic data integrity mechanisms can do little to support them. However, the term integrity also means “the state of being whole and undivided.” This is the notion of data integrity that I’m considering here. Cryptography can be used to detect whether data has remained whole and undivided since the moment it was created. Ironically, this means that cryptography can be used to protect, but not prevent, fake news.

Integrity or No Integrity, That Is the Question

To confirm the whole and undivided nature of data, we first need an accepted source of the “truth” about what state the data should be in. Do we have data integrity or not? Where should we turn for the answer?

The most obvious option is to identify a specific source we can trust enough to act as an integrity reference point. If your friend says something is true and you trust your friend, then you tend to believe in the integrity of what they tell you.7 Another common approach is to defer trust to a higher authority of some sort. If you’re not sure how to spell a word, for example, you might consult an authoritative source, such as the Oxford English Dictionary.

In reality, matters of trust are often less clear-cut. For example, if you download software from a website, you will often see some data displayed on the website called an MD5 hash.8 This value enables you to verify whether the software you download is exactly the same as the software the website thinks it’s making available to you. This verification process works only if you “trust” the website—not just that the website has good intentions, but also that it has good cybersecurity processes in place and could not possibly have been hacked. The website is offering itself directly as a point of trust for providing integrity. Trust me, or don’t trust me; the choice is yours.

Most cryptographic mechanisms for supporting data integrity rely on specific sources of trust. These sources are typically linked to keys. I’ll explain shortly how this works for a few different cryptographic data integrity tools. However, there is another possible reference point for data integrity. Rather than relying on a specific source, we might regard something as correct because everyone says so.

In 2016, the relatively lowly English soccer team Leicester City confounded the pundits, and almost everyone with even a passing interest in the sport, by winning the English Premier League. But how does anyone know for sure that this happened without having been there when Leicester City was awarded the trophy? Should you believe it to be correct because you read about it in a newspaper? Or saw it on television? Or because your trusted Uncle Angus told you? Should you contact the English Premier League directly to seek confirmation in writing? No, most people accept that Leicester City won the trophy because they consistently heard about it from everyone, and everywhere. Rather than relying on a specific source for the integrity of this information, we rely on the fact that all the sources agree. Leicester City was the winning team because the world agreed it to be so.

There is an increasing interest today, for a variety of reasons, in integrity mechanisms that use a more global type of reference point. These include technologies such as Bitcoin (more on this shortly), which enables the integrity of a digital currency without the need for a single trusted bank.

Integrity Checks

One way of checking correctness of information is to seek corroborative evidence. Determining the integrity of information during a court case usually means seeking information from multiple sources, then establishing which aspects of this information are agreed upon. Scientists determine the integrity of experimental results by rerunning past experiments. Ideally, we make up our minds about the integrity of information by evaluating evidence received from different sources, all of which we trust to varying degrees.

In many situations, however, we cannot afford the luxury of seeking supporting evidence. When our web browser is communicating with an online store, there is no alternative source of integrity to consult regarding the integrity of the data being exchanged. Decisions about the integrity of data need to be made immediately, on the basis of the current communication session, and need to be resolved efficiently, without delay.

Think, for a moment, about how we approach this problem in the physical world. An example of important written information requiring an assurance of integrity is a transcript listing a job applicant’s academic qualifications. In extreme circumstances, a potential employer could personally call the applicant’s teachers in order to directly confirm the validity of a transcript, but this would be an inefficient means of verifying integrity.

More commonly, the transcript bears an official stamp or seal.9 The real purpose of the seal is to indirectly state that the integrity of the information on this piece of paper is assured by the creator of this stamp. The stamp itself is small, containing much less information than the transcript, yet it vouches for the integrity of the entire document. The employer only has to scrutinize the stamp and, if satisfied, can assume that the information in the rest of the document is likely to be correct.

In many other situations as well, a small piece of information is used as a verifiable representation of the integrity of a larger piece of information. Perhaps the most widespread means of providing assurance of integrity is the handwritten signature. Interestingly, signatures are used in several different security contexts, but perhaps the most common use of a handwritten signature is to vouch for the correctness of a longer document. When you sign a letter or a contract, you are really confirming that you’re happy with the integrity of the information contained therein. Anyone relying on the contents of the signed document will assume, on checking the validity of your signature, that you were happy with the document’s integrity at the time of signing.

Stamps and handwritten signatures are compact seals of approval of the integrity of a written document. However, their effectiveness also relies on the materiality of the document itself. An unscrupulous job seeker could try to modify the grades on a transcript and hope that a potential employer will not notice the change. Likewise, a fraudster could sign a letter and then later modify the contents. Cumbersome legal procedures, such as keeping duplicate copies of a contract at a lawyer’s office, are perhaps the only means of countering such fraud.

The main problem with compact assurances of integrity, such as stamps and signatures, is that they have a static form; they don’t change each time they are used. The stamp on the transcript is precisely the same on both the original transcript and the one the fraudster modified. The handwritten signature on the letter remains the same, no matter how many modifications are subsequently made to the contents. Indeed, in the physical world, it’s hard to imagine how it could ever be otherwise. This is one reason context is so important to the provision of integrity in the physical world.

The digital world presents an amazing opportunity to do much better. Information in the digital world is represented by numbers. Since numbers can be combined and computed, it is possible to do something in the digital world that’s unthinkable in the physical world: we can design means of assuring the integrity of data that are not just compact and easy to check but also dependent on the data itself. In other words, we can produce a digital stamp on a transcript that will cease to be valid if the transcript changes. While we lose physical and contextual integrity in cyberspace, we can use data integrity mechanisms that are more sophisticated than those of the physical world.

The Evil Librarian

Let’s start with a simple data integrity mechanism designed for information consisting of numbers. The International Standard Book Number (ISBN) is a universally recognized means of uniquely identifying a published book (in fact, you’ll find one printed on the jacket of this book).10 For example, the must-read title Dachshunds for Dummies by Eve Adamson (John Wiley, 2007) has the ISBN 978-0-470-22968-2. The ISBN identifies this title precisely. Should anyone else choose to write a book with the same title, it will have a different ISBN. The ISBN is particularly useful for librarians and booksellers, who can be sure they are accessing the correct title by referencing it.

That said, “Dachshunds for Dummies” trots off the tongue, and keyboard, much more readily than does “978-0-470-22968-2.” Even if you spell dachshund incorrectly, most computer spellcheckers will automatically rectify your error. You probably won’t be so fortunate if you mistype one of the digits of 978-0-470-22968-2. For this reason, each ISBN has a built-in integrity check, designed to make it likely that errors will be detected. While the first twelve digits of an ISBN form the unique serial number, the purpose of the last digit is to verify the integrity of the first twelve. This final check digit is computed from the others by means of a simple calculation: add the digits in positions 1, 3, 5, 7, 9, and 11 to 3 times the sum of the digits in positions 2, 4, 6, 8, 10, and 12; then subtract the last digit of the answer from 10. The answer is the thirteenth digit of the ISBN. Check our example: (9 + 8 + 4 + 0 + 2 + 6 = 29) is added to 3 × (7 + 0 + 7 + 2 + 9 + 8 = 33) to get 128, whose last digit is 8, which, subtracted from 10, is 2.

Whenever an ISBN is entered into a computer, the thirteenth digit can be automatically recomputed. If an error has occurred in any of the first twelve digits, it is highly likely that the result of the calculation will not be the thirteenth digit of the ISBN. For example, if the fourth digit in our example is mistakenly entered as 1 rather than 0 (resulting in the incorrect 978-1-470-22968-2), then the check-digit calculation yields 9. Since the thirteenth digit is 2, something is wrong. There is a small chance that an error could go undetected and, unluckily, the check digit would compute correctly anyway (for instance, if we made two mistakes and wrongly entered the ISBN in our example as 978-1-470-22968-9), but in most cases errors are flagged.

Importantly, the ISBN is not designed to cope with intentional errors. If a manipulative librarian decides to make deliberate changes to an ISBN, then this mechanism offers no protection. For example, suppose a librarian changes the twelfth digit of the ISBN in our example from 8 to 7. If this is the only change, then 978-0-470-22967-2 will be flagged invalid because the check digit does not compute correctly. However, anyone can compute what the thirteenth digit should be, so an evil librarian simply has to calculate what the correct check digit is for the twelve digits 978-0-470-22967. Since 29 plus 3 × 32 is 125, whose last digit is 5, the correct check is 10 – 5 = 5. To avoid detection, the librarian should thus alter the final digit to 5, resulting in a valid ISBN of 978-0-470-22967-5, which happens to correspond to the sister title Chihuahuas for Dummies (perish the thought that such a horrendous crime might ever be committed).

Librarians are generally not evil, as we all know. The ISBN integrity check mechanism is extremely lightweight and designed to cope with only accidental errors. Nonetheless, we rely on numbers like the ISBN for many aspects of our lives, and including a lightweight integrity check as part of these numbers is better than having no integrity check at all. Integrity check digits computed using similar calculations are included in the likes of credit card numbers, social security numbers, and the numbering system for European locomotives.11

Toward Stronger Integrity Checking

The check digit included as part of the ISBN is very basic. However, integrity check digits share a few important features with stronger cryptographic integrity mechanisms that are worth commenting on.

Most fundamentally, unlike stamps on a physical transcript, check digits serve as compact representatives of the data being protected because they are computed from the data itself. For a particular data item, such as a book number, there will be only one correct check digit. However, because there are many fewer possible check digits (only ten) than book numbers, there will inevitably be many ISBNs with the same check digit. This is not a problem per se, but it is the reason that an erroneous ISBN might not be detected, since an incorrect number’s check digit could compute correctly. Adding extra check digits could reduce this risk, but at a cost of reduced efficiency (in this case the ISBN would have to become longer). Likewise, cryptographic integrity mechanisms sometimes allow such a trade-off between security and efficiency.

It is worth reemphasizing that check digits don’t guarantee the detection of errors; rather, they detect errors with a known probability of success. Nor do they prevent errors from occurring (realistically, no mechanism based solely on a data computation would be able to do this) or correct errors that have occurred. These observations apply just as readily to cryptographic integrity mechanisms.

Unfortunately, check digits have a property that is highly undesirable in stronger cryptographic integrity mechanisms. Because the ISBN check digit is computed simply by adding multiples of the first twelve digits together, it is straightforward to predict how changes to the first twelve digits of an ISBN will affect the value of the check digit. This means it is easy to determine how the check digit changes if, for example, we alter one digit of an ISBN or add two ISBNs together. It also makes it very easy to predict when two ISBNs will have the same check digit.

That said, booksellers and librarians don’t care about the predictability of check digits. Nor does anyone ever want to add two different social security numbers together. Check digits work just fine for these examples.

The Cryptographic Swiss Army Knife

Integrity check digits are the “store brand” data integrity tool. If you are willing to pay a little bit more, where the currency in this case is a combination of complexity to implement and time to compute, then the integrity mechanism you should use is a cryptographic hash function.12

A hash function inputs data of any length and outputs a short integrity check for this data known as a hash, or digest. Somewhat unusually for a cryptographic tool, no key is involved in this process. Just like a check digit, a hash is much smaller than the underlying data and is computed directly from it. If we want to detect whether a file has suffered accidental changes while being transmitted across a network, for example, then the sender of the file can first compute a hash of the file. The sender transmits both the file and its hash to the receiver. In order to check the integrity of the file arriving at the destination, the receiver computes the hash of the received file and compares it with the sent hash. If they match, the receiver concludes that the file did not accidentally change en route.

Where a hash differs from a check digit is in the process used to compute it. While a check digit is calculated in a very simple way from the data, a hash is computed by a cryptographic algorithm. Recall that I previously compared cryptographic algorithms to blenders. This analogy works quite well for encryption, since an encryption algorithm mixes a set of ingredients into a blended form with no loss of mass. In other words, the ciphertext is a randomized version of the plaintext but remains (approximately) the same size as the plaintext. A hash function also blends the underlying data, but it outputs something much smaller than the data itself. A hash function is more like a juicer: the ingredients are pulped, but what is eventually output is much smaller in quantity than what was put in.

The main advantage of a hash over a check digit is that the cryptographic process used to compute a hash obscures the connection between the data and its hash. In contrast to a check digit, changes to the data being hashed result in unpredictable changes to the hash. Even if you change just one bit of information in a file, for example, the resulting hash will bear no apparent relationship to the hash of the original file. In addition, unlike for check digits, it is extremely difficult to find two different files with the same hash.

Hash functions turn out, perhaps surprisingly, to be one of the most useful tools that cryptographers have ever invented.13 Unlike encryption, hash functions aren’t much use on their own. However, they are used in all sorts of other ways to support more complex cryptographic operations. For this reason they are sometimes described as the “Swiss Army knives” of the cryptographic tool kit.

For starters, they can be used as a sort of cryptographic “glue,” to bind different data items together. Because hashes of data are essentially unpredictable, hash functions can also be used to generate random numbers. Since they compress data, hash functions are often used as central components of other cryptographic mechanisms, including digital signatures, to improve efficiency. Another use of hash functions is to protect passwords. And the Bitcoin cryptocurrency scheme is built from multiple uses of hash functions, which means that hash functions help to drive the economy of the Dark Web. I will discuss all three of these latter uses of hash functions in due course.

Integrity in the Presence of Malice

Unfortunately, a hash function on its own cannot provide integrity in any situation where there might exist an attacker capable of deliberately changing data. While an evil librarian is unlikely to gain much from manipulating ISBN check digits, the same cannot be said for an attacker who observes a file and its hash being sent over the internet. If the attacker wants to change the file without detection, all they need to do is modify the file and then compute the new hash for the modified file. When the receiver retrieves the modified file, this new hash will be verified as being correct. This situation arises because, just as anyone can compute the check digit of an ISBN, anyone can compute a hash of some data.

There are two approaches to dealing with this problem. The first is to make sure the hash is communicated to the recipient of the file by means that an attacker cannot manipulate. Suppose you want to send a file to a friend. First, you email the file to your friend. Your friend then telephones you and asks what the hash of the file should be. Since a hash is a short piece of data, it is easy to read over the phone. Your friend then checks the hash of the file they received from you and compares it with the hash you’ve just told them.

In many situations, however, it is either impossible, or at least inconvenient, to employ a separate means of protecting the hash of some data. In these cases the idea of a hash function needs to be adapted to make it impossible for everyone to be able to compute the hash of some data. Fortunately, there is an obvious way of doing this. Remember, a hash function is a cryptographic algorithm that simply compresses data into a smaller hash without using a key. The solution to preventing everyone from being able to compute a hash is thus to include a key, somehow, in the hash computation process.

On the Origin of Data

The next product upgrade in our data integrity store is a keyed hash function. Suppose you and your friend agree on a secret cryptographic key. You append this key to a file and then compute a hash on the combined file and key. You now send the file (without the appended key) and this hash to your friend, who appends the shared secret key to the received file and recomputes the hash. If this recomputed hash matches the sent hash, your friend concludes that the file is unmodified.

This process should protect the file against an attacker who makes deliberate changes. The attacker can intercept the file during transmission and make any changes they like. What they cannot do is compute a new hash that will be valid for the modified file. Although they know the contents of the modified file, since they don’t know the secret key they cannot compute the hash of the key appended to the modified file. Any alteration of the file will thus be detected.

The basic idea here is excellent. Alas, however, it does not quite work, for a variety of technical reasons, none of which I will bore you with here.14 In practice, special hash functions are used that incorporate the secret key in a more sophisticated manner than simply appending it to the data. These hash functions are more commonly referred to as message authentication codes, or MACs. One of the most popular MAC algorithms is known as HMAC15 and is built directly from a hash function (hence the “H”). Other MAC algorithms are built in different ways, including CMAC,16 which is built from a block cipher (hence the “C”).

In fact, MACs are one of the most important cryptographic mechanisms used to protect everyday applications in cyberspace. The reason MACs are so useful is that the introduction of a key not only enables protection against an attacker who deliberately manipulates data but also strengthens the level of integrity protection to enable the provision of data origin authentication, which I introduced earlier. When the receiver of a file successfully verifies a MAC on the file, the key also provides evidence of the source of the file itself. Whoever computed the MAC on the file must know the key, and the receiver knows that the specified sender is the only other entity who knows this key. The file must thus have originated with the specified sender.

You have undoubtedly used MACs many times without realizing it. They provide data origin authentication (and hence data integrity) for bank transactions, card payments, Wi-Fi communications, secure web connections, and many other applications. Indeed, it is relatively unusual to symmetrically encrypt data without also adding a MAC of the data. Confidentiality and data origin authentication are so commonly required together that a number of special authenticated-encryption modes of operation of a block cipher have been proposed in order to both encrypt and compute a MAC on data in one go. These authenticated-encryption modes are increasingly popular and are likely to become default choices in the future.17

Anything You Can Do, I Can Do

In terms of providing strong data integrity in the form of data origin authentication, you would think, surely, that MACs are the perfect tool. They can detect the tiniest changes to data, whether accidental or deliberate; they can be used to determine the sources of data; they are widely deployed in many of the most vital applications of cryptography. What’s not to like?

A MAC provides the receiver of a file with assurance that the file is unmodified. This is sufficient assurance for most applications in the real world. But it’s not the strongest notion of data origin authentication that we could ever ask for. To see why, ask yourself this question: Does a MAC allow anyone to be sure that a file is unmodified and came from the specified sender?

Consider using a MAC to protect a digital contract being sent over the internet. The MAC allows the receiver to be confident that the contract came from the sender. But what happens if the sender and receiver later argue about the contract? If they call in a third party to resolve their dispute, the receiver might present the MAC as evidence that the sender originated, and by default agreed to, the contract. The sender, on the other hand, could argue that no such thing happened and that the receiver created the contract and its MAC without any involvement of the sender. This problem arises because of the symmetric capabilities of both sender and receiver. The third party can certainly conclude that the file originated with a holder of the MAC key. But which one? Was it the sender or the receiver who created the MAC? They both have the key, so either could have computed it.18

This example reveals an inherent problem with using symmetric cryptography to provide data origin authentication. Because a symmetric key is shared between, in this case, a sender and receiver, anything one can do, the other can also do (not better, just equally). Therefore, the receiver can confirm the sender as the originator of a received file, but the MAC does not enable anyone else to be so sure.

Thus, MACs are excellent cryptographic mechanisms for providing data origin authentication, unless you want to demonstrate the authentication to someone else. If this yet stronger ability to provide data origin authentication to a third party is required, some asymmetry needs to be introduced into the keying arrangements of a MAC so that only one person has the ability to create a MAC. Fortunately, you already know all about asymmetric keys—don’t you?

Digital Anti-padlocks

Most physical integrity mechanisms have a property not provided by check digits, hash functions, or MACs. Sealing a document with an official stamp or writing a signature on a contract might not prevent anyone from subsequently modifying the contents, but both of these integrity-assuring mechanisms provide undeniable evidence of who created them. The stamp on an academic transcript is an authoritative linking of the transcript to the issuing institution. A handwritten signature on a contract essentially states: “The signatory was here.” In contrast, anyone can compute a hash, and anyone who holds the symmetric key can compute a MAC.

The ability to link an integrity check to a unique source is sometimes called nonrepudiation, because it prevents whoever created an integrity check from denying they did so. Nonrepudiation is the premium version of data integrity, and it is needed wherever an attacker could manipulate data and where the source of data may need to be proved to a third party. These are strong requirements and call for a powerful cryptographic tool.

Nonrepudiation requires a cryptographic mechanism producing an integrity check that anyone can verify as uniquely linked to the creator of this check. If you think about it, this is almost the opposite of what a padlock does. Remember that a padlock allows anyone to lock something up in such a way that only the key holder can unlock it. What is required is a sort of “anti-padlock,” which allows only a key holder to create an integrity check that anyone can verify.

Can we use our knowledge of how to build digital padlocks to create a digital anti-padlock? The good news is that we can. The concept of a digital padlock, enabled by asymmetric encryption, can be adapted to create a cryptographic mechanism for providing nonrepudiation. Since this mechanism links data to a unique source in much the same way that a handwritten signature does, it is known as a digital signature.

Digital Signatures

The principle behind a digital signature is to use an asymmetric encryption scheme somehow in reverse. By reverse, I mean swapping the roles of the public and private keys. In asymmetric encryption, the sender encrypts their plaintext data by using the widely available public key of the recipient, who then decrypts the ciphertext by using the private key that, critically, only the recipient has access to. To create a digital signature, the sender encrypts the data by using their private key, and the recipient verifies the integrity of the data by decrypting it using the sender’s public key. Only the sender can create this digital signature, because the signature relies on the sender’s private key. Anyone can verify this digital signature, because verification requires the sender’s public key, which is not a secret. This is the idea anyway.

In practice, things don’t work quite so simply. For one thing, most asymmetric encryption algorithms need to be slightly reworked to facilitate this reversal. Perhaps more fundamentally, however, digital signatures are an integrity check, not a means of providing confidentiality. Since the data is not secret, it is reasonable to assume that whoever needs to verify the digital signature will also be sent the data itself. An integrity check should thus accompany the underlying data, just as a handwritten signature is an addition to a document. If the data were simply “encrypted” in order to produce a digital signature, the signature would end up being a “ciphertext” as large as the data itself. Compared to the compactness of check digits, hash functions, and MACs, digital signatures would be clumsy and inefficient.19

The crucial observation is that to create a digital signature, it is not necessary to “encrypt” (a better verb in this case is sign) the entire data. It suffices to sign a compact representation of the data—something small that depends on every bit of the data. If you’ve been paying attention, you should remember that we have a great cryptographic tool for this! A digital signature is typically created by first using a hash function to generate a hash of the data; then this hash is signed using the sender’s private key. Anyone can then verify this digital signature by first computing a hash of the data and then checking whether the “decryption” (let’s call this verification) of the digital signature results in the same hash. If it does, then whoever is verifying this digital signature learns several things. Let’s go through them one by one.

First, the verifier is assured of data integrity. If an attacker has modified the file in transit, the hash of the modified file will be different. The attacker can compute this modified hash, but what the attacker cannot do is produce a new valid digital signature on the modified hash, because they don’t have access to the original sender’s private key.

Second, the verifier is assured of data origin authentication. The only circumstance in which the digital signature could be “decrypted” via the sender’s public key to the correct hash of the data, is when the sender’s private key was used to create the digital signature. So, the verifier knows that the data originated with the sender.

Third, we have nonrepudiation. Anyone can verify this digital signature, since all that’s needed is knowledge of the sender’s public key. Importantly, the sender cannot deny signing the data, since only they know the private key corresponding to the public key used to verify the signature.

Game, set, and match.

Digital signatures are top-drawer data integrity mechanisms; in terms of the strength of the data integrity offered, they really can’t be beaten. However, you don’t get the best without paying for it. Hopefully, you’ll have already noticed that using digital signatures costs the same as asymmetric encryption. First, we have the problem of determining the validity of public keys, in this case public verification keys. Second, digital signatures are slower to compute than are other data integrity mechanisms, because they rely on computations similar to those of asymmetric encryption.

Just as for asymmetric encryption, it is probably wise not to deploy digital signatures unless you really need the strength of integrity they offer. As noted, MACs suffice for many of our daily uses of cryptography. In some sense, digital signatures are to integrity what asymmetric encryption is to confidentiality. In the types of open environments where we tend to need asymmetric encryption, we also tend to need digital signatures. For example, most secure email systems give users the options to encrypt email (using hybrid encryption) and/or digitally sign email. Your Wi-Fi, on the other hand, uses symmetric encryption and MACs because it’s a closed environment where sharing keys is simple.

Ironically, however, one of the most significant uses of digital signatures is to address the primary problem with using them in the first place! Digital signatures form an important component of the most common method for validating public keys, whether those are public keys to support asymmetric encryption or digital signatures. More on this later.

Digital Signatures Are Not Signatures

The term digital signature conjures up the image of some sort of futuristic cyber hand signing digital data. It is tempting, therefore, to claim that digital signatures are the cyberspace equivalent of a handwritten signature. Tempting, maybe, but the analogy is treacherous. While they indeed share certain features, digital signatures are very different beasts. Perhaps they should really be described as nonrepudiation mechanisms, but that doesn’t have the same ring to it!

In many respects, digital signatures are superior to handwritten signatures. By far the strongest advantage of digital signatures is that they are computed directly from the underlying data. If the data changes in any way, then the digital signature changes. Therefore, every version of a document has a different digital signature. Sure, you have your “I’m in a hurry but you don’t really need to read this” signature for the delivery driver, and your “I can do really nice handwriting when I want to” signature for your passport application, but in general, handwritten signatures vary only slightly from document to document.

Another positive feature of digital signatures is that they can be reproduced with precision. If the same data is signed a second time using the same signature key, then the same digital signature will be produced. This feature has the potential to provide strong evidence in a court case. Handwritten signatures do not have such absolute accuracy, and sometimes they require experts to determine whether two handwritten signatures are the same.

However, there are some disadvantages of digital signatures. Most significantly, digital signatures rely on a cryptographic infrastructure, which requires good key-management practices and reliable technology. If there are weaknesses in this potentially costly infrastructure, then digital signatures become ineffective. For example, if someone manages to steal another person’s signature key, the thief will be able to create digital signatures appearing to originate from the victim. Handwritten signatures require no such infrastructure and are, quite literally, portable.20

The Wisdom of the Crowd

It’s time to think again about who, or what, we rely on in order to determine data integrity. Is this data correct or not? What helps us decide? We can verify the MD5 hash of a downloaded file by checking it against the one displayed on a website. This verification works, as long as we trust the website. We can verify the MAC on a received file by locally recomputing it. This verification also works, as long as we trust that the sender is the only other person with a copy of the key used to compute the MAC. We can verify the digital signature on an email message by applying the appropriate public verification key to the signature. This verification works as well, as long as we trust that the public verification key is the one belonging to the sender of the email.

All of these examples rely on some very specific trusted roles. The effectiveness of the MD5 hash requires trust in the implementation and management of the website. The MAC requires trust in the distribution and secrecy of MAC keys. The digital signature requires trust in the distribution and secrecy of private signature keys, as well as trust in the authenticity of public verification keys. What do we do when trust of this type is lacking?

One option, which I introduced earlier when considering Leicester City’s year of glory, is to trust the integrity of some information if everyone says it is correct. We must be a little careful with this argument, however, since much depends on who “everyone” really is.

The citizens of North Korea, for example, are subject to very strict information control. They have few means of communicating with the outside world, and their ability to freely exchange information with one another is limited through press control, surveillance, and travel restrictions. North Korean citizens are also required to tune in to daily radio broadcasts from the governing regime. As a result of these measures, they undoubtedly believe many things that most of us would not, largely because what everyone agrees on is the result of an information control operation, tightly managed by their political leaders.21

North Korean state broadcasts may not always be factually correct, but because the government controls information within the country’s borders, the political messages received by North Korean citizens do have integrity in the sense that the information North Koreans absorb is unchanged from the information they were intended to receive. The fact that everyone receives this same information, and most citizens believe it, helps to reinforce integrity. The truth, as noted during our discussion on fake news, is quite a different matter.

Societies more democratic than North Korea are not always in a much better position to assess the correctness of information. Traditional media, social networks, and search engines are all known to create filter bubbles, where experiencing a consistent message from multiple sources influences a user’s beliefs.22 A user may come to trust in the integrity of some information because “everyone” seems to say it is correct. In these examples, however, the notion of “everyone” is often restricted in ways that the user may not appreciate or understand. Readers of a particular newspaper often have shared political beliefs, social networks are self-selecting because most people choose friends with whom they share interests, and search engines are driven by algorithms heavily influenced by a user’s previous behavior (search history, visits to web pages, etc.). “Everyone,” in these instances, might be only a few people and will most likely be unrepresentative of the population at large.

The web encyclopedia Wikipedia provides another interesting example. You read it on Wikipedia, so it must be correct, right? Some people scoff at the very idea that anyone would trust Wikipedia, while others regard it as a reliable source of information. The important thing to recognize is that almost anyone can create and subsequently edit a Wikipedia web page. The information displayed on Wikipedia evolves over time through a process whereby users read, contest, and correct the entries. It could thus be argued that a Wikipedia page eventually represents a consensus of what “everyone” says. The problem with this argument is that some Wikipedia pages are scrutinized extensively, while others are rarely visited. For each individual page there is thus a very different notion of who “everyone” is. As a result, the quality of information on Wikipedia is highly variable.23

As we’ve seen, the perceived wisdom of the crowd comes with hazards. The correctness of information depends very much on which crowd is used. Nonetheless, especially when there’s no central point of trust on which to base integrity, the idea of using a global reference point for integrity is highly compelling. Does anyone doubt that Paris is the capital of France?

We cannot always depend on “everyone” being sufficiently excited about some particular information that its integrity becomes globally recognized. Nor can we always wait for integrity to grow over months and years in the manner of information on a Wikipedia page. So, how do we use this idea of a global reference point to support the integrity of day-to-day information, such as precisely how much bitcoin you own? How do we find a crowd whose wisdom can be universally relied on?

You Are Your Bank

How much money is in your bank account? Don’t tell us! But consider how you establish the integrity of this number (whether positive or negative). How do you really know your correct bank balance? Whether you like it or not, the ultimate answer to this question is that you need to trust your bank. The bank is the authoritative source of your balance. You are welcome to dispute the details, but frankly, if you don’t trust your bank, then you should move your money elsewhere.24

For some types of information, however, there might not be any single authority we can trust. Or we might not want such a central point of trust to exist. An example is the Bitcoin digital-currency scheme.25 Its main purpose is to emulate the perceived freedoms of cash. These include the lack of necessity of having a relationship with a bank, and the relative anonymity of conducting transactions. Digital cash can be facilitated by a single central bank, but all users need to trust this bank.26 The alternative, which Bitcoin uses, is to simulate the role of a bank without actually having one.

What do banks do anyway? In terms of currency management, the most important role of a bank is to serve as a trusted witness of transactions into and out of your account. In doing so, your bank acts as the definitive source of truth regarding the integrity of your account balance. It’s not easy to be a trusted bank; the bank has to work hard to establish the necessary authority to have such trust placed in it. To accomplish this status, a bank must engage in multiple interrelated activities, including managing the bank brand, adhering to financial regulations, subjecting itself to financial audit, managing personnel, and using numerous physical and cybersecurity mechanisms (banks are avid users of cryptography).27 All this effort ultimately protects the financial information that the bank oversees. You can think of this information as being represented by a centralized ledger, containing the accounts of the financial data of all the customers for whom the bank is responsible.

If we cannot have a bank witnessing transactions, then who will do it? The answer is “everyone.” The idea of a distributed ledger is to do away with the need for an official centralized version of (well, anything really, but let’s stick to banks for now) all financial transactions and replace it with an entirely open ledger that everyone keeps a copy of. In other words, you don’t need a bank, because you, and everyone else who has money, are the bank.

At first, this seems a fairly mind-blowing idea. Every Bitcoin user maintains their own version of the ledger of all transactions, which represents the true state of Bitcoin finances. While a distributed ledger is simple to propose, the practical challenge is clear. This concept will fly only if everyone agrees on the ledger’s contents.

Clearly, it’s not possible for every Bitcoin user to sit down each night with the entire day’s Bitcoin receipts (accompanied by a large glass of wine) and check the validity of each transaction in order to establish which bitcoins have gone where. Fortunately, computers are more efficient at this type of task. It remains, however, a formidable challenge to develop and manage an agreed-on, but distributed, version of the Bitcoin ledger. The solution that Bitcoin deploys is ingenious, and it’s built almost entirely from cryptography.

The Bitcoin Blockchain

Bitcoin uses the idea of a blockchain to facilitate a distributed ledger. It’s worth emphasizing that although distributed ledgers and blockchains are often treated as being synonymous, simply because of the high profile of Bitcoin, which uses the latter to enable the former, a distributed ledger does not necessarily have to be based on the use of a blockchain.

The users of Bitcoin form the Bitcoin network. A Bitcoin user maintains as many Bitcoin “accounts” as they like. Each account is identified by a Bitcoin address, which is just a cryptographic public key that can be used to verify digital signatures. Importantly, while a Bitcoin address is unique to its owner, it does not explicitly identify the owner. This masking provides Bitcoin’s anonymity. A Bitcoin transaction consists of a digitally signed statement (cryptographically signed with the private signature key of the payer) that a certain amount of bitcoin should be transferred from the payer’s Bitcoin address to the payee’s Bitcoin address.

Each time a Bitcoin transaction is conducted, the details are made available to everyone else on the Bitcoin network. You can think of Bitcoin, therefore, as a whole bunch of individual transaction statements that are all flying around the Bitcoin network. Since new transactions are continuously being made, at a rate of one every few seconds, the challenge is to manage all this information in a sufficiently organized manner that every user can agree on what has been happening.

A block is a collection of Bitcoin transactions (roughly the equivalent of ten minutes’ worth of Bitcoin payments). Whenever a new block is formed and approved, it is “glued” to the previous blocks, thus forming an ever-growing chain of blocks. This growing collection of blocks, bound to one another, forms the Bitcoin ledger that everyone has to agree on. Since each block consists of data, digital glue is required to join the blocks. Digital glue? With luck, you may recall that binding data together is one of the many important uses of a hash function, the great Swiss Army knife of cryptography.

It would be anarchic if every user in the Bitcoin network were constantly forming new blocks and trying to bind them simultaneously to the blockchain. How would a single agreed-on version of the blockchain ever emerge? The cunning solution to this problem is to make the forming of a new block rather difficult, but not impossible. The effect is that new block creation is slowed down to approximately one block every ten minutes.28 This pace is fast enough to ensure that transactions find their way into the Bitcoin ledger fairly soon after they are conducted. However, it is also slow enough that a new block has time to propagate through the Bitcoin network and become accepted by the majority of Bitcoin users before the next block is created.

The process of creating a new block, which lies at the heart of Bitcoin, is called mining. This term reflects the fact that forming a new block takes considerable effort. The mining task is to collect some floating Bitcoin transactions that are not yet in a block of the current blockchain, verify that they’re in the right format, and then cryptographically bind them together. As part of this process, the miner must attach some data known as a header to the start of the new block. This header includes an indication of which block the miner believes is currently at the end of the blockchain (and to which this new block should be attached), as well as a cryptographic “summary” of all the transactions in the new block. However, the header also must include something else, and this something else is what makes new blocks so hard to mine.

Recall that hash functions are cryptographic “juicers,” which compress data that is input into a smaller number (a hash). If you hash some data, then make even a tiny change to the data, the hash of the modified data will have no apparent relationship to the hash of the original data. In other words, the hash of some data appears to be randomly generated. Therefore, if you want to find some data with a particular hash value, the only thing you can do is keep trying to hash different things until you get lucky.

This is precisely the challenge presented to a bitcoin miner. The miner must include a randomly generated number in the block header that results in the hash of the entire block header having a particular property. As soon as the miner has gathered a sufficient number of transactions to form a block, they have to try out different random numbers in the hope that one of them will result in the header of the new block having the right hash. This is a somewhat frantic process, because all around the Bitcoin network, rival miners are also attempting to form a new block. The first to do so will “win.” But what’s the prize?

Nobody sane would expend the considerable resources required to mine a new block just for the fun of it. Mining involves not just trying out the odd random number or two, but trying out millions and millions of them. So many, in fact, that you must have considerable computing power to be a successful bitcoin miner.29 A miner who successfully creates a new block is thus paid a financial reward—in bitcoin, of course.

Once a new block is formed, the users of the Bitcoin network are all notified, and each user adds the block to the version of the blockchain that they currently believe to be correct. Each user is able to check the validity of this new block and can be reasonably sure that their version of the blockchain is the same as everyone else’s. They are only reasonably sure, though, because it’s possible for two different blocks to be found at approximately the same time by different users of the network. In this case, two different versions of the blockchain will be forming, each extended by a different block.30

This problem is unavoidable, but resolvable. As soon as the next block is found, one of these two versions of the blockchain will become further extended. Whenever a user of Bitcoin encounters two possible different versions of the blockchain, and one is longer than the other, the user rejects the shorter one. In practice, every Bitcoin user can be almost certain that within half an hour of being conducted, most transactions are included in a universally accepted version of the blockchain (only the very end of the blockchain might vary, and any differences will soon be sorted out).

Blockchain This, Blockchain That

Bitcoin is a wonderful cryptographic construct. An account is associated with a cryptographic key, transactions are digitally signed statements, formation of new blocks requires the solving of a cryptographic challenge, and the Bitcoin blockchain is cryptographically bound together by hash functions. This is why Bitcoin, alongside the hundreds of other similar digital-cash technologies now in circulation, is often described as a cryptocurrency.31 However, it’s important to remember why we are discussing Bitcoin. The Bitcoin blockchain is, first and foremost, a security mechanism for providing data integrity—in this particular case, the integrity of Bitcoin transactions.

The Bitcoin blockchain is not without its flaws. For one thing, the amount of computer time and energy required to form new Bitcoin blocks has raised serious questions about whether bitcoin is an environmentally sustainable currency. From time to time, the cost of mining Bitcoin blocks exceeds the value of the new currency generated. As noted earlier, however, using a blockchain the way Bitcoin does is not the only way a distributed ledger can be instantiated.

Distributed ledgers have much wider potential applications than just to digital cash. They could, at least in theory, be used to protect any data that is not confidential but requires absolute integrity. Instead of Bitcoin transactions, the data protected in a distributed ledger could concern any form of formal record keeping, such as legal contracts, supply-chain details, or government registers.

As we have seen, distributed ledgers have the distinct advantage of providing data integrity without requiring a centralized point of trust. However, we should be cautious about rushing to place everything in a blockchain, or any alternative form of distributed ledger. The architecture of blockchains and distributed ledgers is very different from the traditional architecture, in which integrity is provided by the data being secured in protected, centralized databases. Distributed ledgers represent a significant change to the way most data is protected today. While distributed ledgers are a fascinating idea, the purposes (beyond Bitcoin) for which they ultimately prove effective as data integrity mechanisms remain to be seen.

The Integrity of Integrity

It’s easy to underestimate how important integrity is to the functioning of our daily lives. In the physical world, integrity is often provided implicitly. In cyberspace, however—a place where data is relatively simple to manipulate—the explicit provision of data integrity is paramount.

Data integrity mechanisms cannot stop data from being altered, but they can warn us when such modification has occurred. Which data integrity mechanism you choose depends on what you realistically imagine could go wrong. Friendly environments, such as the book cataloguing of a public library, require only lightweight data integrity mechanisms. Hostile war zones (like the internet!) require strong data integrity mechanisms such as MACs and digital signatures. If you have no single place you can go to establish the trust necessary to provide data integrity, then you can consider deploying a distributed ledger.

Data integrity mechanisms work. For this reason, as long as appropriate data integrity mechanisms are used, criminals do not tend to manipulate the value of a bank transfer as you make it, or change the wording of an email as you download it from your webmail provider, or remove previous transactions from the Bitcoin blockchain. They don’t do these things because they can’t do these things, at least without being caught out.

However, data integrity mechanisms can only tell you that data has not been altered since the point at which it was created by . . . a MAC key holder or a signature key owner or the possessor of a Bitcoin address private key—whoever they are.

Good cybercriminals don’t waste their time trying to manipulate integrity-protected data. A far better strategy is to appear to be someone they’re not. If you are fooled regarding the identity of someone you’re talking to in cyberspace, then the integrity of any data they subsequently send you is worth very little. As valueless as a bitcoin, perhaps, when all you want to know is heads or tails.