Biology Meets Computation

Sequencing the Genome

Although DNA’s structure and its role in genetics were known from the seminal work of Watson and Crick in the early 1950s, for about twenty years there was no technique that could be used to decode the sequence of bases in the genome of any living being. However, it was clear from the beginning that decoding the human genome would lead to new methods that could be used to diagnose and treat hereditary diseases, and to address many other conditions that depend on the genetic makeup of each individual.

The sequencing of the human genome resulted from the research effort developed in the decades that followed the discovery of DNA’s structure. The interest in determining the sequence of the human genome ultimately led to the creation of the Human Genome Project, headed by James Watson, in 1990. That project’s main goals were to determine the sequence of chemical base pairs that make up human DNA and to identify and map the tens of thousands of genes in the human genome. Originally expected to cost $3 billion and to take 15 years, the project was to be carried out by an international consortium that included geneticists in the United States, China, France, Germany, Japan, and the United Kingdom.

In 1998, a similar, but privately funded, project was launched by J. Craig Venter and his firm, Celera Genomics. The effort was intended to proceed at a much faster pace and at only a fraction of the cost of the Human Genome Project. The two projects used different approaches and reached the goal of obtaining a first draft of the human genome almost simultaneously in 2001.

The exponentially accelerating technological developments in sequencing technology that led to the sequencing of the human genome didn’t stop in 2001, and aren’t paralleled in any other field, even in the explosively growing field of digital circuit design. The rapidly advancing pace of sequencing technologies led to an explosion of genome-sequencing projects, and the early years of the twenty-first century saw a very rapid rise in the number of species with completely sequenced genomes.

The technological developments of sequencing technologies have outpaced even the very rapid advances of microelectronics. When the Human Genome Project started, the average cost for sequencing DNA was $10 per base pair. By 2000 the cost had fallen to one cent per base pair, and the improvements have continued at an exponential pace ever since. (See figure 7.1.) In 2015 the raw cost stood at roughly 1/100,000 of a cent per base pair, and it continues to fall, though more slowly than before (Wetterstrand 2015). The budget for the Human Genome Project was $3 billion (approximately $1 per base pair). In 2007, Watson’s own genome was sequenced using a new technology, at a cost of $2 million. This may seem expensive, but it was less than a thousandth as expensive as the original Human Genome Project. As of this writing, the cost of sequencing a human genome is approaching the long-sought target of $1,000. The cost per DNA base pair has fallen by a factor of 200,000 in 15 years, which is a reduction by a factor of more than 2 every year, vastly outpacing the reduction in costs that results from Moore’s Law, which has reduced costs by a factor of 2 every two years.

Figure 7.1 A graph comparing sequencing costs with Moore’s Law, based on data compiled by the National Human Genome Research Institute. Values are in US dollars per DNA base.

Sequencing technology had a slow start in the years that followed the discovery of DNA, despite the growing interest in methods that could be used to determine the specific sequence of bases in a specific strand of DNA. About twenty years passed between the discovery of the DNA double helix and the invention of the first practical methods that could be used to determine the actual bases in a DNA sequence. Since it is not possible to directly observe the sequence of bases in a DNA chain, it is necessary to resort to indirect methods. The most practical of these first methods was the chain-terminator method (Sanger, Nicklen, and Coulson 1977), still known today as Sanger sequencing. The principle behind that method is the use of specific molecules that act to terminate the growth of a chain of DNA in a bath of deoxynucleotides, the essential constituents of DNA. DNA sequences are then separated by size, which makes it possible to read the sequence of base pairs. The method starts by obtaining single-stranded DNA by a process called denaturation. Denaturation is achieved by heating DNA until the double-stranded helix separates into two strands. The single-stranded DNA is then used as a template to create new double-stranded DNA, which is obtained by adding, in four physically separate reactions, the four standard DNA deoxynucleotides, A, C, T, and G. Together with the standard deoxynucleotides, each reaction also contains a solution of one modified deoxynucleotide, a dideoxynucleotide that has the property of stopping the growth of the double-stranded DNA when it is incorporated. Each of these chain terminators attaches to one specific base (A, C, T, or G) on the single-stranded DNA and stops the DNA double chain from growing further. DNA fragments of different sizes can then be separated by making them flow through a gel through which fragments of different sizes flow at different speeds. From the positions of the fragments in the gel, the specific sequence of bases in the original DNA fragment can be read.

The Sanger sequencing technique enabled researchers to deal with small DNA sequences and was first applied to the study of small biological sub-systems. In-depth study of specific genes or other small DNA sequences led to an increasing understanding of cellular phenomena.

For a number of decades, the ability to sequence the complete genome of any organism was beyond the reach of existing technologies. As the technology evolved, sequencing of longer and longer sequences led to the first sequencing of a complete gene in 1972 and to the first complete sequencing of an organism (the 3,569-base DNA sequence of the bacteriophage MS2) in 1976 (Fiers et al. 1976). This was, however, still very far from sequencing any complex organism. (Complex organisms have billions of bases in their genomes.)

The first realistic proposal for sequencing the human genome appeared in 1985. However, not until 1988 did the Human Genome Project make it clear that such an ambitious objective was within reach. The selection of model organisms with increasing degrees of complexity enabled not only the progressive development of the technologies necessary for the sequencing of the human genome, but also the progressive development of research in genomics. The model organisms selected to be sequenced included Hemophilus influenza (with 1.8 mega bases, or Mb), in 1995, Escherichia coli (with 5 Mb), in 1996, Saccharomyces cerevisiae (baker’s yeast, with 12 Mb), in 1996, Caenorhabditis elegans (a roundworm, with 100 Mb), in 1998, Arabidopsis thaliana (a small flowering plant, with 125 Mb), in 2000, and Drosophila melanogaster (the fruit fly, with 200 Mb), in 2000. The final steps of the process resulted in the simultaneous publication, in 2001, of two papers on the human genome in the journals Science and Nature (Lander et al. 2001; Venter et al. 2001). The methods that had been used in the two studies were different but not entirely independent.

The approach used by the Human Genome Project to sequence the human genome was originally the same as the one used for the S. cerevisiae and C. elegans genomes. The technology available at the time the project started was the BAC-to-BAC method. A bacterial artificial chromosome (BAC) is a human-made DNA sequence used for transforming and cloning DNA sequences in bacteria, usually E. coli. The size of a BAC is usually between 150 and 350 kilobases (kb).

The BAC-to-BAC method starts by creating a rough physical map of the genome. Constructing this map requires segmenting the chromosomes into large pieces and figuring out the order of these big segments of DNA before sequencing the individual fragments. To achieve this, several copies of the genome are cut, in random places, into pieces about 150 kb long. Each of these fragments is inserted into a BAC. The collection of BACs containing the pieces of the human genome is called a BAC library. Segments of these pieces are “fingerprinted” to give each of them a unique identification tag that can be used to determine the order of the fragments. This is done by cutting each BAC fragment with a single enzyme and finding common sequences in overlapping fragments that determine the location of each BAC along the chromosome. The BACs overlap and have these markers every 100,000 bases or so, and they can therefore be used to create a physical map of each chromosome. Each BAC is then sliced randomly into smaller pieces (between 1 and 2 kb) and sequenced, typically about 500 base pairs from each end of the fragment. The millions of sequences which are the result of this process are then assembled (by using an algorithm that looks for overlapping sequences) and placed in the right place in the genome (by using the physical map generated).

The “shotgun” sequencing method, proposed and used by Celera, avoids the need for a physical map. Copies of the genome are randomly sliced into pieces between 2 and 10 kb long. Each of these fragments is inserted into a plasmid (a piece of DNA that can replicate in bacteria). The so-called plasmid libraries are then sequenced, roughly 500 base pairs from each end, to obtain the sequences of millions of fragments. Putting all these sequences together required the use of sophisticated computer algorithms.

Further developments in sequencing led to the many technologies that exist today, most of them based on different biochemical principles. Present-day sequencers generate millions of DNA reads in each run. A run takes two or three days and generates a volume of data that now approaches a terabyte (10¹² bytes)—more than the total amount of data generated in the course of the entire Human Genome Project.

Each so-called read contains the sequence of a small section of the DNA being analyzed. These sections range in size from a few dozen base pairs to several hundred, depending on the technology. Competing technologies are now on the market (Schuster 2007; Metzker 2010), the most common being pyrosequencing (Roche 454), sequencing by ligation (SOLiD), sequencing by synthesis (Illumina), and Ion Torrent. Sanger sequencing still remains the technology that provides longer reads and less errors per position, but its throughput is much lower than the throughput of next-generation sequencers, and thus it is much more expensive to use.

Computer scientists became interested in algorithms for the manipulation of biological sequences when it was recognized that, in view of the number and the lengths of DNA sequences, it is far from easy to mount genomes from reads or even to find similar sequences in the middle of large volumes of data.

Because all living beings share common ancestors, there are many similarities in the DNA of different organisms. Being able to look for these similarities is very useful because the role of a particular gene in one organism is likely to be similar to the role of a similar gene in another organism. The development of algorithms with which biological sequences (particularly DNA sequences) could be processed efficiently led to the appearance of a new scientific field called bioinformatics. Especially useful for sequencing the human genome were algorithms that could be used to manipulate very long sequences of symbols, also known as strings. Although manipulating small strings is relatively easy for computers, the task becomes harder when the strings have millions of symbols, as DNA sequences do. Efficient manipulation of strings became one of the most important fields in computer science. Efficient algorithms for manipulating strings are important both in bioinformatics and in Web technology (Baeza-Yates and Ribeiro-Neto 1999; Gusfield 1997).

A particularly successful tool for bioinformatic analysis was BLAST, a program that looks for exact or approximate matches between biological sequences (Altschul et al. 1990). BLAST is still widely used to search databases for sequences that closely match DNA sequences or protein sequences of interest. Researchers in bioinformatics developed many other algorithms for the manipulation of DNA strings, some of them designed specially to perform the assembly of the millions of fragments generated by the sequencers into the sequence of each chromosome (Myers et al. 2000).

Even though BAC-to-BAC sequencing and shotgun sequencing look similar, both involving assembly of small sequence fragments into the complete sequences of the chromosomes, there is a profound difference between the two approaches. In BAC-to-BAC sequencing, the physical map is known, and finding a sequence that corresponds to the fragments is relatively easy because we know approximately where the segment belongs. It’s a bit like solving a puzzle for which the approximate location of each piece is roughly known, perhaps because of its color. However, in the case of the human genome creating the BAC library and deriving the physical map was a long and complex process that took many years to perfect and to carry out. In shotgun sequencing this process is much faster and less demanding, since the long DNA sequences are immediately broken into smaller pieces and no physical map is constructed. On the other hand, assembling millions of fragments without knowing even approximately where they are in the genome is a much harder computational problem, much like solving a puzzle all of whose pieces are of the same color. Significant advances in algorithms and in computer technology were necessary before researchers were able to develop programs that could perform this job efficiently enough.

One algorithm that has been used to construct DNA sequences from DNA reads is based on the construction of a graph (called a de Bruijn graph) from the reads. In such a graph, each node corresponds to one specific fixed-length segment of k bases, and two nodes in the graph are connected if there is a read with k + 1 bases composed of the overlapping of two different k-base segments.

For example, suppose you are given the following set of four base reads: TAGG, AGGC, AGGA, GGAA, GAAT, GCGA, GGCG, CGAG, AATA, GAGG, ATAG, TAGT. What is the shortest DNA sequence that could have generated all these reads by being cut in many different places? The problem can be solved by drawing the graph shown here in figure 7.2, in which each node corresponds to a triplet of bases and each edge corresponds to one read. Finding the smallest DNA sequence that contains all these reads is then equivalent to finding a Eulerian path in this directed graph—that is, a path in the graph starting in one node and going through each edge exactly once (Pevzner, Tang, and Waterman 2001). As we saw in chapter 4, for undirected graphs there exists such a path if, and only if, exactly two vertices have an odd degree. For directed graphs, Euler’s theorem must be changed slightly, to take into account the fact that edges can be traversed in only one direction. A directed graph has a Eulerian path if every vertex has the same in degree and out degree, except for the starting and ending vertices (which must have, respectively, an out degree equal to the in degree plus 1 and vice versa). In this graph, all vertices have the in degree equal to the out degree except for vertices TAG and AGT, which satisfy the condition for the starting and ending vertices respectively. The Eulerian path, therefore, starts in node TAG, goes through node AGG and around the nodes in the topmost loop, returns to node AGG, and finishes in node AGT after going through the nodes in the bottom loop. This path gives TAGGCGAGGAATAGT, the shortest DNA sequence containing all the given reads.

Figure 7.2 DNA sequencing from reads, using Eulerian paths. The reads on the right are used to label the edges on the graph. Two nodes are connected, if they have labels contained in the read that is used to label the edge that connects them.

It is also possible to formulate this problem by finding a Hamiltonian path in a graph in which each read is represented by a node and the overlap between reads is represented by an edge (Compeau, Pevzner, and Tesler 2011). That approach, which was indeed used in the Human Genome Project (Lander et al. 2001; Venter et al. 2001), is computationally much more intensive and less efficient, because the Hamiltonian path problem is an NP-complete problem. In practice, mounting a genome using de Bruijn graphs is more complex than it appears to be in figure 7.2 because of errors in the sequences, missing sequences, highly repetitive DNA regions, and several other factors.

The race to sequence the human genome, which ended in a near tie, made it clear that computers and algorithms were highly useful in the study of biology. In fact, assembling the human genome wouldn’t have been possible without the use of these algorithms, and they became the most important component of the technologies that were developed for the purpose. The field of bioinformatics was born with these projects. Since then it has became a large enterprise involving many research groups around the world.

Today, many thousands of researchers and professionals work in bioinformatics, applying computers to problems in biology and medicine. Bioinformatics, (also called computational biology) is now a vast field with many sub-fields. In fact, manipulating DNA sequences is only one of the many tasks routinely performed by computers, helping us along the long path that will lead to a more complete understanding of biological systems. Researchers and professionals in bioinformatics are a diverse lot and apply their skills to many different challenges, including the modeling and simulation of many types of biological networks. These networks—abstract models that are used to study the behaviors of biological systems—are essential tools in our quest to understand nature.

Biological Networks

A multi-cellular organism consists of many millions of millions of cells. The human body contains more than 30 trillion cells (Bianconi et al. 2013), a number so large as to defy the imagination.

Although DNA contains the genetic information of an organism, there are many more components inside a living cell. Recently developed technologies enable researchers to obtain vast amounts of data related to the concentrations of various chemicals inside cells. These measurements are critically important for researchers who want to understand cellular behavior, in particular, and biological systems, in general. The advances in genomics have been paralleled by advances in other areas of biotechnology, and the amounts of data now being collected by sequencing centers and biological research institutes are growing at an ever-increasing rate.

To make sense of these tremendous amounts of data, computer models of biological processes have been used to study the biological systems that provided the data. Although there are many types of biological models, the most widespread and useful models are biological networks. In general, it is useful to view complex biological systems as a set of networks in which the various components of a system interact. Networks are usually represented as graphs in which each node is a component in the network and the edges between the nodes represent interactions between components.

Many types of biological networks have been used to model biological systems, and each type of network has many different sub-types. Here I will use a small number of types of biological networks that are representative to illustrate how models of biological networks are used to study, simulate, and understand biological systems.

In general, nodes of biological networks need not correspond to actual physically separate locations. The network represents, in this case, a useful abstraction, but there is no one-to-one correspondence between a node in a network and a physical location, as there is in a road network or in an electrical network. Of the four types of networks I will discuss, only in neuronal networks is there direct correspondence between a node in the network and a specific location in space, and even in that case the correspondence isn’t perfect.

The four types of biological networks I will describe here are metabolic networks, protein-protein interaction networks, gene regulatory networks, and neuronal networks. I will use the term neuronal networks to refer to networks that model biological neurons, in order to distinguish them from neural networks (which were discussed in chapter 5 as a machine learning technique). Even though other types of biological networks are used to model biological systems, these four types constitute a large fraction of the models used to study the behavior of cells, systems, and even whole organisms.

Metabolic networks are used to model the interactions among chemical compounds of a living cell, which are connected by biochemical reactions that convert one set of compounds into another. Metabolic networks are commonly used to model the dynamics of the concentrations of various chemical compounds inside a cell. These networks describe the structure and the dynamics of the chemical reactions of metabolism (also known as the metabolic pathways), as well as the mechanisms used to regulate the interactions between compounds. The reactions are catalyzed by enzymes, which are selective catalysts, accelerating the rates of the metabolic reactions that are involved in most biological processes, from the digestion of nutrients to the synthesis of DNA. Although there are other types of enzymes, most enzymes are proteins and are therefore encoded in the DNA. When the complete genome of an organism is known, it is possible to reconstruct partially or entirely the network of biochemical reactions in a cell, and to derive mathematical expressions that describe the dynamics of the concentration of each compound.

Protein-protein interaction networks are used very extensively to model the dynamics of a cell or a system of cells. Proteins, encoded by the genes, interact to perform a very large fraction of the functions in a cell. Many essential processes in a cell are carried out by molecular machines built from a number of protein components. The proteins interact as a result of biochemical events, but mainly because of electrostatic forces. In fact, the quaternary structure of proteins, to which I alluded in chapter 6, is the arrangement of several folded proteins into a multi-protein complex. The study of protein-protein interaction networks is fundamental to an understanding of many processes inside a living cell and of the interaction between different cells in an organism.

Gene regulatory networks are used to model the processes that lead to differentiated activity of the genes in different environments. Even though all cells in a multi-cellular organism (except gametes, the cells directly involved in sexual reproduction) have the same DNA, there are many different kinds of cells in a complex organism and many different states for a given cell type. The process of cellular differentiation creates a more specialized cell from a less specialized cell type, usually starting from stem cells, in adult organisms. Cell differentiation leads to cells with different sizes, shapes, and functions and is controlled mainly by changes in gene expression. In different cell types (and even in different cells of the same type in different states) different genes are expressed and thus, different cells can have very different characteristics, despite sharing the same genome. Even a specific cell can be in many different states, depending on the genes expressed at a given time. The activity of a gene is controlled by a number of pre-transcription and post-transcription regulatory mechanisms. The simplest mechanism is related to the start of the transcription. A gene can be transcribed (generating mRNA that will then be translated into a protein) only if a number of proteins (called transcription factors) are attached to a region near the start of the coding region of the gene: the promoter region. If for some reason the transcription factors don’t attach to the promoter region, the gene is not transcribed and the protein that it generates is not created. Gene regulatory networks, which model this mechanism, are, in fact, DNA-protein interaction networks. Each node in this network is either a transcription factor (a protein) or a gene (a section of DNA). Gene regulatory networks can be used to study cell behavior, in general, and cell differentiation, in particular. Understanding the behavior and the dynamics of gene regulatory networks is an essential step toward understanding the behavior of complex biological systems.

Neuronal networks are one type of cell interaction networks extensively used to study brain processes. Although different types of cell interaction networks are commonly used, the most interesting type is neuron-to-neuron interaction networks, the so-called neuronal networks. The nervous system of a complex organism consists of interconnected neural cells, supported by other types of cells. Modeling this network of neurons is a fundamental step toward a more profound understanding of the behavior of nervous systems. Each node in a neuronal network corresponds to one nerve cell, and an edge corresponds to a connection between neurons. The activity level of each neuron and the connections (excitatory or inhibitory) between neurons lead to patterns of neuron excitation in the nervous system that correspond to specific brain behaviors. In the central nervous systems of humans and in those of our close evolutionary parents, the primates, these patterns of behavior lead to intelligent and presumably conscious behavior. The study of neuronal networks is, therefore, of fundamental importance to the understanding of brain behavior.

As we saw in chapter 5, simplified mathematical models of neurons have been used to create artificial neural networks. As I noted, artificial neural networks capture some characteristics of networks of neurons, even though they are not faithful models of biological systems. Networks of these artificial neurons—artificial neural networks— have been used extensively both as models of parts of the brain and as mechanisms to perform complex computational tasks requiring adaptation and learning. However, in general, artificial neural networks are not used to model actual cell behavior in the brain. Modeling actual cell behaviors in living brains is performed using more sophisticated models of neurons, which will be described in chapter 8.

Emulating Life

The development of accurate models for biological systems will enable us to simulate in silico (that is, by means of a computer) the behavior of actual biological systems. Simulation of the dynamic behavior of systems is common in many other fields of science. Physicists simulate the behavior of atoms, molecules, mechanical bodies, planets, stars, star systems, and even whole galaxies. Engineers routinely use simulation to predict the behaviors of electronic circuits, engines, chemical reactors, and many others kinds of systems. Weather forecasters simulate the dynamics of the atmosphere, the oceans, and the land to predict, with ever-increasing accuracy, tomorrow’s weather. Economists simulate the behavior of economic systems and, with some limited success, of whole economies.

In computer science the word simulation has a very special meaning—it usually refers to the simulation of one computer by another. In theoretical computer science, if one computing engine can simulate another then they are, in some sense, equivalent. As I noted in chapter 4, in this situation I will use the word emulate instead of simulate. We have already encountered universal Turing machines, which can emulate any other Turing machine, and the idea that a universal Turing machine can emulate any existing computer, if only slowly. In more practical terms, it has become common to use the concept of virtualization, in which one computer, complete with its operating system and its programs, is emulated by a program, running in some other computer. This other computer may, itself, be also just an emulation running in another computer. At some point this regression must end, of course, and some actual physical computer must exist, but there are no conceptual limits to this virtualization of computation.

The ability to simulate the behavior of a system requires the existence of an abstract computation mechanism, either analog or digital. Although analog simulators have been used in the past (as we saw in chapter 2), and although they continue to be used to some extent, the development of digital computers made it possible to simulate very complex systems flexibly and efficiently. Being able to simulate a complex system is important for a number of reasons, the most important of which are the ability to adjust the time scale, the ability to observe variables of interest, the ability to understand the behavior of the system, and the ability to model systems that don’t exist physically or that are not accessible.

The time scales involved in the behavior of actual physical systems may make it impossible to observe events in those systems, either because the events happen too slowly (as in a collision of two galaxies) or because the events happen too fast (as in molecular dynamics). In a simulation, the time scales can be compressed or expanded in order to make observation of the dynamics of interest possible. A movie of two galaxies colliding lasting only a minute can compress an event that, in reality, would last many hundreds of millions of years. Similarly, the time scale of molecular dynamics can be slowed down by a factor of many millions to enable researchers to understand the dynamic behavior of atoms and molecules in events that happen in nanoseconds.

Simulation also enables scientists and engineers to observe any quantity of interest (e.g., the behavior of a part of a transistor in an integrated circuit)—something that, in many cases, is not possible in the actual physical system. Obviously one can’t measure the temperature inside a star, or even the temperature at the center of the Earth, but simulations of these systems can tell us, with great certainty, that the temperature at the center of the Sun is roughly 16 million kelvin and that the temperature at the center of the Earth is approximately 6,000 kelvin. In a simulator that is detailed enough, any variable of interest can be modeled and its value monitored.

The ability to adjust the time scales and to monitor any variables internal to the system gives the modeler a unique insight into the behavior of the system, ultimately leading to the possibility of a deep understanding that cannot otherwise be obtained. By simulating the behavior of groups of atoms that form a protein, researchers can understand why those proteins perform the functions they were evolved to perform. This deep understanding is relatively simple in some cases and more complex (or even impossible) in others, but simulations always provide important insights into the behavior of the systems—insights that otherwise would be impossible to obtain.

The fourth and final reason why the ability to simulate a system is important is that it makes it possible to predict the behavior of systems that have not yet come into existence. When a mechanical engineer simulates the aerodynamic behavior of a new airplane, the simulation is used to inform design decisions that influence the design of the plane before it is built. Similarly, designers of computer chips simulate the chips before they are fabricated. In both cases, simulation saves time and money.

Simulation of biological systems is a comparatively new entrant in the realm of simulation technology. Until recently, our understanding of biological systems in general, and of cells in particular, was so limited that we weren’t able to build useful simulators. Before the last two decades of the twentieth century, the dynamics of cell behavior were so poorly understood that the knowledge needed to construct a working model was simply not available. The understanding of the mechanisms that control the behavior of cells, from the processes that create proteins to the many mechanisms of cell regulation, opened the door to the possibility of detailed simulation of cells and of cell systems.

The dynamics of the behavior of a single cell are very complex, and the detailed simulation of a cell at the atomic level is not yet possible with existing computers. It may, indeed, never be possible or economically feasible. However, this doesn’t mean that very accurate simulation of cells and of cell systems cannot be performed. The keys to the solution are abstraction and hierarchical multi-level simulation, concepts extensively used in the simulation of many complex systems, including, in particular, very-large-scale integrated (VLSI) circuits.

The idea behind abstraction is that, once the behavior of a part of a system is well understood, its behavior can be modeled at a higher level. An abstract model of that part of the system is then obtained, and that model can be used in a simulation, performed at a higher level, leading to hierarchical multi-level simulation, which is very precise and yet very efficient. For instance, it is common to derive an electrical-level model for a single transistor. The transistor itself is composed of many millions or billions of atoms, and detailed simulation, at the atom and electron level, of a small section of the transistor may be needed to characterize the electrical properties of the device. This small section of the transistor is usually simulated at a very detailed physical level. The atoms in the crystalline structure of the semiconductor material are taken into consideration in this simulation, as are the atoms of impurities present in the crystal. The behavior of these structures, and that of the electrons present in the electron clouds of these atoms, which are subject to electric fields, is simulated in great detail, making it possible to characterize the behavior of this small section of the transistor. Once the behavior of a small section of the transistor is well understood and well characterized, an electrical-level model for the complete transistor can be derived and used to model very accurately the behavior of the device, even though it doesn’t model each individual atom or electron. This electrical-level model describes how much current flows through a transistor when a voltage difference is applied at its terminals; it also describes other characteristics of the transistor. It is precise enough to model the behavior of a transistor in a circuit, yet it is many millions times faster than the simulation at the atomic level that was performed to obtain the model.

Thousands or even millions of transistors can then be effectively simulated using these models for the transistors, even if the full simulation at the electron level and at the atom level remains impossible. In practice, complete and detailed simulation of millions (or even billions) of transistors in an integrated circuit is usually not carried out. Systems containing hundreds or thousands of transistors are characterized, using these electrical-level models and more abstract models of these systems are developed and used, in simulations performed at a higher level of abstraction. For instance, the arithmetic and logic unit (ALU) that is at the core of a computer consists of a few thousand transistors, but they can be characterized solely in terms of how much time and power the ALU requires to compute a result, depending on the type of operation and the types of operands. Such a model is then used to simulate the behavior of the integrated circuit when the unit is connected to memory and to other units. High-level models for the memory and for the other units are also used, making it practicable to perform the simulation of the behavior of billions of transistors, each consisting of billions of atoms. If one considers other parts in the circuit that are also being implicitly simulated, such as the wires and the insulators, this amounts to complete simulation of a system with many billions of trillions of atoms.

Techniques for the efficient simulation of very complex biological systems also use this approach, even though they are, as of now, much less well developed than the techniques used in the simulation of integrated circuits. Researchers are working intensively on the development of models for the interaction between molecules in biological systems—especially the interactions among DNA, proteins, and metabolites. Models for the interaction between many different types of molecules can then be used in simulations at higher levels of abstraction. These higher levels of abstraction include the various types of biological networks that were discussed in the preceding section. For instance, simulation of a gene regulatory network can be used to derive the dynamics of the concentration of a particular protein when a cell responds to specific aggression by a chemical compound. The gene regulatory network represents a model at a high level of abstraction that looks only at the concentrations of a set of proteins and other molecules inside a cell. However, that model can be used to simulate the phenomena that occur when a specific transcription factor attaches to the DNA near the region where the gene is encoded. Modeling the attachment between a transcription factor (or a set of transcription factors) and a section of the DNA is a very complex task that may require the extensive simulation of the molecular dynamics involved in the attachment. In fact, the mechanism is so complex that it is not yet well understood, and only incomplete models exist. However, extensive research continues, and good detailed models for this process will eventually emerge. It will then be practicable to use those detailed models to obtain a high-level model that will relate the concentration of the transcription factors with the level of transcription of the gene, and to use the higher-level models in the gene regulatory network model to simulate the dynamics of the concentrations of all the molecules involved in the process, which will lead to a very efficient and detailed simulation of a system that involves many trillions of atoms.

By interconnecting the various types of networks used to model different aspects of biological systems and using more and more abstract levels of modeling, it may eventually become possible to model large systems of cells, complete organs, and even whole organisms. Modeling and simulating all the processes in an organism will not, in general, be of interest. In most cases, researchers will be interested in modeling and understanding specific processes, or pathways, in order to understand how a given dysfunction or disease can be tackled. Many of the problems of interest that require the simulation of complex biological systems are related to illnesses and other debilitating conditions. In fact, every existing complex disease requires the work of many thousands of researchers in order to put together models that can be used to understand the mechanisms of disease and to investigate potential treatments. Many of these treatments involve designing specific molecules that will interfere with the behavior of the biological networks that lead to the disease. In many types of cancer, the protein and gene regulatory networks controlling cell division and cell multiplication are usually the targets of study, with the aim of understanding how uncontrolled cell division can be stopped. Understanding neurodegenerative diseases that progressively destroy or damage neuron cells requires accurate models of neuronal networks and also of the molecular mechanisms that control the behavior of each neuron. Aging isn’t a disease, as such, but its effects can be delayed (or perhaps averted) if better models for the mechanisms of cellular division and for the behavior of networks of neurons become available.

Partial models of biological organisms will therefore be used to understand and cure many illnesses and conditions that affect humanity. The use of these models will not, in general, raise complex ethical problems, since they will be in silico models of parts of organisms and will be used to create therapies. In most cases, they will also reduce the need for animal experiments to study the effects of new molecules, at least in some phases of the research. However, the ability to perform a simulation of a complete organism raises interesting ethical questions because, if accurate and detailed enough, it will reproduce the behavior of the actual organism, and it would become, effectively, an emulation. Perhaps the most advanced effort to date to obtain the simulation of a complex organism is the OpenWorm project, which aims at simulating the entire roundworm Caenorhabditis elegans.

Digital Animals

The one-millimeter-long worm Caenorhabditis elegans has a long history in science as a result of its extensive used as a model for the study of simple multi-cellular organisms. It was, as was noted above, the first animal to have its genome sequenced, in 1998. But well before that, in 1963, Sydney Brenner proposed it as a model organism for the investigation of neural development in animals, an idea that would lead to Brenner’s research at the MRC Laboratory for Molecular Biology in Cambridge, England, in the 1970s and the 1980s. In an effort that lasted more than twelve years, the complete structure of the brain of C. elegans was reverse engineered, leading to a diagram of the wiring of each neuron in this simple brain. The work was made somewhat simpler by the fact that the structure of the brain of this worm is completely encoded by the genome, leading to virtually identical brains in all the individuals that share a particular genome.

The painstaking effort of reverse engineering a worm brain included slicing several worm brains very thinly, taking about 8,000 photos of the slices with an electron microscope, and connecting each neuron section of each slice to the corresponding neuron section in the neighbor slices (mostly by hand). The complete wiring diagram of the 302 neurons and the roughly 7,000 synapses that constitute the brain of C. elegans was described in minute detail in a 340-page article titled “The Structure of the Nervous System of the Nematode Caenorhabditis elegans” (White et al. 1986). The running head of the article was shorter and more expressive: “The Mind of a Worm.” This detailed wiring diagram could, in principle, be used to create a very detailed simulation model with a behavior that would mimic the behavior of the actual worm. In fact, the OpenWorm project aims at constructing a complete model, not only of the 302 neurons and the 95 muscle cells, but also of the remaining 1,000 cells in each worm (more exactly, 959 somatic cell plus about 2,000 germ cells in the hermaphrodites and 1,031 cells in the males).

The OpenWorm project, begun in 2011, has to overcome significant challenges to obtain a working model of the worm. The wiring structure of the brain, as obtained by Brenner’s team, isn’t entirely sufficient to define its actual behavior. Much more information is required, such as detailed models for each connection between neurons (the synapses), dynamical models for the neurons, and additional information about other variables that influence the brain’s behavior. However, none of this information is, in principle, impossible to obtain through a combination of further research in cell behavior, advanced instrumentation techniques (which I will address in chapter 9), and fine tuning of the models’ parameters.

The OpenWorm project brings together coders (who will write the complete simulator using a set of programming languages) and scientists working in many fields, including genetics, neuron modeling, fluid dynamics, and chemical diffusion. The objective is to have a complete simulator of all the electrical activity in all the muscles and neurons, and an integrated simulation environment that will model body movement and physical forces within the worm and the worm’s interaction with its environment.

The interesting point for the purpose of the present discussion is that, in principle, the OpenWorm project will result in the ability to simulate the complete behavior of a fairly complex animal by emulating the cellular processes in a computer. Let us accept, for the moment, the reasonable hypothesis that this project will succeed in creating a detailed simulation of C. elegans. If its behavior, in the presence of a variety of stimuli, is indistinguishable from a real-world version of the worm, one has to accept that the program which simulates the worm has passed, in a very restricted and specific way, a sort of Turing Test: A program, running in a computer, simulates a worm well enough to fool an observer. One may argue that a simulation of a worm will never fool an observer as passing for the real thing. This is, however, a simplistic view. If one has a good simulation of the worm, it is relatively trivial to use standard computer graphics technologies to render its behavior on a screen, in real time, in such a way that it will not be distinguishable from the behavior of a real worm.

In reality, no one will care much if a very detailed simulation of a worm is or is not indistinguishable from the real worm. This possibility, however, opens the door to a much more challenging possibility: What if, with the evolution of science and technology, we were to become able to simulate other, more complex animals in equivalent detail? What if we could simulate, in this way, a mouse, a cat, or even a monkey, or at least significant parts of those animals? And if we can simulate a monkey, what stops us from eventually being able to simulate a human? After all, primates and humans are fairly close in biological complexity. The most challenging differences are in the brain, which is also the most interesting organ to simulate. Even though the human brain has on the order of 100 billion neurons and the brain of a chimpanzee only 7 billion, this difference of one order of magnitude will probably not be a roadblock if we are ever able to reverse engineer and simulate a chimp brain. A skeptical reader probably will argue, at this point, that this is a far-fetched possibility, not very likely to happen any time soon. It is one thing to simulate a worm with 302 neurons and no more than 1,000 cells; it is a completely different thing to simulate a more complex animal, not to speak of a human being with a hundred billion neurons and many trillions of cells.

Nonetheless, we must keep in mind that such a possibility exists, and that simulating a complex animal is, in itself, a worthwhile goal that, if achieved, will change our ability to understand, manipulate, and design biological systems. At present this ability is still limited, but advances in synthetic biology may change that state of affairs.

Synthetic Biology

Advances in our understanding of biological systems have created the exciting possibility that we will one day be able to design new life forms from scratch. To do that, we will have to explore the vastness of the Library of Mendel in order to find the specific sequences of DNA that, when interpreted by the cellular machinery, will lead to the creation of viable new organisms. Synthetic biology is usually defined as the design and construction of biological devices and systems for useful purposes. The geneticist Wacław Szybalski may have been the first to use the term. In 1973, when asked during a panel discussion what he would like to be doing in the “somewhat non-foreseeable future,” he replied as follows:

Let me now comment on the question “what next.” Up to now we are working on the descriptive phase of molecular biology. … But the real challenge will start when we enter the synthetic phase of research in our field. We will then devise new control elements and add these new modules to the existing genomes or build up wholly new genomes. This would be a field with an unlimited expansion potential and hardly any limitations to building “new better control circuits” or … finally other “synthetic” organisms, like a “new better mouse.” … I am not concerned that we will run out of exciting and novel ideas … in the synthetic biology, in general.

Designing a new biological system is a complex enterprise. Unlike human-made electronic and mechanical systems, which are composed of modular components interconnected in a clear way, each with a different function, biological systems are complex contraptions in which each part interacts with all the other parts in complicated and unpredictable ways. The same protein, encoded in a gene in DNA, may have a function in the brain and another function, entirely different, in the skin. The same protein may fulfill many different functions in different cells, and at present we don’t have enough knowledge to build, from scratch, an entirely new synthetic organism.

However, we have the technology to create specific sequences of DNA and to insert them into cells emptied of their own DNA. The machinery of those cells will interpret the inserted synthetic DNA to create new copies of whatever cell is encoded in the DNA. It is now possible to synthetize, from a given specification, DNA sequences with thousands of base pairs. Technological advances will make it possible to synthesize much longer DNA sequences at very reasonable costs.

The search for new organisms in the Library of Mendel is made easier by the fact that some biological parts have well-understood functions, which can be replicated by inserting the code for these parts in the designed DNA. The DNA parts used most often are the BioBrick plasmids invented by Tom Knight (2003). BioBricks are DNA sequences that encode for specific proteins, each of which has a well-defined function in a cell. BioBricks are stored and made available in a registry of standard biological parts and can be used by anyone interested in designing new biological systems. In fact, this “standard” for the design of biological systems has been extensively used by thousands of students around the world in biological system design competitions such as the iGEM competition. Other parts and methodologies, similar in spirit to the BioBricks, also exist and are widely available.

Existing organisms have already been “re-created” by synthesizing their DNA and introducing it in the machinery of a working cell. In 2003, a team from the J. Craig Venter Institute synthetized the genome of the bacteriophage Phi X 174, the first organism to have had its 5386 base genome sequenced. The team used the synthesized genome to assemble a living bacteriophage. Most significantly, in 2006, the same team constructed and patented a synthetic genome of Mycoplasma laboratorium, a novel minimal bacterium derived from the genome of Mycoplasma genitalium, with the aim of creating a viable synthetic organism. Mycoplasma genitalium was chosen because it was the known organism that exhibits the smallest number of genes. Later the research group decided to use another bacterium, Mycoplasma mycoides, and managed to transplant the synthesized genome of this bacterium into an existing cell of a different bacterium that had had its DNA removed. The new bacterium was reported to have been viable and to have replicated successfully.

These and other efforts haven’t yet led to radically new species with characteristics and behaviors completely different from those of existing species. However, there is little doubt that such efforts will continue and will eventually lead to new technologies that will enable researchers to alter existing life forms. Altered life forms have the potential to solve many problems, in many different areas. In theory, newly designed bacteria could be used to produce hydrocarbons from carbon dioxide and sunlight, to clean up oil spills, to fight global warming, and to perform many, many other tasks.

If designing viable new unicellular organisms from scratch is still many years in the future, designing complex multi-cellular eukaryotes is even farther off. The developmental processes that lead to the creation of different cells types and body designs are very complex and depend in very complicated ways on the DNA sequence. Until we have a much more complete understanding of the way biological networks work in living organisms, we may be able to make small “tweaks” and adjustments to existing life forms, but we will not be able to design entirely new species from scratch. Unicorns and centaurs will not appear in our zoos any time soon. In a way, synthetic biology will be, for many years to come, a sophisticated version of the genetic engineering that has been taking place for many millennia on horses, dogs and other animals, which have been, in reality, designed by mankind.

Before we embark, in the next chapters, on a mixture of educated guesses and wild speculations on how the technologies described in the last chapters may one day change the world, we need to better understand how the most complex of organs, the brain, works. In fact, deriving accurate models of the brain is probably the biggest challenge we face in our quest to understand completely the behavior of complex organisms.