Appendix A: Notes on Visualization

The Tectonic Diagrams

The visual layouts in chapter 2 of the most significant words in the titles of eighteenth-century systems codes the probability of the co-occurrence of these terms as literal distances. We determined a word’s significance by measuring the ratio between the number of times each word appeared in a title with system during a specific historical period and the number of times that we would expect it to appear during that period given its overall frequency in the corpus. We included only words that are relatively frequent in titles with system, but significantly less frequent in the overall corpus in the tectonic maps: this process allowed us to exclude very rare words that appear on only one or two title pages, as well as very frequent words, such as articles, pronouns, and modifiers, that appear on almost every title page. Both have little meaning in determining the unique clusters of words that surround system.

The visualization itself was developed through a new methodology based in part on work in information visualization at Bell Laboratories (Gansner et al. 2009, 345–346). Each map is based on titles that contain the word system within the ECCO archive. The term system therefore acts as the center for each diagram: all other words are placed in relation to both it and each other. Although in some cases it may appear as though system lies outside the visual center of the map, each map should be read with the tile that contains system in the center. In these cases, the borders of the map actually extend farther in the direction against which system is offset.

The distances between terms are visual representations based on the underlying relationships between the terms and are a function of the probability of their co-occurrence within a single title. The farther apart two words are, the less likely they are to both appear within a single title that also contains the word system. They closer together they are, the more likely they are to appear on a title page that also contains the term system. The cardinality of each term is determined by the relationship of that term and the word system. Groups of words that occur to the “south” of system share a different set of similar frequencies with each other than do groups of words that occur to the “north” of system on a given map. The specific location of each word on the map is a function of both how often it is likely to appear with system and what words it is most likely share a title page with. Words in the northeast corner therefore share a similar set of frequencies with each other that is different from those shared by words in the southwest corner. Each set of words may be equidistant from system (suggesting that both sets share title pages with system with the same frequency), but as the layout of the map is based on the relationship of every word to every other word, those in the northeast are much more likely to appear next to each other than to any word from the southwest corner.

In the early map, for example, the words sir, isaac, and newton all appear next to each other on the extreme eastern edge; this means that they are all equally likely to appear on a title page with system and much more likely to appear with each other than with a word from the western edge, such as grammar, which is more likely to appear with language and “exact. The absolute cardinality is arbitrary (“north” and “south” have no absolute meaning); rather, the distance and configurations of groups of words as they radiate out from the term system can be read as meaningful as these represent the relationship of each word to every other word. The size of each tile is a measure of its cluster density. The smaller the tile, the more tightly clustered it is and the more related to the words that surround it. The larger the tile, the less relationship it has to the words that surround it. The smallest tiles on each map therefore represent the words that are the most densely clustered and therefore have the highest probability of co-occurrence on the same title page.

Each map shows the configuration of the most significant words that share a title page with system in a different part of the eighteenth century. In dividing the period up into the three parts, we are primarily concerned with meaningful temporal divisions rather than dividing our total sample of 224,759 texts into three equal groupings. As the number of texts held by the ECCO archive per year is not constant across the century (reflecting the increasing number of books printed each year in the eighteenth century), our three maps are based on the following sample sizes: 43,703 for the early period, 61,559 for the middle period, and 114,497 for the late period. In each case, our results were scaled by the number of words on the title page and number of texts per period. The maps therefore represent the probabilities of co-occurrence rather than the raw values. A word that is rarely used, but when it is, it is always used with history would be closer to history on the map than a word that is used very frequently, but appears on a title page with history only half of the time. This latter type of word would appear with history more times than the former, but the probability of the former appearing with history is much higher, and therefore it would be placed closer to history on the tectonic map. This allows us to compare the maps to each other, even though each is based on a different number of texts.

The underlying relationships that structure the map are derived from the distance metrics that establish the relationships between words based on their co-occurrence in titles that also contain the word system in the holdings of the ECCO database. In short, each word receives a distance score from every other word based on the likelihood or probability that both would appear within the same title. Next, these data are used to build a network graph by reconfiguring this distance matrix as a series of nodes and edges. In the network, each word is represented by one node, with each node serving as the locus for a set of edges that extend to its nearest neighbors. The number of edges is limited by the distance of each node to its surrounding nodes: those within tight clusters of terms have more edges than those that occupy their own space. A force-directed layout of the network is then derived using a GEM (graph embedder) analysis of the data (Frick, Ludwig, and Mehldau 1994). This algorithm uses the node and edge information to simulate a physical system of springs whose equilibrium of forces is used to arrange the nodes within a meaningful and readable layout. The number of neighbors each node has and their distance determine the location of that node in relation to all of the others. In every case, system, as the only term that is related to all other nodes, remains the center point of the graph. Finally, the points at the coordinates determined by the placement algorithm are used to derive tessellated polygons in a Voronoi diagram, creating a topological layout of the network (Okabe et al. 2000, 43–45). The resulting map provides a more complex visualization of the relationships among the individual members and their local neighborhoods, replacing the raw distances with a depiction of relationality based on proximity, area and conjunction.

Appendix A: Notes on Visualization

The Scatter Plots

The Tectonic Diagrams

Technical Notes