Appendix A: Notes on Visualization

Mark Algee-Hewitt in consultation with Clifford Siskin

The Scatter Plots

These graphs are composed of points indicating the percentage of texts per year held by the Eighteenth Century Collections Online (ECCO) archive that contain (1) the word system anywhere in the text, (2) the word system on the title page, or (3) the word essay as compared to system on the title page. As the number of texts held by the archive varies per year, the use of the percentage (the raw number of texts per year that contain the term in question divided by the total raw number of texts per year) allows each year to be compared to the others. The overall presence of system or essay throughout the century can therefore be assessed. These numbers can also be read as measures of probability; for example, there is a 40 percent chance that a text drawn at random from ECCO’s holdings for the year 1791 would contain the word system at least once.

To aid in interpreting the trends that each graph charts, each also contains a regression line, a prediction model that attempts to assess the pattern (or trend) of the data (whether it increases, decreases, or remains stable) and predict what the future of that trend will be. The equation shown on the body of the graph describes the characteristics of this line. In the graph “Percent of Texts Containing System,” this is a linear equation (showing that the increase in the use of the word system was linearly progressive). In the other two graphs of title occurrences, the line is a higher-order polynomial, demonstrating the complexity of these data.

The graph titled “Percent of Titles Containing System” indicates an exponential increase in the use of system toward the end of the century: its data could also be fit to an exponential curve with little reduction in the goodness of fit. Below the equation, there is an R-squared number, giving the goodness of fit for each of the lines: this is a number between 0 and 1 that describes how well the regression model fits the observed data. An R-squared of 1 would indicate a perfect fit. These not only suggest the predictive power of the model, but also how regular the data are. With an R-squared of over 0.95, the linear graph is an excellent fit showing a high degree of regularity in the increase of the word system across the century. The graph of system in titles has an R-squared of 0.68, suggesting a good overall fit and a high correspondence to an exponential pattern of growth. Finally, the graph of essay in titles is the most irregular: the R-squared of 0.174 indicates that any observable historical pattern that may be present is remarkably weak. Unlike the graphs of system, this weakness of fit for essay indicates the nonsignificance of any increase or decrease in its use in titles throughout the century.

The Tectonic Diagrams

The visual layouts in chapter 2 of the most significant words in the titles of eighteenth-century systems codes the probability of the co-occurrence of these terms as literal distances. We determined a word’s significance by measuring the ratio between the number of times each word appeared in a title with system during a specific historical period and the number of times that we would expect it to appear during that period given its overall frequency in the corpus. We included only words that are relatively frequent in titles with system, but significantly less frequent in the overall corpus in the tectonic maps: this process allowed us to exclude very rare words that appear on only one or two title pages, as well as very frequent words, such as articles, pronouns, and modifiers, that appear on almost every title page. Both have little meaning in determining the unique clusters of words that surround system.

The visualization itself was developed through a new methodology based in part on work in information visualization at Bell Laboratories (Gansner et al. 2009, 345–346). Each map is based on titles that contain the word system within the ECCO archive. The term system therefore acts as the center for each diagram: all other words are placed in relation to both it and each other. Although in some cases it may appear as though system lies outside the visual center of the map, each map should be read with the tile that contains system in the center. In these cases, the borders of the map actually extend farther in the direction against which system is offset.

The distances between terms are visual representations based on the underlying relationships between the terms and are a function of the probability of their co-occurrence within a single title. The farther apart two words are, the less likely they are to both appear within a single title that also contains the word system. They closer together they are, the more likely they are to appear on a title page that also contains the term system. The cardinality of each term is determined by the relationship of that term and the word system. Groups of words that occur to the “south” of system share a different set of similar frequencies with each other than do groups of words that occur to the “north” of system on a given map. The specific location of each word on the map is a function of both how often it is likely to appear with system and what words it is most likely share a title page with. Words in the northeast corner therefore share a similar set of frequencies with each other that is different from those shared by words in the southwest corner. Each set of words may be equidistant from system (suggesting that both sets share title pages with system with the same frequency), but as the layout of the map is based on the relationship of every word to every other word, those in the northeast are much more likely to appear next to each other than to any word from the southwest corner.

In the early map, for example, the words sir, isaac, and newton all appear next to each other on the extreme eastern edge; this means that they are all equally likely to appear on a title page with system and much more likely to appear with each other than with a word from the western edge, such as grammar, which is more likely to appear with language and “exact. The absolute cardinality is arbitrary (“north” and “south” have no absolute meaning); rather, the distance and configurations of groups of words as they radiate out from the term system can be read as meaningful as these represent the relationship of each word to every other word. The size of each tile is a measure of its cluster density. The smaller the tile, the more tightly clustered it is and the more related to the words that surround it. The larger the tile, the less relationship it has to the words that surround it. The smallest tiles on each map therefore represent the words that are the most densely clustered and therefore have the highest probability of co-occurrence on the same title page.

Each map shows the configuration of the most significant words that share a title page with system in a different part of the eighteenth century. In dividing the period up into the three parts, we are primarily concerned with meaningful temporal divisions rather than dividing our total sample of 224,759 texts into three equal groupings. As the number of texts held by the ECCO archive per year is not constant across the century (reflecting the increasing number of books printed each year in the eighteenth century), our three maps are based on the following sample sizes: 43,703 for the early period, 61,559 for the middle period, and 114,497 for the late period. In each case, our results were scaled by the number of words on the title page and number of texts per period. The maps therefore represent the probabilities of co-occurrence rather than the raw values. A word that is rarely used, but when it is, it is always used with history would be closer to history on the map than a word that is used very frequently, but appears on a title page with history only half of the time. This latter type of word would appear with history more times than the former, but the probability of the former appearing with history is much higher, and therefore it would be placed closer to history on the tectonic map. This allows us to compare the maps to each other, even though each is based on a different number of texts.

The underlying relationships that structure the map are derived from the distance metrics that establish the relationships between words based on their co-occurrence in titles that also contain the word system in the holdings of the ECCO database. In short, each word receives a distance score from every other word based on the likelihood or probability that both would appear within the same title. Next, these data are used to build a network graph by reconfiguring this distance matrix as a series of nodes and edges. In the network, each word is represented by one node, with each node serving as the locus for a set of edges that extend to its nearest neighbors. The number of edges is limited by the distance of each node to its surrounding nodes: those within tight clusters of terms have more edges than those that occupy their own space. A force-directed layout of the network is then derived using a GEM (graph embedder) analysis of the data (Frick, Ludwig, and Mehldau 1994). This algorithm uses the node and edge information to simulate a physical system of springs whose equilibrium of forces is used to arrange the nodes within a meaningful and readable layout. The number of neighbors each node has and their distance determine the location of that node in relation to all of the others. In every case, system, as the only term that is related to all other nodes, remains the center point of the graph. Finally, the points at the coordinates determined by the placement algorithm are used to derive tessellated polygons in a Voronoi diagram, creating a topological layout of the network (Okabe et al. 2000, 43–45). The resulting map provides a more complex visualization of the relationships among the individual members and their local neighborhoods, replacing the raw distances with a depiction of relationality based on proximity, area and conjunction.

Technical Notes

  1. Although the ECCO archive contains many duplicate titles (through the inclusion of foreign reprints and later editions), we have elected to retain these duplicated titles rather than filter out the unique texts. As duplication is an indicator of a text’s significance (only the most popular texts are issued in multiple editions), their inclusion allows us to more finely trace the cultural significance of our terms.
  2. The Tectonic maps presented in chapter 2 are based on a composite of two force-directed layouts. The first uses the scaled distance between nodes to derive the shape of the network. The second recomputes the layout using the initial placement of the nodes in the first graph (to which the GEM algorithm is highly sensitive) instead of the distance: this smooths the layout and establishes readable distances between the points. (See Fruchterman and Reingold 1991.)