2 The Science of Culture?

While in the physical sciences the investigator will be able to measure what, on the basis of a prima facie theory, he thinks important, in the social sciences often that is treated as important which happens to be accessible to measurement. This is sometimes carried to the point where it is demanded that our theories must be formulated in such terms that they refer only to measurable magnitudes.

—Friedrich August von Hayek, Nobel Prize Lecture, December 11, 19741

Analyzing, Visualizing, and Interacting with Cultural Data: Examples

Today, the research that uses computation and large cultural datasets is spread among many academic fields, professional practices, and publication formats. It includes publications in academic journals, conference papers, blog posts, GitHub code and depositories, large long-term institutional projects to assemble the digital records of many separate collections such as Europeana.eu, short-lived art installations in museums and public spaces, and interactive projects by data artists and designers.

Let’s look at some examples of this research and the academic publications and conferences in which this work often appears. In computer science, the numbers of relevant publications and conference papers that analyze cultural contents and interactions now number in the hundreds of thousands. Some of this research appears in conferences on “social computing”2 and “computational social science.” Other work is done in various subfields of computer science, including computer multimedia, computer vision, music information retrieval, natural language processing, web science, and machine learning. Nature and Science, the two most prestigious international science journals, have also published a number of important papers (I will discuss two of them ahead.)3 Another prestigious journal with many publications that use computational methods to analyze large social media datasets is PLOS One.4 Among annual conferences that feature such work, two very important ones are the International World Wide Web Conference (1994–present) and, as mentioned earlier, the International AAAI Conference On Web And Social Media (2007–present).

Much of such research in computer science relies on large samples of user content shared on social networks and data about people’s behaviors on these networks, such as the numbers of views, likes, and shares for a post, lists of followers, and so on. The papers analyze user behavior on most popular social networks and media sharing services such as Weibo, Facebook, Instagram, Flickr, YouTube, Pinterest, and Tumblr. They also analyze computationally the characteristics of image, video, and text posts and propose models that connect user behavior and these characteristics. For example, in one research area called computational aesthetics, scientists create mathematical models that predict which images and videos will be popular and how this popularity is affected by their content and by other characteristics such as “memorability,” “interestingness,” “beauty,” or “creativity.”5 (The researchers proposed metrics to measure these characteristics.)

For examples of how scientists analyze cultural behaviors on a media sharing platform, consider work on Instagram. On Google Scholar, my search for “Instagram dataset” conducted on February 3, 2020, returned 17,110 journal articles and conference papers. One publication analyzed the most popular Instagram subjects and types of users in terms of what kinds of subjects frequently appear in photos in their feeds.6 Another paper used a sample of 4.1 million Instagram photos to quantify the effect of using filters on the numbers of views and comments.7 In another paper, a group of researchers analyzed temporal and demographic trends in Instagram selfies by using 5.5 million photos with faces they collected from the network. They have also tested three alternative hypotheses about the reasons behind posting selfies in each of 117 countries in their dataset.8 Yet another paper investigated clothing and fashion styles in forty-four world cities using one hundred million Instagram photos.9

These papers illustrate the general characteristics common to a large proportion of cultural research in computer science. This research deals with the present time. It relies on large random samples of user-created content and user activities, such as many millions of posts on social and media sharing networks by millions of people. As a result, what this research looks at and quantifies is popular culture—that is, the tastes, interests, and imagination shared by majorities. (Because of privacy issues, scientists can’t ask each of these users to identify themselves or submit demographic information.)

There are obvious advantages of such scale (e.g., we can find more reliable statistical patterns), but the desire to model and predict human cultural behavior on that scale can be also blinding. As I will discuss later in more detail, the small “islands” of global culture—groups of unique cultural artifacts, unique cultural behaviors, and unique tastes—can easily become invisible when we aggregate all data together and analyze it as though it is coming from a single population.

Contemporary popular culture as it exists in social media, blogs, forums, and other online platforms receives the most attention in computational research, but we can also find some very interesting quantitative work on media history. A number of scientists published studies of historical visual and audio media that creatively use methods from the fields of image processing, computer vision, and music information retrieval. The examples of such work that I find particularly interesting are “Toward Automated Discovery of Artistic Influence,”10 “Measuring the Evolution of Contemporary Western Popular Music,”11 and “Quicker, Faster, Darker: Changes in Hollywood Film over 75 Years.”12 The first paper presents a mathematical model for the automated discovery of influence among artists. The model was tested using 1,710 images of paintings by sixty-six well-known artists. Although some of the discovered influences are the same ones often described by art historians, the model also suggests other visual influences among artists that had not been discussed previously. The second paper investigates changes in popular music using a dataset of 464,411 songs produced between 1955 and 2010. The third paper analyzes gradual changes in the average shot duration across 9,400 English-language narrative films created between 1912 and 2013.

The analysis of mostly historical text cultures has been central to the field of digital humanities as it developed in literary studies departments. The history that this field constructed for itself (especially in English-language countries),13 starts in 1949 with a project by Italian priest Roberto Busa to make an index of words in the writings of St. Thomas Aquinas that eventually was supported by IBM. (For alternative histories of the field’s beginnings, see “A Genealogy of Distant Reading” by Ted Underwood14 and “Search and Replace: Josephine Miles and the Origins of Distant Reading” by Rachel Sagner Buurma and Laura Heffernan15). Important institutional milestones for the field’s development in the United States include the founding of Computers and the Humanities journal (1996–present), the Association for Computers and the Humanities (1978–present), the NEH Office for Digital Humanities (2008–present) and Annual Digital Humanities Conferences (1989–present) internationally.16 Any attempt at a summary of the field today will be incomplete given its size and diversity, but for a compact view from 2015, I recommend “Seven Ways Humanists Are Using Computers to Understand Text” by Underwood.17 I can give lots of interesting examples of digital humanities research, but I will mention just one example here because it illustrates well what I see as the most interesting type of inquiry: using bigger cultural data to question our existing concepts and methods of analysis (this idea appears as the first in my list of twelve research questions for cultural analytics in the book’s introduction). The authors of the paper “Mapping Mutable Genres in Structurally Complex Volumes” apply computer methods to analyze texts of 469,200 digitized English language volumes covering a number of centuries.18 The initial problem of automatically classifying these volumes by genre leads to the discussion of instability of genre categories over time:

Existing metadata rarely provides unambiguous information about genre. More troublingly, when you dig into the problem, it becomes clear that no amount of manual categorization will ever produce a definitive boundary between fiction and nonfiction in a collection with a significant timespan, because the boundary changes over time. Form and content didn’t necessarily align in earlier centuries as they do for us. Nineteenth-century biographies that invent imagined dialogue often read exactly like a novel; eighteenth-century essays like Richard Steele’s Tatler use thinly fictionalized characters as a veil for nonfiction journalism.19

Among multitudes of papers that analyze cultural data computationally, some of the most interesting are ones that test existing cultural theories and/or propose new ones. One such study is called “Fashion and Art Cycles Are Driven by Counter-Dominance Signals of Elite Competition: Quantitative Evidence from Music Styles.”20 The paper uses data on eight million music albums released between 1952 and 2010 to test two common theories of art and fashion cycles. As the authors of the paper summarize it, “According to ‘top down’ theories, elite members signal their superior status by introducing new symbols (e.g., fashion styles), which are adopted by low-status groups. In response to this adoption, elite members would need to introduce new symbols to signal their status. According to many ‘bottom-up’ theories, style cycles evolve from lower classes and follow an essentially random pattern.” The quantitative analysis of the historical data leads authors to propose a different theory supported by statistical tests: “changes in art and fashion styles happen whenever a new elite successfully challenges the hegemony of a previous elite.” As they note, the sociologists have been interested in mechanisms of style cycles ever since the 1905 book Philosophy of Fashion by Georg Simmel. By formulating and testing quantitative models for different mechanisms of change, the paper provides a methodology that can be used to study style cycles in other cultural areas besides popular music.

Work with large cultural datasets includes not only researchers doing analysis in their labs and publishing papers, but also creation of interactive web interfaces that allow the public to explore trends in such datasets. One such prominent project is the Ngram Viewer, created in 2010 by Google scientists Jon Orwant and Will Brockman following the prototype by two PhD students from Harvard University in biology and applied math.21 A visitor to the Ngram Viewer’s website can enter several words or phrases and instantly see plots comparing the frequencies of the appearances of these words across millions of books published over a few centuries.

Among the experiments to create interfaces to large image collections, I want to mention the pioneering projects by the New York Public Library (NYPL) labs. One of these projects created in 2016 allows online visitors to browse 187,000 public domain images from NYPL by century, genre, collection, and color.22 The interface shows all these 187,000 images at once in small size; clicking on each thumbnail brings up the larger image and the related information. Another project called the Photographers’ Identities Catalog supports exploration of data for 128,668 photographers, studios, and dealers covering the history of photography worldwide.23 The interface includes an interactive map showing detailed locations for records on a city street level. If a photographer lived in a number of places, the map connects all these places, thus giving us a spatial overview of photographer’s career.

In our own lab, we created two projects that make it possible for visitors to explore and interact with large collections of social media images and data. Selfiecity (2014–2015) enables interactive comparison of patterns in thousands of Instagram self-portraits (selfies) that were shared in six global cities (see plate 5). On Broadway (2014) uses a touch screen to present an interface for navigating a “data city”—specifically, the area along 21 km of Broadway in Manhattan (see plate 11). The images and data used in this project include 660,000 geocoded Instagram photos, eight million Foursquare check-ins, and twenty-two million taxi pickups and drop-offs for one year. Our lab participants worked on collecting and organizing the data; interface design and programming was done by the team, consisting of one of the world’s leading data visualization designers, Moritz Stefaner, an expert in programming interactive applications, Dominicus Baur, and a data products designer, Daniel Goddemeyer.

The previous examples of cultural analytics research and projects may create an impression that this work only serves academic or artistic interests. However, cultural analytics also is often carried out as part of design projects to create new or improve existing digital products and services. These can range from design of new media interfaces for digital collections for museums and libraries to analysis of urban social media for guiding urban design and policy. The large-scale analysis of people’s interactions with media content and each other mediated by computer systems can be used to improve these systems. For example, we can propose algorithms to help people discover more types of content or discover content they would normally ignore. In fact, computer scientists that work on improving recommendation systems devote significant energy to figuring out how to deliver more diverse but still relevant recommendations. (In October 2018, Spotify said that its listening diversity, defined as the number of artists the average user streams per month, “has risen on Spotify over the past 10 years at an average of about 8 percent per year.”24)

Some computer scientists have been studying the aesthetic preferences and dynamics of attention in visual media among social networks users—asking what images or videos users prefer and how these preferences can be predicted from media content and visual characteristics. For example, consider the 2015 publication titled “An Image is Worth More than a Thousand Favorites.”25 (One of the authors of this work, Miriam Redi, later collaborated with us on analysis of Instagram images.) The paper presents “analysis of ordinary people’s aesthetics perception of web images” using nine million Flickr images with Creative Commons licenses. Reviewing the large body of quantitative research that uses large data, the authors note:

The dynamics of attention in social media tend to obey power laws. Attention concentrates on a relatively small number of popular items and neglecting the vast majority of content produced by the crowd. Although popularity can be an indication of the perceived value of an item within its community, previous research has hinted to the fact that popularity is distinct from intrinsic quality. As a result, content with low visibility but high quality lurks in the tail of the popularity distribution. This phenomenon can be particularly evident in the case of photo-sharing communities, where valuable photographers who are not highly engaged in online social interactions contribute with high-quality pictures that remain unseen.

The authors propose an algorithm that can find “unpopular” images (i.e., images that have been seen by only a small proportion of users) that are equal in aesthetic quality to the popular images. Implementing such algorithm would allow more creators to find audiences for their works. Such research exemplifies how large-scale quantitative analysis of cultural patterns and situations can be further used to offer constructive solutions that can change these situations for the better.

History versus Present, Professionals versus Amateurs

The research that analyzes large cultural datasets using computational methods today can be found in many academic disciplines, including computer science, data science, anthropology, sociology, communication, media studies, game studies, linguistics, geography, folklore studies, history, art history, and literary studies. The examples in the previous section illustrated some of the questions being studied. But rather than providing more examples from each of these disciplines, I want to move from individual examples to a larger question. This question is about assumptions and goals of the larger intellectual paradigms that separate these disciplines—and the possibilities of bringing them together in cultural analytics as equal intellectual contributors.

These three paradigms are humanities and qualitative social sciences, quantitative social sciences, and computer science. Each has different goals, different research methods, and different ways to evaluate the originality of research. When researchers study cultural data, what they do with it and how they do it reflects the assumptions and norms of these paradigms. In fact, if we know these norms, we may expect that research in each paradigm will develop in its own direction. Thus, computer scientists can be expected to be searching for general laws that describe patterns in large cultural data and creating quantitative models that can predict future patterns, particularly in relation to user behaviors online (following recommendations, disseminating information, purchasing, etc.). Quantitative social scientists will be asking social science questions and using particular statistical methods that are accepted in their fields with the data. Given that their focus is on social phenomena, we may also expect them to study group behaviors online. Humanists will be analyzing particular historical datasets and also particular cultural texts, and ideally questioning existing interpretations of cultural histories and offering new interpretations.

But we don’t have to select either of these approaches or goals. Cultural analytics does not have to choose between humanistic and scientific goals and methodologies or subordinate one to another. Instead, we may want to put together elements from both humanities and sciences for the study of cultures. The humanities can contribute their strengths—focus on the particular (e.g., single artifacts and authors), the meanings of the artifacts, and orientation toward the past. And the sciences can give us theirs—focus on the general (e.g., large-scale patterns), use of the scientific method and mathematics, and interest in predicting the future.

In this section, I will further look at some of the assumptions and norms of humanities, qualitative social sciences, and computer science and will discuss how cultural analytics can potentially combine them. To get started, let’s ask a question: What types of cultural data so far have been analyzed in computer science and the humanities? In other words, what counts as “culture” in each discipline?

In keeping with the historical orientation of the humanities, the researchers in the humanities have been using computers to analyze mostly historical artifacts created mostly by professional authors, whether this be medieval manuscripts by learned monks or nineteenth-century novels by authors paid for their work by the publishers. This focus on historical data is easy to see if you go through the issues of digital humanities journals such as Digital Humanities Quarterly (2007–present), or programs of the annual international Digital Humanities Conference.

In contrast, as I noted earlier, relevant publications in computer science focus almost exclusively on the period after 2005 because they analyze data from social networks, media sharing services, online forums, and blogs. The datasets used in these studies are often much larger than those used in digital humanities. Tens or hundreds of millions of posts, photos, or other items and billions of recorded interactions is not uncommon. And because the great majority of user-generated content is created by regular people rather than professionals, computer scientists have been studying nonprofessional vernacular culture by default. Or, as I expressed this idea earlier, what this research looks at and quantifies is popular culture.

Thus, we have two research universes that often use the same computational methods but apply them to different “cultures.” On the humanities side, we have the past that stretches into hundreds or even thousands of years. On the computer science side, we have the present that starts in the beginning of the twenty-first century. On the humanities side, we have artifacts created by professional elites. On the computer science side, we have artifacts and online behavior by everybody else.

The scale of the research in computer science that uses web and social media datasets may be surprising to humanities and arts practitioners, who may not realize how many scientists are working in this area. I have given a number of research examples thus far but did not made fully clear how much is being published on these topics. Let’s again turn to Google Scholar to see this. My recent searches on Google Scholar for “Twitter dataset algorithm,” “YouTube dataset dataset,” and “Flickr images algorithm” returned hundreds of thousands of journal articles and conference papers. I use words dataset and algorithm to limit results to papers that use computational methods. Not all these publications directly ask cultural questions, but many do.

Why do computer scientists rarely work with large historical datasets of any kind? Typically, they justify their research by reference to already existing industrial applications—for example, search or recommendation systems for online content. The general assumption is that computer science will create better algorithms and other computer technologies useful to industry, government, NGOs, and other organizations. The analysis of historical artifacts falls outside this goal, and consequently not many computer scientists work on historical data (the field of digital heritage being one exception).

However, looking at many examples of these papers, it is clear that they are actually asking questions typical of humanities or media studies in relation to contemporary media—but using bigger data to answer them. Consider, for example, these papers: “Quantifying Visual Preferences around the World” and “What We Instagram: A First Analysis of Instagram Photo Content and User Types.”26 The first study analyzes worldwide preferences for website design using 2.4 million ratings from forty thousand people in 179 countries. The study of aesthetics and design traditionally was part of the humanities. The second study analyzes the most frequent subjects of Instagram photos—a topic that can be compared to art historical studies of the genres in the seventeenth-century Dutch art.

Another example is an influential paper called “What is Twitter, a Social Network or a News Media?”27 Published in 2010, it has since been cited 7,480 times.28 This paper describes the first large-scale analysis of the Twitter social network using 106 million tweets by 41.7 million users. The authors looked at trending topics, exploring “what categories trending topics are classified into, how long they last, and how many users participate.” Such analysis can be seen as an update of the classical work in the communication field, going back to the pioneering work of Paul F. Lazarsfeld and his colleagues in the late 1930s when they manually counted the topics of radio broadcasts.29 The big difference is that in the 1930s such broadcasts were created by a small number of professional stations and belonged to a small number of genres, whereas Twitter may have numerous topics with different levels of generality, time duration, and geographic coverage. At the same time, given that Twitter and other microblogging services represent a new form of media—like oil painting, printed books, and photography before them—understanding the specificity of Twitter as a medium can be also seen as a contribution to the humanities.

The Regular versus the Particular

When humanities were concerned with “small data”—that is, content created by single authors or small groups—the sociological perspective was only one of many options for interpretation—unless you were a Marxist. But when we start studying the online content and activities of millions of people, this perspective becomes almost inevitable. When we look at big cultural data, the cultural and the social closely overlap. Large groups of people from different countries and socioeconomic backgrounds (sociological perspective) create, share, and interact with images, videos, and texts, and they make certain semantic and aesthetic choices when they do this (humanities perspective). Because of this overlap, the kinds of questions investigated in the sociology of culture of the twentieth century, as exemplified by its most influential researcher, Pierre Bourdieu, are directly relevant for cultural analytics.30

Given that demographic categories are taken for granted now in our thinking about society, it appears natural today to group people into these categories and compare them in relation to social, economic, or cultural indicators. For example, the Pew Research Center regularly reports the statistics of popular social platform use in the United States, breaking their user sample into demographics such as gender, ethnicity, age, education, income, and place of living (urban, suburban, and rural).31 So if we are looking at types of social media content and behavior, such as types of images shared and liked, filters used, or selfie poses, it is logical to study the differences in content and activity among people from different cities and countries, ethnicities, socioeconomic backgrounds, levels of technical expertise, education, and so on. The first wave of relevant publications in computer science in the second part of the 2000s often did not do this, treating all social media users as one undifferentiated pool of humanity. However, later some publications started to break users into demographic groups.

Although this is a very good move, we also want to be careful. Humanistic analysis of cultural phenomena and processes using quantitative methods should not be simply reduced to sociology—that is, to considering common characteristics and behaviors of human groups defined using some taken-for-granted criteria, such as age, gender, income, and education. And given that we can now see daily the cultural choices of millions of individuals on social networks, is it still necessary to divide people in socioeconomic groups and look for differences between cultural preferences and behaviors of these groups? The idea that a group or a single person has consistent cultural behaviors and tastes made sense in ancient and modern societies, when taste was governed by prescriptive aesthetic norms. (This was the society of Kant and of Pierre Bourdieu.) But with numerous cultural choices available today, and the ability to “vote” for this or that choice using a single press of a button, we may find that the idea of a stable taste or stable “cultural personality” is an illusion.

Sociological tradition is concerned with finding and describing the general patterns in human behavior, rather than with analyzing or predicting the behaviors of particular individuals. Cultural analytics is also interested in patterns that can be derived from the analysis of large cultural datasets. However, ideally the analysis of the larger cultural patterns should also lead us to particular individual cases—that is, individual creators and their particular creations or cultural behaviors. (And as I just suggested, an individual can be further divided into separate personas with many different behaviors and cultural tastes.) For instance, the computational analysis of all photos made by a photographer during her long career may lead us to the outliers—the photos that are most different from all the rest. Similarly, we may analyze millions of Instagram images shared in multiple cities to discover the images unique to each city and the most original local photographers.

In other words, we may combine the concern of social science, and sciences in general, with the general and the regular and the concern of humanities with individual and particular. The examples just described of analyzing large cultural datasets to detect unique outliers is one simple way of doing this, but it is not the only one.

The Science of Culture? Deterministic Laws, Statistical Models, Simulation

The goal of science is to explain phenomena and develop compact mathematical models that describe how these phenomena work. The three laws of Newton’s physics are a perfect example of how classical science approaches this goal. Since the middle of the nineteenth century, a number of new scientific fields have adopted a different, probabilistic approach to describing physical reality. The first instance of this new approach was the statistical distribution describing likely speeds of gas particles, presented by physicist James Maxwell in 1860 (it is now called the Maxwell-Boltzmann distribution).

And what about the social sciences? Throughout the eighteenth and nineteenth centuries, many thinkers were expecting that, similar to physics, the quantitative laws governing societies also eventually would be found. In his 1785 Essay on Applications of Analysis to the Probability of Majority, French mathematician Condorset writes: “All that is necessary to reduce the whole of Nature to laws similar to those which Newton discovered with the aid of calculus, is to have a sufficient number of observations and a mathematics that is complex enough.” In the nineteenth century, the founder of sociology August Compte makes a similar statement in Cours de philosophie positive (1830–1842): “Now that human mind has grasped celestial and terrestrial physics, mechanical and chemical, organic physics, both vegetable and animal, there remains one science, to fill up the series of sciences of observations—social physics.”32

However, this never happened in a way similar to classical physics. The closest that nineteenth-century social thought came to postulating objective laws were the theories of Karl Marx. But by the end of the nineteenth century, economists demonstrated that his analysis was mostly wrong, and twentieth-century attempts to create new societies based on his theories all ended in disaster. Instead, when quantitative social sciences started to develop in the late nineteenth and early twentieth centuries, they also adopted a probabilistic approach. Instead of looking for deterministic laws of society, social scientists study correlations among measurable characteristics and model the relations between dependent and independent variables using various statistical techniques.

After deterministic and probabilistic paradigms in science, the next paradigm was computational simulation—running models on computers to simulate the behavior of systems. The first large-scale computer simulation was created in the 1940s by the Manhattan Project to model a nuclear explosion. Subsequently, simulation was adapted in many hard sciences, and in the 1990s it was also taken up in the social sciences.

The twentieth-century humanities stayed away from looking for either physics-like laws of culture or modeling cultural processes probabilistically. Although literary studies, art history, and later film and media studies described various semantic and aesthetic patterns in the cultural corpora they studied, counting how frequently these patterns appeared in these corpora and interpreting the results was not seen as something humanists should be doing. A handful of people who were doing such quantitative analysis were real exceptions (e.g., Boris Jarkho in Russia in the 1930s or Barry Salt in the United States in the 1970s).

The explosion of digital cultural content and online interactions mediated by software and networks in the early twenty-first century has changed how culture functions. The volume of this content and user interactions allow us to think of a possible science of culture. For example, by the summer of 2015, Facebook users were sharing four hundred million photos and sending forty-five billion messages daily,33 and the number of monthly users worldwide reached 2.5 billion by the end of 2019. This scale is still much smaller than that of atoms and molecules; for example, 1 cm³ of water contains 3.33 * 10²² molecules. However, the number of weekly Facebook messages is already bigger than the numbers of neurons in the whole nervous system of an average adult brain, estimated at around one hundred billion.

Although the idea of a science of culture may terrify some readers, you should not be scared. As I explained, the concept of science as a set of hard laws is only one among others. Today, science includes at least three different fundamental approaches to studying and understanding phenomena: deterministic laws, statistical models, and simulation. Let’s continue our thought experiment and ask which of these approaches will be most useful for a hypothetical science of culture.

Looking at the papers of computer scientists who are studying social media datasets, it is clear that their default approach is statistical. They characterize social media data and user behavior in terms of probabilities. They frequently create statistical models—mathematical equations that specify the relations between variables that may be described using probability distributions rather than specific values. Many papers published after 2010 also use supervised machine learning—a paradigm for teaching a computer to classify or predict values of new data using already existing examples. Note that in both cases, a model usually can correctly describe or classify only some of the data and not all of it. This is typical of the statistical approach.

Computer scientists use statistics differently than social scientists. The latter want to explain social, economic, or political phenomena—for example, the effect of family background on children’s educational performance. Computer scientists are generally not concerned with explaining patterns they discover in social media or other cultural data by referencing external social, economic, or technological factors. Instead, they typically either analyze social media phenomena by themselves or try to predict the outside phenomena using information extracted from social media datasets. Examples of the former include network measurements of connections between friends in a social network, or a statistical model that predicts the effects of filter use on number of views and comments a photo on Instagram may receive. An example of the latter is the Google Flu Trends service that was designed to predict flu activity using a combination of Google search data and US Centers for Disease Control and Prevention (CDC) official flu data.34

The difference between deterministic laws and nondeterministic models is that the latter only describe probabilities and not certainties. The laws of classical mechanics apply to any macroscopic objects. In contrast, a probabilistic model for predicting the numbers of views and comments for an Instagram photo as a function of filter use cannot predict exactly these numbers for any particular photo. It only describes the overall trend. So, if we have to choose between deterministic laws and probabilistic models for a hypothetical science of culture, the second approach is better. If instead we start postulating deterministic laws of human cultural activity, what happens to the idea of free will? Even in the case of seemingly almost automatic cultural behavior, such as social media photos of perfect beaches or luxury hotels getting likes, we do not want to reduce humans to mechanical automata always behaving in the same way when presented with an appropriate stimulus.

The current focus on probabilistic models of online activity in computer science studies of social media data leaves out the third scientific paradigm: simulation. In sociology, economics, political theory, and history, simulation has already been in use for a few decades, and recently a few digital humanities scholars have also showed interest in using this paradigm.35

In 2009, scientists at the IBM Research Almaden Center simulated the human visual cortex using 1.6 billion virtual neurons with nine trillion synapses.36 Given this, why can’t we begin thinking about how to simulate, for instance, all content shared every month on Instagram? Or all content shared by all users of major social networks? Or, more interestingly, can we simulate the evolution of the types of content being shared and aesthetic strategies over time?

The point of such simulations is not to get everything right or to precisely predict what people will be sharing next year. Instead, we can follow the authors of the influential textbook Simulation for the Social Scientist, who state that one of the purposes of simulation is “to obtain a better understanding of some features of the social world” and that simulation can be used as “a method of theory development.”37 Because computer simulation requires developing an explicit and precise model of a phenomenon being simulated, thinking of how cultural processes can be simulated can help us to develop more explicit and detailed theories of cultural processes.38

And what about big data? Does it represent a new paradigm in science that allows us to think about and also study phenomena differently? In natural sciences, the impact of big data depends on a particular field. But if we are talking about research methods and techniques, the developments in computer hardware in the 2000s, including increasing CPU speed and RAM size and the use of GPUs and computing clusters, were likely more important than the availability of larger datasets. And although the use of supervised machine learning with large training datasets has achieved remarkable successes in some cases, as can be seen in industrial applications such as speech recognition and synthesis or image content categorization, its role in the sciences is more ambiguous. If we assume that the goal of science is to provide an explanation and a mathematical model of some natural or biological phenomena, the existence of a successful machine learning system that can correctly classify the new inputs usually does not provide explanations of the phenomena.

However, as I argue in this book, big data is certainly of fundamental importance for the study of culture. (See in particular the section “Why We Need Big Data to Study Cultures” in chapter 5). But the magnitude of this impact also has to do with the fact that humanities and media theory did not use science principles and methods before. So along with big data, the humanities are also discovering how scientific thinking and methodologies can be applied to their subjects. And here the concepts and methods of sampling, feature extraction, and exploratory data analysis are more important than data size (see chapters 5–9).