Conclusion: Can We Think without Categories?

Since the turn of the twenty-first century, global digital culture has entered a new stage. Computational analysis of cultural digital artifacts, their online “lives,” and people’s interactions with these artifacts and each other have redefined dynamics and mechanisms of culture. This analysis is now used by numerous players—the companies that run social networks, NGOs planning their outreach, tens of millions of small businesses around the world advertising online, and hundreds of millions of people who plan, create, and manage their self-presentation on the web. At the same, fashion companies, book and music publishers, television and film producers, food and beverage companies, hotel chains, airlines, and other big companies in culture and lifestyles industries use big data that is relevant to them and data science to design, test, and market their offerings, and to predict the demand.

Some of the same computational methods and their underlying concepts also make possible new exciting research about cultures and societies in fields that include computer science, data science, computational social science, digital humanities, urban studies, media studies, data visualization, data design, data art, and so on. Cultural analytics is one such field that emerged in the second part of the 2000s. The goal of this book was to present my own cultural analytics journey and what I have learned since starting this research in 2007. The book took you through the sequence of steps involved in representing and exploring cultural phenomena as data. My goal was to examine the concepts involved in each step, questioning normal ways of doing things and pointing out possibilities not yet explored.

The book’s particular perspective reflects the three domains in which I have been working for a long time—media theory, digital art, and data science. Its media theory contributions include analysis of some of the key concepts and practices of data science and a new stage in the development of modern technological media that I call media analytics. This stage is characterized by algorithmic large-scale analysis of media and user interactions and the use of the results in algorithmic decision-making such as contextual advertising, recommendations, search, and other kinds of information retrieval, filtering of search results and user posts, content categorization of user photos, document classification, plagiarism detection, video fingerprinting, automatic news production, and more. And we are still only at the beginning of this stage. Given the trajectory of gradual automation of more and more functions in modern society using algorithms, I expect that production and customization of many forms of commercial culture will also gradually become more automated. Thus, in the future already developed digital distribution platforms and media analytics will be joined by the third part: algorithmic media generation. Experimental artists, designers, composers, poets, and filmmakers have been using algorithms to generate work since the early 1960s, but in the future this is likely to become the norm across culture industry. We can see this at work already today in automatically generated news stories, online content written about topics suggested by algorithms, commercially distributed music generated with AI, movie acquisition and release decisions, production of television shows and TV broadcasts during sport events in which multiple robotic cameras automatically follow and zoom into dynamic human performances and video games. (My 2018 book AI Aesthetics discusses cultural uses of AI in more detail.1)

Until the beginning of the twenty-first century, key cultural techniques we used to represent and reason about the world and other humans included natural languages, capturing reality as media information (photos, videos, and audio recordings), map making, logic, calculus, and digital computers. The core concepts of the data society we covered are now as important. They form the data society’s mind—the particular ways of encountering, understanding, and acting on the world and the humans in it. And this is why even if you have no intention of doing practical cultural analytics research yourself, you still need to become familiar with these new data-centered cultural techniques.2

Contemporary data science includes hundreds of algorithms and dozens of methods for working with data. They belong to a number of larger areas defined by different types of operations—data preparation, exploratory data analysis (including visualization), descriptive statistics, unsupervised machine learning (clustering, dimension reduction, etc.), statistical models, supervised machine learning (with its main applications—classification and regression), time series analysis, network analysis, and others. Some of these areas had already started to develop in the first decades of the twentieth century; others became popular only recently because their algorithms and methods require faster computers or rely on very large datasets.

Among these areas, two in particular are so important today that we can think of them as types of data society’s cognition: unsupervised machine learning and supervised machine learning. In our work, we focus on the first as opposed to the second because we want to see the structures of cultural fields and their “landscapes” and find groupings and connections that can be revealed by starting with objects’ features—rather than impose existing categories and classification systems on cultural data. But you should become familiar with both approaches because both can be used creatively. Certainly, both descriptive statistics and visualization should be also in your toolbox. Among all other areas, network analysis and time series analysis are particularly relevant for exploring culture, in my view. And you should learn methods specific to the particular media type you are interested in—natural language processing, computer vison, music information retrieval, or spatial analysis (used for analysis of texts, images and videos, music and sound, and space).

Do We Want to “Explain” Culture?

Approaching cultural processes and artifacts as data can lead us to ask the kinds of questions about culture that people who professionally write about it, curate it, and manage it do not normally ask today—because such questions would go against the accepted understanding of culture, creativity, aesthetics, and taste in the humanities, popular media, or the art world. For example, would collectors and museums pay millions for the works of contemporary artists if data analysis shows that they are completely unoriginal despite their high financial valuations? Or what if data analysis reveals that trends in the art world can be predicted as accurately as trends in fashion?

The most well-known and influential quantitative analysis of cultural data within social sciences remains Pierre Bourdieu’s Distinction (1979). As I already mentioned, the data used in this analysis comes from the surveys of the French public. For analysis and visualization of this data, Bourdieu used the recently developed method of correspondence analysis. It is similar to PCA but works for discrete categories, showing their relations in graphical form. For Bourdieu, this form of data analysis visualization went along with his theoretical concepts about society and culture, and that’s why it plays a central role in this book. Distinction is Bourdieu’s most well-known book, and in 2012 Bourdieu was the second-most quoted academic author in the world, just behind Michel Foucault.3

Bourdieu did not use the most common method in quantitative social science: “explaining” some observed phenomena, represented by dependent variables, by predicting their values using other phenomena, represented by independent variables, via a mathematical model. However, given the variety and the scale of cultural data available today, maybe now such a method can produce interesting results?

What would happen if we also took other standard methods of quantitative social science and used them to “explain” the seemingly elusive, subjective, and irrational world of culture? For example, we can use factor analysis to analyze the choices and preferences of local audiences around the world for music videos from many countries to understand the dimensions people use to compare musicians and songs. Or we can use regression analysis and a combination of demographic, social, and economic variables to model choices made by cultural omnivores—people who like cultural offerings associated with both elite and popular taste.4 (For the first quantitative study of cultural taste that uses large social media data, see the 2016 paper “Understanding Musical Diversity via Online Social Media.”)5

In quantitative sociology of culture and marketing and advertising research, investigators ask similar questions all the time in relation to consumer goods and cultural artifacts. And computer scientists do this as well when they analyze social media and web data. But this does not happen in the humanities. In fact, if you are in the arts or humanities, such ideas may make you feel really uncomfortable. And this is precisely why we should explore them.

The point of any application of quantitative or computational methods to analysis of culture is not whether it ends up being successful or not, unless you are in the media analytics business. It can force us to look at the subject matter in new ways, to become explicit about our assumptions, and to precisely define our concepts and the dimensions we want to study.

So at least as a thought experiment, let’s think about applying a quantitative social science paradigm to culture. Quantitative social science aims to provide explanations of social phenomena, expressed as mathematical relations among small numbers of variables (what influences what and by how much). Once such models are created, they are often used for prediction. The common statistical methods for such explanations are regression models, versions of factor analysis, or fitting a probability distribution to the data. The latter means determining if observed data can be described using a simple mathematic model: Gaussian distribution, log-normal distribution, the Paretto distribution, and so on. For example, in quantitative film studies, a number of researchers found that shot frequencies in the twentieth-century Hollywood films follow a log-normal distribution.6

Are we interested in trying to predict future of culture with mathematic models? Do we need to explain culture through external economic and social variables? Do we really need to find that an author’s biography, for example, accounts for 30 percent of the variability in her works? Or that age, location, and gender variables account for, let’s say, 20 percent of variability in Instagram posts? And even if we find that a combination of some variables can predict the content and style of Instagram posts of some users with 95 percent accuracy, probably what is really important in this cultural sphere is the 5 percent we cannot predict.

Applied to real-life data, regression models typically can only predict some of the data, not all of it. The part that is not predicted is often treated as “noise” because it does not fit the mathematical model. In fact, in the standard presentation of regression analysis, the term that is added to the model to represent the unpredicted data is called an error term, or noise. The assumption is that the noise is due to some possibly random variation, which adds disturbance to the process we are observing and modeling. However, in cultural processes, the parts that statistical models would not be able to predict is what is most interesting. We call something original if it can’t be predicted.

Can we pay equal attention to the norms and the exceptions? Our common histories of culture often focused too much on the original elements and the new inventions. This focus on the avant-garde in human history comes at the expense of the norm, the typical, the conventional. Use of bigger cultural data and quantitative methods are well suited for the study of these norms. As Moretti wrote about this already in 1998 in relation to literature: “A history of literature as a history of norms, then: a less innocent, much ‘flatter’ configuration than the one we are used to; repetitive, slow—boring, even. But this is exactly what most of life is like, and instead of redeeming literature from its prosaic features we should learn to recognize them and understand what they mean.”7 I fully agree—but I also hope that cultural analytics can do more than this. Ideally, it should look both at the boring and the exiting, the norms and the inventions. And this may require a fundamental rethinking of statistical and other data science methods if they can only account for the regular parts of cultural history.

Is the Goal of Cultural Analytics to Study Patterns? (Yes and No)

While humanities often focus on individual human creations, they are also concerned with larger cultural patterns. Some of the terms that twentieth-century humanities used to refer to such patterns are techniques, conventions, structures, types, themes, topics, and motifs. These patterns were discussed as common features of works or authors who belonged to particular genres, cultural movements, historical periods, national cultural traditions, or subcultures.

Cultural analytics, as I see it, is situated within the same paradigm, but it works with bigger and more representative samples, expands the number of dimensions for study, and uses computer techniques to measure and quantify these dimensions. Most importantly, rather than starting with already accepted cultural categories, it analyzes “raw” cultural data to find patterns, connections, and clusters that often do not correspond to these categories.

Cultural analytics thus also can be defined as the quantitative study of cultural patterns on different scales. But we need to keep in mind that any cultural pattern we may detect and describe captures similarities among a number of artifacts on only some dimensions, ignoring their other differences. When we start considering these differences, what looked like a single group of similar artifacts reveals the presence of multiple and distinct smaller groups. A single pattern breaks down into many patterns. Thus, any cultural analytics results are always relative to what dimensions we choose to compare and which ones we choose, for the time being, to ignore.

In summary, although we want to discover repeating patterns on different scales in cultural data, we should always remember that they account for only some aspects of the artifacts and their reception.

In the previous section, I briefly explored the implications of looking at culture the way twentieth-century social scientists looked at society. Do we actually want to do this? Cultural analytics does not want to “explain” most or even some data using a simple mathematical model and treat the rest as an error or as noise just because our mathematical model cannot account for it. And we do not want to assume that cultural variation is a deviation from a mean. We also do not want to assume that large proportions of works in a particular medium or genre follow a single or only a few patterns, such as the hero’s journey, the golden ratio, or binary oppositions, or that every culture goes through the same three or five stages of development as was claimed by some art historians in the nineteenth century.

I believe that we should study cultural diversity without assuming that it is caused by variations from certain types or structures. This is very different from the modern thinking of quantitative social science and the statistical paradigm it adapted. As I explained in this book, the historical development of statistics in the eighteenth and nineteenth centuries leads that field to consider observed data in terms of deviations from the mean.

Does this mean that we are only interested in the differences and that we want to avoid any kind of reduction at all cost? To postulate the existence of cultural patterns is to accept that we are doing at least some reduction when we think and analyze data. Without this, we cannot compare anything, unless we are dealing with extreme cultural minimalism or seriality, such as the artworks by Sol LeWitt who makes everything else equal and only varies a single variable.

My answer is presented in the next two paragraphs, and for me, these are the most important paragraphs in the whole book. In them, I describe cultural analytics as it developed until now, and I sketch a more difficult task ahead.

Unless it is a 100 percent copy of another cultural artifact or produced mechanically or algorithmically to be identical with others, every expression and interaction is unique. In some cases, this uniqueness is not important in analysis, and in other cases it is. For example, certain facial features we extracted from a dataset of Instagram self-portraits revealed interesting differences in how people represent themselves in this medium in particular cities and time periods we analyzed. But the reason we do not get tired looking at endless faces, bodies, and landscapes when we browse Instagram is that each of them is unique.

The ultimate goal of cultural analytics can be to map and understand in detail the diversity of contemporary professional and user-generated artifacts created globally—that is, to focus on what is different among numerous artifacts and not only on what they share. In the nineteenth and twentieth centuries, the lack of appropriate technologies to store, organize, and compare large cultural datasets contributed to the popularity of reductive cultural theories. Today I can use any computer to map and visualize thousands of differences among tens of millions of objects. We do not have an excuse any more to only focus on what cultural artifacts or behaviors share, which is what we do when we categorize them, or perceive them as instances of general types. So though we may have to start by extracting patterns first just to draw our initial maps of contemporary cultural production and dynamics, given their scale, eventually these patterns may recede into the background or even completely dissolve as we focus only on the differences among individual objects.

How to Think without Categories

In my experience, these ideals are easier to state than to put in practice. The human brain and languages are categorizing machines. Our cognition constantly processes sensory information and categorizes it. A pattern we observe is like constructing a new category: a recognition that some things or some aspects of these things have something in common. So how can we learn to think about culture without categories?

How do we move away from the assumption of the humanities (which until now “owned” thinking and writing about culture) that their goal of research is discovery and interpretation of general cultural types, be they “modernism,” “narrative structures,” or “selfies”? How do we instead learn to see cultures in more detail, without immediately looking for, and noticing, only types, structures, or patterns?

In this book, I described one possible strategy for doing this, which borrows ideas from data science. First, we need sufficiently large cultural data samples. Next, we extract sufficiently large numbers of features that capture characteristics of the artifacts, their reception and use by audiences, and their circulation. (We also need to think more systematically about how to represent cultural processes and interactions—especially since today we use interactive digital cultural media as opposed to historical static artifacts.) Once we have such datasets, we can explore them using data science methods—while also keeping in mind that the features can’t capture everything, so our human abilities to see and reason about similarities and differences and our knowledge of cultural histories and theories are still crucial.

In part III, I discussed some exploratory methods we have been using for visual media datasets. But now I want to suggest an alternative and more general way to think about what it means to observe cultures. It expands further the idea expressed earlier that one possible goal of cultural analytics is to focus on what is different among numerous cultural artifacts, as opposed to what they have in common, like we did in the nineteenth and twentieth centuries.

Most generally, I suggest that to observe and analyze culture means to be able to map and measure five fundamental characteristics. The first four are diversity, uniqueness, dynamics (temporal changes), and structure. The last term here means clusters, networks, and other types of relations between many objects—that is, structure as it is understood in exploratory data analysis and unsupervised machine learning, as opposed to in 1960s structuralism. In situations in which artifacts were created using a prescriptive aesthetic or template—for example, Instagram filters provided by the app or the Instagram themes described and illustrated in thousands of advice posts—we can also consider a fifth characteristic: variability. For example, if we analyze a sample of Instagram images, we can first detect the presence of all themes suggested in many posts and then look at deviations from these themes and also at images that do not follow any of them. But we do not want to assume that the deviation from the type (or from a mean or another statistic we can compute for our dataset) is a necessary measurement for all cultural situations.

The development and testing of measures of cultural variability, diversity, temporal change, influence, uniqueness, and structure appropriate for many kinds of cultural artifacts and experiences is a massive theoretical and practical task. Certainly, cultural analytics is not going to quickly solve this challenge by itself. There are many such measures or concepts that can be turned into such measures developed in statistics, information theory, computer science, ecology, demography, machine learning, and other fields—for example, diversity indexes in biology, the Gini coefficient in economics, the index of dissimilarity in demographics, and the divergence measure and shared information distance in information theory. Seeing what measures work better with different types of cultural data and refining those is an important direction for cultural analytics as I see it.

As an example of this research, in the Visual Earth project we used the Gini index for measurement of social media inequality—that is, unequal spatial distribution of volumes of social media posts in a particular geographic area or between areas. We used a dataset of 7,442,454 public and geocoded Instagram images shared in Manhattan over five months (March–July) in 2014, and also selected demographics data for 287 Census tracts in Manhattan. (The inequality of Instagram sharing in Manhattan turned out to be bigger than inequalities in levels of income, rent, and unemployment.8)

For inspiration for thinking about quantitative analysis of diversity, variability, uniqueness, structure, and temporal change, we can turn to the pioneering work of Russian quantitative humanities scholar Boris Jarkho from the 1920s and the 1930s. A presentation at the 2018 Digital Humanities conference to which I contributed summarized features of his approach: “variety, mutability and continuity as the methodology’s foundations; discovering typical patterns in literary works using comparison and statistics; finding changes in patterns across time and genres; biology and systems approaches as benchmarks for future literary studies.”9 As Jarkho wrote,

Theoretically, literature can be perceived as a structure, not as a combination but as a system of proportions and relations between its properties. The system is perceived to be in a continuous movement, with properties (features) moving along the curves of various types, sometimes independently of each other, sometimes in pairs or sets of features. This, in turn, results in understanding the organic dynamics (change) in literature, with a variety of qualitative, quantitative and hierarchical concepts to follow. . . . The specifics of these concepts is, first, that they are very close to how contemporary science understands the concepts of life (organic world) and second, that most of these properties can be measured.

Jarkho’s vision of cultural dynamics as temporal curves showing changes in features is very relevant today, but one of his ideas—literature as a single system—may be more problematic for us. But it is crucial to remember that he was theorizing the development of literature in previous centuries, when the numbers of professional creators in each cultural field active in a particular country in a given period were relatively small, so for these creators and their critics their professional world could have indeed appeared as a single system. When we consider certain cultural fields today with a small number of agents such as big social networks companies and other massive online platforms, their behaviors may also look like a system (e.g., they closely monitor each other, and periodically include new features added by one company in their own platforms). But in other contemporary fields with millions of participants, such as music, fashion, filmmaking, or design, nobody can hear, read, view, or interact with all works being created, and even large-scale computational analysis (i.e., cultural analytics) can’t reveal all patterns.

Learning to See at a New Scale

To explore is to compare. And to compare, we need first to see. To see contemporary culture at its new scale, we can use data science methods and larger datasets. But even if a given historical or contemporary cultural field or phenomenon does not have such scale, data-driven analysis helps us question our inherited intuitions, assumptions, and categories.

Until the twenty-first century, we typically compared small numbers of artifacts, and the use of our human cognitive capacities unaided by machines was considered to be sufficient. But today we need to be able to compare millions or billions of cultural artifacts being created. User-generated content, professional shared content, and online interactions (between users, between users and software, and between users and artifacts) are examples of large-scale cultural data, but some digitized collections of historical artifacts can be also large, running into tens of millions of items (e.g., the Internet Archive collections,10 the Europeana collections, or Russian proza.ru and stihi.ru). So even if we use sampling to select small parts of such cultural universes for analysis, we have no choice but to use computational methods.

If we do not learn to see at sufficient resolution what people today create and how they behave culturally, any theories or interpretations we may propose based on our intuitions and received knowledge are likely to be misguided. This was the case when we analyzed data on 16 million Instagram images shared in 17 global cities in 2012-2015, one million manga pages, five thousand paintings of Impressionist artists, and other cultural datasets. In each case, my assumptions about what I am going to see based on intuitions and accepted knowledge were overturned.

This computational cultural vision can be understood as an extension of the most basic method of the humanities—comparing cultural artifacts, periods, authors, genres, movements, themes, techniques, and topics. So though it may be radical in terms of its scale—how much you can see in one glance, so to speak—it continues the humanities’ most basic and oldest way of thinking.

I want you to think of cultural analytics as a toolkit of ideas, methods, and techniques to experiment, explore, discover, and communicate your discoveries. This book’s purpose was to discuss why we need this toolkit today, to present the core concepts and a few methods we found most useful in our explorations. Now it is up to you to expand this toolkit and make your own discoveries.