3

Culture Industry and Media Analytics

Culture today is infecting everything with sameness. Film, radio, and magazines form a system. . . . Interested parties like to explain culture industry in technological terms. Its millions of participants, they argue, demand reproduction processes that inevitably lead to the use of standard processes to meet the same needs at countless locations. . . . In reality, the cycle of manipulation and retroactive need is unifying the system ever more tightly.

—Max Horkheimer and Theodor W. Adorno, Dialectic of Enlightenment, 19471

Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.

—J. Wiener and N. Bronson, “Facebook’s Top Open Data Problems,” 20142

Our data is literally a big deal. Measuring every second of engagement on every single page on most every major website in the globe means a scientifically defined insane amount of data.

—Chartbeat, “About,” 20153

When I first thought of cultural analytics in November of 2005, the paradigm of computing culture—using algorithms to analyze online digital content and people’s online behaviors—was already largely in place. The first web search engines were created in 1993 to 1994, and Google started operating in 1998. In March 2005, Amazon begun to display a few statistics calculated from the texts of all its books in searches, such as the most unique phrases per book and one hundred most common words in a book.4 Earlier, in 2001, Amazon engineers presented a paper describing an important recommendation algorithm that was later implemented on the Amazon site: item-to-item collaborative filtering.5 The social network Friendster that launched in 2002 patented a few fundamental techniques of social networking: “A method and apparatus for calculating, displaying and acting upon relationships in a social network,” “System and method for managing connections in an online social network,” and “Method of inducing content uploads in a social network.”6 However, as of 2005, social networking was not yet a massive phenomenon, the iPhone did not exist, and the term data science was not yet popular.

This situation changed dramatically over the next few years. The types of digital cultural data being analyzed, the methods for analysis, the scale of data, and the number of companies involved all grew quickly. By November 2017, Facebook was available in 101 languages; 75 percent of its two billion users were outside the United States and Canada.7 Instagram reached eight hundred million users by September 2017, while Chinese WeChat, QQ, and Qzone reached 960, 850, and 650 million users, respectively.8 When the top US social networks limited access to their data, hundreds of academic researchers signed a letter explaining why they need this access, giving examples of numerous social science studies that would be impossible without access to this data.9 In 2018, Facebook—together with its partners, Harvard University’s Institute for Quantitative Social Science and the Social Science Research Council—started a project called Social Science One, with the goal to enable “academics to analyze the increasingly rich troves of information amassed by private industry in responsible and socially beneficial ways.”

In this chapter, I will discuss large-scale analysis of online cultural content and users’ interactions with this content and each other by companies, NGOs, and other actors. I call these practices media analytics. While cultural analytics and media analytics share the same idea—large-scale computational analysis of cultural artifacts and behaviors—their goals and motivations are different. Media analytics always serves practical goals: decide when and what ads to show to users, index billions of web pages as part of a search engine, automatically pick up the best images representing businesses on a recommendation site, and so on. The goals of cultural analytics as I see it are observation and analysis of global culture, and developments of analytical concepts and methods that combine the respective strengths of data science, the humanities, and media theory.

Another key difference is what happens with the results of the analysis. Companies that use media analytics do so to improve their services and almost never make available the detailed results of such analysis (Google Search Trends is one exception). Cultural analytics researchers should not only publish their research findings and datasets, but ideally also create public interactive visualization and exploration tools accessible to everybody.

Cultural analytics research can certainly benefit from learning the details of how industry analyzes digital media artifacts and user activities. In its quest to optimize products, automate decisions, and create personalized experiences, industry often analyzes more dimensions and details of cultural artifacts and interactions than researchers in the humanities or social sciences were ever able to do or even imagine. Another fundamental aspect of media analytics is its scale. In the humanities, scholars of literature, cinema, music, digital media, and other art forms often think about the effects of artistic works on readers, viewers, and listeners using only their own experience with particular works. In social sciences, the sociology of culture and communication studies have been using surveys and interviews to learn about cultural behaviors of larger groups—but this approach does not scale well. In contrast, industry is capturing aspects of cultural experiences of billions of people.

Digital humanities has mostly ignored opportunities to study vernacular digital culture because, as I explained earlier, it follows the traditional humanities paradigm of studying professional and high cultures. But social scientists concerned with society at large have welcomed opportunities to analyze social phenomena via digital networks in the process of developing new research methods such as the design of online experiments. As the organizers of the MIT Conference on Digital Experimentation (2017) point out:

The newly emerging capability to rapidly deploy and iterate micro-level, in-vivo, randomized experiments in complex social and economic settings at population scale is, in our view, one of the most significant innovations in modern social science. As more and more social interactions, behaviors, decisions, opinions and transactions are digitized and mediated by online platforms, our ability to quickly answer nuanced causal questions about the role of social behavior in population-level outcomes such as health, voting, political mobilization, consumer demand, information sharing, product rating and opinion aggregation is becoming unprecedented.10

The use of digital experiments in social science suggests that researchers of culture should be also analyzing data about cultural reception and interaction on a large scale and conducting large digital experiments. So far, such culture experiments are done by the industry itself; think, for instance, of A/B testing in web design or automatic selection of friends’ posts in social networks such as Facebook. These are digital humanities experiments, so to speak, performed by industry. Of course, the data collected from these experiments is not publicly available, and the experiments are aimed at only a few pragmatic goals: to increase engagement (such as time spent on the site) or to increase brand awareness or to lead people to purchase goods or services. This is why we need to make our own experiments and ask alternative questions.

A New Stage in Media Technology History

The history of technological media can be imagined as a series of many overlapping stages. At each stage, new technologies and new practices for creating, storing, distributing, and using content become prominent. But these practices do not replace each other in a linear fashion. Instead, the older ones continue to coexist along with the new ones. For example, consider mass reproduction of print (1500–), broadcasting (1920–), use of personal computers for media creation (1981–), the web as a publishing and distribution platform (1993–), and social networks and media sharing sites (2003–), to name just a few of these practices. All of them are active today, although over long periods of time, the earlier practices may become less important or be transformed in significant ways.

Media analytics is the newest stage in the development of modern technological media. Unlike other stages, it is not at its core about creation, publishing, or distribution, although it also affects these operations. The focus of this new stage is automatic computational analysis of the content of all media available online, as well as online personal and group behaviors and communication.

The motivations and uses of media analytics are multiple, but they all are related to the scale of digital culture in the early twenty-first century. This scale is the volume of digital content: in 2017 the Web had 14 billion web pages, and 2 billion photos were shared daily on Facebook alone. It is also the numbers of people active online. As of early 2020, there were 3.8 billion active social network users and 4.5 billion internet users, and these numbers continue to grow. Therefore, to say that media analytics and the rise of the big data paradigm are related is an understatement. In fact, Google and Facebook developed the next generation of technologies to store, retrieve, and analyze big data, and these are now also used in other fields because of the volumes of data they are dealing with.

Media Analytics Examples

Companies that sell cultural goods and services via websites or apps (e.g., Amazon, Apple, Spotify, Netflix); organize and make searchable information and knowledge (Google, Baidu, Yandex); provide recommendations (Yelp, TripAdvisor); and enable social communication, information sharing (Facebook, QQ, WeChat, WhatsApp, Twitter), and media sharing (Instagram, Pinterest, YouTube, iQiyi) all rely on computational analysis of massive media datasets and data streams. These datasets include the following:

The term dataset is often used in industry to refer to static or historical data organized in databases. The term historical in industrial data analytics applications means everything that is more than a few seconds, or sometimes even fractions of a second, in the past. Data streams refers to the data that arrives in real time and is analyzed continuously using platforms such as Spark Streaming and Apache Storm.11 So far, digital humanities and computational social sciences have only been analyzing historical, static datasets; meanwhile, the industry has been increasingly using real-time analysis of data streams that are larger and require the special technologies such as Hadoop, Apache Cassandra, Apache HBase, and MongoDB.

Let’s look at one example of the industry computational analysis of media content and the use of this analysis. Spotify extracts many characteristics of each music track in its collection of over forty million tracks. These characteristics, or features, are also made available to external developers via the Get Audio Features for a Track Spotify API method. The current specification for this method lists thirteen audio features.12 Many of them are built on top of more low-level features extracted by algorithms from the track audio file. These features are “acousticness,” “danceability,” duration in milliseconds, “energy,” “instrumentalness,” key, “liveness,” loudness, mode, “speechiness,” tempo, time signature, and valence. (Feature extraction is a key part of contemporary data analysis in general, and I will discuss it in chapter 6.)

Spotify and other music streaming services use such extracted features to automatically create custom playlists for users by starting with a song, album, artist, or genre. You can start with a single song, and the app’s algorithms select and stream songs that are close to it in a feature space. The advantage of this method is that the new songs do not have to belong to the same album or artist—they only need to share some musical features with the previous songs.

There are numerous other applications of media analytics. For example, to make its search service possible, Google continuously analyzes the full content and markup of billions of web pages. It looks at every page on the web that its spiders can reach—its text, layout, fonts used, images, and so on, extracting over two hundred signals in total.13 Email spam detection relies on the analysis of the texts of numerous emails. Amazon analyzes the purchases of millions of its customers to recommend books. Netflix analyzes the choices of millions of subscribers to recommend films and TV shows. It also analyzes information on all its offerings to create more than seventy thousand genre categories.14 Contextual advertising systems such as Google AdSense analyze the content of web pages and automatically select the relevant ads to show. Video game companies capture the gaming actions of millions of players and use this to optimize game design. Facebook’s algorithm analyzes all updates by all friends of every user to automatically select which ones to show in a user’s feed if that user is using the default Top Stories option.15 And it does this for all the posts of its 2.5 billion (as of early 2020) users. Other uses of media analytics in the industry include automatic translation (Google, Skype) and recommendations for people to follow or to add to your friends list (Twitter, Facebook). Using the voice interface in Google Search, Google Voice transcriptions,16 Microsoft’s Cortana, Siri, Amazon Alexa, or the Yandex browser also relies on computational analysis of millions of hours of previous voice interactions.

The development of algorithms and software that make this data collection and analysis and subsequent actions possible is carried out by researchers in a number of academic fields, including machine learning, computer vision, music information retrieval, computational linguistics, and natural language processing. Many of these fields started to develop in the 1950s, with the key concept of information retrieval appearing in 1950 (discussed ahead). The newest term is data science, which became popular after 2010. It refers to professionals who know contemporary algorithms and methods for data analysis—described today by the overlapping terms of machine learning, data mining, and AI—as well as classical statistics, and can implement gathering, analysis, reporting, and storage of big data using current technologies.

People outside the industry may be surprised to learn that many key parts of media analytics technologies are open-sourced. To speed up the progress of research, most top companies regularly share many parts of their code. For example, on November 9, 2015, Google open-sourced TensorFlow, its data and media analysis system that powers many of its services.17 Other companies, such as Facebook and Microsoft, also open-sourced their software systems for organizing massive datasets. Cassandra and Hive are two popular systems developed by Facebook, and they are now used by numerous commercial and nonprofit organizations. The reverse is also true: The data from community mapping project OpenStreetMap (openstreetmap.org), with its more than two million members, is used by many commercial companies, including Microsoft and Craigslist, in their applications.18 The most popular programming languages used for media analytics research today are open source: R and Python.

If we want to date the establishment of the practices of the massive analysis of content and interaction data across the industry, we may pick 1995 as the starting date (early web search engines) and 2010 (when Facebook reached five hundred million users) as the date these practices fully matured. Today, media analytics is taken for granted, with every large company selling services or products online or physically doing this analytics daily and increasingly in real time. The same analysis is performed by hundreds of companies that offer social media dashboards—web tools for monitoring and analyzing user activity and posting content—and perform custom analysis for numerous clients, both profit and nonprofit, including private and public universities.

The Two Parts of Media Analytics

Media analytics is the new stage of media technology that impacts everyday cultural experiences of significant percentages of populations in most countries. One part of media analytics—the practices of gathering and algorithmic analysis of user interaction data (i.e., digital traces)—has received significant attention. However, most discussions of these practices are focused on political and social issues such as privacy, surveillance, access rights, discrimination, fairness, and biases, as opposed to the history and theory of technological media.

The second part of media analytics—the practices of algorithmic analysis of all types of online media content by the industry, including images, videos, and music—has received less attention in comparison. However, only if we consider the two parts of media analytics together—analysis of user interaction data and analysis of media content—does the magnitude of the shift that gradually took place between 1995 and 2010 become fully apparent. Although articles in popular media have discussed details of computational analysis of cultural content and data in some cases, such as Google Search, Netflix’s recommendation system, or US presidential election campaigns starting with Obama’s in 2008, they have not explained that media analytics is now used throughout the culture industry.19

Media analytics practices and technologies are employed in platforms and services via which people share, purchase, and interact with cultural products and with each other. They are used by companies to automatically select what will be shown on these platforms to each user, including updates from friends and recommended content, and how and when. And they also are used by millions of individuals who participate in the culture industry not only as consumers but also as content and opinion creators; George Ritzer and Nathan Jurgenson called this combination of consumption and production prosumer capitalism.20 For example, Google Analytics for websites and blogs, and analytics dashboards provided by Facebook, Twitter, and other major social networks are used by millions to fine-tune their content and posting strategies.

Both parts of media analytics are historically new. At the time when Max Horkheimer and Theodor Adorno were writing their book that introduced the term culture industry (see the quote at the start of this chapter), interpersonal and group interactions were not part of the culture industry. But today, they have now also become industrialized—influenced in part by algorithms deciding what content, updates, and information from people in your networks to show you. These interactions are also industrialized in a different sense: interfaces and tools of social networks and messaging apps are designed with input from user interaction (UI) scientists and designers, who test endless possibilities to ensure that every UI element, such as buttons and menus, is optimized and engineered to achieve maximum results.

The second part of media analytics—computational analysis of media content—also did not exist until recently. The first computer technologies that could retrieve computer-encoded text in response to a query were introduced in the 1940s. In a conference held in in 1948, participants learned about the UNIVAC computer, which was “capable of searching for text references associated with a subject code.”21 Calvin Mooers coined the term information retrieval in his master’s thesis at MIT and published the definition of the term in 1950. According to this definition, information retrieval is “finding information whose location or very existence is a priori unknown.”22 While the earliest systems only used subject and author codes, in the late 1950s IBM computer scientist Hans Peter Luhn introduced full-text processing. I identify this moment as the beginning of the media analytics paradigm.

In the 1980s, the first search engines applied information retrieval technology to files on the internet. After the World Wide Web started to grow, new search engines for websites were created. The first well-known engine that searched the text of websites was WebCrawler, released in 1994. In the second part of the 1990s, many other search engines, including Yahoo!, Magellan, Lycos, Infoseek, Excite, and AltaVista, were analyzing website texts. And in the 2000s, massive analysis of other types of online media, including images, videos, and songs, began. Google introduced Image Search in July 2001, indexing 1 billion images by 2005 and 10 billion by 2010. Another image search service TinEye indexed forty billion web images by early 2020. Music streaming services analyze the characteristics of millions of songs and use this analysis for recommendation. YouTube analyzes the content of posted videos to see if a new video matches an already existing item in the database of millions of copyrighted videos.

Automation: Media Analysis

If we look at the cultural analytics stage of media history in terms of automation, it follows the earlier stage when software tools and computers were adapted for authoring individual media products. 23 The important moments in this history include the introduction of Quantel Paintbox for video effects (1981), Microsoft Word for writing (1983), PageMaker for desktop publishing (1985), Illustrator for vector drawing (1987), Photoshop for image editing (1990), and Video Toaster for video editing (1990). These software tools made possible faster workflows, exchanging and sharing projects’ digital files and assets, creation of modular content (e.g., layers in Photoshop), and the ability to easily change parts of the created content in the future. Later, these tools were joined by other technologies that support computational media authoring, such as render farms and media workflow management software.

The tools of media analytics are different: they automate the analysis of (1) billions of pieces of media content shared and published online and (2) data from trillions of interactions between users and software services and apps. For example, in 2018, an Instagram algorithm that creates recommendations for each user utilized these main factors (in addition to many others):

What is now being automated is no longer a creation of individual media items but other media operations. This includes selection and filtering (what to show), content placement (behavioral advertising), and discovery (search, recommendations). Another application is how to show; for example, news portal Mashable automatically adjusts the placement of stories based on real-time analysis of users’ interactions with this content. Yet another application of media analytics is what to create; for example, in 2015, New York Times writers started to use an in-house application that recommends topics to cover.25

Just as the adoption of computers for media authoring gradually democratized this process, the development of concepts, techniques, software, and hardware for media analytics also democratizes its use. Today, every creator of web content has free tools that until recently were only available to big advertising agencies or marketers. Every person who runs a blog site or posts content on their social media networks can now act as a media company, studying data about clicks, reshares, and likes; paying to promote any post; and systematically planning what and where they share. All popular media sharing and networking platforms show people detailed graphs and statistics about interactions network users have with their content.

As another example, consider Mailchimp, a popular service for sending and tracking mass emails. When I use Mailchimp to send an email to my own mailing list (Mailchimp is currently free for up to two thousand email addresses and twelve thousand emails per month), I use its Send Time Optimization option. Mailchimp then analyzes data from my previous email campaigns and “determines the best sending time for the subscribers you’re sending to and distributes it at the optimal time.”26 To create my posts for Facebook and Twitter, I use the Buffer app, which also calculates the best time for me to post to each network. If I want to promote my Facebook page or Twitter posts, I can use free advertising features that can create a custom audience for my campaign by selecting users on these networks based on hundreds of settings, including country, age, gender, interests, and behaviors. Category-based market segmentation was already in use earlier in marketing and advertising, but Twitter also allows you to “target users who are similar to the people who already follow you” for your account.27 In this new situation, I no longer have to start with explicit categories or terms; instead, I can let Twitter’s media analytics build a custom audience for me.

For web giants such as Google, Baidu, Yandex, and Facebook, their technical and talent resources for data analysis and access to data about the use of their services by billions of people daily give them significant advantages. These resources allow these companies to analyze user interactions and act on them in ways that are quantitatively different from an individual user or a business using Google Analytics or Facebook analytics on their own accounts, or using any of the social media dashboards—but qualitatively, in terms of concepts and most of the technologies, it is exactly the same. One key difference between the largest companies and smaller companies is that the former have top scientists developing their machine learning systems (a modern form of AI), which analyze and make decisions based on billions of data points captured in near real time. Another difference is the fact that Google and Facebook dominate online search and advertising in many countries and therefore have a disproportional effect on the discovery of new content and information by hundreds of millions of people.

Media analytics is big, and it is used throughout the culture industry. But still, why do I call it a stage, as opposed to just one among other trends of the contemporary culture industry? Because in some industries, media analytics is used to algorithmically process every cultural artifact. For example, digital music services that use media analytics accounted for 70 percent of music revenue in the United States in 2014.28 This is the new logic of how media work internally and how they function in society. In short, it is crucial both practically and theoretically. Any future discussion of media theory or communication has to start with this new situation.

(Of course, I am not saying that nothing else has happened after 1993 in media technologies. I can list many other important developments, such as the move from hierarchical organization of information to search, the rise of social media, the integration of geolocation information, mobile computing, the integration of cameras and web browsing into phones, and the adoption of supervised machine learning across media analytics applications and other areas of data analysis after 2010.)

Companies that are key players in big media data processing—Google, Baidu, VK, Amazon, eBay, Facebook, Instagram, and the like—are very young. They developed in a web era, as opposed to older, twentieth-century cultural industry players, such as movie studios or book publishers. These older players were, and continue to be, the main producers of “professional” content. The newer players act as interfaces between people and this professional content, as well as user-generated content. The older players are gradually moving toward adoption of analytics, but key decisions (e.g., publishing a book) are still made by individuals following their instincts. In contrast, the new players have built their business on computational media analytics from the beginning.

On the one hand, companies use media analytics to optimize distribution, marketing, advertising, discovery, and recommendations—that is, the part of the culture industry in which customers find and purchase cultural products. On the other hand, the users of social networks and web platforms become “products” to each other. Thus, Amazon algorithms analyze data about what goods people look at and what they purchase and use this analysis to provide personal recommendations to each of its users. Facebook algorithms analyze what people do on Facebook to select what content appears in each person’s news feed.29

Although the word algorithms and the term algorithmic culture are convenient, they can be also misleading—and that is why I use analytics instead. The most frequently used technology for big data analysis and prediction is supervised machine learning that uses neural networks, and it is quite different from our common understanding of an algorithm as a finite sequence of steps executed to accomplish some task. Some machine learning applications are interpretable, but many are not. The process of creating such a system often results in a black box, which has good practical performance but is not interpretable; that is, we do not know how it generates results.30 For these reasons, I prefer to avoid using the terms algorithms and algorithmic when referring to the real-world systems deployed by companies to analyze data, make predictions, or execute automatic actions based on this analysis. My preferred term is software, which is more general; it does not assume that the system uses traditional algorithms, nor that these algorithms are interpretable.31

Media analytics is the key aspect of “materiality” of media today. Fifteen years ago, this concept may have been used in discussions of computer hardware, programming languages, databases, network protocols, and media authoring, publishing, and sharing software.32 Today, media materiality includes big data storage and processing technologies such as Hadoop and Storm, paradigms such as supervised machine learning and deep learning, and popular machine learning algorithms such as k-means clustering, decision trees, support vector machines, and k-NN (k-nearest neighbors). Materiality is Facebook “scanning 100 billion rows per second,”33 and Google processing 100+ TB of data per day (2014 estimate)34 and automatically creating “multiple [predictive] models for every person based on the time of the day.”35

Automation: Media Actions

So far, our discussion has focused on automatic analysis of media content and user interactions with the content. I now want to talk about another novel aspect of media culture enabled by media analytics: automation of media actions based on the results of earlier and/or real-time analysis. These actions can be divided into two types: (1) automatic actions partly controlled by explicit user inputs or chosen settings and (2) automatic actions not controlled by explicit user inputs.

Examples of automatic actions partly controlled by explicit user inputs or chosen settings include search results returned in response to a text search query, image search results produced in response to the user choosing an image type to find, and music tracks recommended by a music streaming service in response to the user’s initial selection of a musician or tracks. For example, Google image search options currently have a choice of face, photo, clip art, line drawing, or animation; and full color, a dominant color, or black and white. Examples of settings that can be changed by users are ads chosen by the system to show in response to the user’s ad preferences and types of images shown in response to “safe search” settings.

These users’ inputs and settings are combined with the results of content and interactions analysis to determine the actions taken by the software. The choice of actions may combine previous data from the particular user and data for all other users—such as the purchasing history of all Amazon customers. Other information also can be used to determine actions. For instance, real-time algorithmic actions that involve thousands of ads determine which ads will be shown be on the user’s page at a given moment.

Automatic actions not controlled by explicit user inputs depend on the analysis of user interaction activity but do not require the user to choose anything explicitly. In other words, a user “votes” with all of their previous actions. The automatic filtering of emails in Gmail into Important and Everything categories is a good example of this type of action. Most of the automatic actions we encounter in our interactions with web services and apps today can be partly controlled by us via settings; however, not every user is willing to spend the time to understand and change the default settings for every service (e.g., those at https://www.facebook.com/settings).

We can also divide automatic actions into two types, depending on whether they are arrived at in a deterministic or nondeterministic way:

The overall result is another new condition of media: what we are shown and recommended every time is not completely determined by us or by system designers. This shift from strictly deterministic technologies and practices of the culture industry in the twentieth century to nondeterministic technologies in the the twenty-first century is another important aspect of the new stage of media culture. What was strictly the realm of experimental arts—the use of indeterminacy by John Cage or stochastic processes by Iannis Xenakis to create or perform compositions—has been adopted by the culture industry as a method to deal with the new massive scale of available content. But of course, the industry goal is different: not to create a possibly uncomfortable and shocking aesthetic experience, but to expose a person to more existing content that fits with that person’s existing taste, as manifested in their previous choices. However, we should keep in mind that industry recommendation systems can be also used to expand one’s taste and knowledge, if one gradually keeps moving further from one’s initial selections—and certainly web hyperlinking structures, Wikipedia, and open-access publications also can be used to do this.

In addition to the examples of automatic actions based on media analytics I already mentioned, there are many other types of such actions that also make contemporary media different from media of the past. For example, the data on users’ interaction with a web service, app, or device also often is used to make automatic design adjustments in that web service, app, or device. The data also is used to create more cognitive automation, allowing the system to “anticipate” what users may need at any given location and time and deliver the information best tailored to this location, moment, user profile, and type of activity. The term context-aware is often used to describe computer systems that can react to a location, time, identity, and activity.36 The Google Now assistant introduced in 2012 is one example of such context-aware computing (since 2016, its functions have been incorporated into Google Assistant).

Twentieth-century industrial and software designers and advertisers used user testing, focus groups, and other techniques to test new products and to refine them. But in the media analytics stage, a service or a product can automatically adjust its behavior for each individual user based on that user’s interaction history, as well as analysis of interactions of every other user with the service or product. Following the model popularized by Google, every web and app user has become a beta tester of many constantly changing systems that learn from every interaction.

Large-scale media analytics is often used in making decisions about what cultural products to create, their contents and aesthetics, and how they should be marketed and to what groups. For example, when you create a post that you want to promote and let Facebook, Twitter, or another social network automatically create a particular audience segment that is, for example, similar to your current followers, you are using media analytics. Here the system automatically decides what audience will be most interested in your content. But media industry is already going further, sometimes using analytics to decide what to create in the first place. Here Netflix has been a pioneer, using data to decide on elements of a new show that became very successful (House of Cards in 2013).37 Netflix also systematically analyzes all kinds of data about what its viewers watch and also the content of the films and TV shows it offers. As the director of engineering at Netflix, Xavier Amatriain, explained in an interview in 2013: “We know what you played, searched for, or rated, as well as the time, date, and device. We even track user interactions such as browsing or scrolling behavior. All that data is fed into several algorithms, each optimized for a different purpose. In a broad sense, most of our algorithms are based on the assumption that similar viewing patterns represent similar user tastes. We can use the behavior of similar users to infer your preferences.”38

Netflix is even analyzing the colors of the cover images for its programs. On its technical blog, the company published examples of visualizations it created to compare the color palettes of its shows. Describing one such visualization from 2013 that compares the palettes of two covers that are hard to tell apart, Phil Simon points out that it shows that “subtle differences exist—and Netflix can precisely quantify those differences. What’s more, Netflix can see if they have any discernible impact on subscriber viewing habits, recommendations, ratings, and the like.”39 In yet another application of media analytics, the blog describes how Netflix uses computer vision algorithms to automatically find images from its films and TV series that would best represent this content on smaller mobile screens.40 And these are only a few examples of how a company like Netflix uses media analytics to drive all kinds of decisions.

In another example, Yelp is using media analytics to automatically select the best photos to represent businesses on its review site. As explained on its engineering blog (2016):

In order to provide a great experience for Yelp users, the Photo Understanding team had the challenging task of determining what qualities make photos appealing, and developing an algorithm that can reliably assess photos using these characteristics. . . . At Yelp, each business’s page showcases a few of its best photos, which we call cover photos. For many years we have chosen these photos purely by calculating a function based on likes, votes, upload date, and image caption. However, this approach suffered from a few drawbacks. . . . Now, as a result of our scoring algorithm, we believe that the quality of cover photos for restaurants has significantly improved.41

Media Analytics and Cultural Analytics

Many of the cultural effects—as opposed to economic, social, and political ones—of the new computational organization of media culture have not been systematically studied empirically by either industry or academic researchers. For example, we know many things about the language of conservative and liberal Twitter users in the United States or the political polarization on the same platform.42 But we do not know anything about the evolution in topics of hundreds of millions of blogs over last fifteen years or the changes in characteristics of billions of Flickr photos during the same period or the differences in types of content shared on Instagram in thousands of cities worldwide. Nor do we know anything quantitatively about how exposure to algorithmically selected images recommended by Instagram changes users’ tastes and affects the new images they create themselves.

The datasets to research such questions exist or can be created. In 2014, Flickr released an open dataset of one hundred million photos with Creative Commons licenses shared between 2004 and 2014 to all interested researchers.43 Such datasets can be used to study both evolution of global photo culture over time and the differences between local photo cultures. In our own work, we have analyzed one hundred thousand Instagram images shared in five global cities in one week in December 2013 and found significant differences in content, visual styles, and photo techniques between cities.44 In another project from 2013, we compared temporal rhythms of image sharing in thirteen global cities using a sample of 2.3 million Instagram images.45

The industry does extract numerous patterns from professional and user-generated content online, but often the only ones “looking” at these patterns are algorithms and neural networks. Companies use this information for search, recommendation, design, marketing, advertising, and other applications, but they usually do not publish results of the analysis. The business clients of media analytics services are also typically interested in only particular content (e.g., all social media mentions of a particular brand or the activity of competing companies) and particular user behaviors or user activities (e.g., likes for this brand).

Often the same analytical methods used in the culture industry to rationalize and refine content and communications also can be used to research, map, and quantify and interpret cultural effects of industrial media analytics. For example, if the industry uses cluster analysis to study audiences for particular songs or movies, we can use cluster analysis to understand relations among thousands of movies being offered. But as this example demonstrates, there is a crucial asymmetry between what industry is doing and what independent researchers can do. I can assemble large datasets of user-generated content from some social networks, and also some types of professional content, such as music videos or motion graphics shared by designers and companies on Vimeo or design projects shared on Behance. If a given social network API provides this data, I can also access data on how users interact with a particular post, such as the number of likes, comments, and so on. However, I cannot access all such data for all professional media created today, nor can I get the kinds of details Netflix has access to: who is watching every program, at that time, in what locations, what else they searched for, their previous interaction histories, their mouse clicks, and so on. The same goes for data available to Spotify, the iTunes Store, Google Play, Amazon, Etsy, AliExpress, and so on.

One free system that provides a wealth of data, an easy-to-use interface with graphs, and the ability to download results is Google Trends. It can be used to ask interesting cultural questions, and in fact many researchers use its results in their papers. It is also possible to become a paying client for social media monitoring software (Hootsuite, Sprout Social, Brandwatch, Critical Mention, Crimson Hexagon [now part of Brandwatch], etc.) and monitor social media, blogs, reviews, news, forums, and other sources for certain keywords, hashtags, and topics, seeing their relative popularity over time and geography (similar to how Google Trends shows patterns for search terms). However, the main purpose of this software is to enable a business or organization to plan its social media activity, to see what people say about it, and to compare it with its competitors. Therefore, it can’t be used as a general cultural analytics tool. To ask many research questions or to be able to analyze large cultural datasets directly instead of relying on algorithms built into social media monitoring software, you have to learn programming and data science, then acquire data (download data via an API, scrape it from websites, or purchase it from data providers such as DataSift or Webhose.io), and then you can start the analysis. If we are interested in fine-grained historical or large-scale cross-cultural analysis, this is often the only way.

The term culture industry, which already appeared in this book a number of times, has a precise origin. As I already mentioned, it was developed by German culture theorists Horkheimer and Adorno in their 1947 book Dialectic of Enlightenment. They wrote this book in Los Angeles when the Hollywood studio system was in its classical—that is, most integrated—period. There were eight major film conglomerates, and five of them (20th Century Fox, Paramount, RKO Pictures, Warner Brothers, and Loews) had their own production studios, distribution divisions, theater chains, directors, and actors. According to some film theorists, the films produced by these studios during this period also had a very consistent style and narrative construction.46 Regardless of whether Horkheimer and Adorno had already fully formed their ideas before arriving in Los Angeles as emigrants from Germany, the tone of the book and its statements, such as the famous quote, “Culture today is infecting everything with sameness,”47 seem to fit the Hollywood classical era—although even during this era, films by different directors were different from each other.

How does the new “computational base” (i.e., media analytics) affect both the products that the culture industry creates and what consumers get to see and choose? For example, do computational recommendation systems used today by many companies help people choose apps, books, videos, movies, or songs more widely (i.e., long tail effect), or do they, on the contrary, guide them toward “top lists”? What about systems used by Twitter and Facebook to recommend to us whom to follow and which groups to join? (For an example of the industry publication that presents details of its recommendation system, see the 2013 paper “Location Based Personalized Restaurant Recommendation System for Mobile Environments”;48 for the quantitative analysis of the effects of an industry recommendation system on media consumption, see the 2010 paper “The Impact of YouTube Recommendation System on Video Views.”49)

Or consider the interfaces and tools of popular media capture and sharing apps, such as Instagram, with its standard set of filters and adjustment controls appearing in a certain order on the user’s phone. Does this lead to homogenization of image styles, with the same few filters dominating the rest? Such questions about the effect of digital tools and services on cultural diversity can now be studied quantitatively using large-scale cultural data from the web and data science methods. For example, when we compared the use of Instagram filters in 2.3 million photos shared in thirteen global cities during the spring of 2012, we found remarkable consistency between the cities.50 The relative frequencies of different filters were very similar across the cities, and their popularity was almost perfectly correlated with the order of their appearance in the Instagram app interface.

Digitization of historical cultural media also makes it possible to quantitatively analyze changes in diversity and homogeneity over time. In the paper “Measuring the Evolution of Contemporary Western Popular Music” (2012), researchers applied computational methods to a dataset of 464,411 distinct music recordings for the 1955–2010 period. They found that many sound parameters of popular music did not change during this period, but some changed significantly. The researchers highlight three changes: “the restriction of pitch transitions, the homogenization of the timbral palette, and the growing loudness levels.”51 The first two findings suggest that on these dimensions, popular Western music became less diverse during the fifty-five-year period being studied.

Another publication, “The Evolution of Popular Music: USA 1960–2010,” analyzed 17,094 songs that appeared in the charts for this period. The authors analyzed sound properties “to produce an audio-based classification of musical styles and study the evolution of musical diversity and disparity, testing, and rejecting, several classical theories of cultural change.” They also investigated “whether pop musical evolution has been gradual or punctuated” and found that while some periods had gradual changes, there were also three stylistic “revolutions” around 1964, 1983, and 1991.52

In this chapter, we looked at media analytics—computational analysis of digital cultural content and user activities that have become the foundation of contemporary digital culture. However, though large-scale computational analysis of content and interaction data by companies such as Google, Facebook, Instagram, Amazon, and their counterparts in other countries gives them lots of power, they are not simply new iterations of the tightly integrated Hollywood conglomerates from the 1940s. The web, social media, and the use of media analytics create a new type of culture industry that coexists and interacts with the older one established in the 1910s–1940s. This earlier culture industry was focused on creating, distributing, and marketing content, such as movies, radio shows, songs, books, and TV programs. The new cultural industry of our time is focusing on organizing, presenting, and recommending content created by various actors, as well as capturing and analyzing individuals’ interactions with this content. In other words, these companies are usually not content creators themselves.

The actors creating content include professional producers of different sizes (e.g., big movie studios, television production companies, book publishers, and music labels—the “old” culture industry) and billions of ordinary casual users, as well millions of people who are situated on many points in between these two extremes. Examples include minicelebrities and “influencers” on social media; freelancers such as photographers, designers, yoga instructors, hairstylists, or interior decorators, along with small shops or individual sellers that promote themselves using social media; creators of online videos in numerous genres, such as anime music videos, YouTube reaction videos, Russian schools’ graduation videos, Chinese minimovies, and so on; thirty-five million artists sharing their works on DeviantArt (deviantart.com); one hundred and thirteen million academics who have accounts on academia.edu;53 and more.54

And the content itself is also qualitatively different from what was produced at the time that Horkheimer and Adorno wrote their book (early 1940s): it is not only songs, films, books, and TV shows, but also our individual posts, messages, images, videos shared on Twitter, Facebook, Vine, Instagram, YouTube, and Vimeo, academic papers, code, and so on. All content published by the entire culture industry in the 1940s in the United States probably was under a few million items per year; today, all content shared on social networks adds up to many billions of items every day.

Surfacing the variability of this content so we can understand and interpret it can only be done using computational methods. As I am writing this, the academic fields that aim to understand media and digital phenomena—media theory, digital culture studies, and new media studies—have not yet adopted cultural analytics methods. But just as researchers in the recently emerged fields of digital history, digital humanities, and digital art history have started to apply these methods in their own areas, it is only a matter of time before media theory will start doing the same. This new area may be called computational media studies—or perhaps, by the time this adoption fully happens, it would be seen simply as another set of tools and methods that media and new media theory can use, and it would not need its own special name.