`Intervention: Too Much Data`

“This is really an anticlimax! Last time we met, we were like, ‘Wow, we’ve got unlimited data, what should we do with it?’ And then it took just a couple of days and we discovered that we still have too little data!” This disheartened remark is called out by one of us as we’re sitting in front of our laptops, hardly believing what we are seeing in the Excel files. Having compiled the data from our last run of bots, we have just realized that despite tons and tons of data points in the files, none of our ninety-six bots actually got any recommendations at all from Spotify’s Discover feature. And the Discover feature is the only thing we’re interested in. We have loads of data here, but not the data we wanted.¹

These short notes, recorded during the implementation of one of our bot experiments—the so-called gender case—point to the perils of navigating and trying to understand the effects of a constantly changing software system such as Spotify. They also raise critical and self-reflective questions about doing humanistic inquiry during what David Berry has termed “the computational turn.”² What did it mean for us to have “loads of data”? What were the epistemic rationales and subjective investments at work here?³ These are issues that we have struggled with during the course of our experimental case studies, and they were acutely brought to the fore in the case mentioned above.⁴ By drawing on the field notes made during this process, we can go back to the beginning:

In early 2015, we were starting to gain momentum in the research project and were gradually establishing our roles as researchers and developers. One of the first suggestions that came up during an initial brainstorming session was to approach the issue of user profiling by investigating the relation between users’ self-categorization and their recommended content. As discussed in chapter 3, gender and age are the two demographic categories required to sign up for the Spotify service. Based on what we knew about gender and age-skewed artists,⁵ we were interested in how these categories corresponded to the kind of music that was recommended—and thus, how male and female users were constructed and treated in interaction with the software.

At the outset, we were unsure about how to best design such an investigation. Should data collection be dealt with manually, by creating a few accounts and observing the results in line with an ethnographic tradition, or would it be better to use automated scripts to collect data? Inspired by earlier work on algorithmic auditing,⁶ and bearing the public interest in mind, we were interested in potential instances of algorithmic discrimination. We did not intend to cause harm to Spotify’s software system nor to collect any personal data about Spotify users.

As researchers coming from fields primarily characterized by qualitative methods and social constructionist stances, we were well experienced in doing research based on close readings, digital ethnography, and rich, contextual data, but we had little previous experience of using quantitative scientific approaches. Notes from our first meetings reveal that we considered the manual approach to be familiar and potentially fruitful. At the same time, the idea of an automated setup was very appealing to us, as we imagined that automation would make it possible to collect structured data that, in turn, would allow more reliable comparisons between users of each gender.

Hence, after some discussion, we opted for the automated approach. We agreed that the developers at Humlab would design a script for running identical user accounts and retrieve their recommended content from the supposedly personalized Discover feature, and we were eager to start working with this type of rule-bound data collection. The developers repeatedly asked us how we wanted the data to be reported (in what format, according to what structure, etc.). Yet at the start of our project, this seemed like an abstract question to us. Furthermore, having been trained to collect empirical material in an open-ended fashion, we did not want to limit ourselves to certain forms before we knew where the study was heading. This meant, in essence, that we had no specific plan as to how we would organize the data after obtaining it.

The first step, once a structure of data capture had been developed, was to run a consistency test to analyze the stability of the system and learn more about the basics of Spotify’s recommendations. For instance, we wanted to explore what type of input was needed for recommended content to be displayed to users and whether identically registered bots with identical behaviors were also given identical recommendations. For this purpose, we used seventeen bots. Some of these bots played songs, others followed artists, and some did nothing at all. The resulting amount of data was quite small, and we managed to analyze it manually. It showed that streaming was indeed needed for recommendations to appear and that the bots that streamed the same songs also received the same recommendations. Satisfied to see that the system worked the way we had hoped, we set out to design a small pilot study with sixteen bots that were identical apart from their gender (half were registered as male users and half as female). The bots were divided into four music genre groups, meaning that two bots of each gender streamed the same tracks. This design, we thought, might indicate potential gender and genre differences in the recommendations, and depending on the results of the pilot, we would then proceed with an extended study.

The pilot study went on for five days, with data capture occurring twice a day. The results were presented to us in the form of spreadsheets as well as through an interface built specifically for the project, which enabled us to view the documentation for each individual bot and session but not compare them or aggregate data from several bots.

Although captivated by the structured feel of the data presentation—where bot IDs, time stamps, input data, and output data were recorded—we were not able to get close enough to the data to detect any gendered patterns in the recommendations. Had the pilot study not been running long enough? Did we approach the data in the wrong way by only doing synchronic comparisons between recommendations provided at the same point in time? Or were there simply no gender differences—something that, we reassuringly told one another, would be an important result in itself? As we were beginning to experience frustration, one of the developers demonstrated some simple network visualizations of the data. This presentation enabled us to see that there were indeed variations in which artists were recommended to male and female registered bots—differences that we had not previously been able to identify manually and that were noticeable only when aggregating and comparing all recommendations for each gender.

With these preliminary results in mind, we decided to scale up the test. Aware of the opacity and potential bias of algorithmic recommendations and the way these are subject to constant change, we wanted to confirm the results from the pilot study. Having embarked on a project that involved structured data collection and quantification of results, we also felt a simultaneous urge and duty to continue on this chosen methodological path. By using more bots and capturing more data, we assumed that we would generate more valid results. Thus, we eventually decided to use ninety-six bots for the next round. And instead of only targeting selected elements of the web client, such as the Discover recommendations in the pilot study, we now wanted to capture as much as possible. Notes and recordings from our meetings reveal how every nook and cranny of the client was perceived as a possible source of knowledge about music streaming: What if we missed something? Why not capture everything when we had the means to do so? We realized that this would result in quite a lot of data and that we had no definite plan for how to manage it, but we agreed that we’d cross that bridge when we came to it.

Thus, a system was set up where roughly fifteen thousand data points would be collected daily.⁷ Using both the web client and the Spotify application programming interface (API), we received screenshots, HTML documents, and logs of each instance of data capture. When one week’s worth of data had been collected, we consequently found ourselves in the midst of a flood of data detailing different aspects of music recommendations. While we were thrilled, it also felt slightly unsettling. What now? What would we do with all of this information? How does one even begin to sift through all of these entries? We had a hundred thousand data points to consider, and we were not even well experienced in working with spreadsheets.

In other words, we were immersed in data, but not in the ethnographic sense. The data lacked the richness and deep contextualization that we were used to, and the immersion did not help us get any closer to understanding the potentially gendered dimensions of Spotify’s recommendations. And while our notes, reflections, and documentation from project meetings provided a much-needed context to our methodological decisions, they did not shed light on our actual research question. At the same time, we felt truly excited that we had managed to extract such tremendous loads of data. In a meeting with the project team, we discussed how smoothly the process of data collection had run and that “there’s simply too much damn data” for us to work with.⁸ A few of us promised to look further into it and get back to the team a week later with some initial observations.

Because we realized that we could not review this mass of data from the existing interface, we asked the developers to design a tool for comparing input data between bots. With this solution in our hands, we sat down together and began scrolling through the entry list in the drop-down menu. Even though we had known from the start that we would end up with a large amount of data, the seeming endlessness of the list was almost shocking. Slightly overwhelmed by the massive records, we collected ourselves enough to start running comparisons between male and female registered bots in each of the four music genres. As playlist recommendations popped up on our screen, highlighted by bold colors, we gradually noted that they only referred to the country-specific Featured Playlists and Genres & Moods features. The specific recommendation category that we were interested in, Spotify’s supposedly personalized Discover feature, did not show up on the screen at all. Perplexed by this fact, we double-checked it with the developers, who confirmed our worries. There were simply no Discover recommendations in the huge database.

This realization, then, takes us back to the opening quote of this section. Despite our successful capture of substantial amounts of data, we still lacked the specific data points that we were after. More precisely, none of our ninety-six bots had received any personalized recommendations during the course of data collection—not one single bot! We could track the greetings they had received at different times during the week, as well as new album releases and the prepackaged playlists delivered to them through the Genres & Moods feature. But we already knew that these elements were broadcast to larger populations and not tailored to individuals. This was not the data needed to explore patterns of gendering. The data that lay before us was useless for our purposes.

This insight left us disappointed, of course, but we were also puzzled as to why the Discover recommendations were absent in the first place. Trying to figure out where things had gone wrong, we speculated together with the developers: Had we somehow accidentally managed to create a setup that made the bots immune to user profiling? Could the lack of personalized recommendations be related to the weeks that passed between registration of the bot accounts and implementation of the actual test? Perhaps Spotify had categorized our bots as lazy and unworthy of recommendations because they didn’t use the service immediately after registration? Was it possible that the service had identified our bots as bots and therefore blocked their recommendations? But then again, why not block them from the service as a whole if that was the case? Perhaps our bots had simply not streamed enough music? Even if ten played songs had been sufficient to generate recommendations in the pilot study, Spotify could have changed its requirements so that the study was no longer reproducible. A thousand theories like these were running through our heads as we traced the historical details of each data capture, making sure that our bots had behaved as intended.

For comparison, we then ran a series of manual tests using fresh Spotify accounts. Here, to our great surprise, recommendations appeared shortly after streaming began, sometimes after as few as two played songs and after at most eleven streamed tracks. It did not make sense. If that was the approximate range of streams needed, then at least some of our bots—having streamed seven times ten tracks each—should have received some recommendations. Annoyed by this result, we repeated the automated setup with sixteen gendered bots, letting them stream songs for yet another week. During this period, personalized recommendations finally began to roll out for some but not in any coherent manner. Instead, there were quite large variations between the bots as to the number of streams needed to generate recommendations. Was it all simply a matter of performance on the part of the service, with recommendations being partially scheduled to avoid overloading the systems? Frustrated by the lack of consistency, we found ourselves struggling to find a rational—and, in particular, causal—explanation for what seemed to be entirely random patterns of recommendations.

Without any clear answer, we began to think of the gender case as an outright methodological failure. Scrutinizing our setup, we also started doubting if the number of bots was sufficient for answering our initial research question. We had not fully assessed different ways of approaching the data—for instance, whether it would be useful to have an in-depth, interpretative look at the musical content delivered to our bots—but simply found ourselves oriented toward statistical comparison ever since deciding on the automated setup. Having initiated this automated approach, we felt as if we were also compelled to engage in a form of knowledge production that required quantification, and we began to think that the bots might be too few to generate any valid results. To address this problem, we scaled up the test to 144 bots of each gender, which was the limit of what our systems could handle. This time, we let it run until we had made sure that all bots received some personalized recommendations. Finally, it all proceeded without major difficulties. (The results, by the way, did not indicate any significant differences in how male and female users were treated by the client.⁹)

So, why are we telling this story of an initially failed case study? It is not because of some masochistic desire to put our mistakes and ignorance on display but because we want to intervene with our own assumptions and stress the need to reflect on how research could always have been done otherwise. Our narrative describes the complications of trying to capture the outcomes of a constantly changing system, as well as the ways in which digital skills and disciplinary backgrounds on the part of researchers may impede (or facilitate) such endeavors. Furthermore, it sheds light on how scientific identities and methodologies are not only bound up with different ontologies and epistemologies but also “built around emotionally charged constructs” or fantasies.¹⁰

By fantasy, we refer to Jason Glynos’s Lacanian use of the term as a way of critically explaining why people become invested in certain discourses. More specifically, Glynos describes “a narrative structure involving some reference to an idealized scenario promising an imaginary fullness or wholeness (the beatific side of fantasy) and, by implication, a disaster scenario (the horrific side of fantasy).”¹¹ From an ethnographic point of view, as Anna Johansson and Anna Sofia Lundgren demonstrate, the use of digital technology for research purposes has historically been construed both in terms of an imaginary fullness and a threatening disaster: the former because it provides ethnography with methodological rigor, order, and distance, and the latter because it might be seen to threaten traditional values in ethnographic research, such as empirical immersion, creativity, and personal involvement. Such fantasies of digital technology thus build on a longer history of perceived oppositions between technology and culture—oppositions that tie in with persistent tensions between explicit and less explicit methodologies; between computers as orderly and fieldwork as messy; between computer-driven and human-driven modes of analysis; between quantitative and qualitative analysis; and between distance and closeness to data.¹² In other words, while digital technology has sometimes been conceptualized as a threat to the foundational values of qualitative research, it is also linked to dominant notions of objectivity and “scientificity.”¹³ Engaging with computational methods as an ethnographer—or, in a broader sense, as a humanistic or social science scholar—hence implies that one is drawn into (and possibly subordinated by) already established power relations between different scientific epistemologies.

In hindsight, our methodological choices and experiences during the experiment on Spotify’s gendered recommendations were—at least to some extent—structured precisely by a beatific fantasy of digital technology. Our first tests did not include a large number of bots; manual observations and collection of data in this context would indeed have been possible in practice. Still, we opted for the automated setup, having great hopes for what it would help us achieve. Our hopes were fueled by the idea that automated bot methods would provide the rigor and systematicness necessary for a scientifically valid analysis of algorithmic effects—an idea that was, in turn, related to unspoken assumptions about digital data as being possible to master through standardization (and, at a later stage, quantification).

Jason Glynos’s notion of fantasy involves a promise of “imaginary fullness,” but as this fullness is inherently unattainable, any fantasy also involves the construction of “obstacles” that explain why fullness is not yet achieved.¹⁴ In the gender case, our choice of methods seems to have been guided by a fantasy in which digital technology symbolized a fullness to come in the form of a quantified and well-organized scientificity. However, as demonstrated in our narration of the process, the paths and detours taken en route to the end result proved to be far from the systematic and well-ordered scientific endeavor we had imagined. The frustration and unease that we experienced during this process was caused not only by the actual system failures—or even by our own troubles in making sense of the large amounts of data. Rather, our ambivalent feelings toward our methodological approach were related to our initial fantasy of what the bot setup would help us achieve: a systematic and well-ordered analysis of structured and easily comparable data.

Invested in this fantasy, we were oblivious to the fact that our frustration and perceived methodological failures were indeed reminiscent of how qualitative, and especially ethnographic, research is always expected to unfold: as an unpredictable, open-ended, and messy process. Instead of recognizing this complexity as a valuable aspect of the research setup—an aspect that could tell us something about the incoherent and constantly shifting ways in which algorithmic systems work—we perceived it as an obstacle to attaining a certain form of desired scientificity. By discussing these methodological slips and scrutinizing our own decisions and experiences along the way, we want to remind ourselves and others of the need for reflexivity throughout the research process. We need to account for not only the methodological and interpretative choices made along the way but the subjective “desires, identifications, and investments”¹⁵ that permeate any research process and its actual outcomes.

`Notes`

12. Johansson and Lundgren, “Fantasies of Scientificity,” 156–157. Cf. Mike Fortun, Kim Fortun, and George E. Marcus, “Computers in/and Anthropology: The Poetics and Politics of Digitization,” in The Routledge Companion to Digital Ethnography, ed. Larissa Hjorth et al. (New York: Routledge, 2017), 11–20; and Nick Seaver, “Computers and Sociocultural Anthropology,” Savage Minds: Notes and Queries in Anthropology (blog), May 19, 2014, https://savageminds.org/2014/05/19/computers-and-sociocultural-anthropology/. However, it should be noted that many ethnographers have also developed digital approaches that fruitfully combine the methodological benefits of ethnography with the use of digital tools and methods. See, for example, Anne Beaulieu, “Vectors for Fieldwork: Computational Thinking and New Modes of Ethnography,” in The Routledge Companion to Digital Ethnography, ed. Larissa Hjorth et al. (London: Routledge, 2017); Georgina Born and Christopher Haworth, “Mixing It: Digital Ethnography and Online Research Methods—A Tale of Two Global Digital Music Genres,” in The Routledge Companion to Digital Ethnography, ed. Larissa Hjorth et al. (New York: Routledge, 2017), 70–86; and Wendy F. Hsu, “Digital Ethnography toward Augmented Empiricism: A New Methodological Framework,” Journal of Digital Humanities 3, no. 1 (2014), http://journalofdigitalhumanities.org/3-1/digital-ethnography-toward-augmented-empiricism-by-wendy-hsu/.