10
Retrieving

Carolin Gerlitz

Introduction

Digital and social media have opened up new avenues for data collection about social, cultural and political life. Since the advent of digital online media, their data have been of interest to a range of disciplines which have approached digital data with their respective questions and methodologies. Within sociology, online and social media data have initially led to the hope that these new data formats may not have been ‘contaminated’ by the interferences of researchers and their methodologies (Savage and Burrows 2009) – as opposed to traditional sociological methods such as interviews or questionnaires. Engaging with the making and technicity of digital and social media data, however, soon confronted researchers with multiple inscriptions of media, methods and their tools (Ruppert, Law and Savage 2013). Various methodological approaches to access such data preformatted by media have emerged in the context of media and communication studies, sociology and Science and Technology Studies (STS). In the field of digital research methods, digital media are treated as research devices, capable of structured data production (Rogers 2013; Weltevrede 2015). STS have embraced digital data as defined by the actors themselves rather than researchers (Callon 2006) and contributions from the field of Actor Network Theory (Latour et al. 2012) have pointed out that digital data offer both a granular view on individual actions and an aggregated overview, allowing the possibility to cut across the micro/macro distinction that has been so central to sociological debates.

Across disciplines, different ways of accessing data from online and social media have emerged, most notably scraping, that is the extraction of preformatted data from user interfaces (Marres and Weltevrede 2013), but also retrieval, the extraction of data via application programming interfaces (APIs) offered by digital and social media platforms. APIs are software interfaces that enable researchers and other third parties to connect to associated databases in order to produce content for, or extract data from, platforms. This access is usually highly structured, standardized and regulated by the associated platform, offering data access to developers, business partners and researchers among others. As scraping is limited by what can be extracted from user interfaces, retrieving has gained increasing relevance in the context of ever-growing volumes of data.

Interdisciplinary methodological debates have drawn attention to the various inscriptions at stake when working with digital and social media data and have attended to possible bias built into the data by the media that pre-structure them for their own purposes (Marres and Gerlitz 2016). Twitter data, for instance, may well be used to study public debates, but has originally been structured alongside Twitter’s own valuation logic which largely focuses on identifying popularity and trending topics. Data retrieval through APIs has emerged as an interdisciplinary methodological approach which enables access to such actor defined data, allowing researchers to attune their methods to their research object. However, when tracing the inscriptions of retrieved data, it becomes apparent that retrieval not only confronts researchers with the self-categorization of the medium that provides APIs in the first place, but with a larger cascade of inscriptions as the data accessible from one platform might not necessarily have its origin in that same platform. Retrieval, this chapter suggests, poses particular challenges to inventive methods that seek to account for the ongoing happening of social life (Lury and Wakeford 2012) while attending to the mutual inscription of method and problem. In the context of complex data ecologies and interoperability between platforms, the question emerges as to what data retrieval makes accessible in the first place and to which inscription or bias methods need to attune. To account for the inventiveness in retrieving, we need to attend to its technicity, namely APIs, first.

Application programming interfaces and the grammatization of data

Many digital and social media platforms offer (a variety of) APIs to build upon, produce content or extract data, enabling structured access to platform databases. In the case of Twitter, for example, the social media platform offers a so-called REST API for discrete queries, a Streaming API for continued data-capture in real time and an Advertising API.1 The data available via APIs can be considered pre-structured on many levels. On a first level, it is the result of standardized platform activities – or grammars of action to draw on the work of Philip Agre (1994) – which enable users to act in particular ways and platforms to instantly capture data about these actions in standardized form. In the context of Twitter, these grammars focus on organizing user relations (friending, following, muting), comments, likes, retweets, status updates and posts among others, as well as their metadata. The grammatization of user action in the front-end is met with another layer of grammatization in the back-end in the form of API commands and regulations that determine which data can be accessed by whom in what quantities. To remain with the case of Twitter, API access is organized through OAuth,1 a personalized access token for third-party API access. Once access is granted, input to or retrieval from the database are pre-structured through the platform’s developer facing grammars. In the case of Twitter’s REST API grammars are organized alongside query-related GET commands which retrieve data and activity-focused POST commands, that allow the API user to post content. These commands largely mirror the grammars of front ends, but also offer additional data and are policed through extensive documentation, good practice cases and Twitter’s rules of conduct.

Looking at data retrieval through the lens of grammatization brings to attention that organizing data into categories and units is increasingly distributed, at least between the researcher, users realizing these grammars and the media, as platforms define what data formats users can generate and retrieve through APIs. This redistribution of ordering capacities away from the researcher towards media platforms was initially perceived as the promise of transactional digital data resulting from the direct capture of user activities. Following Callon, ‘One way of testing the relevance and robustness of a proposed categorization is to allow the entities studied to participate in the enterprise of classification’ (2006: 8). Indeed, the grammatization of API data suggests that researchers should attune their methods to the categorizations, inferences and objections made by the medium.

Retrieval and realism

However, attending only to the categorizations suggested by a medium, data research enabled through API retrieval runs the risk of re-enacting a specific form of realism. If we follow Alain Desrosieres (2001), there are different degrees of data realism: metrological realism, which assumes an unproblematic relationship between the world and its measure; accounting realism, which establishes the trustworthiness of metrics through standardized practices; and proof-in-use realism, in which realities are defined by the databases that promise to describe them, while paying only little attention to how the data are made, captured and animated.

For users in this third group, ‘reality’ is nothing more than the database to which they have access. Normally, such users do not want to (or cannot) know what happened before the data entered the base. They want to be able to trust the ‘source’ (here the database) as blindly as possible to make their arguments.

Desrosieres 2001: 346

Such trust in a data source is akin to the hopes and phantasies characteristic of those early advocates of big data debates who claimed digital transactional data would be more ‘raw’ than qualitative research data. Despite considerable caution being expressed about this approach, such proof-in-use realism may still inform today’s investment in data as pre-structured by or specific to a medium. Desrosieres’ analysis suggests that while the increasing valorization of transactional and actor categorized data in the context of social media research might initially have been driven by an interest in attuning methods to objects, it might yet end up contributing to a proof-in-use realism if it remains inattentive to the question of how the grammars of platform databases are actually counting, capturing and composing in the first place. To understand what digital data is animated by and which other actors participate in its categorization, it is relevant to attend to the wider infrastructures in which retrieval operates.

The ecosystem of data retrieval

APIs not only lend themselves to data retrieval, but also enable a variety of third parties to build on top and produce content for respective platforms, allowing a variety of actors to participate in data production. Let us return to the case of Twitter. Since its launch in 2006, Twitter has offered APIs for developers to extract and input data. The POST commands have resulted in a proliferating ecosystem of third-party Twitter clients, sources and access points, which allow Twitter users to engage with the content and grammars of Twitter via alternative interfaces (Gerlitz and Rieder 2014). Among these access points are Twitter specific clients who are concerned with the de- and re-composition of topical streams by offering multiple timelines, professional clients focused on team tweeting, follower growth, journalistic or marketing practices such as Hootsuite, as well as automators such as If This Then That and cross-syndication apps that allow the sharing of content of one platform with another. Each of these clients are built on Twitter’s front-end and back-end grammars, but they also extend them, as they are informed by different ideas of ‘being on Twitter’ (Gerlitz and Rieder 2014). Such clients not only provide alternative interfaces to Twitter grammars, they also might come with a re-interpretation or expansion of these grammars. Take the example of Twitter’s previous favourite and current like button: While some third-party apps interpreted favourites as a means to bookmark and save tweets into collections, others treated them as signs of appreciation and collected favourites received into rankings of popularity (Passmann and Gerlitz 2014). Such divergent interpretations of the same action enabled even more activities to fold into the same grammar and thus same data point. Doing so, third-party clients contributed to realize the ‘interpretative flexibility’ of platform grammars (Bijker, Hughes and Pinch 1987), which may come fixed in form, but offer users a certain flexibility to define what a tweet, a favourite/like or a @reply stands for.

In addition, data retrieval has to face a third layer of grammatization, as clients are not only able to reinterpret grammars, but are also able to fold the data of one platform into the grammars of another platform. In the case of cross-platform syndication, that is the automatic posting of content from one platform to another, the grammars of one platform (hashtags/posts/images on Instagram or Facebook) are transposed into the grammars of another platform (Twitter). The data researchers retrieve from the Twitter API might seem comparable and countable through their standardized forms, but might not even have been created for Twitter in the first place, but cross-syndicated from Facebook, transformed from RSS feeds or automatically created from news postings. Is a hashtag produced within Twitter’s web interface comparable to a hashtag cross-syndicated from Instagram or a hashtag automatically selected through software? The grammatization of a platform suggests that the data units it generates are comparable if not similar entities, while at the same time creating the conditions for third parties and users to fold heterogeneous interpretations of these grammars into the platform.

Lively metrics

Looking at the proliferating client-ecosystem thus challenges a proof-in-use realism of retrieved data by leading us to ask: what do we actually count when retrieving data through APIs? The activity of categorizing is not only distributed between the researcher and the platform, but is realized through users, their practices and interpretations, third-party clients and cross-platform syndication. What API data retrieval gives access to, therefore, are ‘lively metrics’, that is data categories that are internally dynamic, situated, localized and alive. Their liveliness – as opposed to mere currency or liveness (Marres and Weltevrede 2013) refers to the multiple ways in which platform grammars can be realized and interpreted. It is hence not only the platform that categorizes and grammatizes the data that can be retrieved, the lively metrics available via APIs are animated by the entire ecosystem of users, practices and clients. Hence, the moment researchers retrieve data through APIs, it has already been pre-composed in dynamic, local and distributed ways. Or, put the other way around, expanding Agre’s work in the context of social media platforms, grammatization not only enables capture, but establishes avenues for new, dynamic and thus lively forms of data composition, which are often made invisible through standardized data and its retrieval infrastructures.

This opens up new avenues for inventive methods that seek to let objects pose their own problems (Lury and Wakeford 2012).

  1. Lively metrics refuse a single interpretation. Aggregated data units provided through API data retrieval are not comparable from the outset, but need to be made comparable through additional interpretation of the wider ecosystem of actors and practices in which the data are produced. The retrieval of pre-structured data is thus not a discrete process but invites an attentiveness as to how the capture and composition of data are entangled on many levels.
  2. Retrieving data from a single platform means working with data from a multiplicity of sources. The proliferation of clients and cross-platform syndication allows the grammar of one medium to fold into the grammar of another. API retrieval thus gives access to data formats that are themselves already composed as distributed accomplishments. To attune methods to the data retrieved requires the researcher to move beyond a single medium perspective and advance the notion of medium-specificity (Rogers 2013) to include the distributed ecologies of platforms.
  3. Retrieving lively data confronts researchers with the insight that it is not only social life that can be regarded as ‘happening’ (Lury and Wakeford 2012), but data, their capture and composition, are equally subject to such happening. The process of retrieval contributes to the happening of the categorization of data, as it creates specially composed samples.
  4. The liveliness of metrics should not be addressed as a matter of data cleaning but be part of the quest to engage with the messiness and internal heterogeneity of data. Rather than seeking to retain only data formats that are fully comparable and rely on the same interpretation of grammars, data retrieval asks us to attend to the wider dynamics through which data formats are animated. Many platform APIs offer cues for such approaches as they allow to retrieve the source or client from which platform data was produced in the first place (Gerlitz and Rieder 2014).

As data retrieval through APIs is enabled by platforms that seek to open themselves to multiple stakeholders, retrieval should be expanded to capture these relations, foldings and distributed accomplishment of grammars. Engaging with the assembly of capture and composition at stake in platform data suggests that data retrieval has more in common with established methods such as interviews and survey than with big data hopes for raw and unfiltered transactional data, as the data comes with cascades of inscriptions. On the other hand, data retrieval confronts researchers with an even more distributed process of categorization, objection and inscription, which is made partly invisible by data infrastructures and which expands beyond the data source and the researcher to involve all of a platform’s stakeholders. In the context of platform media, data categorization itself can be understood as happening and inventive retrieval infrastructures can be called to account for its liveliness.

Note

1 https://dev.twitter.com/overview/documentation

References

Agre, P. E. (1994). Surveillance and capture: two models of privacy. The Information Society, 10(2): 101–127.

Bijker, W. E., Hughes, T. P. and Pinch, T. J. (1987). The Social Construction of Technological Systems: New Directions in the Sociology and History of Technology. Cambridge, MA: MIT Press.

Callon, M. (2006). Can methods for analysing large numbers organize a productive dialogue with the actors they study? European Management Review, 3(1): 7–16.

Desrosieres, A. (2001). How real are statistics? Four possible attitudes. Social Research, 68(2): 339–355.

Gerlitz, C. and Rieder, B. (2014). Tweets Are Not Created Equal. Intersecting Devices in the 1% Sample. Presentation at the AoIR conference, Daegu, South Korea.

Latour, B. et al. (2012). ‘The whole is always smaller than its parts’: a digital test of Gabriel Tardes’ monads. The British Journal of Sociology, 63(4): 590–615.

Lury, C. and Wakeford, N. (2012). Inventive Methods: The Happening of the Social. London: Routledge.

Marres, N. and Gerlitz, C. (2016). Interface methods: renegotiating relations between digital social research, STS and sociology. The Sociological Review, 64(1): 21–46.

Marres, N. and Weltevrede, E. (2013). Scraping the social? Journal of Cultural Economy, 6(3): 313–335.

Passmann, J. and Gerlitz, C. (2014). ‘Good’ platform political reasons for ‘bad’ platform data. Zur sozio-technischen Geschichte der Plattformaktivitäten Fav, Retweet und Like. Datenkritik. Retrieved March 2018 from: www.medialekontrolle.de/wp-content/uploads/2014/09/Passmann-Johannes-Gerlitz-Carolin-2014-03-01.pdf

Rogers, R. (2013). Digital Methods. Cambridge, MA: The MIT Press.

Ruppert, E., Law, J. and Savage, M. (2013). Reassembling social science methods: the challenge of digital devices. Theory, Culture & Society, 30(4): 22–46.

Savage, M. and Burrows, R. (2009). Some further reflections on the coming crisis of empirical sociology. Sociology, 43(4): 762–772.

Weltevrede, E. (2015). Repurposing Digital Methods: The Research Affordances of Platforms and Engines, Amsterdam: PhD dissertation.