an-artificial-revolution-8

‌2

Challenging the Sanctity of Data

We can think of an algorithm as a machine with an established function or series of functions. Once an algorithm has the data and some parameters to relate the data to and classify it, it can be set to make decisions and to learn how to make better ones. For example, if a business decides to use an algorithm to choose the best-qualified individual for a job, it may wish to set some guidelines for what a good employee looks like and let the algorithm identify the prime candidates.

This applies to other sectors. With all the data we now have, we can train machines to learn to make decisions and to perform whatever task they are set to perform clinically, objectively, and without prejudice or bias.

Data is collected every moment of our life: through our smartphones, when we touch our cards on public transport, through all the connected devices we use and during our browsing on the web. Recent estimates suggest that humanity generates around 2.5 quintillion bytes of data every day12 – enough, if printed, to create a pile of paper that would stretch nearly one and a quarter times around the earth.13

Let’s take the example of a fridge. Until recently, a fridge was a pretty simple analogue device: a tool to store food and keep it cold. Some have now become ‘smart’, which means they monitor what we buy, like and consume. That data can then be used for marketing, to send us specific products and recipes and to monitor our eating habits. There are, almost invariably, unforeseen consequences of technological evolution, and recently a smart fridge was used by a teenager to tweet, as her parents had confiscated her mobile phone.14

The collection and analysis of this huge quantity of data is possible because of the increased computing power we now have, which is only going to increase still further with the 5G network coming into our lives.

Before we look at the politics of AI, it is necessary to understand the politics of data. I want to challenge the basic idea that data can inform better decisions because of the inherent integrity within it. The concept is that as data mirrors society, and reflects it honestly, then the data must be ‘true’, and the AI applied to it will be able to read it and so drive our policies.

A delegate of a national education system at an international meeting recently told me there was ‘no need for judgement, just use data!’ This concept is what I call the ‘sanctity of data’, and it is a dangerous one, especially for women and the most vulnerable in our society.

Data is not neutral, and the fact that we collect a huge amount of it brings many challenges – not just from the standpoint of privacy but also from the standpoint of power dynamics.

What we must never forget is that data is not simply information. It is our public and private desires, our likes on Facebook, every purchase made, and every opinion expressed. These are all expressions of ourselves: our identities, our personalities. Data is, essentially, what constitutes us as human beings, and it should be treated as such.

I first became interested in data because I was alarmed at the proliferation of CCTV cameras in the United States, where I was studying as an exchange student in 1996. I remember feeling oppressed by the sense of being watched at all times, but even then it was more than that. It felt more like being invaded.

This commodification of our personalities into raw material is what underpins the data economy, which drives the collection of information during every moment of our daily experience. We have blindly accepted that we are being studied and analysed to train machines, our behaviours becoming fuel for algorithms so they can serve us the adverts they think we will buy into, that our consumption of energy at home, which represents the way we live our life, is being analysed, allegedly to offer us the best service.

We have internalized the idea that there is nothing more objective, more neutral, more informative and more efficient than data. This is misleading. When an algorithm is fed data, a decision has already been made. Someone has already decided that some data should be chosen, and other data should not. And if data is, in reality, people, then some of us are being selected while others are being silenced.

We can see this play out in the medical field. Medical research analysing what causes strokes has progressed rapidly over the past few decades, but it was only very recently that it emerged that the symptoms of heart diseases in women are different from those in men, which has led to very serious consequences in terms of prevention and detection of illness for half the population. Why did this happen so late? Simple: because, until then, most of the research was focused on men and used men’s bodies and data. Another example is endometriosis. As reported by Huntington and Gilmour in 2005,

The time period between initially seeking medical help to a diagnosis being made typically took 5–10 years… Characteristically, diagnosis was a time of relief […] after years of having experiences negated by medical authorities and being told the pain was a normal part of menstruation. Women’s feelings that their pain was being dismissed as imaginary have also been noted in other studies.15

Endometriosis is common. About 10 per cent of women suffer from it. And yet, according to Morassutto et al. (2016), it’s quite possible that about six out of ten cases still go undiagnosed.16 In reality, women know that pretty much any pathology that disproportionately affects them follows similar patterns: misdiagnosis, lack of understanding, dismissal, mistreatments.

To understand why this is happening, we need to look at the non-neutrality of data: medical research is still a male-dominated field, where men overwhelmingly decide what to study and what not to study, thus informing which data is collected and which data is excluded.17

Choosing data to train algorithms means making a choice about which individuals will form the data set, the consequences of which can be profound and pervasive. Any sense of objectivity here is wholly illusory. Think about surveys. To collect data around victims of a particular crime, we would get a far more accurate (and higher) figure by asking victims anonymously than by asking local police forces to report the figures voluntarily. This is because a lot of crime doesn’t get reported – especially around domestic abuse. And what is reported may not give the true picture of crime occurrences in one area – this can cause problems if data is uncritically used to inform and decide the allocation of resources. And consider how the allocation of resources and policy decisions might shape how a local authority would respond to and support women suffering from domestic abuse.

The choice about which data sets are studied is being made by people. It is a subjective decision, and a political one. Each individual, once entered into a data set, becomes part of a new transaction between them and the unseen force that has put them into it, and has used that data set to train an algorithm and ultimately make a decision about them.

This represents an asymmetry of power, and this asymmetry – the outcome of choice and power – is what underpins the politics of data and, ultimately, the data economy. The data economy is political at every level, not least because some organizations hold a huge amount of power over others by deciding who gets onto a data set, and who is left out, a decision that may have far-reaching implications.

I often talk about data violence, and that is because I see an intrinsic violence in choosing or disregarding data. It is a new form of violence, a new way of silencing people – and an insidious one; not only because it is more subtle and less understood than others, but because it is uncontested.

British activist and journalist Caroline Criado-Perez, in her book Invisible Women, shows the sometimes fatal consequences of ignoring or excluding data about women, from healthcare to industrial design to urban planning.18 Her examples are deeply shocking, and range from Siri, who could give you information about Viagra, but wouldn’t know where to find the nearest abortion clinic, to the way we have designed our cities to serve the mythical male breadwinner with a home, wife and kids in the suburbs.

Choices about data are a matter of power, and this power dynamic needs to be interrogated because it lies at the very heart of the model of our data-driven society. We have already seen how diseases mostly affecting women have been less studied than others, with much less data available about them. If we look again at how algorithms work, we can take this argument to a greater and graver level.

Algorithms need to be fed data, but because data, and the choices around it, reflect current societal structures, the output of the algorithm is likely to be the product of those same societal structures. With algorithms now being used for everything from determining access to loans to deciding who to release from jail, who should receive an immigration visa, who to hire and who should be assigned social housing, this issue is becoming ever more important.

And, indeed, a lot of literature has been produced recently under the heading of algorithmic bias: namely that, as algorithmic predictions are built on historic data, which means past decisions made by humans (and therefore incomplete and, as we have seen, representing and mirroring structural dynamics in our society), their output is likely to be biased too.

At a recent AI conference in London, I heard Evanna Hu, CEO of tech company Omelas, describing three types of bias.

The first she called ‘pre-existing bias’, and that is bias that emerges straight out of existing data. That is what happened in 2016, when researchers at Google unleashed a neural network on a corpus of 3 million words taken from Google News texts. The neural net’s goal was to look for patterns in the way words appear next to each other. This product was called Word2vec, and it was powerful. For example, if you typed ‘France: Paris as Japan: x’ the system would return the word ‘Tokyo’.

But the product was also remarkably chauvinistic. So if you asked the database ‘father: doctor as: mother: x’ it would say ‘x = nurse’. And the query ‘man: computer programmer as: woman: x’ gave ‘x = homemaker’. The reason was that the data came from the internet and the historic data it contained.

The second type of bias Hu defined as ‘technical bias’, something that Virginia Eubanks, Associate Professor of Political Science at the University of Albany, discusses in her book, Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. She gives the example of the Pennsylvania child triage system, a statistical model supposedly able to predict which children might in the future be victims of abuse or neglect. Predictive models use statistics to predict which parents might maltreat their children.19 But the data serving as their foundation was collected only on families that use public programs, leading to hi-tech risk detection systems that confuse parenting while poor with poor parenting.20

The third kind of bias Hu called ‘emerging bias’, something that happened with Microsoft Tay. TayTweets was an account controlled by AI, designed to learn from conversations on social media. What happened, though, is that Tay soon started to put out highly offensive content, including, ‘I fucking hate feminists and they should all die and burn in hell.’21 Tay was the victim of a coordinated attack with lots of people tweeting awful things at it, and that is where it learned its language. What is disappointing, however, is that no one had taught Tay’s developers and publishers – or taught the system – about this potential outcome and how the system could have learned from the environment in which it was set to operate.

Bias is complex and mostly inevitable. An AI artefact will be fed data, which is data that somehow or other represents the world as it is now. But that is not the only reason. An artefact is the product of who has decided to put it together and decided what to use it for. As such, bias is always present. What I also find troubling is the idea that a technical fix can resolve such problems.22 Many readers will have heard of Amazon having to pull out of the release of a piece of recruitment software when it emerged that it would only pick up male CVs.23 Using AI, specifically ML, the tool in question reviewed job applicants’ résumés and assigned each of them a score from one to five stars. To produce its ranking, this AI was trained by spotting patterns in résumés submitted to the company over the previous decade. However, given the under-representation of women in the technology industry, the vast majority of these résumés belonged to male candidates.

Inevitably, based upon this data, the AI system learned to favour males. So, for example, if a résumé included the word ‘women’s’ (for example, ‘women’s chess-club captain’), or the names of all-women’s colleges, then the system would disregard it. Instead, it would favour lexicon like ‘executed’ and ‘captured’, commonly found in male engineers’ self-descriptions.

This applies to images as well as words. In 2018, Google announced that it had fixed its Photos application, which was mistakenly identifying black people as gorillas. The application uses ML to recognize people, places and events depicted in photographs and to automatically group those with similar content. Three years after Jacky Alciné, an African-American consumer and developer, pointed out that Google Photos labelled photographs of his friends and him with the tag ‘gorillas’, all that the company has managed to fix was completely removing the tag so that the ML algorithm does not assign it to any image whatsoever.24

We have seen how some of the bias is due to technical factors: AI needs data, and data reflects the bias in society. The problem lies with characterizing this structural discrimination as bias. As algorithms are increasingly deployed by our local authorities, banks, hiring companies and the health sector to make decisions and to perform tasks on our behalf, the issue of outputs reflecting structural inequality and power imbalances in society will be escalated to a new level – and this is happening with little or no scrutiny.

Women have fought for centuries to make organizations accountable for discrimination, and now we are finding all this returning through the back door of automation. By calling this ‘bias’, we are making two dangerous mistakes. Firstly, we are humanizing these machines by attributing to them human characteristics, and, by doing so, we are shifting the responsibility away from us to the artefact. AI is a human tool, created, developed, organized and fed data by humans. When AI goes wrong, we are both the victims and the perpetrators. The responsibility is ours and absolving us from the outcomes of algorithmic decisions is a grave error.

Secondly, by calling it ‘bias’ we are missing the real issue. It is not simply bias. If it discriminates on the grounds of race, it is racist. I am struck by how reluctant we are to say ‘algorithmic racism’ or ‘algorithmic chauvinism’. Instead, we revert to the use of the soft term ‘bias’. Are we really supposed to accept that, because humans can be racist, it is by extension somehow acceptable for unaccountable, uncontrolled and proprietary software to embed that racism into every corner of decision-making, so it becomes ingrained and, worse yet, unchallenged?

Let’s take the issue of housing. Alongside education, housing is one of the biggest opportunities for people to get out of poverty. So what happens when algorithms make decisions about assigning homes to families? It is an interesting question because the answer is, remarkably, unknown. It is not by chance that in the United States a new housing and urban development department is discussing proposals that would give landlords much greater protection from discriminatory claims.25 The reality is that most of the products that government agencies use come straight off the shelf: they are pieces of proprietary software covered by commercial interest, thus making the system unaccountable. And even if accountability tools were introduced, the reasons underpinning discrimination can emerge at different stages of the process.

Let’s say the police want to identify who is more likely to carry a knife. In this case they would look at statistics related to who has been found carrying a knife in a particular time frame to identify patterns of behaviour or common characteristics. But, again, it is easy to imagine how skewed that data set is likely to be: we know that black people are more likely to be stopped and searched by the police.26

The increased use of these algorithms for decision-making or for making predictions means that algo-sexism and algo-racism are reaching a completely new level, unaccountable and unchallenged, embedded in the very fabric of our techno-chauvinistic societies, which means that we are being fed by public-relations departments, as well as austerity-inclined politicians, the message that tech is, necessarily, better, cheaper and more efficient.

The shrinking of public-sector services often means that local authorities and agencies are receptive to the idea of saving money through automated systems with the alluring prospect of efficiency and cost-cutting. Why employ a human to decide who to assign a house to if an algorithm can do it? And why not use ML to analyse patterns, thus identifying which families are more at risk of poverty, payment arrears or gambling addiction?

The problem, though, is that these analyses inform policy decisions and the allocation of resources. If the data ingested into the system to determine these decisions has the power-related complexities we discussed above, the outcome will be that both the patterns and the solutions informed by them will reinforce society as it is now, rather than breaking the cultural and social norms underpinning it and thus transforming it for the better.

Perhaps the biggest misconception about AI is that, because humans are biased, irrational and emotional, we must automatically welcome algorithms as they are by nature ‘scientific’, thus not deformed by cultural norms. Or, even, that we can code ‘fairness’ into those algorithms by virtue of a mathematical fix.

This simplistic view does not take into account the power relation, discussed already, between who is in a data set and who has decided to put them in it. If society as it is today is the only model we use to train algorithms that are going to affect us tomorrow, then we risk hard-coding injustices and prejudices into societies of the future.

To solve this problem, we must change the vocabulary around it, first by eradicating the word ‘bias’, which is an easy way out for companies who can – or can at least claim to – offer a technical solution to it. Instead, it is time we acknowledge that data is simply not neutral, and, as such, every single decision and action around data is a political one. How we respond to algo-sexism and algo-racism is through politics, not through algorithmic fixes.

The obsession with fixing algorithms is a smokescreen, and, for some, a rather useful one because it deters us from acknowledging the real issue, and that is the power dynamics underpinning algorithms and, therefore, the entire data economy.

In 2018, Sundar Pichai, the CEO of Google, told a town-hall event in San Francisco that AI is one of the most important things humanity is working on: ‘It is more profound than, I don’t know, electricity or fire.’27 That sounds bold, but it is arguably true, and to fully appreciate his claim we need to look further into what underpins AI: namely, data.

Some insist that data is the new oil, and that is an interesting and illuminating comparison, though a limited one – firstly, and most obviously, because data can be used over and over again. Unlike oil, it is not a finite resource. And secondly, and more importantly, it is a limited comparison because I think it is time to start seeing data not as a commodity but as capital.

The reasons for this are, in my view, clear. For example, companies like Siemens or GE now present themselves as data firms rather than technology companies.28 Data has been the driver and the underlying reason for many acquisitions, including Amazon’s purchase of Whole Foods for $13.7 billion in 2017.29

There is little doubt that the accumulation of data is a core component of the political economy in the twenty-first century.30 We need only consider the power of Facebook and Google, the vast amount of data they have amassed and are continuing to amass about us, as well as their power over politics and regulators alike. Some will recall Facebook CEO and founder Mark Zuckerberg being questioned by (mostly clueless) US Congress members about what Facebook does and doesn’t do – what its business model is and how it operates.31 This was a clear example of the chasm between tech and politics, with politicians simply not grasping the extent of the digital realm and its influence on the world we inhabit.

If data accumulation is a core component of the political economy, ‘data extractivism’ is the tool behind it. Extracting data is the way to grow power and influence. And because data is capital, it behaves like capital. At a global level, data is becoming a geopolitical arsenal that is creating a new form of colonialism, whereby large corporations take over the digital infrastructure of the Global South for exploitation and control. For example, Netflix is increasingly buying up a lot of content from Africa32 and now ranks at number one in the generation of internet traffic at a global level.33 Meanwhile, Uber’s aggressive expansion into the taxi industry in South Africa led to escalating violence during 2017’s ‘Uber wars’.34

Similarly, China has signed an agreement with Zimbabwe to deploy facial-recognition software, developed by CloudWalk Technology, which will prove invaluable for China as the country needs non-Asian faces to train its algorithms for facial recognition.35 Another example is UK-based firm De La Rue rolling out the ID programme in Rwanda, collecting the biometric data of all citizens.36

In recent years, one tech company in particular has generated headlines, and that is Jumia, known as ‘the Amazon of Africa’. Jumia operates in fourteen African countries and is the first Africa-focused e-commerce company to be listed on the New York Stock Exchange. Customers can order anything from an iPhone to a chicken korma at the touch of a screen.

The success of Jumia is particularly interesting and revealing, considering that many African countries have put their faith in ‘leapfrogging’ – the idea that developing nations in Africa can skip whole stages of development. Jumia does indeed fit that narrative and gives many Africans hope for a future in which e-commerce, not extraction, becomes the engine of African growth. Cities like Lagos boast vibrant tech districts, and more and more people are using apps.

However, critics have started to question how well Jumia is serving the denizens of Africa. Jumia was in fact incorporated in 2012 in Berlin, though it has been known to tell inquirers that it was headquartered in Nigeria. It was originally called Kasuwa, which means ‘market’ in Hausa, a language used in northern Nigeria. Later, it was renamed Jumia. At the most senior level, the company is managed not by Africans but by French executives, who were operating out of Paris until they moved to their current headquarters in Dubai. Much of Jumia’s capital was raised in Europe and America.37

People started to wonder how different Jumia was from companies like Shell, a large corporation that employs lots of Africans but can hardly claim to be African. Rebecca Enonchong, a Cameroon-born tech entrepreneur living between Africa and the United States, sees Jumia as a foreign company dressed in African robes. Jumia, she says, is the brainchild of Rocket Internet, a German company that ‘copy pastes’ ideas developed in Silicon Valley and applies them to the rest of the world. ‘This is a Rocket Internet company. It is not an African start-up. We have a painful history with European companies, this colonial legacy that is very recent. It seems like it’s being repeated in the start-up world.’38

These examples and the debates they generate go to the heart of what many refer to as ‘digital colonialism’. The issue is not whether companies from all over the world should be trading in African countries. It is, rather, that African institutions are outgrowing their local ecosystem, as it is very difficult to match the cash injections Western companies can provide.

My point, though, is that this taking over of the digital infrastructures could lead to a real form of extraterritorial jurisdiction, and that is the main danger in my understanding of data as capital. The question, therefore, becomes whether or not the accumulation of data will increase the global divide, exacerbating the dramatic consequences that we all see unfolding around us already, from forced migration to the threats to our environment.