Burning the Books

13 The Digital Deluge

WE ARE AT a moment in history where knowledge is changing dramatically in the way we interact with it. The age we now live in is an era of ‘digital abundance’, where digital information saturates our lives.1 The volume of information created daily, held in digital form and available online, is astonishing. In a typical minute in 2019, 18.1 million text messages were sent, 87,500 people tweeted, and over 390,000 apps were downloaded across the globe.2 Not only must we be concerned about the narrative of those texts, or the images in those tweets, but the underlying data that underpin them is also now part of societies’ knowledge.

Many libraries and archives are now ‘hybrid’ collections, dealing with both traditional and digital media together. In many institutions, digital collections will often fall into two categories: those that have been digitised from existing collections of books, manuscripts and records, and those materials that were ‘born-digital’, created in digital form from the start, such as email, word-processing files, spreadsheets, digital images and so on. Scholars don’t just write articles in learned journals, they create research data from scientific instruments or other scholarly processes, often in huge quantities. The scale of the digital collections of many libraries and archives has been growing rapidly. In the Bodleian we have, for example, around 134 million digital image files, spread across multiple storage locations, which require preservation.3 Such an abundance of information has become normal. We now take for granted the ease and convenience of being able to access information, and we regard the opportunities for research in all fields that it enables as routine.

As our everyday lives are increasingly played out in digital form, what does this mean for the preservation of knowledge? Since the digital shift has been driven by a relatively small number of powerful technology companies, who will be responsible for the control of history and for preserving society’s memory? Is knowledge less vulnerable to attack when it is controlled by private organisations? Should libraries and archives still have a role to play in stewarding digital memory from one generation to the next as they have done since the ancient civilisations of Mesopotamia?

Libraries and archives have been very active in digitising their collections and putting them online so they can be shared. The phenomenon of Distributed Denial of Service (DDoS) is familiar for anyone publishing information online. DDoS attacks work through software that subjects a public website to a bombardment of queries, thousands or even tens of thousands of times a second, from a range of internet addresses, and often utilising a piece of automated software called a botnet. This normally overwhelms the servers hosting the website being attacked. These kinds of attacks can be regular and frequent, sometimes the work of idle hackers who are attracted to the challenge of ‘taking down’ the website of a large, famous, venerable or respected institution (such as the Bodleian, which has suffered such attacks from time to time), but there is growing evidence that states are also using DDoS against their rivals and enemies. The organisations on the receiving end of these attacks respond by building a stronger infrastructure, costing more and more. But this kind of attack is just the most ‘straightforward’ variety in the digital world. There are also more insidious forms.

A new existential challenge faces libraries and archives, one that affects the whole of society. Knowledge in digital form is increasingly created by a relatively small number of very large companies, which are so powerful that the future of cultural memory is under their control, almost unwittingly, with consequences and implications that we are only just waking up to. They are collecting knowledge created by us all and we now refer to it simply as ‘data’. This data is gathered from the entire globe, and because it relates to our interaction with their platforms the companies often have exclusive access to it. They are using it to manipulate our behaviour in many different ways, mostly by trying to shape our purchasing habits, but this influence is also entering other areas of life – our voting behaviour and even our health. They are doing this in secretive ways, which are hard for people to understand.

The rapid rise of these companies, with their global customer bases and vast revenues, has been unprecedented. Perhaps the closest parallel is that of the Roman Catholic Church in the Middle Ages and Renaissance. The Catholic Church likewise held both spiritual and temporal powers over vast swathes of the globe, with massive financial interests. Its authority was held in a single individual, albeit one who worked within a power structure that gave immense authority to a relatively small number of people. A commonly held belief system, together with a common language, enabled their global authority to be maintained and to grow. Facebook today boasts of its ‘single global community’, and the statistics show that Google has the overwhelming market share of online search and consequently the largest share of ‘adtech’, the data used to track the behaviour of users of these services, which is then sold to online advertisers (and others).4 The large tech companies in China, like Tencent and Alibaba, have billions of users, who interact with the platform many times a day. All of the companies offer free online hosting of images, messages, music and other content for their users, taking up vast amounts of storage using cloud technologies (Amazon is now the world’s largest provider of data storage through its subsidiary Amazon Web Services). We have become used to clicking ‘like’ or engaging with posts and adverts created by other social media users or advertisers. The power that these companies now hold has made historian Timothy Garton Ash refer to them as ‘private superpowers’.5 And the way these companies operate has been termed ‘surveillance capitalism’.6

At the end of 2019 the photo-sharing site Flickr, struggling to keep pace with competition from the likes of Instagram, announced that it was reducing the amount of free storage for its account holders. After February 2019 users of free accounts were limited to 1,000 photos and videos and any excess was automatically deleted by the company. Millions of Flickr users found that much of their content had been permanently removed. What happened at Flickr shows us that ‘free’ services aren’t really free at all. Their business model is based on the (often unacknowledged) trading of user data, and as market share is lost to competitors, ‘free’ services have to make way for paid premium (freemium) services. Storage is not the same as preservation.7

The problem that the Flickr case throws up is one of trust in the companies that now control knowledge online. Active users will have known about the coming changes, and were perhaps able to move their data on to other platforms. Others who did not move fast enough perhaps lost images of their loved ones or a photographic record of their adventures. Gone in the blink of an eye. Consumers have had similar experiences with other ‘free’ platforms such as Myspace and Google+, both of which were also closed down in 2019 with little advance notice. YouTube destroyed thousands of hours of videos documenting the Syrian Civil War in 2017.8 Precious information was lost, much of it gone for ever.9 These sites, and the companies that maintain them, are driven by commercial gain and are (for the most part) answerable to shareholders. They have no public benefit mission, and any knowledge that they store is kept only to support their commercial operations.

Libraries and archives are trying to engage with this new information order and to play a positive role in the preservation of digital knowledge, but the tasks are complex and expensive. The Library of Congress, for example, announced a groundbreaking partnership with social media giant Twitter in 2010, with the library’s aspiration to develop a complete archive of all of Twitter, past, present and future, since its launch in March 2006. The Library of Congress has been one of the leading institutions working with digital preservation; as the national library of the richest nation on earth it would seem natural for it to form a partnership with a technology company at the forefront of the social media revolution.

Unfortunately, owing to funding shortfalls the arrangement ceased in 2017, with the library now only preserving tweets ‘on a selective basis’.10 Given the power of social media platforms like Twitter and Facebook, and the use made of them by leading individuals and organisations involved in politics and other aspects of public life, the lack of any preserved systematic record cannot be good for the health of an open society.

We increasingly play out our lives on social media so we need to find ways for libraries and archives to help society remain open. As the political sphere has embraced digital information we have seen the rise of ‘fake news’ and ‘alternative facts’. Preserving knowledge in order to inform citizens and to provide transparency in public life is becoming a critical issue in the future of democracy. The behaviour of tech companies, especially the social media firms, and data corporations who have been employed in political campaigning, is coming under increasing scrutiny. Archives can be vital to providing evidence of their behaviour.

Libraries and archives that preserve the web (in ‘web archives’) have become particularly important as they are able to provide permanent bases for a huge range of the human endeavours documented online in websites, blogs and other web-based resources. The public statements of political candidates, office holders and government officials (often to their embarrassment) appear on the web, and there is an increasing sentiment that they should be preserved so that the public, the media and, eventually, voters can call their representatives to account for those statements.

Web archiving is still a relatively new tool. The UK Web Archive, for example, is a collaborative effort of the six copyright libraries in the United Kingdom and the Republic of Ireland.11 They have enjoyed ‘legal deposit’ privilege, whereby printed publications have been required to be deposited in designated libraries since the Licensing Acts of 1662 and the Copyright Act of Queen Anne in 1710.12 The archiving of the UK web domain was begun in 2004 as a British Library initiative, with its collecting of carefully selected websites undertaken through a voluntary ‘permissions-based’ approach, where the sites were selected for capture, and each website owner was contacted and had to grant explicit permission before the site could be added to this archive. All the preserved sites were then openly accessible to the public, online. In 2013 the legal deposit legislation was updated when the ‘Non-Print Legal Deposit Regulations’ were passed into law. These regulations transferred this voluntary system to one mandated by law and harnessed to the six legal deposit libraries, which now co-fund the vast enterprise.13

Archiving the web is a complex task, as the targets are constantly moving. Many websites disappear or change address frequently. The UK Web Archive shows an astonishing rate of attrition of sites it has captured over time. Of the websites preserved in any one year, within two years around half of them have gone from the open web, or cannot be found for some reason (in technical terms, their web addresses are not resolvable). After three years this becomes 70 per cent or so. Despite these problems, the web archive is growing. In 2012 it held regular archived copies of around 20,000 websites. At the end of the last complete ‘crawl’ of the UK web, in 2019 (the crawls take almost a year to complete), the archive contained copies of over six million websites, which archive over 1.5 billion web resources. The archive also holds deeper, more regular collections of over nine thousand curated ‘special collections’ websites, which our curatorial teams have identified as being of more significant research value. These are crawled much more frequently: monthly, weekly or even daily, and they make up 500 million web resources as the sites are regularly re-crawled.14

One of the UK Web Archive’s special collections of blogs and websites has captured 10,000 sites relating to the 2016 referendum on European Union (EU) membership known as Brexit, as well as the political aftermath of the vote. In April 2019 the Vote Leave campaign deleted a great deal of content from their public website, including references to that campaign’s promise to spend £350 million a week on the National Health Service (NHS) if Britain left the EU, a promise that by 2019 had become increasingly controversial. Fortunately the UK Web Archive had captured the website before that content was deleted.

Access to knowledge on the web is now a social necessity. However, in 2007 Harvard scholars Jonathan Zittrain, Kendra Albert and Lawrence Lessig discovered that more than 70 per cent of the websites referenced in articles in the Harvard Law Review and other legal journals and, even more importantly, 50 per cent of URLs in the public website for the US Supreme Court, were broken, suffering from what is called in the digital preservation community ‘linkrot’. These websites are of huge social importance: how can society behave unless it knows what the laws of the land are?15

The growth of digital information has moved faster than libraries and archives have been able to keep pace with, and other players have moved in to try to fill the gaps. The uber-web archive of them all, the Internet Archive, is a good example of this kind of private archiving initiative. Founded by internet pioneer Brewster Kahle in 1996, it is based in San Francisco. The archive’s strapline ‘Universal Access to All Human Knowledge’ is typical of the bold thinking you encounter in this part of California. Since it was founded, through its key service called the Wayback Machine, it has captured more than 441 billion websites. The sites are publicly viewable through the internet and the tool has been developed entirely through the use of web-crawlers, which ‘scrape’ data from the public web and capture it. No permissions are sought and there is no explicit legal basis for their activity that would form the equivalent of the legal deposit regulations in the UK.

The Internet Archive has itself become subject to attempts to destroy the knowledge it holds. In June 2016 a massive DDoS was launched on the Internet Archive from groups angry that the site hosted websites and videos created by members of the extremist organisation ISIS and its supporters, but it failed. What the incident highlights is the relatively fine line between legitimate acquisition and provision of access to knowledge and censorship of knowledge that is either offensive to the majority of citizens, or being used as propaganda tools by groups that have been legitimately banned for their violent or illegal views.16

What concerns me most about the Internet Archive is its long-term sustainability. It is a small organisation, with a board to oversee its activities, but it operates with a modest funding base. It has no parent body to look after it – perhaps this is why it has been able to achieve what it has so quickly – but this could provide it with a greater capacity for longevity. At some point it must become part of, or allied to, a larger institution, one which shares its long-term goals to preserve the world’s knowledge and make it available. I have used the Internet Archive many times and it is incredibly valuable. When my family and I moved to Oxford in 2003 we had to fight a case with the Local Education Authority to make it possible for our two children to attend the same local primary school. We were able to prove that the authority’s public information about its policy changed on a certain date by accessing the preserved copies of their website through the Wayback Machine.

The Internet Archive is a reminder that there are some areas of public life where archives and libraries are not keeping up with the needs of society. They tend to be cautious institutions and slow to act. In many ways this has been one of their strengths, as the structures they have built have tended to be resilient. My sense is that the Internet Archive is now an ‘organised body of knowledge’ of huge importance for global society, but it is one that is ‘at risk’ in its current independent state. The international community of libraries and archives needs to come together to develop new ways of supporting the Internet Archive’s mission.

The work of the Internet Archive is one example of what I would like to call ‘public archiving’, or ‘activist archiving’, initiatives that emerge from concerned members of the public who have taken tasks on themselves, independently of the ‘memory organisations’ like libraries and archives. Sometimes these public archiving activities can move faster than the institutionally bound ones, particularly with the rise of ‘fake news’, where again public archiving has had to step in.

One of the features of political life in the United States under the Trump administration has been the presidential use of social media – Donald Trump has an astonishing 73.1 million followers on Twitter as of 28 February 2020 (equating to 22 per cent of the US population), and 17.9 million followers on Instagram. Such an enormous following gives him the ability to reach out directly to the voting public of America. His statements on social media therefore have a powerful impact, with potentially profound consequences for the whole world. The organisation Factbase has been keeping track of the presidential Twitter feed and its deletions. Since the president joined Twitter in 2009 up to 28 February 2020, he has tweeted an astonishing 46,516 times, and a small number – 777 – of those have been deleted, presumably either by the president himself or by members of his staff. Under the rigours of the Presidential Records Act, eventually the presidential Twitter feed should become part of the Presidential Archive, and if that turns out to be the case it would become the responsibility of the National Archives and Records Administration.17

The Presidential Records Act depends on trust between the presidential office and the National Archives. The archivist of the United States cannot realistically force the president or his team to comply with the Act. The Act requires the president to ‘take all such steps as may be necessary to assure that the activities, deliberations, decisions, and policies that reflect the performance of the President’s constitutional, statutory, or other official or ceremonial duties are adequately documented, and that such records are preserved and maintained as Presidential records’, but the president also has discretion to ‘dispose of those Presidential records of such President that no longer have administrative, historical, informational, or evidentiary value’. The Act states that such disposals can only happen if the advice of the archivist of the United States is sought, but the president is not bound by law to adhere to this advice. As such, during the tenure of office of a US President, the archivist has limited ability to take any steps to preserve presidential records, beyond seeking the advice of two congressional committees.

Although Donald F. McGahn II, the White House Counsel to the President, issued a memorandum to all White House personnel in February 2017 on their obligation to maintain presidential records (as defined by the Presidential Records Act), expressly referring to electronic communication, it remains to be seen whether the administration, or indeed the president himself is complying with the Act. The Act has no teeth because of its inherent assumption that all presidents would honour the system. The use of technologies such as encrypted messaging apps (WhatsApp is, for example, known to be widely used by the president’s inner circle of advisers), which allow for messages to be automatically deleted after a period of time that can be pre-set by the user, social networks and other ‘internet-based means of electronic communication to conduct official business’ are expressly prohibited without the approval of the Office of the White House.18 The use of such technologies ought to have provided an opportunity for advice to be sought from the archivist of the United States, and numerous commentators have claimed that use of such technologies violates the Presidential Records Act.19

Prior to becoming president, Donald Trump maintained a video log (vlog) from 2011–14, which was mounted on the Trump Organization’s YouTube channel. He deleted most of it prior to 2015 (only 6 of the original 108 original entries are still to be found on YouTube), but Factbase maintains a record of it on their website, in order to add it to the public record. One of the sections of the website covers media interviews that the president has undertaken during his term of office. The dominance of his interviews given to media outlets owned and controlled by Newscorp is one of the revealing pieces of data that Factbase makes available to the public: 36.4 per cent of all his interviews have been given to Newscorp organisations. Factbase has sourced, captured, transcribed and made all of these searchable, but it is not the only tool designed to document the president’s online behaviour; a website called Trump Twitter Archive also attempts to track these tweets in a similar way.20

The work that Factbase, Trump Twitter Archive and others have done is to make the public utterances of the president available for public scrutiny in a way that no other president has been subject to before, at least not during his term of office. This ‘public knowledge’ is essential for the health of an open democratic system, particularly one where the incumbent of the most powerful political office in the world uses public media channels extensively to promote his political agenda. This work is made even more important when the president or his aides are prone to deleting these public utterances. The work relies on screenshots of the Trump tweets, which are then followed up by automated routines to transcribe the tweets, add metadata and place them in a database for further analysis.

Another example of public archiving has been developed by an independent organisation in the UK called Led By Donkeys. Operating in the public sphere, both online and in the physical setting of billboards and other public places in major cities, Led By Donkeys (the name originates in a phrase used during the First World War, when British infantrymen were often described as ‘lions led by donkeys’, giving a sense of what the men on the front thought of their generals) has been preserving statements by leading politicians that are now different from their stated policy position and making them public – essentially holding those politicians to account.21

These public archiving activities reveal the importance of preserving information that can call politicians to account for their comments. Political discourse has often been a battleground between truth and falsehood but the digital arena amplifies the influence that political falsehoods can have on the outcomes of elections. Public archiving initiatives like Factbase and Led By Donkeys seem to me to be filling a void where public institutions could, and should, be saving this kind of information more systematically.

One of the most heavily used ‘organised bodies of knowledge’ in the present day is the online encyclopaedia Wikipedia. Founded in 2000, it grew rapidly adding its millionth entry within six years. Despite its many critics and undoubted limitations, it is now a huge and heavily used resource with around 5,000 to 6,000 hits a second on any of its 6 million entries. Libraries and archives, far from feeling threatened by it, have from the outset chosen to work with it.

The knowledge held in Wikipedia is subject to attack. Public relations companies have, for instance, been paid to edit or remove material which their clients find uncomfortable. A popular beverage, the beer Stella Artois used to be nicknamed the ‘wife-beater’. This is a verifiable fact backed up by sources and included in the Wikipedia article about Stella Artois. Such a nickname is no longer tolerated in Western society and at one point this was deleted. The account that did this turned out to belong to the PR company Portland Communications. Members of the Wikipedia community restored the deleted references.22

Politicians have deleted unwelcome references in Wikipedia to the so-called ‘expenses scandal’ (a series of revelations made by the Daily Telegraph and other newspapers relating to illegal expense claims made by Members of Parliament). By analysing the IP addresses of computers that made changes to the biographies of those MPs, the journalist Ben Riley-Smith uncovered the fact that the references, although verifiably in the public domain, were deleted by staff within the Palace of Westminster.23

Wikipedia is built on a culture of openness. Any entry has all the changes made to it tracked and these are openly viewable. The nature of the content deleted (or changed), the date and time it was done and which account made the change can all be seen. Wikipedia organises teams of ‘watchers’ who regularly read a number of pre-identified entries that they know will be subject to unwarranted malicious deletion or incorrect editing. Anyone with an account can elect to ‘watch’ any selection of pages, so they notice any change made in their area of interest.

Every contributor also has a contributions record that is openly viewable, so if someone is only making edits about certain individuals or topics, that information is also visible to other users. Although there is a human layer of ‘watchers’ they are supported by a technological layer of software tools (or ‘bots’) that do some large-scale automated ‘watching’.

Wikipedia itself monitors the entire site. Their bots can detect events such as a significant part of an article being deleted, or a homophobic or racial slur being added. When a large amount of text is added they can automatically Google-search sentences from the article to detect any plagiarism. When the staff of a politician deletes material, various bots and human editors flag it up, can see the pattern of edits made by the same account or computer and restore the deleted material with one click. Sometimes attempts to delete or censor Wikipedia has generated its own media story, which then gets cited in the article.

The shift in the creation of knowledge to digital form is posing challenges for administrators who, facing the digital deluge, struggle to cope with the burden of dealing with large quantities of digital information. In December 2018 the state government in Maine revealed that it had suffered a catastrophic loss of public documents from the administrations of governors Angus King and John Baldacci with most state government emails sent before 2008 being irretrievably lost and many other kinds of documents destroyed by state officials before they reached the Maine State Archives. Not only has information for future historians gone but these emails could also contain documentation of vital information in high-profile legal cases, as the work done by lawyers like Larry Chapin on the Libor scandal of 2012 has shown – email records when pieced together can tell a story in enough detail to help secure a conviction or prevent a defendant from going to jail.24

There are other areas of life where future access to knowledge will be of critical importance, and where commercial interests would not necessarily be beneficial. A good example is the nuclear industry. As a society we need to be sure long into the future – not just five to ten years, but hundreds and even thousands of years hence – exactly where we have stored nuclear waste, what material it consists of, when it was placed there, what kind of container it was stored in, and so on. This data exists today but the challenge facing the Nuclear Decommissioning Authority and other players in the nuclear world is how we can be sure that property developers, mining companies, water suppliers, as well as local authorities and governments, in say five hundred years’ time have guaranteed access to all this information. We need to know where to find the information, that the format the information is stored in can be accessed and that we can make sense of it when we need to. When businesses have turned bad, as in the case of Enron in the early years of the present century, litigation could have been made much easier if digital preservation solutions had been more available in the corporate world – as Enron employees deleted vast numbers of emails and other digital information, hampering the ability of its auditors to know what was going on, and making the job of litigation harder and more costly.

The preservation of knowledge is fundamentally not about the past but the future. The ancient libraries of Mesopotamia contained a preponderance of texts concerning the prediction of the future: astrology, astronomy and divination. Rulers wanted to have information to help them decide when was the optimal time to go to war. Today, the future continues to be dependent on access to the knowledge of the past and will be even more so as digital technology changes the way we can predict what will happen. It will also be contingent on how the knowledge created by our digital lives is harnessed for political and commercial gain by a number of organisations that are becoming increasingly powerful.

The tech industry is now pouring huge investment into ‘the internet of things’, where many domestic devices, such as fridges, are connected to the internet, operating by the passage of data from sensors. The internet of things is moving into the field of wearable devices, such as watches and jewellery. These are designed to monitor our health, generating massive amounts of biometric data. The volume of data will reach a point where medics will be able to make accurate predictions of our future health. This will help in the prevention of disease, but it will also open up major ethical issues as well. Who will own this data? We may be happy to share this material with our doctor, but would we be happy sharing it with our health insurer? It may be that libraries and archives could play a much stronger role in providing secure access to personal digital information, where the citizen can control who has access to it, but where anonymised aggregated use of that information could be facilitated by libraries for the purposes of public health. If this knowledge were to be destroyed it could have profound consequences for the health of individuals, as we become tied more tightly than ever to digital health systems.

In June 2019 Microsoft announced that it was taking offline a huge database of images of human faces, over 10 million images in all, relating to 100,000 individuals, that has been used to train AI facial recognition systems around the world. The images were collected without permission by being ‘scraped’ from the open web.25 Other similar databases, openly available on the web, were discovered by researcher Adam Harvey whose work has resulted in a number of other facial recognition datasets being identified, including examples created by Duke and Stanford universities. They even include a dataset scraped from postings from transgender groups on YouTube, which was used to train facial recognition AI for transgender people.26

Until recently worries about the gathering of data generated by users of online services were centred around the invasion of privacy and the risk of monetisation of this data. The concerns are now shifting to broader areas. Where so much political campaigning takes place in the realm of social media, how can we be sure that our feeds are not being manipulated unlawfully, and that online campaigning is being done openly and fairly, and with the consent of individuals, unless the data collected by those companies can be archived for open scrutiny?

Through 2017 and 2018 it became clear that the data generated by users of Facebook had been used, almost certainly illegally, by a private company, Cambridge Analytica, to create targeted political advertising. At the same time one of the major credit agencies, Equifax, had compromised over 147 million users’ financial information through an inadvertent data breach.27 These issues have raised concerns about having the information of individuals owned by private companies, under weak or non-existent legislative frameworks. It has also been alleged that a number of governments have used the manipulation of these platforms for their own political advantage.

The Cambridge Analytica website has long since disappeared but fortunately several web archives captured the site before it went offline. On 21 March 2018 it was describing itself as ‘Data drives all we do: Cambridge Analytica uses data to change audience behaviour.’ People were then invited to ‘visit our Commercial or Political divisions to see how we can help you’. With offices in New York, Washington, London, Brazil and Kuala Lumpur, Cambridge Analytica were digital mercenaries, aiming to put the global society at the service of anyone willing to pay, no matter what the political or commercial intent. The site claimed that they had gathered 5,000 data points on each American voter who uses the internet.

The web archives of their site seem to be the only archival traces of their behaviour, but the company had access to the data of a staggering 87 million Facebook users without their consent. The full scope of their activities remains unclear, and the full details of what went on are still being uncovered. ‘Nobody has seen the Facebook data set for the Trump campaign,’ Carole Cadwalladr, whose investigative journalism for the Guardian has worked to uncover this, commented on Twitter. ‘Nobody has seen the ad archive. Nobody knows what Cambridge Analytica did. Nobody knows what worked. *If anything*. It’s why we need the evidence.’28

Archiving the datasets created by the big tech companies, such as the advertisements on Facebook, the posts on Twitter, or the ‘invisible’ user data harvested by the adtech companies is, I believe, one of the major challenges facing the institutions charged with the preservation of knowledge. Libraries and archives are only able to make relatively modest inroads into an area where the volumes of data are vast. But society needs such archives to exist and to be able to understand what our culture is doing today, and what role is played by key individuals, corporations and others in the way society is changing.

The problem of archiving social media sites is daunting and we have seen, in the case of Twitter, that the preservation of an entire social media platform digitally is a challenge greater than even the largest library in the world can face. These sites are dynamic, they change every second, and are presented to each user in a unique and personalised way. We need to archive the communications on the platform itself, and the data transmission that underpins it. The messages are one thing but the ‘likes’, the ‘nudges’ and other social tools that the platforms put in place can tell us a great deal about social behaviour, culture, politics, health and so much more. In my view preserving the great social media and adtech platforms is becoming one of the critical issues of the current period.

There are, however, some approaches to archiving social media beginning to emerge. In the summer of 2019 the National Library of New Zealand announced a project asking New Zealanders to donate their Facebook profiles to its Alexander Turnbull Library. As Jessica Moran, digital service team leader at the library, explains in her blog:

We hope to collect a representative sample of Facebook archives. We want to build a collection that future researchers may use to understand what we saved and how we used social media platforms like Facebook, but also to better understand the rich context of early 21st Century digital culture and life. In return for your donation, we can offer potential donors a trusted digital repository that is committed to preserving these digital archives into the future.29

The National Library of New Zealand has highlighted two key issues. Firstly, the memory institutions must begin to archive the information held in the major social media platforms: the future needs to know what happened in the past and if this cannot be done at the platform level (there are currently over 2.5 billion monthly active users of Facebook worldwide) it must be done by tackling smaller chunks at a time. A sample of users in a relatively small country like New Zealand is a very good way to approach such a large problem. Secondly, they know that some current users of Facebook are interested in having their own histories preserved by a trusted public institution, one that will do most of the work, and bear the cost, on their behalf. Crucially the National Library also makes very clear statements about respecting the privacy of anyone who donates their Facebook material.

Society has been too slow to catch up with the commercial realities that the world of big data and ubiquitous computing have created. Our laws and institutions have not been able to keep pace with an industry now incredibly rich and with very smart people working inside it. As data scientist Pedro Domingos has said: ‘whoever has the best algorithms and the most data wins’.30 The construction of the platforms and the ‘data industry’ around them has created what Shoshana Zuboff terms a ‘private knowledge kingdom’, although ‘kingdoms’ might be a better analogy. All of this data and technology has been created ‘for the purposes of modification, prediction, monetization, and control’.31 The warning sounded by Zuboff and other writers who have studied the growth of surveillance capitalism is that a disproportionate amount of the world’s memory has now been outsourced to tech companies without society realising the fact or being fully able to comprehend the consequences.

At the heart of the current relationship between the public and the major tech companies is the problem of trust. We all use their services, partly because we have become reliant on them, but increasingly the public does not trust them. Society has created a huge bank of knowledge but has privatised its ownership, management and use, even though the knowledge was created freely by individuals around the world. Arguably the owners of the companies are beginning to be viewed by the public with a sense of dystopian fear and suspicion.

A 2016 study by Pew Research reported that 78 per cent of American adults regarded libraries as guiding them to information that is trustworthy and reliable. The figure is even higher among the 18–35 age group (so-called ‘Millennials’). There are no long-term studies that allow us to plot this trend over time but the Pew researchers regard these levels of trust as increasing among adults, in stark contrast to the levels of trust in financial companies and social media organisations.32 And even governments.

Given that the public trust in libraries and archives is high (and growing), perhaps they could become the place where individuals could store their personal data? Perhaps society is beginning to enter into an era that will challenge the dominance of the ‘private superpowers’ and bring the interests of society to the fore. Can we conceive of a future where the data of individuals is placed in the hands of public institutions, as trusted stewards of public data?

Certain conditions would need to be fulfilled. Firstly, there would have to be legislation to establish the facilities and to put regulation in place.33 The public should be consulted and involved in the development of the policies, and the way the system was established. Such laws would need to be harmonised across geopolitical boundaries. Secondly, there would need to be significant levels of funding to allow the libraries to undertake the task. This could be derived from a ‘memory tax’ on the same tech companies.34

Existing bodies, such as the Digital Preservation Coalition, would be key players in supporting digital preservation, and national bodies like the British Library, National Archives and their sister organisations in Scotland, Wales and Northern Ireland could work in collaboration to achieve this. There are models for such a modus operandi – such as the shared responsibilities for legal deposit, which as we have seen were extended in 2013 to digital publications. While not perfect, the legislation and the system that has been built by the six legal deposit libraries works.

This would not in itself be sufficient. A new data architecture for the internet is required that allows the individual to control who has access to their data.35 The General Data Protection Regulations (GDPR), which came into force in the UK as the Data Protection Act 2018, have gone a long way in Europe to increasing the protection of individuals’ data.

The move of society’s knowledge from the personal domain to the commercial has brought with it massive issues that society must address. The rights of individuals are certainly at stake. In other areas of life there is a notion of ‘duty of care’, where companies and institutions have to follow standards in, for example, the design and operation of public buildings. This notion could and should be applied to the digital world.36 If we do not archive the data that is being exploited we will never properly understand the full extent of that exploitation, and the effect it has had. Until we have a full archive of the Facebook political adverts, we will not be able to assess how electorates were influenced. Without this information, to enable analysis, study and interrogation of these organisations, and the advertisements on their platforms, we will never know.

A hundred years from now, historians, political scientists, climate scientists and others will be looking for the answer to how the world in 2120 has come to be the way it is. There is still time for libraries and archives to take control of these digital bodies of knowledge in the early twenty-first century, to preserve this knowledge from attack, and in so doing, to protect society itself.