What impact will data and AI have on the distribution of geopolitical power and economic wealth? It’s another dynamic that centers in part on the United States and China, but with even broader implications for the rest of the world. It’s one of the overarching questions of our time, and in the fall of 2018 a pessimistic view emerged.
As we met with members of Congress in Washington, DC, a few senators mentioned that they had read the advance galleys sent to them for a new book called AI Superpowers. Its author, Kai-Fu Lee, is a former executive at Apple, Microsoft, and Google. Born in Taiwan, he is now a leading venture capitalist based in Beijing. His argument is sobering. He asserts that “the AI world order will combine winner-take-all economics with an unprecedented concentration of wealth in the hands of a few companies in China and the United States.”1 As he puts it, “other countries will be left to pick up the scraps.”2
What’s the basis for this view? Mostly it comes down to the power of data. The argument is that the firm that gains the most users will gain the most data, and because data is rocket fuel for AI, its AI product will become stronger as a result. With a stronger AI product, the firm will attract even more users and hence more data. The cycle will continue with returns to scale, so that eventually this firm will crowd out everyone else in the market. According to Kai-Fu, “AI naturally gravitates toward monopolies . . . once a company has jumped out to an early lead, this kind of ongoing repeating cycle can turn that lead into an insurmountable barrier to entry for other firms.”3
The concept is a common one in information technology markets. It’s referred to as “network effects.” It has long been true in the development of applications for an operating system, for example. Once an operating system is in a leadership position, everyone wants to develop apps for it. While a new operating system might emerge with superior features, it’s difficult to persuade app developers to consider it. We benefited from this phenomenon in the 1990s with Windows and then hit the barrier on the other side twenty years later, competing against the iPhone and Android with our Windows Phone. Any new social media platform that wants to take on Facebook encounters the same problem today. It’s part of what defeated Google Plus.
According to Kai-Fu, AI will benefit from a similar network effect on steroids, with AI leading to increased concentration of power in nearly every sector of the economy. The company in any sector that most effectively deploys AI will gain the most data about its customers and create the strongest feedback loop. In one scenario, the outcome could be even worse. Data could be locked up and processed by a few giant tech companies, while every other economic sector relies on these companies for their AI services. Over time, this would likely lead to an enormous transfer of economic wealth from other industrial sectors to these AI leaders. And if, as Kai-Fu projects, these companies are located mostly on the east coast of China and the west coast of the United States, then these two areas will benefit at every other region’s expense.
What should we make of these predictions? Like many things, they are based on a kernel of truth. And in this case, perhaps more than one.
AI depends upon cloud-based computing power, the development of algorithms, and mountains of data. All three are essential, but the most important of these is data—data about the physical world, the economy, and how we live our daily lives. As machine learning has evolved rapidly over the past decade, it has become apparent that there’s no such thing as too much data for an AI developer.
The implications of data in an AI-driven world go well beyond the impact on the tech sector. Consider what a product like a new automobile will be like in 2030. One study recently estimated that fully half of the cost of a car by that time will consist of electronics and computing components, up from 20 percent in the year 2000.4 It’s apparent that by 2030, cars will always be connected to the internet for autonomous or semiautonomous driving and navigation, as well as for communications, entertainment, maintenance, and safety features. All of this is likely to involve artificial intelligence and large quantities of data based on cloud computing.
This scenario raises an important question: What industries and companies will reap the profit generated from what increasingly is a massive AI-driven computer on wheels? Will it be traditional automobile makers or tech companies?
This question has profound implications. To the extent this economic value is retained by automakers, there is cause for more optimism in the longer-term futures of car companies like General Motors, BMW, Toyota, and others. And, of course, this is likely to provide brighter prospects for the salaries and jobs at these companies and for the people who hold them. Put in this context, it’s apparent that this is also an important issue for these companies’ shareholders and for the communities and even nations where these companies are based. It’s not an exaggeration to say that the economies in places like Michigan, Germany, and Japan have their future riding on this outcome.
If this seems far-fetched, consider Amazon’s impact on book publishing—and now many retail sectors—or what Google and Facebook have done to advertising. AI could have the same impact on everything from airlines to pharmaceuticals and shipping. This in effect is the future painted by Kai-Fu Lee. It’s why there is at least a plausible basis to conclude the future could bring an ever-increasing transfer of wealth to the small number of companies that hold the largest pools of data and the regions where they are based.
As is so often the case, however, there is no single and inevitable path into the future. While there is a risk the future could unfold in this manner, there is an alternative course we can chart and pursue. We need to empower people with broader access to all the tools needed to put data to work. We also need to develop data-sharing approaches that will create effective opportunities for companies, communities, and countries large and small to reap the benefits from data. In short, we need to democratize AI and the data on which it relies.
So how do we create a bigger opportunity for smaller players in a world where large quantities of data matter?
One person who may have the answer is Matthew Trunnell.
Trunnell is the chief data officer at the Fred Hutchinson Cancer Research Center, a leading cancer research center in Seattle named for a hometown hero who pitched ten seasons for the Detroit Tigers and managed three major league baseball teams. In 1961, Fred Hutchinson took the Cincinnati Reds to the World Series.
Sadly, Fred’s successful baseball career and life were cut short when he died of cancer in 1964 at the age of forty-five.5 His brother, Bill Hutchinson, was a surgeon who treated Fred’s cancer. After his younger brother’s death, Bill founded the “Fred Hutch,” a research center devoted to curing cancer.
Trunnell came to Seattle in 2016 to work at the Hutch. The Institute has twenty-seven hundred employees working in thirteen buildings that sit on the south shore of Lake Union. Seattle’s iconic Space Needle is visible in the distance.
The Hutch’s mission is ambitious: to eliminate cancer and its related deaths as a cause of human suffering.6 It brings together scientists, three of whom have won Nobel Prizes, with doctors and other researchers to pursue cutting-edge research and treatments. It partners closely with its neighbor, the University of Washington, which has globally renowned medical and computer science centers. The Hutch has built an impressive track record that has included innovative treatments for leukemia and other blood cancers, bone marrow transplants, and now new immunotherapy treatments.
The Hutch has become like almost every institution and company in virtually every field on earth: Its future depends on data. As Hutch president Gary Gilliland has concluded, data is “going to transform cancer prevention, diagnosis and treatment.”7 He notes that researchers are turning data into a “fantastic new microscope” that shows “how our immune system responds to diseases like cancer.”8 As a result, the future of biomedical science is no longer in biology alone, but in its convergence with computer science and data science.
While Trunnell has never met Kai-Fu Lee, this recognition set him on a path that in effect challenges the author’s thesis that the future belongs only to those who control the world’s largest supply of data. If that were the case, then it would be difficult for even a world-class team of scientists in a midsize city in a far corner of North America to aspire to be among the first to find a cure for one of the planet’s most challenging diseases. The reason is clear. While the Hutch has access to important collections of health record data that help it pursue AI-based cancer research, in no way does it possess the world’s largest data sets. Like most organizations and companies, if the Hutch is to continue to lead into the future, it must compete without actually owning all the data that it will need.
The good news is that there is a clear path to success. And it builds on two features that set data apart from most other important resources.
First, unlike traditional natural resources such as oil or gas, humans create data themselves. As Satya put it during one of our Friday senior leadership team meetings at Microsoft, data is probably “the world’s most renewable resource.” What other valuable resource do we create many times unintentionally? Human beings are creating data at a rapidly growing rate. Unlike resources for which there is a finite supply or even a shortage, the world if anything is awash in an ever-expanding ocean of data.
This doesn’t mean that scale doesn’t matter or that larger players don’t have an advantage. They do. China has more human beings and hence more capacity to create data than any other country. But unlike, say, the Middle East, which has more than half of the world’s proven oil reserves,9 it will be hard for any country to corner the world’s market on data. People everywhere create data, and over the course of this century, it seems reasonable to expect nations everywhere to create data in some rough combination of their relative population size and economic activity.
China and the United States may be early AI leaders. But China, as large as it is, accounts for only 18 percent of the world’s population.10 And the United States represents only 4.3 percent of the world’s people.11 When it comes to the size of their economies, the United States and China have more of an advantage. The US represents 23 percent of the world’s GDP, while China represents 16 percent.12 But because the two countries are far more likely to compete than join forces, the real question is whether one nation can dominate the world’s data with less than a quarter of the global supply.
While there is no single guaranteed outcome, there is a greater opportunity for smaller players based on data’s second feature, which, as it turns out, is even more critical. Data, as economists put it, is “non-rivalrous.” When a factory is powered by a barrel of oil, that barrel is not available to any other factory. But data can be used again and again, and dozens of organizations can draw insights and learning from the same data without detracting from its utility. The key is to ensure that data can be shared and used by multiple participants.
Perhaps not surprisingly, the academic research community has been a leader in using data in this way. Given the nature and role of academic research, universities have begun to set up data depositories, where data can be shared for multiple uses. Microsoft Research is pursuing this data-sharing approach too, making available a collection of free data sets to advance research in areas such as natural language processing and computer vision, as well as in the physical and social sciences.
It was this ability to share data that inspired Matthew Trunnell. He recognized that the best way to accelerate the race to cure cancer is to enable multiple research organizations to share their data in new ways.
While this sounds simple in theory, its execution is complicated. To begin, even in a single organization data is often stashed in silos that must be connected, a challenge made even greater when the silos sit in different institutions. The data may not be stored in a form that is readable by machines. Even if it is, different data sets are likely to be formatted, labeled, and structured in different ways that make it harder to share and use in common. If the data came from individuals, legal issues around privacy will need to be worked through. And even if the data doesn’t involve personal information, other big questions need to be hammered out, such as the governance process among organizations and the ownership in data as it grows and is improved.
These challenges are not just technical in nature. They’re also organizational, legal, social, and even cultural. As Trunnell recognized, they stem in part from the fact that most research institutes have conducted much of their technology work with tools they developed in-house. As he says, “In addition to keeping data siloed at one organization, this approach often results in duplicative data collection, lost patient histories and outcomes, and a lack of knowledge about potentially complementary data elsewhere. Together these problems hinder discovery, slow the pace of health data research, and drive up costs.”13
The collective impact of all these impediments, Trunnell observed, makes it difficult for research organizations and technology companies to partner with each other. And the result, he found, hinders the aggregation of data sets large enough to support machine learning. In effect, the inability to overcome these barriers offers the best prospect for the AI dominance envisioned by Kai-Fu Lee.
Trunnell and others at the Hutch saw a data problem that needs to be solved, and they set out to solve it. In August 2018, Satya, himself a Hutch board member, invited a group of senior Microsoft employees to a dinner to hear about the Hutch’s work. Trunnell spoke about his vision for a data commons that would enable multiple cancer research institutes to share their data in new ways. His vision would bring together several organizations to pool their data in partnership with a tech company.
My enthusiasm grew as I listened to his presentation. In many ways, the challenge was like many others we had learned about and even experienced ourselves. As Trunnell described his plans, it reminded me of the evolution of software development. In the early days of Microsoft’s history, developers protected their source code as a trade secret, and most tech companies and other organizations developed their code by themselves. But open source had revolutionized the creation and use of software. Increasingly software developers were publishing their code under a variety of open-source models that allowed others to incorporate, use, and contribute improvements to it. This enabled broad collaboration among developers that helped accelerate software innovation.
When these developments began, Microsoft had not only been slow to embrace the change, we’d actively resisted it, including by asserting our patents against companies shipping products with open-source code. I had been a central participant in the latter aspect. But over time, and especially after Satya became the company’s CEO in 2014, we began to recognize this was a mistake. In 2016, we acquired Xamarin, a start-up that supports the open-source community. Its CEO, Nat Friedman, joined Microsoft and brought an important outside perspective to our leadership ranks.
By the start of 2018, Microsoft was using more than 1.4 million open-source components in its products, contributing back to many of these and other open-source projects, and even open sourcing many of its own foundational technologies. As a sign of how far we’d come, Microsoft had become the most prolific contributor to open source on GitHub,14 a company that was the home of software developers around the world and especially of the open-source community. In May, we decided to spend $7.5 billion to acquire GitHub.
We decided that Nat would lead the business, and as we worked on the deal, we concluded that we should join forces with key open-source groups and do the opposite of what we had done a decade before. We would pledge our patents to defend the open-source developers that had created Linux and other key open-source components. As I talked this through with Satya, Bill Gates, and then others on our board of directors, I said it was time to “cross the Rubicon.” We had been on the wrong side of history, and as we all concluded, it was time to change course and go all in on open source.
I recalled these lessons as I listened to Trunnell describe the data commons. The challenges, while complicated, were like many the open-source community had overcome. Within Microsoft, our increasing use of open-source software had led us to think through the technical, organizational, and legal challenges involved in creating it. More recently we had built one of the tech sector’s leading efforts to address the privacy and legal challenges in working on shared data use as well.
But even more striking than the pitfalls was the promise of what Trunnell described. What if we could create an open-data revolution that would do for data what open-source code had done for software? And what if this approach can outperform the work of an inward-facing institution relying on the largest proprietary data set?
The discussion reminded me of a meeting I had attended a couple of years earlier, which had surprisingly ended up focusing on the real-world impact of sharing data.
In early December 2016, a month after the presidential election, a meeting was held in Microsoft’s Washington, DC, offices to examine the impact of technology on the presidential race. The two political parties and various campaigns had used our products, as well as technology from other companies. Groups of Democrats and Republicans had agreed to meet with us separately to talk about how they’d used technology and what they’d learned.
We first met with advisers on the team from Hillary Clinton’s campaign. Throughout the 2016 campaign season, they were considered the country’s political data powerhouse. They had set up a large analytics department that built on the success of the Democratic National Committee, or DNC, and Barack Obama’s successful campaign for reelection in 2012.
The Clinton campaign had leading tech experts building what was regarded as the world’s most advanced campaign tech solutions to utilize and improve what was perhaps the country’s single best political data set. As the tech and campaign advisers told us, Robby Mook, the bright and affable Clinton campaign manager, had based most of his decision making on the insights generated by the analytics department. Reportedly as the sun set on election day on the East Coast, the entire campaign organization believed they had won the race, thanks in no small measure to their data analysis capabilities. At about dinnertime, the analytics team stepped away from its computers to receive a standing ovation from a grateful campaign staff.
A month later, that initial applause had been replaced by a growing silence from the analytics side of the defeated Clinton campaign. The campaign team had been criticized publicly for missing a rising Republican swing in Michigan until a week before the election and in Wisconsin until the night the votes were counted. But there still prevailed a high degree of confidence in the campaign’s data prowess. As our debrief reached its conclusion, I asked the Democratic team assembled a simple question: “Do you believe you lost because of your data operation or in spite of it?”
Their reaction was both swift and full of self-confidence. “Without question we had the better data operation. We lost despite that.”
We took a break as the Democratic team left, and a team of leading Republicans sat down with us to compare notes.
As they described the course of the campaign, the surprising twists and turns that had led to the nomination of Donald Trump had a decisive impact on his campaign’s data strategy. Shortly after Barack Obama’s reelection in 2012, Reince Priebus was reelected to a second term heading the Republican National Committee, or RNC. He and his new chief of staff, Mike Shields, undertook a top-to-bottom review of the RNC’s operations in the wake of the 2012 defeat, including its technology strategy. And as often happens in the fast-paced world of technology, there emerged an opportunity to leapfrog the competition.
Priebus and Shields utilized data models from three Republican technology consulting firms and embedded them in-house at the RNC. While they lacked easy access to the Democratic-leaning talent pool in Silicon Valley, they brought in a new CTO from the University of Michigan and a young technologist from the Virginia Department of Transportation to build new algorithms for the world of politics. The two RNC leaders believed—and proved—that there was top data science talent everywhere.
Most important to the Republican tech strategists that morning was what Preibus and his team did next. They succeeded in establishing a data-sharing model that convinced not only Republican candidates across the country but also a variety of super PACs and other conservative organizations to contribute their information to a large, federated file of foundational data. Shields believed it was important to assemble as much data as possible from as many sources as possible in part because the RNC had no idea who the ultimate presidential nominee would be. Until then, they couldn’t know what types of issues or voters the candidate would find most important. So the RNC team worked to connect with as many organizations and to federate as much diverse data as possible. It created a much richer total data set than anything the DNC or the Clinton campaign possessed.
When Donald Trump secured the Republican nomination in the spring of 2016, his operation lacked the deep technology infrastructure of the Clinton campaign. To make up for this deficit, Trump’s son-in-law, Jared Kushner, worked with the campaign’s digital director, Brad Parscale, on a digital strategy that would build on what the RNC already had rather than create their own. Based on the RNC’s data sets, they had identified a group of fourteen million Republicans who said they did not like Donald Trump. To turn this group of skeptics into supporters, the Trump team created Project Alamo in Parscale’s hometown of San Antonio to consolidate fund-raising, messaging, and targeting, especially on Facebook. They communicated to these voters repeatedly with messages on topics that the data said were likely to be important to them, like the opioid epidemic and the Affordable Care Act.
The Republican team described what their data operation revealed as the election approached. Ten days before the election, they estimated that they were down two points to Clinton in key battleground states. But they had identified 7 percent of the population that was still undecided about whether it would vote. And the campaign had email addresses for seven hundred thousand people who, the team believed, were likely to vote for Trump in these states if they went to the polls. They put all their energy into persuading this group to turn out.
We asked the Republican team what technology lessons they had learned from their experience. Those lessons were several: Don’t go as deep as the Clinton team had gone in building a data operation from the ground up. Instead use one of the major commercial tech platforms and focus on building on top of it. Build with a broader federated ecosystem that brings together as many partners as possible to contribute and share data, as the RNC had done. Use this approach to focus resources on differentiated capabilities that can run on top of a commercial platform, like those developed by Parscale. And never assume that your algorithms are as good as you believe. Instead test and refine them constantly.
As the meeting concluded I asked a question similar to the one I had put to the Democrats. “Did you win because you had the best data operation or despite the fact the Clinton campaign had a better operation?”
Their response was as swift as the answer we had heard from the Democrats earlier in the day. “No question we had the better data operation. We saw Michigan start to break for Trump before the Clinton campaign did. And we saw something else that the Clinton team never saw. We saw Wisconsin break for Trump the weekend before election day.”
After both political teams had left, I turned to the Microsoft team and asked for a show of hands. Who thought the Clinton team had the better data operation and who thought the RNC/Trump team had the better operation? The vote was unanimous. Everyone in the room concluded that the approach used by Reince Priebus and the Trump campaign was superior.
The Clinton campaign had relied upon its technical prowess and its early lead. The Trump campaign, by contrast and out of necessity, had relied on something closer to the shared-data approach Matthew Trunnell described.
There will always remain plenty of room to debate the various factors that determined the outcome of the 2016 presidential race, especially in battleground states where the votes were close, such as Michigan, Wisconsin, and Pennsylvania. But as we concluded that day, Reince Priebus and the RNC’s data model quite possibly helped change the course of American history.
If a more open approach to data can do that, just imagine what else it could do.
The key to this type of technology collaboration lies with human values and processes and not just a focus on technology. Organizations need to decide whether and how to share data, and if so, on what terms. A few principles will be foundational.
The first is concrete arrangements to protect privacy. Given the evolution of privacy concerns, this is a prerequisite both for enabling organizations to share data about people and for people to be comfortable sharing data about themselves. A key challenge will involve the development and selection of techniques to share data while protecting privacy. This will likely include new so-called “differential privacy” techniques that protect privacy in new ways, as well as providing access to aggregated or de-identified data or enabling query-only access to a data set. It may also involve the use of machine learning that is trained on encrypted data. We may well see new models emerge that enable people to decide whether to share their data collectively for this type of purpose.
A second critical need will involve security. Clearly, if data is federated and accessible by more than one organization, the cybersecurity challenges of recent years take on an added dimension. While part of this will require continuing security enhancements, we’ll also need improvements in operational security that enable multiple organizations to manage security together.
We’ll also need practical arrangements to address fundamental questions around data ownership. We need to enable groups to share data without giving up their ownership and ongoing control of the data they share. Just as landowners sometimes enter into easements or other arrangements that allow others onto their property without losing their ownership rights, we’ll need to create new approaches to manage access to data. These must enable groups to choose collaboratively the terms on which they want to share data, including how the data can be used.
In addressing all these issues, the open-data movement can take a page from open-source trends for software. At first that effort was hampered by questions about license rights. But over time standard open-source licenses emerged. We can expect similar efforts for data.
Government policies can also help advance an open-data movement. This can begin by making more government data available for public use, thereby reducing the data deficit for smaller organizations. A good example was the decision by the US Congress in 2014 to pass the Digital Accountability and Transparency Act, which makes publicly available more budget information in a standardized way. The Obama administration built on this in 2016 with a call for open data for AI, and the Trump administration has followed by proposing an integrated federal data strategy to “leverage data as a strategic asset” for government agencies.15 The United Kingdom and the European Union are pursuing similar efforts. But today only one in five government data sets is open. Much more needs to be done.16
Open data also raises important issues for the evolution of privacy laws. Current laws were mostly written before AI developments began to accelerate, and there are tensions between current laws and open data that deserve serious consideration. For example, European privacy laws focus on so-called purpose limitations that restrict the use of information only for the purpose specified when data was collected. But many times new opportunities emerge to share data in ways that will advance societal goals—like curing cancer. Fortunately, this law allows data to be repurposed when it is fair and compatible with the original purpose. Now there will be critical questions about how to interpret this provision.
There will also be important intellectual-property issues, especially in the copyright space. It has long been accepted that anyone can learn from a copyrighted work, like reading a book. But some now question whether this rule should apply when the learning is conducted by machines. If we want to encourage broader use of data, it will be critical that machines can do so.
After developing practical arrangements for data owners and addressing government policy, one more need will be vital. This is the development of technology platforms and tools to enable easier and less costly data sharing.
This is one of the needs that Trunnell has encountered at the Hutch. He’s taken note of the difference between the work pursued by the cancer research community and by tech companies. New, cutting-edge tools for managing, integrating, and analyzing diverse data sets are being developed by the technology sector. But as Trunnell recognized, “the divide between those producing the data and those building novel tools is a huge missed opportunity for making impactful, life-changing—and potentially lifesaving—discoveries using the massive amount of scientific, educational, and clinical trial data being generated every day.”17
But for this to be viable, data users need a strong technology platform that is optimized for open-data use. Here the market is starting to go to work. As different tech companies consider different business models, they have alternatives from which to choose. Some may choose to collect and consolidate data on their own platform and offer access to their insights as a technology or consulting service. In many respects, this is what IBM has done with Watson and what Facebook and Google have done in the world of online advertising.
Interestingly, as I was listening to Matthew Trunnell that August evening, a team from Microsoft, SAP, and Adobe was already working on a different but complementary effort. The three companies announced the Open Data Initiative, launched a month later, designed to provide a technology platform and tools to enable organizations to federate data while continuing to own and maintain control of the data they share. It will include tech tools that organizations can use to identify and assess the useful data they already possess and put it into a machine-readable and structured format suitable for sharing.
Perhaps as much as anything else, an open-data revolution will require experimentation to get this right. Before our dinner ended, I pulled up a chair next to Trunnell and asked what we might do together. I was especially intrigued by the opportunity to advance work that we at Microsoft were already pursuing with other cancer institutes in our corner of North America, including with leading organizations in Vancouver, British Columbia.
By December, this work had borne fruit and we announced a $4 million Microsoft commitment to support the Hutch’s project. Formally called the Cascadia Data Discovery Initiative, the work is designed to help identify and facilitate the sharing of data in privacy-protected ways among the Hutch, the University of Washington, and the University of British Columbia and the BC Cancer Agency, both based in Vancouver. It is an early example of what is starting to spread, including the California Data Collaborative, where cities, water retailers, and land planning agencies are federating data to enable analytics-driven solutions to address water shortages.18
All of this provides cause for optimism about the future of open data, at least if we seize the moment. While some technologies benefit some companies and countries more than others, that is not always the case. For example, nations have never been forced to grapple with hard questions about who would be the world leader in electricity. Any country could put the invention to use, and the question was who would have the foresight to apply it as broadly as possible.
Societally we should aim to make the effective use of data as accessible as electricity. It is not an easy task. But with the right approach to sharing data and the right support from governments, it is more than possible for the world to create a model that will ensure that data does not become the province of a few large companies and countries. Instead it can become what the world needs it to be—an important engine everywhere for a new generation of economic growth.