8  Data: The New Frontier

8.1  Introduction

This book so far has described the phenomenon of cloud computing. This chapter moves somewhat beyond the cloud to focus on data, which is the functional essence of the cloud. Data is the new frontier, and something fundamentally new is happening to it. We are witnessing a movement away from raw data to intelligent or smart data, a movement rooted in the most nourishing of environments: the cloud. There data from different sources can combine and generate new information, in turn yielding fresh insights and business intelligence and creating valuable new products and services.

As water droplets make up the real clouds in the sky, data is the critical component of the cloud in computing. And as with clouds in the sky, data clouds can range from single isolated ones to massive complex ones formed through the interactions of powerful internal and external forces.

8.2  Consumer Data in the Cloud

The web has undergone significant innovative and disruptive phases over the last 15 years. From the initial browser, e-commerce, and more recently the Web 2.0 phenomenon of mass participation in community-based social media sites, each phase has brought new challenges and opportunities. The movement of data into the cloud is one such phase offering both evolutionary and revolutionary opportunities to fundamentally change what we do with data and the way we think about data. Questions of ownership, management, distribution, control, and privacy of data will become challenging. This is already happening for a significant portion of web users. Social media users upload their profiles, social activities, and musings on to the web with a willing acceptance that ownership and privacy concerns are secondary to communication and connectedness. People gladly hand over their personal details to be able to play games or run a virtual farm on Facebook. Sometimes they don’t even know that they have moved data from their digital devices such as phones and computers into the cloud. The services being consumed, and the types of data that are shared in the cloud are varied, which is evident in the range of sites/companies shown in Figure 8.1 (Meeker et al. 2010).

8-1.jpg

Figure 8.1: The cloud is everywhere today: the connected user can access a wide range of products and services on the cloud. The user can work, shop and be entertained from anywhere—a paradise for some! (Copyright 2010 Morgan Stanley)

Being able to see and follow activity within a network of friends has helped drive a greater demand for transparency all round, particularly in the US. Citizens demand insight into government activities and especially into how their taxes are being spent. Similarly, these groups of connected users, marking a new generation of workers, expect connectedness amongst their peers as a workplace norm. This in turn has encouraged some C-level executives to experiment with and learn from new methods of managing and controlling data flow, both inside and outside of their organization, to foster an environment where serendipity and good discovery become the norm.

8.3  Change in Mindset

Conceptually cloud computing has done away with the need for physical ownership of computing resources. It also challenges the orthodoxy of the traditional IT command and control model. Cloud computing enables the decoupling of applications from infrastructure, of data from infrastructure and applications, and even of data from the organization. The challenge now is to leverage this change in mindset to maximize the potential of data in the cloud.

Large amounts of consumer opinions on products and services are being willingly offered and published by consumers themselves on social networking sites. These comments are becoming a treasure trove of business intelligence providing market insights into consumer behavior. They are being used to great effect by companies such as Ford Motor Company, PepsiCo, and Southwest Airlines, to name but a few (Bughin 2010). This group of active consumers has driven the first wave of data in the cloud.

Governments are behind the second wave, offering their data more or less unencumbered in the cloud, thereby giving companies and in particular start-ups and active citizens the opportunity to exploit it for their own advantage. Some companies are already leveraging data currently available in the cloud. More will do so to complement their own internal information systems.

The opportunity now exists for companies to experiment with placing some or all of their own data in the cloud, where it will meld with other data there and be used by others in different, perhaps even unexpected and unimagined ways. Business data in the cloud will form the third wave.

The movement of data into the cloud has started. It is likely that large amounts of data from governments, non-governmental organizations (NGOs), corporations, commercial information providers and web users will become available through the cloud. We can expect that there will be many interesting opportunities to participate in making data available in the cloud and harnessing other data there.

8.4  Data Evolution

The Internet is pervasive and is connecting, creating, transporting, and consuming ferocious amounts of data from increasingly diverse sources including mobile phones, RFID (radio frequency identification) tags, sensors, healthcare, financial services, and social networking environments. It is estimated that the amount of data on networks will triple by 2014 from today’s volume, as shown in Cisco’s Visual Networking Index (VNI) in Figure 8.2 (see Cisco in the References section of this book). This explosive growth will then be equivalent to moving approximately 14 billion DVDs over the network every month (Miller 2010).

These enormous amounts of data support the metaphor of data being the lifeblood of the Internet. Data is critical to the effective functioning of every modern person and organization. This is evidenced by what has been achieved over the last 15 years on the web.

8-2.jpg

Figure 8.2: Cisco’s Visual Networking Index (VNI) charts the explosive growth in data and breaks down the data types

The initial web—Web 0—saw the launch and commercialization of the browser and the birth of Netscape and the subsequent browser wars. This professionalized the publication and sharing of material on the web. This was transformational: the dynamic of content being easily viewed by many rather than a few. Commercial transactions and new types of business then evolved on the web. This phase—Web 1.0—saw companies like eBay, Amazon, and Google define and dominate the web.

The next phase—Web 2.0—sees the advent of participation and engagement where users actively engage and communicate with each other. Web 2.0 and social media have become synonymous (Bloem 2009). Companies like Facebook, Twitter and Apple are currently driving this era of the web. It also marks a movement towards a plethora of device types that are used to create and consume content: the PC, phone, console, TV, and tablet. The movement from publication to participation has again been transformative. The potential of true participation and its impact on society, governments, and business is only starting to be realized. Participation is as significant an event as the original browser, according to the distinguished scholar Professor Larry Prusak (see Prusak’s URL in the References section of this book).

Yet despite all of the advances in the web, which we now take for granted, the underlying data that drive and enable each of the phases of the web hasn’t changed much. Though it has transformed enormously in terms of volume over the years, data at its core has not changed much. All of the complexity and increased functionality of use has been achieved through multiple waves of innovation. More complex and faster computers, routers, and networks have helped create better technologies, methods, algorithms, and applications with the many acronyms that are shown in Figure 8.3 (Hawke 2010; Linked Data 2010). These have helped deliver today’s web but mask the underlying rawness of the core data. At issue is the difficulty computer technology has in intelligently understanding data: a router knows the destination of a packet but not its content; a search query will return a result without knowing the meaning of the words in the query result.

8-3.jpg

Figure 8.3: Linked data leverages the best of today’s web technology and is the forefront of creating a smart data environment

8.5  Dumb versus Smart Data

The data underpinning the web today can be characterized as “dumb.” Computers don’t know the meaning of the text forming the web pages we read. The innovation challenge now is to make the data “smart.” With such data, computer technology can infer meaning and do smart things with it. If we realize that all of the successes of the web to date have been achieved on “dumb” data, we can barely imagine what will be possible with “smart” data. Up till now, the perception of smartness has been achieved by applications performing more clever extraction, transformation and load functions (ETL) to create and deliver the right information.

One of the most endearing features of the web is its ease of use and lack of formal structure. Text is relatively easy to input and publish. Humans automatically know how to read, interpret, and understand the meaning and context of text data on the web: converting raw text into coherent and readable stories. Computers, on the other hand, cannot easily complete the same task. Without structure, computers are at a loss ─ they suffer from a text problem. As most of the web is created using unstructured text, the challenge is to create structure without constraining the ability to easily input and publish. Add to this the enormous amount of unstructured data that is already in an enterprise, from emails, marketing material, customer information and feedback, and the potential challenges ahead are daunting.

Significant advances have been made to create smart data and put structure around unstructured text. The most theoretically smart data being researched focuses on the semantic web and is often referred to as the “Web 3.0.” Sir Tim Berners-Lee first proposed the semantic web in 2001 (Berners-Lee et al. 2001). It is a natural extension of the current web and it anticipates a future where computers will (semi-) automatically understand data using advances in fields such as artificial intelligence (AI), natural language processing (NLP), ontologies, linguistics, and reasoning. It is a very active research topic: at DERI, the largest semantic web research center in the world (see DERI in the References section of this book), there are already over 130 researchers dedicated to this topic alone.

The semantic web is a web where data and content is linked, in contrast to today’s web of linked pages and hyperlinks. It is a web understood by both humans and computers alike. While the realization of this vision may be some time off, the first instance of structured smart data is already being created and made available in the cloud. The first version of this smart data is called “linked data” and sits in the middle between the existing web and the semantic web (as illustrated in Figure 8.3). The set of linked data is growing all the time. A snapshot view of it is shown in Figure 8.4 (W3C-SWEO 2010): the diagram shows a high-level relationship diagram of the different data domains that are currently available, and where the links exist between these domains.

For example, the geographic references in the DBpedia data set (which is essentially Wikipedia in a more structured form; see DBpedia in the References section of this book) are linked to the GeoNames data set, which in turn is linked to and from many other data sets such as US Census data (see Figure 8.4). Once a data set is linked into the cloud, it is possible to navigate across, access, and link each of the other data sets. Therefore, an application that references a city, for example, can automatically access available information about the city from linked data sources in the cloud, such as the CIA’s Factbook (see CIA in the References section of this book).

8-4.jpg

Figure 8.4: Linked data is represented by a graph. Once you enter the graph you can traverse it and access any information attached (Cyganiak and Jentzsch 2010)

Putting data in the cloud in its current “dumb” state is limiting in its potential. However, exposing data in a linked, smart format in the cloud is different, potentially ground breaking, and powerful. It benefits from the network effect as described metaphorically by Metcalfe’s law and illustrated in Figure 8.5 (see Wikipedia in the References section of this book). Linked data benefits from the addition of new users to populate the network, much as the telephone network grew.

8-5.jpg

Figure 8.5: Linked Data benefits from being connected as the telephone did with the addition of new users. Metcalfe’s law applies

The network effect will be achieved most when all linked data is exposed to and accessible throughout the cloud. We have already seen the benefits from linking computers and users, and now it is time to take advantage of linking and connecting data.

8.6  Data in the Cloud

Several ingenious cases are emerging that exemplify the potential of linked data for organizations. These cases cross several sectors including life sciences, government, and media.

A life science example

Linking Open Drug Data (LODD) is an initiative of the Health Care and Life Sciences Interest Group of the W3C to create open linked data about drugs. The sources of data range from information about the impacts of drugs on gene expression to clinical trial results. Approximately 370,000 links to external data sources are contained in the LODD data sets (see Figure 8.6 (W3C-HCLSIG 2010)). One interesting case demonstrated how users find information on the effect of Chinese herbs on particular diseases. The users can also find relevant clinical trial information, active ingredients, and any reported side effects. The data has also been used to help medical researchers investigate genes of herbs and how they could help in specific diseases (Jentzsch 2009).

8-6.jpg

Figure 8.6: Linking open drug data

A government example

High-profile initiatives, like the US government’s effort to make its data more accessible, have helped greatly in making massive amounts of linked data available in the cloud. Data.gov is the site to access this data, and it has a stated purpose to democratize public-sector data and drive innovation. One year after its launch it has helped create a community around open linked data that includes 6 countries establishing open linked data initiatives (including Britain and Canada), 8 US states offering data sites, 8 cities in America with open data, and approximately 275,000 data sets being made available (see Data-Gov-A in the References section of this book). Some of these data sets are illustrated in Figure 8.7 (Li and Hendler 2009).

8-7.jpg

Figure 8.7: The breadth and depth of available and accessible government data is enormous. Its utility is greatly increased when accessible in linked format and connected to other open and linked data sources.

It is a great example of where the old adage of “build it and they will come” actually works. Some very interesting and diverse uses are being created using this data. One such use is shown in Figure 8.8, which charts the percentage of cancelled or diverted flights by destination (see Fly on Time in the References section of this book).

8-8.jpg

Figure 8.8: The percentage of cancelled or delayed flights by destination, based on US government open linked data

A media example

The BBC created a new music web site that reuses linked data from MusicBrainz and Wikipedia to help it deliver an enhanced and differentiating service. A snapshot of the site is shown in Figure 8.9 (see BBC Music in the References section of this book). Matthew Shorter, interactive editor of music at the BBC, puts forward three arguments for using linked data (Blumauer 2009B, Ferguson 2009):

1. Reuse: BBC avoids wasting money in creating data that is already available in the public domain through MusicBrainz and Wikipedia.
2. Search Engine Optimization (SEO): more meaningful linkages between data yields better search retrieval of content.
3. Open Platform: a better user experience means extended session times. It also increases the likelihood that other users will access the site and bring their music links with them, thereby extending its reach and value to all users.

8-9.jpg

Figure 8.9: BBC music integrates various linked data music sources to create a richer site and better user experience

8.7  Tapping into the Data Potential for Organizations

As the previous examples show, the real value of linked data lies in the actual links. These links can be exploited to access information, enabling organizations to experiment, evaluate, and innovate with new sources and types of information. As with any type of new initiative, business executives will be looking for a clear business rationale to justify the expenditure and define the benefits. These benefits could be for the organization itself, but could also be a contribution to the organization’s ecosystem, where collaboration is taking place. For governments and NGOs the incremental cost of exposing already existing information to the cloud in linked-data format can be defended in terms of serving and benefiting the common good. For most commercial organizations, open linked data as a full-on business proposition is a long way off. In addition, companies cannot contemplate moving mission-critical information systems until the technology creating linked data is more robust and mature. At the moment, most data that resides in organizations’ private clouds or inside traditional platforms does not have enough classification information in it to allow easy distinction between the data you would and would not want to share.

In an interview, Prof. Dr. Chris Bizer, the force behind DBpedia, suggested that publically available linked data could be used by organizations as a data back-drop to augment corporate data. This augmentation can be achieved in an experimental manner and quickly develop and grow into a more robust deployed application. Most critically this can be achieved without radically changing the corporate applications or data sets. In the same interview, he suggested that linked data could be used as a lightweight data-integration technology.

This approach is incremental and experimental, but avoids the big upfront investment required in modeling global schemas used in classic data-warehousing projects (Blumauer 2009A).

logoDBpedia.jpg

As of April 2010, the DBpedia dataset describes more than 3.4 million things, out of which 1.5 million are classified in a consistent manner, including 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000 video games, 140,000 organizations, 146,000 species and 4,600 diseases. The data set features labels and abstracts for these 3 million-odd things in up to 92 different languages; 1,460,000 links to images and 5,543,000 links to external web pages; 4,887,000 external links into other RDF datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories (see Wikipedia in the References section of this book).

At the Semantic Technology 2010 conference, linked data was described as a viable means of augmenting corporate data and creating better information for applications in financial services. Sample applications include (Semantic Universe 2010):

Mergers and acquisitions.
Anti-money laundering.
Anti-counterfeiting.
Customer and market analysis.
Business intelligence.

Like many large data sets, linked data needs to be sourced, cleansed, packaged, and of good enough quality and accuracy to be of use to organizations. Maintaining or verifying it might be done by corporations themselves, but can also be taken on by third parties. Real value and intellectual property (IP) can be created from this type of data, which creates new business opportunities for the companies themselves, incumbent information providers and start-ups alike.

8.8  Challenges

As with any technology or phenomenon that is new and evolving, linked data in the cloud introduces a whole raft of new concerns for C-level executives. These concerns need to be examined before taking big steps in sharing data publicly or using someone else’s data. The challenges can be grouped into three sections: legal, data, and technology.

As linked data itself is new and evolving, legal opinions are also entering new and uncharted territories. Principal amongst these is that actual ownership, location, and consumption of data may reside simultaneously in different jurisdictions with conflicting regulatory, legal, IP, and privacy requirements. Established governance, regulatory, and legal frameworks may also not be appropriate. As computers can infer meaning from linked data, possibly misleading or untrue statements could be created, leading to all sorts of potential legal problems (Harley et al. 2009). At the same time, the regulatory and legal frameworks that are currently in place are probably not yet ready to cover large-scale adoption and use of linked data by organizations. This too will add to the legal concerns. However, innovative technology solutions and the necessities of businesses to innovate and participate in this space will help replace legal uncertainty with legal clarity. It only takes a few pioneers to lead the way, greatly helping other organizations’ adoption.

From a data perspective several interesting issues arise, some of which also hold true for more traditional data sources. They include:

Attributing authorship of original and derivative data.
Knowing the quality and the accuracy of the data and its source.
Dealing with data duplication and disambiguation (resolving conflicts in meaning).
Identifying data source and lineage (or provenance) and knowing the retention requirements for difference types of data across various jurisdictions.

A recent Pew Research Center report expressed wariness about further exposing private information to governments, corporations, thieves, opportunists, and human and machine error (Anderson and Rainie 2010). A lot of these issues are evolving in the social media space and solutions to most of these issues are being created as they arise.

When it comes to analyzing data, companies like Google handle text differently from most organizations. Traditional web development environments (for example, LAMP: Linux, Apache, MySQL, and PHP) find it difficult to scale and process large volumes of data. Newer technologies (like Hadoop and NoSQL) are becoming available to overcome these difficulties. Acquiring the competencies to use these new and emerging technologies will take time, and these skills will initially be scarce.

logohadoop.jpg

Hadoop is a software framework that enables applications to work with thousands of nodes and massive amounts of data.

logomongoDB.jpg

NoSQLs are next generation databases that are non-relational, distributed and scalable. MongoDB is one such database.

Additional technical complexities arise in creating, curating and provisioning data if real-time access to information is required, such as in financial services and new media. Latency or lack of bandwidth can also be a problem, depending on particular business and data needs. As new requirements and demands are being created, new technologies will be developed to satisfy these needs. Technological obsolescence is therefore a problem.

8.9  Conclusion

Something new is afoot regarding data. Gone are the days of data created in a manner that renders it “dumb” and non-intelligent to computers. Making data “smart” is the next big thing—the new frontier. Computers can discover, interpret, and manipulate data and infer meaning with smart data, and linked data is a first step in this direction. Governments are in a first wave that aims to make data more accessible to citizens and commercial entities. Industry is following, with news media and life sciences in particular showing some early promise. Other industries are closely behind. Linked data affords organizations the opportunity to look at and exploit data differently; decoupled from technology, applications, and indeed the organization itself, it offers a cost-effective way to experiment with diverse data sources. These sources can be private, partner, public domain, or third-party information providers. Linked data presents the opportunity to truly tap into a vast resource of data and convert it into real information and knowledge. Companies that plan for and innovate around this new type of data will engage with their customers, partners, and competitors differently and will bring new types of product and services to market faster. Data is innovation. Data is the new frontier.