Coping with “Big Data”

eScience

8.1 Introduction

The growing interest in making research publicly available resulted in the development of web infrastructure and services that allow the production, management, and storage of large volumes of data. The rise of “Big Data,” a term used to denote the exponential growth of large volumes of data, was called by Jim Gray from Microsoft “the fourth paradigm” in scientific research, the first three being experimental, theoretical, and computational sciences (Tolle et al., 2011). Initially a characteristic of some areas of biology, particle physics, astronomy, and environmental sciences, this revolution has spread to the humanities, social sciences, and other disciplines (Bartling and Friesike, 2014).

The rapid accumulation of research data is creating an urgent need for action before organizations and companies get swamped by unmanageable volumes of data. How are all these data going to be managed, preserved, and stored? Who will do it? How are we going to find them? What are researchers going to do with the data?

A government report, “U.S. Open Data Action Plan,” provided directions for “managing information as a national asset and opening up its data, where possible, as a public good to advance government efficiency, improve accountability, and fuel private sector innovation, scientific discovery, and economic growth” (US Government, 2014).

8.2 Types of research data

Data are everything produced in the course of research. According to the National Science Foundation (NSF), “What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models” (US National Science Foundation (NSF), 2010). The UK government defined data as “Qualitative or quantitative statements or numbers that are assumed to be factual, and not the product of analysis or interpretation” (UK Government, 2014).

Data are generated through experimental, statistical, or other processes and are often stored locally. Some of the most common types of data include laboratory experimental data, computer simulation, textual analysis, and geospatial data. In the social sciences, data include numeric data, audio, video, and other digital content generated in the course of social studies or obtained from administrative records. For scholars in the humanities, data consist of textual and semantic elements. Data can be structured or unstructured; simple or complex; and uniform or heterogeneous in terms of syntax, structure, and semantics. They are organized in datasets that are curated, archived, stored, and made accessible.

Graphic displays can communicate information much more clearly than pure numerical data. Some of the visual representations of data may include a dynamic component such as time and computer animation and simulation. Visualization of research data is particularly important in some disciplines, such as astrophysics, biology, chemistry, geophysics, mathematics, meteorology, and medicine.

8.3 Managing data

Data management is an important part of the research process. Researchers have to ensure that their data are accurate, complete, and authentic. Some of the tools that are used to manage research data depend on the nature of the discipline and the type of experiment, while others are of a more general type that can be used in many research activities, such as data storage and analysis (Liu et al., 2014; Ray, 2014).

The management of data consists of four layers: curation, preservation, archiving, and storage.

8.3.1 Data curation

A uniform understanding of the meaning or semantics of the data and their correct interpretation requires defining a number of characteristics or attributes. The process of data authentication, archiving, management, preservation, retrieval, and representation is known as data curation. In the course of this process, metadata (which are data about data) are created to help data discovery and retrieval. Metadata summarize data content, structure, context, relationships, and relevance. They can be viewed as a specific kind of information structured to describe, explain, locate, or make it possible to retrieve, use, or manage data. There are three types of metadata:

 Descriptive metadata are about the content of data (e.g. title, author, and keywords).

 Administrative metadata stipulate preservation, ownership, and technical information about formats.

 Structural metadata are about the design and specification of data structures.

Metadata include information about data provenance, a term used to describe the history of data from the moment they were created, through all their modifications in content and location. Data provenance is particularly important to scholarship today, because the digital environment makes it so easy to copy and paste data (Frey and Bird, 2013; Liu et al., 2014).

The adoption of the digital object identifier (DOI) was an important step in improving the process of dataset identification, verification, and discovery. A DOI (discussed in more detail in Chapter 15 of this book) is a unique persistent identifier that is attached to a dataset, an article, or other creative work. It is similar to an International Standard Book Number (ISBN), which allows a book to be tagged, discovered, and retrieved.

8.3.2 Data preservation, archiving, and storage

Data preservation and storage are some of the biggest challenges that data present. Surveys of researchers from different institutions have shown that research data are often found in spreadsheets that do not allow the performance of complex analyses, sharing, or reuse. They are saved on computers or external hard drives—sometimes without any backup. More and more data, though, are now stored in the “cloud” or managed with locally installed software tools (Frey and Bird, 2013). Hardware architecture for storage is very important, but software also plays a significant role. Data can be stored as flat files, indexed files, relational files, binary files, or any other electronic format. There are some data that are not easy to preserve, which would require special software for storing them. If only some of the data are to be preserved, rules have to be designed to filter data and prevent the loss of those that are important.

8.4 Data standards

Data standards are accepted agreements on format, representation, definition, structuring, manipulation, tagging, transmission, use, and management of data. A data standard describes the required content and format in which particular types of data are to be presented and exchanged. Establishing standards is an important step toward managing data in a more consistent way, and their absence results in data corruption, loss of information, and lack of interoperability.

In the United States, the development and application of technical standards for publishing bibliographic and library applications are coordinated by the National Information Standards Organization (NISO) (www.niso.org). NISO is currently involved in the development of assessment criteria for nontraditional research outputs such as datasets, visualization, software, and other applications (National Information Standards Organization, 2015). The organization also provides webinars and training in data management, discovery tools, new forms of publishing, and alternative metrics (discussed in Chapter 14 of this book) for evaluation of research.

Building new data standards is a difficult task, because different areas of science require different standards. Genomics research, for example, generates enormous volumes of data that are not all numeric. In a review addressing the creation of data standards for the life sciences, Vogt presented arguments for establishing two different types of standard—one for scientific argumentation and another one for the production of data. The first type would “… use criteria of logical coherence and empirical grounding. They are important for the justification of an explanatory hypothesis. Data and metadata standards, on the other hand, concern the data record itself and all steps and actions taken during data production and are based on virtues of objectivity …” (Vogt, 2013).

Vogt also suggested that, “… in order to meet the requirements of eScience, the specification of the currently popular minimum information checklists should be complemented to cover four aspects: (i) content standards, which increase reproducibility and operational transparency of data production, (ii) concept standards, which increase the semantic transparency of the terms used in data records, (iii) nomenclatural standards, which provide stable and unambiguous links between the terms used and their underlying definitions or their real referents, and (iv) format standards, which increase compatibility and computer-parsability of data records.”

8.5 Citing data

While there are established conventions for citing published papers, there is no uniformly accepted format for citations of digital research data. The currently emerging conventions vary by discipline, but some common elements within these conventions are becoming obvious. An article presents an overview of the citation practices of individual organizations and disciplines and identifies the following set of “first principles” for data citation that can be adapted to different disciplines, organizations, and countries to guide the development and implementation of data citation practices and protocols (CODATA-ICSTI, 2013):

1. Status of data: Data citations should be accorded the same importance in the scholarly record as the citation of other objects.

2. Attribution: Citations should facilitate giving scholarly credit and legal attribution to all parties responsible for those data.

3. Persistence: Citations should be as durable as the cited objects.

4. Access: Citations should facilitate access both to the data themselves and to such associated metadata and documentation as are necessary for both humans and machines to make informed use of the referenced data.

5. Discovery: Citations should support the discovery of data and their documentation.

6. Provenance: Citations should facilitate the establishment of provenance of data.

7. Granularity: Citations should support the finest-grained description necessary to identify the data.

8. Verifiability: Citations should contain information sufficient to identify the data unambiguously.

9. Metadata standards: Citations should employ widely accepted metadata standards.

10. Flexibility: Citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data across communities.

Data publication and citation are very important to the scientific community, as they make the scientific process more transparent and allow data creators to receive credit for their work. Just like citation of articles, citation of datasets demonstrates the impact of the research and will benefit authors. Developing mechanisms for peer review of data will ensure the quality of datasets and will allow analysis of data and conclusions made on the basis of these data (Kratz and Strasser, 2014).

8.6 Data sharing

Many research labs are now using cloud storage facilities such as Dropbox, OneDrive, Google Drive, and Box to share data (Mitroff, 2014). Services such as the subscription-based Basecamp (https://basecamp.com) are used by research groups and larger units for project management purposes. Scientific online collaboration platforms such as colwiz (www.colwiz.com) and Zoho (www.zoho.com) are also popular for managing research data and files. Chapter 9 of this book is entirely devoted to electronic laboratory notebooks (ELNs), which are used for recording, managing, and sharing research data and other information.

As studies of the research practices of scientists have shown, there are significant differences between the disciplines in the practices used for managing and sharing data, which often depend on the culture of the lab or the organization (Bird and Frey, 2013; Harley, 2013; Long and Schonfeld, 2013; Meadows, 2014; Ray, 2014; Tenopir et al., 2011). Researchers in some disciplines share data through a central database available to many researchers, while others do not share them until they publish the results.

An example of such specific culture of research is the area of bioinformatics, a conglomerate of interdisciplinary techniques, which transforms large volumes of diverse data into more manageable and useable information. The handling and analysis of data in this discipline are performed through annotation of macromolecular sequences and structure databases and by classifying sequences or structures into groups. Researchers can interact with these databases and manipulate them, which is of critical importance and benefits users (Scott and Parsons, 2006).

Wiley’s Researcher Data Insights Survey of 90,000 recent authors of papers in different disciplines demonstrated significant differences across research fields and geographic locations in how researchers share their data (Ferguson, 2014; Wiley, 2014). The survey results showed that only 52% of researchers make their data publicly available. Of the data presented for seven countries, German scientists were those who were sharing their results most often (55%). Significant differences were found in how researchers shared their data: 67% provided their data as a supplementary material to papers submitted for publication; 37% posted them on a personal, institutional, or project web page; 26% deposited them in an institutional repository; 19% used a discipline-specific repository; and 6% submitted them to a general-purpose repository such as Dryad or figshare.

Researchers had different motivations for sharing data: 57% said that sharing was common practice in their organization; 55% wanted to make their research more visible and increase its impact; and half of the respondents considered sharing a public benefit. Other reasons included journal, institutional, or funders’ requirements; transparency and reuse; discoverability and accessibility; and preservation.

According to an article that examined the differences between disciplines in sharing data, the life sciences researchers were the most sharing (66%), while those in the social sciences and the humanities were the least likely (36%) to share their data. Another article based on results from surveys and interviews with researchers on current data-sharing practices discussed the different concerns that researchers have about sharing data (Tenopir et al., 2011). According to this article, the most common concerns of researchers were associated with legal issues, data misuse, and incompatible data types. Most data are not copyrightable in the United States, and copyrights usually do not apply, but the situation is different in other countries. It is important that authors understand their rights before publishing their data and adhere to ethical rules when reusing them.

8.7 eScience/eResearch

The term eScience (also called eResearch) has different meanings, but they all are associated with large datasets and tools that allow producing and managing large volumes of digital scientific data. The National Science Board of the NSF has published a document, “Digital Research Sharing and Management,” addressing the complexities of modern science in this way: “Science and engineering increasingly depend on abundant and diverse digital data. The reliance of researchers on digital data marks a transition in the conduct of research” (US National Science Foundation (NSF), 2011). The increasing scale, scope, and complexity of datasets are changing fundamentally the way that researchers share, store, and analyze data. Large and complex datasets pose significant challenges for the research community, because they are sometimes difficult to store, manage, and analyze with conventional approaches and tools. Such datasets may need to be subjected to special management processes and even to be split and stored on several servers. Another major problem in managing large volumes of data is that the existing applications often use proprietary data types, which prevents the parsing of data types from other third-party applications.

The long-term preservation of digital data is of critical importance for their future use. The US National Science Board outlined the following strategies for cooperation in data preservation and storage (US National Science Board, 2011):

Strategic partnerships between key stakeholder communities should be developed to collectively support the development of effective data repositories and stewardship policies. Funding agencies, university-based research libraries, disciplinary societies, publishers, and research consortia should distribute responsibilities that address the establishment and maintenance of digital repositories.

8.8 Data repositories and organizations involved in data preservation

Some of the available data repositories are funded by governments or by professional organizations, while others are commercial enterprises. The discipline-specific repositories are specifically designed to accommodate metadata for the particular subject area. They are smaller and are usually funded by grants. In the United States, the federal funding for repositories is increasing, and Europe is also making significant investments in building such facilities. Some of the data repositories and organizations involved in data preservation and sharing are presented below:

The Board on Research Data and Information at the National Academies (The National Academies, 2014) has a mission to improve the policy and use of digital data and information in science and society.

The Coalition for Networked Information (CNI) (www.cni.org) publishes papers focused on eScience and data management. Also, on the CNI website (http://www.cni.org/event-calendar/), there is a list of upcoming conferences and workshops that include some on eScience/data management.

The Committee on Data for Science and Technology (CODATA) (www.codata.org), an interdisciplinary committee under the umbrella of the International Council for Science (ICSU), is dedicated to improving the quality, management, and availability of data in science and technology.

DataCite (datacite.org) is a data registry created and supported by several libraries from different countries. DataCite assigns DOIs—an effort that is coordinated by the California Digital Library (CDL), Purdue University Libraries, and the Office of Science and Technology (OSTI) of the Department of Energy (DOE). The creation of DataCite and the assignment of DOIs were a big operation, which turned out to be even more difficult than the creation of ISBNs.

Databib (http://www.re3data.org/) is a searchable catalog, registry, directory, and bibliography of research data repositories. It allows identifying and locating online repositories of research data.

Digital Curation Centre (DCC) (www.dcc.ac.uk) is a leading international center for data management located in the United Kingdom.

The Data Observation Network for Earth (DataONE) (www.dataone.org) is an aggregator supported by the NSF under the DataNet program. DataONE provides archiving facilities for ecological and environmental data produced by scientists worldwide.

The Data Net Project (US National Science Foundation (NSF)) is an aggregator of resources for interdisciplinary research.

The Dataverse Network (thedata.org) provides open-source software that can be downloaded, installed, and customized by an institution or organization to host their own Dataverse repository.

The Digital Public Library of America (DPLA) (dp.la) is a free aggregated repository combining resources from many organizations, including the Library of Congress, HathiTrust, and the Internet Archive. It provides access to large digitized collections of books, images, historic records, and audiovisual materials.

Distributed Data Curation Center (D2C2) (d2c2.lib.purdue.edu) is a research center at Purdue University.

The Dryad Digital Repository (datadryad.org) is a nonprofit free repository for data underlying publications from the international scientific and medical literature. It is a curated resource that accepts many data types and makes them discoverable.

DuraSpace (www.duraspace.org) is an independent nonprofit organization founded in 2009 through a collaboration of the Fedora Commons organization and the DSpace Foundation, which are two of the largest providers of open-source repository software. Many institutions worldwide that use DSpace or Fedora open-source repository software belong to the DuraSpace community.

figshare (figshare.com) is a commercial online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. It is free to upload content and free to access, in adherence to the principle of open data. It allows users to upload any file format that can be visualized in a browser (Fenner, 2012).

The Harvard Dataverse Network (thedata.harvard.edu/dvn) is open to all researchers worldwide to publish research data across all disciplines. It is a repository for long-term preservation of research data that provides permanent identifiers for datasets.

The National Information Standards Organization (NISO) (www.niso.org) is a nonprofit organization that supports the discovery, retrieval, management, and preservation of published content. It connects libraries, publishers, and information systems vendors that develop technical standards and provide education about technological advances in information exchange.

OpenDOAR (www.opendoar.org) is a directory of open-access academic repositories. It is one of the SHERPA services including RoMEO and JULIET that are discussed in Chapter 2 of this book.

The Registry of Research Data Repositories (www.re3data.org) offers researchers, funding organizations, libraries, and publishers a directory of existing international repositories for research data. It was created initially by the Berlin School of Library and Information Science at the Humboldt-Universität zu Berlin, the Library and Information Services (LIS) Department of the GFZ German Research Centre for Geosciences, and the KIT Library at the Karlsruhe Institute of Technology (KIT). All records from Databib are now integrated in it, and by the end of 2015, re3data.org will become an imprint of DataCite and be included in its suite of services.

VIVO (www.vivoweb.org) is an interdisciplinary network that enables collaboration among researchers across all disciplines and allows users to search for information on people, departments, courses, grants, and publications. Initiated by Cornell University (vivo.cornell.edu), Vivo is an open-source semantic web application, which is installed locally. Participating institutions have control over content and over the search and browse capabilities.

Zenodo (zenodo.org), a repository hosted at CERN (near Geneva, Switzerland), was created through the OpenAIREplus project of the European Commission to enable researchers and institutions in the European Union (EU) to share multidisciplinary research results such as data and publications that are not part of the other institutional or disciplinary repositories.

8.9 Data management plans

Research institutions and government organizations have become concerned about how researchers manage, preserve, and share their data. NSF and other government funding organizations have introduced new policies aimed at preserving the integrity of data and allowing their sharing, analysis, and discovery (Peters and Dryden, 2011; US National Science Foundation, 2014). Grant proposals submitted to the NSF now must include a data management plan, which describes how the proposal will conform to NSF policy on the dissemination and sharing of research results (US National Science Foundation, 2013). This plan may include the following information (US National Science Foundation, 2014):

 Types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project.

 Standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).

 Policies for accessing and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements.

 Policies and provisions for reuse, redistribution, and the production of derivatives.

 Plans for archiving data, samples, and other research products and for preservation of access to them.

The requirements do not stipulate how data will be managed internally by investigators, but rather focus on how they will be shared and disseminated externally and preserved in the long term. Each data management plan addresses the following aspects of research data: metadata (descriptions of created materials such as types of data, samples, physical collections, and software), standards (for data and metadata formats and content), access policies, and archiving.

8.10 eScience and academic libraries

It is rare to see a job announcement for a science librarian today that does not have “data management” or “eScience” mentioned in the title. Academic librarians are exploring new roles in supporting researchers by helping them in the acquisition, storage, retrieval, analysis, and preservation of data (Bailey, 2011; Bennett and Nicholson, 2007; Carlson et al., 2011; Corrall et al., 2013; Cox and Corrall, 2013; Heidorn, 2011; Prado and Marzal, 2013; Wang, 2013).

In 2011, the Association of Research Libraries (ARL) (Association of Research Libraries, 2014) offered a 6-month-long program called eScience Institute, designed to help research libraries develop a strategic agenda for eScience support, with a particular focus on the sciences. The institute included a series of modules for teams to complete at their institutions and organized a final workshop for participants in the program (Antell et al., 2014).

8.10.1 Data curation

Academic libraries will have an important role to play in ensuring the archiving and preservation of data (Ball, 2013; Cox and Pinfield, 2014; Garritano and Carlson, 2009). Datasets will be treated as important collections. Several years ago, I interviewed James Mullins, Dean of Purdue University Libraries, who was one of the pioneers in introducing eScience in academic libraries (Baykoucheva, 2011). This is how Dr. Mullins described the possible roles that librarians could play in eScience:

Working in the area of data management draws upon the principles of library and archival sciences. Our ability to see structure to overlay on a mass of disparate “parts,” as well as the ability to identify taxonomies to create a defined language for accessing and retrieving data is what is needed from us …

Once a librarian has the experience of talking with researchers about their research and the challenges they have with managing data, it becomes clear that the most important factor is not our subject expertise (although some subject understanding is needed) but rather the librarian’s knowledge of metadata and taxonomies. In the old days we would have said that this is “cataloging and classification,” but today, to convey that we have morphed into a new role, it is best to use the more technical terminologies since it may help identify our “new” role as a cutting edge initiative and not be encumbered with past misperceptions.

8.10.2 Data preservation and storage

Should libraries be responsible for storing research data? And if they are not, who should be doing it? Many universities now maintain institutional repositories, which are online archives for collecting, preserving, and disseminating the intellectual output of a research institution. These repositories provide a preservation and archival infrastructure that allow researchers to share and get recognition for their data. The university repositories have started accepting datasets, but libraries will have to adapt the existing standards for bibliographic metadata to such new research outputs as datasets. How do you catalog entities that have no title or publisher, sometimes even no obvious author? The new generation of institutional repository systems maintained by academic libraries will need to be able to handle a variety of metadata formats, allow immediate analysis, and provide streaming capabilities. More sophisticated archival and preservation features will be essential in making the repositories more attractive to researchers for depositing their works.

8.10.3 Data literacy

Being able to read text is not a problem, but how about reading and understanding data? To be able to do this, you need to have certain skills in data literacy, which is a different type of literacy. There is a need for creating data literacy standards and core data literacy competencies, and some efforts in this direction have already been made (Prado and Marzal, 2013).

Research institutions and individual academic departments are now looking at introducing a mandatory requirement that graduate students attend training in data management. Until now, graduate students obtained some data management information as part of research ethics training. Academic libraries could find a niche opportunity in offering such training to graduate students through workshops or by integrating it in the library instruction classes they teach in science courses.

8.10.4 Are the academic libraries promising more than they can deliver?

The initial view of how libraries could support eScience was that they would be helping researchers draft the data management plans required by funding agencies (Ball, 2013). It soon became clear that researchers do not need librarians’ help, as these plans tend to be standard for any given discipline, and there are templates that can be used to prepare them. As for helping researchers create metadata for their datasets, this role might not be so easy to play, mainly because of subject expertise. A possible involvement for libraries could be to educate students and researchers in how to create metadata, but it is still not clear how important this role could be, as most datasets have very few of these.

Steven Bell considers the engagement of libraries with “Big Data” a “double-edged sword.” He sees “… a place for academic librarians to help their institutions succeed in the effort to capture and capitalize on Big Data. That is well and good, but let’s remember to bring to it the same sensibility that has made us wise evaluators and collectors of information in the past” (Bell, 2013). In spite of all the enthusiasm about eScience, some authors are not certain what role librarians will play in it. An article that looked at the skills and requirements for eScience-related library positions concluded that at present, eScience librarianship is not a well-defined field and that “the role of librarians in e-science is nebulous” (Alvaro et al., 2011).

In his interview included in Chapter 10 of this book, Gary Wiggins discussed the role of academic libraries in supporting eScience, cautioning that this might be an effort for which libraries will need more time to prepare. The biggest problems that academic librarians will face will be their lack of subject expertise and specific skills and gaining the trust of researchers as experts in this field. As suggested in an article, “librarians need to act as a voice for balance and reason within their organizations. While Big Data analytics has many interesting possibilities that should be explored, there is no substitute for more traditional methods of research” (Hoy, 2014).

8.11 Conclusion

Data constitute an integral part of the research life cycle, and where they are not present to support published results questions may be raised about the trustworthiness of those results. Many scientific journals now require raw data to be included in papers submitted for publication. Research data, alone, have no meaning, unless they are properly described and interpreted. This is how Joshua Schimel summarized the transition from raw data to interpretation (Schimel, 2011):

The role of scientists is to collect data and transform them into understanding. Their role as authors is to present that understanding. However, going from data to understanding is a multi-step process. The raw data that come from an instrument need to be converted to information, which is then transformed into knowledge, which in turn is synthesized and used to produce understanding.

The issue with the huge volumes of data is not only having enough space for it but also describing or indexing of datasets or creating metadata, which take much longer. In the future, efforts in the area of data management will be focused on preserving the integrity of data and their quality. Data classification, visualization, and contextualization will be equally important.

With research becoming more interdisciplinary and global, the need to design technologies and platforms that will allow researchers to collaborate more efficiently will become even more important. Further implementation of a semantic-based eScience infrastructure, Science 2.0 tools, and new web technologies will make information sharing and collaboration much easier (Bird and Frey, 2013). As the mandate by the NSF and other funding agencies forces a change in the sharing practices, researchers will see more benefits of making their results more open.

To create a culture of data citation and linking on a larger scale, significant changes are needed in the publishing infrastructure that will make data citation, linking, and reuse an integral part of the publication models. Proper citing and attribution of data will make it more difficult for someone to “steal” research data. The adoption of stricter rules for data attribution is likely to convince more researchers to share their data with their peers and even to make them openly available.

Academic libraries are transitioning from providing platforms for information to providing data-related services (Arms et al., 2009; Gradmann, 2014; Peters and Dryden, 2011; Tenopir et al., 2014). Engaging in these new areas will provide academic libraries with an opportunity to reimage themselves and become more closely aligned with the research and educational missions of their institutions.