The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. Today, data science drives decision making in nearly all parts of modern societies. Some of the ways that data science may affect your daily life include determining which advertisements are presented to you online; which movies, books, and friend connections are recommended to you; which emails are filtered into your spam folder; what offers you receive when you renew your cell phone service; the cost of your health insurance premium; the sequencing and timing of traffic lights in your area; how the drugs you may need were designed; and which locations in your city the police are targeting.
The growth in use of data science across our societies is driven by the emergence of big data and social media, the speedup in computing power, the massive reduction in the cost of computer memory, and the development of more powerful methods for data analysis and modeling, such as deep learning. Together these factors mean that it has never been easier for organizations to gather, store, and process data. At the same time, these technical innovations and the broader application of data science means that the ethical challenges related to the use of data and individual privacy have never been more pressing. The aim of this book is to provide an introduction to data science that covers the essential elements of the field at a depth that provides a principled understanding of the field.
Chapter 1 introduces the field of data science and provides a brief history of how it has developed and evolved. It also examines why data science is important today and some of the factors that are driving its adoption. The chapter finishes by reviewing and debunking some of the myths associated with data science. Chapter 2 introduces fundamental concepts relating to data. It also describes the standard stages in a data science project: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Chapter 3 focuses on data infrastructure and the challenges posed by big data and the integration of data from multiple sources. One aspect of a typical data infrastructure that can be challenging is that data in databases and data warehouses often reside on servers different from the servers used for data analysis. As a consequence, when large data sets are handled, a surprisingly large amount of time can be spent moving data between the servers a database or data warehouse are living on and the servers used for data analysis and machine learning. Chapter 3 begins by describing a typical data science infrastructure for an organization and some of the emerging solutions to the challenge of moving large data sets within a data infrastructure, which include the use of in-database machine learning, the use of Hadoop for data storage and processing, and the development of hybrid database systems that seamlessly combine traditional database software and Hadoop-like solutions. The chapter concludes by highlighting some of the challenges in integrating data from across an organization into a unified representation that is suitable for machine learning. Chapter 4 introduces the field of machine learning and explains some of the most popular machine-learning algorithms and models, including neural networks, deep learning, and decision-tree models. Chapter 5 focuses on linking machine-learning expertise with real-world problems by reviewing a range of standard business problems and describing how they can be solved by machine-learning solutions. Chapter 6 reviews the ethical implications of data science, recent developments in data regulation, and some of the new computational approaches to preserving the privacy of individuals within the data science process. Finally, chapter 7 describes some of the areas where data science will have a significant impact in the near future and sets out some of the principles that are important in determining whether a data science project will succeed.