An obvious trend in modern societies is the proliferation of systems that can sense and react to the world: smart phones, smart homes, self-driving cars, and smart cities. This proliferation of smart devices and sensors presents challenges to our privacy, but it is also driving the growth of big data and the development of new technology paradigms, such as the Internet of Things. In this context, data science will have a growing impact across many areas of our lives. However, there are two areas where data science will lead to significant developments in the coming decade: personal medicine and the development of smart cities.
In recent years, the medical industry has been looking at and adopting data science and predictive analytics. Doctors have traditionally had to rely on their experiences and instincts when diagnosing a condition or deciding on what the next treatment might be. The evidence-based medicine and precision-medicine movement argue that medical decisions should be based on data, ideally linking the best available data to an individual patient’s predicament and preferences. For example, in the case of precision medicine, fast genome-sequencing technology means that it is now feasible to analyze the genomes of patients with rare diseases in order to identify mutations that cause the disease so as to design and select appropriate therapies specific to that individual. Another factor driving data science in medicine is the cost of health care. Data science, in particular predictive analytics, can be used to automate some health care processes. For example, predictive analytics has been used to decide when antibiotics and other medicines should be administrated to babies and adults, and it is widely reported that many lives have been saved because of this approach.
Medical sensors worn or ingested by the patient or implanted are being developed to continuously monitor a patient’s vital signs and behaviors and how his or her organs are functioning throughout the day. These data are continuously gathered and fed back to a centralized monitoring server. It is here at the monitoring server that health care professionals access the data being generated by all the patients, assess their conditions, understand what effects the treatment is having, and compare each patient’s results to those of other patients with similar conditions to inform them regarding what should happen next in each patient’s treatment regime. Medical science is using the data generated by these sensors and integrating it with additional data from the various parts of the medical profession and the pharmaceutical industry to determine the effects of current and new medicines. Personalized treatment programs are being developed based on the type of patient, his condition, and how his body responds to various medicines. In addition, this new type of medical data science is now feeding into new research on medicines and their interactions, the design of more efficient and detailed monitoring systems, and the uncovering of greater insights from clinical trials.
Various cities around the world are adopting new technology to be able to gather and use the data generated by their citizens in order to better manage the cities’ organizations, utilities, and services. There are three core enablers of this trend: data science, big data, and the Internet of Things. The name “Internet of Things” describes the internetworking of physical devices and sensors so that these devices can share information. This may sound mundane, but it has the benefit that we can now remotely control smart devices (such as our home if it is properly configured) and opens the possibility that networked machine-to-machine communication will enable smart environments to autonomously predict and react to our needs (for example, there are now commercially available smart refrigerators that can warn you when food is about to spoil and allows you to order fresh milk through your smart phone).
Smart-city projects integrate real-time data from many different data sources into a single data hub, where they are analyzed and used to inform management and planning decisions. Some smart-city projects involve building brand-new cities that are smart from the ground up. Both Masdar City in the United Arab Emirates and Songdo City in South Korea are brand-new cities that have been built with the smart technology at their core and a focus on being eco-friendly and energy efficient. However, most smart-city projects involve the retrofitting of existing cities with new sensor networks and data-processing centers. For example, in the SmartSantander project in Spain,1 more than 12,000 networked sensors have been installed across the city to measure temperature, noise, ambient lighting, carbon monoxide levels, and parking. Smart-city projects often focus on developing energy efficiency, planning and routing traffic, and planning utility services to match population needs and growth.
Japan has embraced the smart-city concept with a particular focus on reducing energy usage. The Tokyo Electric Power Company (TEPC) has installed more than 10 million smart meters across homes in the TEPC service area.2 At the same time, TEPC is developing and rolling out smart-phone applications that enable customers to track the electricity used in their homes in real time and to change their electricity contract. These smart-phone applications also enable the TEPC to send each customer personalized energy-saving advice. Outside of the home, smart-city technology can be used to reduce energy usage through intelligent street lighting. The Glasgow Future Cities Demonstrator is piloting street lighting that switches on and off depending on whether people are present. Energy efficiency is also a top priority for all new buildings, particularly for large local government and commercial buildings. These buildings’ energy efficiency can be optimized by automatically managing climate controls through a combination of sensor technology, big data, and data science. An extra benefit of these smart-building monitoring systems is that they can monitor for levels of pollution and air quality and can activate the necessary controls and warnings in real time.
Transport is another area where cities are using data science. Many cities have implemented traffic-monitoring and management systems. These systems use real-time data to control the flow of traffic through the city. For example, they can control traffic-light sequences in real time, in some cases to give priority to public-transport vehicles. Data on city transport networks are also useful for planning public transport. Cities are examining the routes, schedules, and vehicle management to ensure that services support the maximum number of people and to reduce the costs associated with delivering the transport services. In addition to modeling the public network, data science is also being used to monitor official city vehicles to ensure their optimal usage. Such projects combine traffic conditions (collected by sensors along the road network, at traffic lights, etc.), the type of task being performed, and other conditions to optimize route planning, and dynamic route adjustments are fed to the vehicles with live updates and changes to their routes.
Beyond energy usage and transport, data science is being used to improve the provision of utility services and to implement longer-term planning of infrastructure projects. The efficient provision of utility services is constantly being monitored based on current usage and projected usages, and the monitoring takes into account previous usage in similar conditions. Utility companies are using data science in a number of ways. One way is monitoring the delivery network for the utility: the supply, the quality of the supply, any network issues, areas that require higher-than-expected usage, automated rerouting of the supply, and any anomalies in the network. Another way that utility companies are using data science is in monitoring their customers. They are looking for unusual usage that might indicate some criminality (for example, a grow house), customers who may have altered the equipment and meters for the building where they live, and customers who are most likely to default on their payments. Data science is also being used in examining the best way to allocate housing and associated services in city planning. Models of population growth are built to forecast into the future, and based on various simulations the city planners can estimate when and where certain support services, such as high schools, are needed.
A data science project sometimes fails insofar as it doesn’t deliver what was hoped for because it gets bogged down in some technical or political issues, does not deliver useful results, and, more typically, is run once (or a couple of times) but never run again. Just like Leo Tolstoy’s happy families,3 the success of a data science project is dependent on a number of factors. Successful data science projects need focus, good-quality data, the right people, the willingness to experiment with multiple models, integration into the business information technology (IT) architecture and processes, buy-in from senior management, and an organization’s recognition that because the world changes, models go out of date and need to be rebuilt semiregularly. Failure in any of these areas is likely to result in a failed project. This section details the common factors that determine the success of data science projects as well as the typical reasons why data science projects fail.
Every successful data science project begins by clearly defining the problem that the project will help solve. In many ways, this step is just common sense: it is difficult for a project to be successful unless it has a clear goal. Having a well-defined goal informs the decisions regarding which data to use, what ML algorithms to use, how to evaluate the results, how the analysis and models will be used and deployed, and when the optimal time might be to go through the process again to update the analysis and models.
A well-defined question can be used to define what data are needed for the project. Having a clear understanding of what data are needed helps to direct the project to where these required data are located. It also helps with defining what data are currently unavailable and hence identifies some additional projects that can look at capturing and making available these data. It is important, however, to ensure that the data used are good-quality data. Organizations may have applications that are poorly designed, a very poor data model, and staff who are not trained correctly to ensure that good data get entered. In fact, myriad factors can lead to bad-quality data in systems. Indeed, the need for good-quality data is so important that some organizations have hired people to constantly inspect the data, assess the quality of the data, and then feed back ideas on how to improve the quality of the data captured by the applications and by the people inputting the data. Without good-quality data, it is very difficult for a data science project to succeed.
When the required data are sourced, it is always important to check what data are being captured and used across an organization. Unfortunately, the approach to sourcing data taken by some data science projects is to look at what data are available in the transactional databases (and other data sources) and then to integrate and clean these data before going on to data exploration and analysis. This approach completely ignores the BI team and any data warehouse that might exist. In many organizations, the BI and data-warehouse team is already gathering, cleaning, transforming, and integrating the organization’s data into one central repository. If a data warehouse already exists, then it probably contains all or most of the data required by a project. Therefore, a data warehouse can save a significant amount of time on integrating and cleaning the data. It will also have much more data than the current transactional databases contain. If the data warehouse is used, it is possible to go back a number of years, build predictive models using the historic data, roll these models through various time periods, and then measure each model’s level of predictive accuracy. This process allows for the monitoring of changes in the data and how they affect the models. In addition, it is possible to monitor variations in the models that are produced by ML algorithms and how the models evolve over time. Following this kind of approach facilitates the demonstration of how the models work and behave over a number of years and helps with building up the customer’s confidence in what is being done and what can be achieved. For example, in one project where five years of historical data were available in the data warehouse, it was possible to demonstrate that the company could have saved US$40 million or more over that time period. If the data warehouse had not been available or used, then it would not have been possible to demonstrate this conclusion. Finally, when a project is using personal data it is essential to ensure that the use of this data is in line with the relevant antidiscrimination and privacy regulations.
A successful data science project often involves a team of people with a blend of data science competencies and skills. In most organizations, a variety of people in existing roles can and should contribute to data science projects: people working with databases, people who work with the ETL process, people who perform data integration, project managers, business analysts, domain experts, and so on. But organizations often still need to hire data science specialists—that is, people with the skills to work with big data, to apply ML, and to frame real-world problems in terms of data-driven solutions. Successful data scientists are willing and able to work and communicate with the management team, end users, and all involved to show and explain what and how data science can support their work. It is difficult to find people who have both the required technical skill set and the ability to communicate and work with people across an organization. However, this blend is crucial to the success of data science projects in most organizations.
It is import to experiment with a variety of ML algorithms to discover which works best with the data sets. All too often in the literature, examples are given of cases where only one ML algorithm was used. Maybe the authors are discussing the algorithm that worked best for them or that is their favorite. Currently there is a great deal of interest in the use of neural networks and deep learning. Many other algorithms can be used, however, and these alternatives should be considered and tested. Furthermore, for data science projects based in the EU, the General Data Protection Regulations, which go into effect in April 2018, may become a factor in determining the selection of algorithms and model. A potential side effect of these regulations is that an individual’s “right to explanation” in relation to automated decision processes that affect them may limit the use in some domains of complex models that are difficult to interpret and explain (such as deep neural network models).
When the goal of a data science project is being defined, it is vital also to define how the outputs and results of the project will be deployed within the organization’s IT architecture and business processes. Doing so involves identifying where and how the model is to be integrated within existing systems and how the generated results will be used by the system end users or if the results will be fed into another process. The more automated this process is, the quicker the organization can respond to its customers’ changing profile, thereby reducing costs and increasing potential profits. For example, if a customer-risk model is built for the loan process in a bank, it should be built into the front-end system that captures the loan application by the customer. That way, when the bank employee is entering the loan application, she can be given live feedback by the model. The employee can then use this live feedback to address any issues with the customer. Another example is fraud detection. It can take four to six weeks to identify a potential fraud case that needs investigation. By using data science and building it into transaction-monitoring systems, organizations can now detect potential fraud cases in near real time. By automating and integrating data-driven models, quicker response times are achieved, and actions can be taken at the right time. If the outputs and models created by a project are not integrated into the business processes, then these outputs will not be used, and, ultimately, the project will fail.
For most projects in most organizations, support by senior management is crucial to the success of many data science projects. However, most senior IT managers are very focused on the here and now: keeping the lights on, making sure their day-to-day applications are up and running, making sure the backups and recovery processes are in place (and tested), and so on. Successful data science projects are sponsored by senior business managers (rather than by an IT manager) because the former are focused not on the technology but on the processes involved in the data science project and how the outputs of the data science project can be used to the organization’s advantage. The more focused a project sponsor is on these factors, the more successful the project will be. He or she will then act as the key to informing the rest of the organization about the project and selling it to them. But even when data science has a senior manager as an internal champion, a data science strategy can still fail in the long term if the initial data science project is treated as a box-ticking exercise. The organization should not view data science as a one-off project. For an organization to reap long-term benefits, it needs to build its capacity to execute data science projects often and to use the outputs of these projects. It takes long-term commitment from senior management to view data science as a strategy.
Most data science projects will need to be updated and refreshed on a semiregular basis. For each new update or iteration, new data can be added, new updates can be added, maybe new algorithms can be used, and so on. The frequency of these iterations will vary from project to project; it could be daily or quarterly or biannually or annually. Checks should be built into the productionalized data science outputs to detect when models need updating (see Kelleher, Mac Namee, and D’Arcy 2015 for an explanation of how to use a stability index to identify when a model should be updated).
Humans have always abstracted from the world and tried to understand it by identifying patterns in their experiences of it. Data science is the latest incarnation of this pattern-seeking behavior. However, although data science has a long history, the breadth of its impact on modern life is without precedent. In modern societies, the words precision, smart, targeted, and personalized are often indicative of data science projects: precision medicine, precision policing, precision agriculture, smart cities, smart transport, targeted advertising, personalized entertainment. The common factor across all these areas of human life is that decisions have to be made: What treatment should we use for this patient? Where should we allocate our policing resources? How much fertilizer should we spread? How many high schools do we need to build in the next four years? Who should we send this advertisement to? What movie or book should we recommend to this person? The power of data science to help with decision making is driving its adoption. Done well, data science can provide actionable insight that leads to better decisions and ultimately better outcomes.
Data science, in its modern guise, is driven by big data, computer power, and human ingenuity from a number of fields of scientific endeavor (from data mining and database research to machine learning). This book has tried to provide an overview of the fundamental ideas and concepts required to understand data science. The CRISP-DM project life cycle makes the data science process explicit and provides a structure for the data science journey from data to wisdom: understand the problem, prepare the data, use ML to extract patterns and create models, use the models to get actionable insight. The book also touches on some of the ethical concerns relating to individual privacy in a data science world. People have genuine and well-founded concerns that data science has the potential to be used by governments and vested interests to manipulate our behaviors and police our actions. We, as individuals, need to develop informed opinions about what type of a data world we want to live in and to think about the laws we want our societies to develop in order to steer the use of data science in appropriate directions. Despite the ethical concerns we may have around data science, the genie is already very much out of the bottle: data science is having and will continue to have significant effects on our daily lives. When used appropriately, it has the potential to improve our lives. But if we want the organizations we work with, the communities we live in, and the families we share our lives with to benefit from data science, we need to understand and explore what data science is, how it works, and what it can (and can’t) do. We hope this book has given you the essential foundations you need to go on this journey.