Chapter 3

Machine Learning Means … Using a Machine to Learn from Data


If you’ve been watching any news for the past decade, you’ve no doubt heard of a concept called machine learning — often referenced when reporters are covering stories on the newest amazing invention from artificial intelligence. In this chapter, you dip your toes into the area called machine learning, and in Part 3 you see how machine learning and data science are used to increase business profits.

Defining Machine Learning and Its Processes

Machine learning is the practice of applying algorithmic models to data over and over again so that your computer discovers hidden patterns or trends that you can use to make predictions. It’s also called algorithmic learning. Machine learning has a vast and ever-expanding assortment of use cases, including

  • Real-time Internet advertising
  • Internet marketing personalization
  • Internet search
  • Spam filtering
  • Recommendation engines
  • Natural language processing and sentiment analysis
  • Automatic facial recognition
  • Customer churn prediction
  • Credit score modeling
  • Survival analysis for mechanical equipment

Walking through the steps of the machine learning process

Three main steps are involved in machine learning: setup, learning, and application. Setup involves acquiring data, preprocessing it, selecting the most appropriate variables for the task at hand (called feature selection), and breaking the data into training and test datasets. You use the training data to train the model, and the test data to test the accuracy of the model’s predictions. The learning step involves model experimentation, training, building, and testing. The application step involves model deployment and prediction.

Remember Here’s a rule of thumb for breaking data into test-and-training sets: Apply random sampling to two-thirds of the original dataset in order to use that sample to train the model. Use the remaining one-third of the data as test data, for evaluating the model’s predictions.

Technicalstuff A random sample contains observations that all each have an equal probability of being selected from the original dataset. A simple example of a random sample is illustrated by Figure 3-1 below. You need your sample to be randomly chosen so that it represents the full data set in an unbiased way. Random sampling allows you to test and train an output model without selection bias.

Schematic illustration of an example of a simple random sample

FIGURE 3-1: A example of a simple random sample

Becoming familiar with machine learning terms

Before diving too deeply into a discussion of machine learning methods, you need to know about the (sometimes confusing) vocabulary associated with the field. Because machine learning is an offshoot of both traditional statistics and computer science, it has adopted terms from both fields and added a few of its own. Here is what you need to know:

  • Instance: The same as a row (in a data table), an observation (in statistics), and a data point. Machine learning practitioners are also known to call an instance a case.
  • Feature: The same as a column or field (in a data table) and a variable (in statistics). In regression methods, a feature is also called an independent variable (IV).
  • Target variable: The same as a predictant or dependent variable (DV) in statistics.

Remember In machine learning, feature selection is a somewhat straightforward process for selecting appropriate variables; for feature engineering, you need substantial domain expertise and strong data science skills to manually design input variables from the underlying dataset. You use feature engineering in cases where your model needs a better representation of the problem being solved than is available in the raw dataset.

Warning Although machine learning is often referred to in context of data science and artificial intelligence, these terms are all separate and distinct. Machine learning is a practice within data science, but there is more to data science than just machine learning — as you will learn throughout this book. Artificial intelligence often, but not always, involves data science and machine learning. Artificial intelligence is a term that describes autonomously acting agents. In some case AI agents are robots, in others they are software applications. If the agent’s actions are triggered by outputs from an embedded machine learning model, then the AI is powered by data science and machine learning. On the other hand, if the AI’s actions are governed by a rules-based decision mechanism, then you can have AI that doesn’t actually involve machine learning or data science at all.

Considering Learning Styles

Machine learning can be applied in three main styles: supervised, unsupervised, and semisupervised. Supervised and unsupervised methods are behind most modern machine learning applications, and semisupervised learning is an up-and-coming star.

Learning with supervised algorithms

Supervised learning algorithms require that input data has labeled features. These algorithms learn from known features of that data to produce an output model that successfully predicts labels for new incoming, unlabeled data points. You use supervised learning when you have a labeled dataset composed of historical values that are good predictors of future events. Use cases include survival analysis and fraud detection, among others. Logistic regression is a type of supervised learning algorithm, and you can read more on that topic in the next section.

Technicalstuff Survival analysis, also known as event history analysis in social science, is a statistical method that attempts to predict the time of a particular event — such as a mother’s age at first childbirth in the case of demography, or age at first incarceration for criminologists.

Learning with unsupervised algorithms

Unsupervised learning algorithms accept unlabeled data and attempt to group observations into categories based on underlying similarities in input features, as shown in Figure 3-2. Principal component analysis, k-means clustering, and singular value decomposition are all examples of unsupervised machine learning algorithms. Popular use cases include recommendation engines, facial recognition systems, and customer segmentation.

Schematic illustration of unsupervised machine learning breaks down unlabeled data into subgroups.

FIGURE 3-2: Unsupervised machine learning breaks down unlabeled data into subgroups.

Learning with reinforcement

Reinforcement learning is a behavior-based learning model. It’s based on a mechanic similar to how humans and animals learn. The model is given “rewards” based on how it behaves, and it subsequently learns to maximize the sum of its rewards by adapting the decisions it makes to earn as many rewards as possible.

Seeing What You Can Do

Whether you’re just becoming familiar with the algorithms that are involved in machine learning or you’re looking to find out more about what’s happening in cutting-edge machine learning advancements, this section has something for you. First, I give you an overview of machine learning algorithms, broken down by function, and then I describe more about the advanced areas of machine learning that are embodied by deep learning and Apache Spark.

Selecting algorithms based on function

When you need to choose a class of machine learning algorithms, it’s helpful to consider each model class based on its functionality. For the most part, algorithmic functionality falls into the categories shown in Figure 3-3.

  • Regression: You can use this type to model relationships between features in a dataset. You can read more on linear and logistic regression methods and ordinary least squares in Chapter 4.
  • Instance-based: If you want to use observations in your dataset to classify new observations based on similarity, you can use this type. To model with instances, you can use methods like k-nearest neighbor classification, covered in Chapter 5.
  • Regularizing: You can use regularization to introduce added information as a means by which to prevent model overfitting or to solve an ill-posed problem. In case the term is new to you, model overfitting is a situation in which a model is so tightly fit to its underlying dataset, as well as its noise or random error, that the model performs poorly as a predictor for new observations.
  • Naïve Bayes: If you want to predict the likelihood of an event’s occurrence based on some evidence in your data, you can use this method, based on classification and regression. Naïve Bayes is covered in Chapter 4.
  • Decision tree: A tree structure is useful as a decision-support tool. You can use it to build models that predict for potential downstream implications that are associated with any given decision.
  • Clustering: You can use this type of unsupervised machine learning method to uncover subgroups within an unlabeled dataset. Both k-means clustering and hierarchical clustering are covered in Chapter 5.
  • Dimension reduction: If you’re looking for a method to use as a filter to remove redundant information, unexplainable random variation, and outliers from your data, consider dimension reduction techniques such as factor analysis and principal component analysis. These topics are covered in Chapter 4.
  • Neural network: A neural network mimics how the brain solves problems, by using a layer of interconnected neural units as a means by which to learn — and infer rules — from observational data. It’s often used in image recognition and computer vision applications.

    Imagine that you’re deciding whether you should go to the beach. You never go to the beach if it’s raining, and you don’t like going if it’s colder than 75 degrees (Fahrenheit) outside. These are the two inputs for your decision. Your preference to not go to the beach when it’s raining is a lot stronger than your preference to not go to the beach when it’s colder than 75 degrees, so you weight these two inputs accordingly. For any given instance where you decide whether you’re going to the beach, you consider these two criteria, add up the result, and then decide whether to go. If you decide to go, your decision threshold has been satisfied. If you decide not to go, your decision threshold was not satisfied. This is a simplistic analogy for how neural networks work.

    Remember Now, for a more technical definition. The simplest type of neural network is the perceptron. It accepts more than one input, weights them, adds them up on a processor layer, and then — based on the activation function and the threshold you set for it — outputs a result. An activation function is a mathematical function that transforms inputs into an output signal. The processor layer is called a hidden layer. A neural network is a layer of connected perceptrons that all work together as a unit to accept inputs and return outputs that signal whether some criteria is met. A key feature of neural nets is that they’re self-learning — in other words, they adapt, learn, and optimize per changes in input data. Figure 3-4 is a schematic layout that depicts how a perceptron is structured.

  • Deep learning method: This method incorporates traditional neural networks in successive layers to offer deep-layer training for generating predictive outputs. I tell you more about this topic in the next section.
  • Ensemble algorithm: You can use ensemble algorithms to combine machine learning approaches to achieve results that are better than would be available from any single machine learning method on its own.

Schematic illustration of machine learning algorithms can be broken down by function.

FIGURE 3-3: Machine learning algorithms can be broken down by function.

Schematic illustration of Neural networks that are connected layers of artificial neural units.

FIGURE 3-4: Neural networks are connected layers of artificial neural units.

If you use Gmail, you must be enjoying its autoreply functionality. You know — the three 1-line messages from which you can choose an autoreply to a message someone sent you? Well, this autoreply functionality within Gmail is called SmartReply, and it is built on deep learning algorithms. Another innovation built on deep learning is Facebook DeepFace, the Facebook feature that automatically recognizes and suggests tags for the people who appear in your Facebook photos. Figure 3-5 is a schematic layout that depicts how a deep learning network is structured.

Schematic illustration of a deep learning network which is a neural network with more than one hidden layer.

FIGURE 3-5: A deep learning network is a neural network with more than one hidden layer.

Deep learning is a machine learning method that uses hierarchical neural networks to learn from data in an iterative and adaptive manner. It’s an ideal approach for learning patterns from unlabeled and unstructured data. It’s essentially the same concept as the neural network, except that deep learning algorithms have two or more hidden layers. In fact, computer vision applications — like those that support facial recognition for images uploaded to Facebook, or the self-driving cars produced by Tesla — have been known to implement more than 150 hidden layers in a single deep neural network. The more hidden layers there are, the more complex a decision the algorithm can make.

Using Spark to generate real-time big data analytics

Apache Spark is an in-memory distributed computing application that you can use to deploy machine learning algorithms on big data sources in near-real-time to generate analytics from streaming big data sources. Whew!

Technicalstuff In-memory refers to processing data within the computer’s memory, without actually reading and writing its computational results onto the disk. In-memory computing provides its results a lot faster but cannot process much data per processing interval.

Because it processes data in microbatches, with 3-second cycle times, you can use it to significantly decrease time-to-insight in cases where time is of the essence. It can be run on data that sits in a wide variety of storage architectures, including Hadoop HDFS, Amazon Redshift, MongoDB, Cassandra, Solr and AWS. Spark is composed of the following submodules:

  • Spark SQL: You use this module to work with and query structured data using Spark. Within Spark, you can query data using Spark’s built-in SQL package: SparkSQL. You can also query structured data using Hive, but then you’d use the HiveQL language and run the queries using the Spark processing engine.
  • GraphX: The GraphX library is how you store and process network data from within Spark.
  • Streaming: The Streaming module is where the big data processing takes place. This module basically breaks a continuously streaming data source into much smaller data streams, called Dstreams — discreet data streams, in other words. Because the Dstreams are small, these batch cycles can be completed within three seconds, which is why it’s called microbatch processing.
  • MLlib: The MLlib submodule is where you analyze data, generate statistics, and deploy machine learning algorithms from within the Spark environment. MLlib has APIs for Java, Scala, Python, and R. The MLlib module allows data professionals to work within Spark to build machine learning models in Python or R, and those models will then pull data directly from the requisite data storage repository, whether that be on-premise, in a cloud, or even a multicloud environment. This helps reduce the reliance that data scientists sometimes have on data engineers. Furthermore, computations are known to be 100 times faster when processed in-memory using Spark as opposed to the traditional MapReduce framework.

You can deploy Spark on-premise by downloading the open-source framework from the Apache Spark website, at Another option is to run Spark on the cloud via the Apache Databricks service, at