Glossary

activation (neuron)

The emission of output from a neuron.

activation function

A function that determines the output of a neuron based on its input.

acyclic graph

A graph that has no cycle.

adjacency matrix

A matrix that represents a graph. It has a row and column for each vertex of the graph. Its contents are 1 in each entry whose row and column correspond to two vertices connected by an edge in the graph; all other entries are 0.

algorithm

1. Go to the first page of the book.

2. Read the current page.

3. If you don’t understand, go to step 2. Otherwise go to step 4.

4. If there is a next page, make it your current page and go to step 2. Otherwise terminate.

approximation

Solving a problem by using an algorithm that may not find the optimal solution, but one that is not far from it.

automatic differentiation

A set of techniques to evaluate the derivative of a function numerically—that is, not analytically, which would entail using the calculus rules for differentiating functions.

backlink

A link that points to the web page we are visiting, and by extension, the web pages that contain links that point to the web page we are visiting.

backpropagation algorithm

A fundamental algorithm for training neural networks. The network corrects its configuration (its weights and biases) by propagating adjustments from the final layer back toward the first layer.

bias

A numerical value attached to a neuron that controls its propensity to fire.

big O

A notation for computational complexity. Given an algorithm and input greater than some threshold, it gives us an upper bound on the expected number of steps required by the algorithm to complete. We want the input to be larger than some threshold because we are interested in the behavior of an algorithm on large data. The big O complexity for an algorithm gives us a guarantee that for large data, the algorithm will not require more than a particular number of steps. For example, a complexity of means that for input of size n that exceeds some threshold, the algorithm will not take more than a constant multiple of steps to complete.

binary search

A search algorithm that works on ordered data. We check the item in the middle of the search space. If it matches the one we are looking for, we are fine. Otherwise, we repeat the procedure to the left or right half, depending on whether we have overshot or undershot our target.

bit

The basic unit of information stored on a computer. A bit can take one of two values, 0 or 1. The word bit comes from binary digit.

bug

An error in a program. The term bug was used by Thomas Edison for a technical fault. In the early days of computing, real bugs would make their way into the machinery, causing them to fail. A moth that did that was found inside the Harvard Mark II computer in 1947. The moth has been preserved in the machine’s logbook, which is part of the collection of the Smithsonian National Museum of American History.

categorical cross-entropy

A loss function that calculates the difference between two probability distributions.

central processing unit (CPU)

The chip that carries out the instructions of a program inside a computer.

chromatic index

In graph coloring, the minimum number of colors required to color the edges of the graph.

Church-Turing thesis

The hypothesis that everything that can be computed by an algorithm can be computed by a Turing machine.

classifier

A program that classifies an observation in one out of a number of possible classes.

complexity (computational complexity)

The time required for an algorithm to run. The time is expressed on the order of elementary computational steps required to complete.

complexity class

A set of problems that require the same amount of a resource (such as time or memory) to be solved.

control structure

The three ways in which steps can be combined in an algorithm or program: sequence, selection, and iteration.

cycle

In graphs, a path that starts and end at the same node.

dangling node

In the PageRank algorithm, a node with only incoming edges and no outgoing edges.

data structure

A way to organize data, such that we can handle the data with a set of specific, prescribed operations.

decision boundary

The values of one or more variables that form the boundary between two different outcomes of a single decision based on the variable or variables.

deep learning

Neural networks that consist of many hidden layers, arranged such that succeeding layers represent deeper concepts, corresponding to higher abstraction levels.

degree (node)

The number of edges adjacent to a node.

densely connected

Layers in a neural network arranged such that all the neurons of a layer are connected to all the neurons of the following layer.

derivative

The slope of a function at a point; equivalently, the rate of change of a function. For example, acceleration is the derivative of speed (the rate of change of speed in time).

Dijkstra’s algorithm

An algorithm invented in 1956 by a young Dutch computer scientist, Edsger Dijkstra, to find the shortest path between two nodes in a graph. It works with graphs that contain positive weights.

directed graph

A graph in which the edges are directed. A directed graph is also called a digraph for short.

divide and conquer

A problem-solving method where we solve a problem by breaking it into smaller problems (typically two) and then do the same on the smaller problems, until the problems get so small that the solution is straightforward to find.

edge coloring

The assignment of colors to the edges of a graph so that no two adjacent edges share the same color.

eigenvector

In linear algebra, an eigenvector is a vector that, when we multiply it by a specific matrix, the result is the same vector multiplied by a number; that number is its eigenvalue. PageRank finds the first eigenvector of the Google matrix—that is, the eigenvector of the Google matrix with the largest eigenvalue, which is equal to one.

epoch

In machine learning, a pass, during training, through the whole training data set.

Euclid’s algorithm

An algorithm for finding the greatest common divisor of two integers, presented in the Elements, a set of 13 books written by the ancient Greek mathematician Euclid (ca. 300 BCE). The Elements treats geometry and number theory, starting from axioms and proving theorems based on the axioms. It is the oldest extant work of mathematics that uses this deductive approach, and as such, one of the most influential books in the history of science.

Eulerian path

A trail through a graph such that each edge is visited exactly once. It is also called a Euleurian walk.

Eulerian tour

A Eulerian path that starts and ends at the same node. It is also called a Eulerian tour.

Euler’s number

The mathematical constant e, approximately equal to 2.71828. It is the limit of as n approaches infinity.

execution path

The series of steps that an algorithm carries out during its execution.

exponential growth

A growth pattern in which a number of things is successively multiplied by itself. For example, we may start with a things, and then we’ll get things, then , and in general . Numbers grow fast with exponential growth.

factorial

The factorial of a natural number n is the product of all numbers from 1 up to and including n. We use the symbol n! so we have . The definition can be extended to all real numbers, but that does not concern us here.

factorial complexity

Computational complexity that follows factorial growth. In big O notation, it is .

fire (neuron)

See activation (neuron).

fitting

In machine learning, the process of learning from the data. In this process we construct a model that fits the observations.

garbage in, garbage out

If we feed a program garbage, instead of its expected input, we should expect no miracles: the program will produce garbage instead of its expected output.

global optimum

The best overal solution to a problem.

Google matrix

A special kind of matrix (a modification of the hyperlink matrix) that is used in the power method in the PageRank algorithm.

gradient

A vector containing all the partial derivatives of a function.

graph

A set of nodes, also called vertices, and edges, also called links, connecting them. Graphs can be used to model any kind of linked structure, from people to computer networks. As a result, many problems can be modeled as graphs, and many algorithms have been developed that work on top of them.

graph coloring

The edge or vertex coloring of a graph.

graphics processing unit (GPU)

A chip specially designed to handle the instructions for the creation and manipulation of images inside a computer.

greatest common divisor (gcd)

Given two integers, the largest integer that divides both.

greedy algorithm

An algorithm in which when we have to choose between alternative courses of action, we choose the one that gives us the greatest immediate payoff. This does not necessarily lead to the optimum outcome in the end.

hardware

The physical components that make up a computer or digital device. The term complements software.

head

The first item in a list.

heuristic

A strategy for making choices among alternatives in an algorithm. A greedy heuristic would require us to take the option that looks best right now (never mind what could happen in the future).

hidden layer

A neural network layer that is not directly connected to the input or output of the network.

Hierholzer algorithm

An algorithm for finding Eulerian circuits on graphs. It was published by the German mathematician Carl Hierholzer in 1873.

hill climbing

A metaphor for describing problem solving. The solution is at the top of the hill, and we have to climb from its foot. At each step there may be a decision to take among alternative paths. Depending on our choices, we may select the best path overall, a path that is not the best but still takes us to the top, or alas a path that leads to a plateau. If the worst happens and we reach a plateau, we’ll have to go back to a previous position to start moving along a different path.

hyperlink

A reference from a text to another part of the text or a different text. On the web, hyperlinks are links between web pages that the user may follow while browsing.

hyperlink matrix

A matrix representing the structure of a graph; it is like an adjacency matrix, but we divide the elements of its row by the number of nonzero elements in the row.

hyperplane

The generalization of the plane in more than three dimensions.

hypertext

Text that contains hyperlinks.

image recognition

The computational task of recognizing patterns in images.

insertion sort

A sorting method where we take each item and insert it into its correct position among the already sorted items.

internet

A global network of computers and digital devices, interconnected by means of a common suite of communication protocols. Initially, it was with its first letter capitalized (Internet) because internet could refer to any network that extended beyond the internal confines of an institution, which is called an intranet. As the global internet took off, however, the initial capital fell out of favor, probably saving a significant amount of ink.

intractable problem

A problem for which the best algorithms we know will take an inordinate amount of time to handle anything but trivial cases.

iteration

See loop.

key

A part of a record that we use for sorting or finding it. A key may be atomic, when it cannot be decomposed into smaller parts (for instance, an identification number), or composite, when it consists of smaller pieces of data (like the full name comprising first name, middle name, and surname).

label

In machine learning, a value representing the category to which an observation belongs. In training, the computer is given problems along with their solutions; when the problem is classification, the solutions are the labels representing the classes.

linear search

A search algorithm in which we examine each item in turn until we find the one we are looking for. It is also called a sequential search.

linear time

Time proportional to the input of an algorithm, written as .

linearly separable

A data set whose observations can be separated into two categories by a straight line in two dimensions, plane in three dimensions, or hyperplane in more dimensions.

list

A data structure that contains items. Each item points to the next one, apart from the last item, which points nowhere, or to null, as we say. The items are therefore linked to each other, and such a list is also called a linked list.

local optimum

A solution that is better than all the other neighboring solutions, but not the overall best. A neighboring solution is a solution in which we can get with a single move from the solution we are now.

logarithm

The inverse of raising to a power. The logarithm is the answer to the question, “To which power should I raise a number to get the value I want?” If we ask, “To which power should I raise 10 to get 1,000?,” the answer is 3 because . The number we will raise to the power is called the base of the logarithm. We write if . For a = 2 we write lgx.

logarithmic time

Time proportional to the logarithm of the input of an algorithm—for example, . Good searching algorithms take logarithmic time.

loglinear time

Time proportional to the product of the size of the input and logarithm of the input of an algorithm—for example, . Good sorting algorithms take loglinear time.

loop

A sequence of instructions in a computer program that is repeated. A loop ends when a condition is fulfilled. A loop that does not end is an infinite loop and is usually a bug because it may lead to a program that fails to terminate. See iteration.

loss

The difference between the actual and desired output of a machine learning algorithm. It is typically calculated by a loss function.

machine learning

The use of algorithms that solve problems by learning automatically from examples.

matrix

A rectangular array, typically of numbers or more generally mathematical expressions. The contents of a matrix are arranged horizontally in rows and vertically in columns.

Matthew effect

The phenomenon of the rich getting richer and poorer getting poorer. Named after the Gospel of Matthew (25:29), it has been found to apply to many contexts, not just material wealth.

minimization problem

A problem in which, among the possible solutions, we try to find the one with the minimum value.

merge sort

A sorting method that works by repeatedly merging larger and larger sets of sorted items.

Moore’s law

The observation, made in 1965 by Gordon Moore, founder of Fairchild Semiconductor and Intel, that the number of transistors in an integrated circuit doubles about every two years. It is an example of exponential growth.

move to front

A self-organizing search algorithm. When we find the item we are looking for, we move it to the first position.

multigraph

A graph in which an edge can occur more than once.

multiset

A set in which an element can appear multiple times; in mathematics, in a normal set an element cannot appear more than once.

node

An item in various data structures. Items in lists are called nodes.

neuron

A neuron is a cell that forms the basic building block of the nervous system. It receives signals from other neurons and propagates them to other neurons in the nervous system.

null

Nothingness in a computer.

online algorithm

An algorithm that does not require the full input to a problem in order to produce a solution. An online algorithm gets the input incrementally, as this arrives, and at each point produces a solution that takes account of the input it has received so far.

onset

The accented part of a rhythm.

optimal stopping problem

The problem of knowing the best time to stop when you are trying to maximize a reward or minimize a penalty.

optimizers

Algorithms that optimize the value of a function. In machine learning, optimizers typically minimize the value of the loss function.

overfitting

The equivalent of learning by rote in machine learning. The model that we are trying to train follows the training data so closely that it fits them too well. As a result, it does not predict correct values for other, unknown data.

overflow

Going beyond the range of allowable values on a computer.

PageRank

An algorithm used to rank web pages in terms of their importance. It was developed by the founders of Google and was the foundation of the Google search engine. The rank of a web page is its pagerank.

pagerank vector

A vector containing the pageranks of a graph.

partial derivative

In a function of many variables, the derivative of the function with respect to one variable, holding all other variables constant.

path

In a graph, a sequence of edges that connect a sequence of nodes.

path length

The sum of the weights along a path in a graph. If a graph does not have weights, it is the number of the links constituting the path.

Perceptron

An artificial neuron that uses the step function for its activation.

permutation

A rearrangement of some data in a different order.

pointer

A place in computer memory that holds the address of another place in computer memory. In this way, the former points to the latter.

polynomial time

Time proportional to the input to an algorithm raised to a constant power, such as .

power method

An algorithm that starts with a vector, multiplies it by a matrix, and then repeatedly multiplies the result by the matrix until it converges into a stable value. The power method is at the heart of PageRank; the vector at which it converges is the first eigenvector of the Google matrix.

program

A set of instructions, written in a programming language, that describes a computational process.

programming

The art of writing computer programs.

programming language

An artificial language that can be used to describe computational steps. A programming language can be executed on a computer. Like a human language, a programming language has syntax and grammar, specifying what can be written in it. Several programming languages exist, and new programming languages are developed all the time in an effort to make programming more productive (or because many people cannot resist creating their own language and hope it will be widely adopted). A programming language can be high level, when it looks somewhat akin to a human language, or low level, when its constructs are rudimentary, mirroring the underlying hardware.

punched card

A piece of stiff paper that records information by the location of the punched holes on it. It is also called a punch card. The cards were used in early computers, and before that, in machines such as Jacquard looms, in which they described the pattern to be woven.

quantum computer

A computer that leverages quantum phenomena to perform computations. Quantum computers work with qubits instead of bits. Some problems can be solved much faster on quantum computers than on classical ones. The manufacture of quantum computers presents difficult physical challenges.

qubit

The basic unit of quantum information. A qubit can exist in a superposition of two states, 0 and 1, until we measure it, when it collapses to one of the two binary values. A qubit can be implemented using quantum properties, such as the spin of an electron.

quicksort

A sorting method that works by repeatedly selecting an item and moving the other items around it so that all smaller items are on the one side and all the rest on its other side.

radix sort

A sorting method that works by breaking the keys into their parts (for example, digits for numerical keys) and placing the items into piles corresponding to the values of their parts (ten piles, one for each digit). We start by forming piles based on the last digit, then we stack all piles and redistribute to piles based on the one but last digit, and so on. When we do the procedure for the first digit, we end up with a sorted pile. It is a string sorting method because we treat numerical keys as a string of digits.

random surfer

A person who surfs the web by going from page to page, choosing the next page according to the probability given by the Google matrix.

randomization

The use of randomness in algorithms. In this way, an algorithm may be able to find good solutions to a problem in most cases, even if it would be computationally infeasible to find the optimal solution.

record

A set of related data describing an entity for a particular application. For example, a student record can include identification data, enrollment year, and transcripts.

rectifier

An activation function that turns all negative inputs to zero, or otherwise its output is directly proportional to its input.

relaxation

A method in graph algorithms, where we assign the worst possible value to the values we want to find, and the algorithm proceeds by producing better and better estimates for these values. We therefore start with the most extreme values possible, and gradually relax them with values that are closer and closer to the final result.

ReLU

A neuron that uses a rectifier as its activation function. ReLU stands for rectified linear unit.

search space

The domain of values in which we search.

secretary problem

An optimal stopping problem. From a pool of candidates, we examine each one in turn. We must make the decision to hire or not on the spot, without being able to reverse past decisions, and without having examined the remaining candidates.

selection

In algorithms and programming, a choice, based on some logical condition, between alternative series of steps to be executed.

selection sort

A sorting method where each time we find the minimum of the unsorted items and put it into its correct position.

self-organizing search

Search algorithms that take advantage of the popularity of search items by moving them to positions where we’ll be able to find them faster.

sequence

In algorithms and programming, a series of steps executed one after the other.

shortest path

The shortest path between two nodes in graph.

sigmoid

An S-shaped function whose values range from 0 to 1.

social network

A graph in which nodes are people, and the edges are the relationships between them.

softmax

An activation function that takes as input a vector of real numbers and turns it into another vector that is a probability distribution.

software

The set of programs running on a computer or digital device; the term complements hardware. The terms have been used before computers in a different setting. In 1850, rubbish-tip pickers were using the terms “soft-ware” and “hard-ware” to distinguish between material that would decompose and everything else. These meanings may bring solace to anybody struggling with a computer that won’t do what it is supposed to do.

spallation

Breaking a material into smaller pieces. In nuclear physics, the material is a heavy nucleus that emits a large number of protons and neutrons after being bombarded with a high-energy particle.

sparse matrix

A matrix in which most elements are equal to zero.

string

A sequence of symbols. Traditionally a string was a sequence of characters, but nowadays what can go into a string depends on the actual application; it may be digits, alphabetic characters, punctuation, or even more recently invented symbols such as emojis.

string sorting method

A sorting method that treats its keys as a sequence of symbols. For example, the key 1234 is treated as the string of symbols 1, 2, 3, 4 instead of the number 1,234.

supervised learning

A machine learning approach in which we provide an algorithm with input problems accompanied by their solutions.

synapse

A connection between neurons.

tabulating machine

Electromechanical devices that could read punched cards and use the information on them to produce a tally.

tanh (hyperbolic tangent)

An activation function that looks like the sigmoid function, but its output ranges from to 1.

test data set

Data that we set aside during training so that we can use them to check how well a particular machine learning approach will perform with real-world data.

tour

A path that starts and ends at the same node in a graph. It is also called a circuit.

training

In machine learning, the process of providing an algorithm with example inputs so that it can learn to produce correct outputs.

training data set

Data that we use with machine learning algorithms to train them to solve problems.

transposition method

A self-organizing search algorithm. When we find an element, we swap it with the one preceding it. In this way, popular items are moving to the front.

traveling salesman problem

Also known as the traveling salesperson problem, but people did not put much thought into gender definitions. The problem that asks us, If we have a list of cities and the distances between each pair of them, what is the shortest possible route that one should take to visit each city once and return to the origin city? It is probably the most famous intractable problem.

Turing machine

An idealized (abstact) machine, described by Alan Turing, consisting of an infinite tape and movable head that reads and writes symbols on the tape following a set of prescribed rules. The Turing machine can implement any algorithm and therefore can be used as a model of what can be computed.

unary numeral system

The number system using a single symbol for representing numbers; for instance, a stroke representing a unit, so that III represents three.

undirected graph

A graph in which the edges are undirected.

unsupervised learning

A machine learning approach in which we provide an algorithm input problems without their solutions. The machine learning algorithm then must derive what the expected input should be in order to be able to produce it.

vector

A horizontal row or vertical column of numbers (or more generally, mathematical expressions). Usually we meet vectors in geometry, where it is a geometric entity with a length and direction, represented as a row or column containing their numerical coordinates; however, the notion of a vector is more general than that—take, for example, the pagerank vector. A vector is a special case of a matrix.

vertex coloring

The assignment of colors to the vertices of a graph so that no two adjacent vertices share the same color.

weight (graph)

A number attached to an edge of a graph. Such a number may, for example, model a reward or penalty associated with the link between the nodes connected by the edge.

weight (neuron)

A numerical value attached to a synapse in a neuron. From each synapse, the neuron receives an input multiplied by the weight of the synapse.

weighted input (neuron)

The sum of the products of the inputs with the weights of a neuron.