Glossary

algorithmic complexity
The complexity of an algorithm is generally measured in the time it takes to run or how much space (memory or disk space) is needed to run it.
annotation
In an NLP context, an annotation is a marking on a segment of text or audio with some extra information. Generally, an annotation will require character indices for the start and end of the annotated segment, as well as an annotation type.
annotator
An annotator is a function that takes text and produces annotations. It is not uncommon for some annotators to have a dependency on another type of annotator.
Apache Hadoop
Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig.
Apache Parquet
Parquet is a data format originally created for Hadoop. It allows for efficient compression of columnar data. It is a popular format in the Spark ecosystem.
Apache Spark
Spark is a distributed computing framework with a high-level interface and in memory processing. Spark was developed in Scala, but there are now APIs for Java, Python, R, and SQL.
application
An application is a program with an end user. Many applications have a graphical user interface (GUI), though this is not necessary. In this book, we also consider programs that do batch data processing as “applications”.
array
An array is a data structure where elements are associated with an index. They are implemented differently in different programming languages. Numpy arrays, `ndarrays`, are the most popular kind of arrays used by Python users (especially among data scientists).
autoencoder
An autoencoder is a neural-network–based technique used to convert some input data into vectors, matrices, or tensors. This new representation is generally of a lower dimension than the input data.
Bidirectional Encoder Representations from Transformers (BERT)
BERT from Google is a technique for converting words into a vector representation. Unlike Word2vec, which disregards context, BERT uses the context a word is found in to produce the vector.
classification
In a machine learning context, classification is the task of assigning classes to examples. The simplest form is the binary classification task where each example can have one of two classes. The binary classification task is a special case of the multiclass classification task where each example can have one of a fixed set of classes. There is also the multilabel classification task where each example can have zero or more labels from a fixed set of labels.
clustering
In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms.
container
In software there are two common senses of “container.” In this book, the term is primarily used to refer to a virtual environment that contains a program or programs. The term “container” is also sometimes used to refer to an abstract data type of data structure that contains a collection of elements.
context
In an NLP, “context” generally refers to the surrounding language data around a segment of text or audio. In linguistics, it can also refer to the “real world” context in which a language act occurs.
CSV
A CSV (Comma Separated Values) file is a common way to store structured data. Elements are separated by commas, and rows are separated by new lines. Another common separator is the tab character. Files that use the tab are called TSVs. It is not uncommon for files that use a separator other than a comma to still be called CSVs.
data scientist
A data scientist is someone who uses scientific techniques to analyze data or build applications that consume data.
DataFrame
A DataFrame is a data structure that is used to manipulate tabular data.
decision tree
In a machine learning context, a decision tree is a data structure that is built for classification or regression tasks. Each node in the tree splits on a particular feature.
deep learning
Deep learning is a collection of neural-network techniques that generally use multiple layers.
dialect
In a linguistics context, a dialect is a particular variety of a language associated with a specific group of people.
differentiate
In a mathematics context, to differentiate is to find the derivative of a function. The derivative function is a function that maps from the domain to the instantaneous rate of change of the original function.
discourse
In a linguistics context, a discourse is a sequence of language acts, especially between two or more people.
distributed computing
Distributed computing is using multiple computers to perform parallelized computation.
distributional semantics
In an NLP context, this refers to techniques that attempt to represent words in a numerical form, almost always a vector, based on the words’ distribution in a corpus. This name originally comes from linguistics where it refers to theories that attempt to use the distribution of words in data to understand the words’ semantics.
Docker
Docker is software that allows users to create containers (virtual environments) with Docker scripts.
document
In an NLP context, a document is a complete piece of text especially if it contains multiple sentences.
embedding
In an NLP context, an embedding is a technique of representing words (or other language elements) as a vector, especially when such a representation is produced by a neural network.
encoding
In an NLP context, the encoding or character encoding refers to the mapping from characters, e.g. “a”, “?”, to bytes.
estimator
In a Spark MLlib context, an estimator is a stage of a pipeline that uses data to produce a model that transforms the data.
evaluator
In a Spark MLlib context, an evaluator is a stage of a pipeline that produces metrics from predictions.
feature
In a machine learning context, a feature is an attribute of an input, especially a numerical attribute. For example, if the input is a document, the number of unique tokens in the document is a feature. The words present in a document are also referred to as features.
function
In a programming context, a function is a sequence of instructions. In a mathematics context, a function is a mapping between two sets, the domain and the range, such that each element of the domain is mapped to a single element in the range.
GloVe
GloVe is a distributional semantics technique for representing words as vectors using word-to-word co-occurrences.
graph
In a computer science or mathematics context, a graph is a set of nodes and edges that connect the nodes.
guidelines
In a human labeling context, guidelines are the instructions given to the human labelers.
hidden Markov model
A hidden Markov model is a technique for modeling sequences using a hidden state that only uses the previous part of the sequence.
hyperparameter
In a machine learning context, a hyperparameter is a setting of a learning algorithm. For example, in a neural network, the weights are parameters, but the number and size of the layers are hyperparameters.
index
In an information retrieval context, an index is a mapping from documents to the words contained in the documents.
interlabeler agreement
In a human labeling context, interlabeler agreement is a measure of how much labelers agree (generally unknowingly) when labeling the same example.
inverted index
In an information retrieval context, an index is a mapping from words to the documents that contain the words.
Java
Java is an object-oriented programming language. Java is almost always compiled to run on the Java Virtual Machine (JVM). Scala and a number of other popular languages run on the JVM and so are interoperable with Java.
Java Virtual Machine (JVM)
The JVM is a virtual machine that runs programs that have been compiled into Java bytecode. As the name suggests, Java is the primary language which uses the JVM, but Scala and a number of other programming languages use it as well.
JSON
JavaScript Object Notation (JSON) is a data format.
K-Means
K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid.
knowledge base
A knowledge base is a collection of knowledge or facts in a computationally usable format.
labeling
In a machine learning context, labeling is the process of assigning labels to examples, especially when done by humans.
language model
In an NLP context, a language model is a model of the probability distribution of word sequences.
latent Dirichlet allocation (LDA)
LDA is a technique for topic modeling that treats documents as a sequence of words selected from weighted topics (probability distributions over words).
latent semantic indexing (LSI)
LSI is a technique for topic modeling that performs single value decomposition on the term-document matrix.
linear algebra
Linear algebra is the branch of mathematics focused on linear equations. In a programming context, linear algebra generally refers to the mathematics that describe vectors, matrices, and their associated operations.
linear regression
Linear regression is a statistical technique for modeling the relationship between a single variable and one or more other variables. In a machine learning context, linear regression refers to a regression model based on this statistical technique.
linguist
A linguist is a person who studies human languages.
linguistic typology
Linguistic typology is a field of linguistics that groups languages by their traits.
logging
In a software context, logging is information output by an application for use in monitoring and debugging the application.
logistic regression
Logistic regression is a statistical technique for modeling the probability of an event. In a machine learning context, logistic regression refers to a classification model based on this statistical technique.
long short-term memory (LSTM)
LSTM is a neural-network technique that is used for learning sequences. It attempts to learn when to use and update the context.
loss
In a machine learning context, loss refers to a measure of how wrong a supervised model is.
machine learning
Machine learning is a field of computer science and mathematics that focuses on algorithms for building and using models “learned” from data.
MapReduce
MapReduce is a style of programming based on functional programming that was the basis of Hadoop.
matrix
A matrix is a rectangular array of numeric values. The mathematical definition is much more abstract.
metric
In a machine learning context, a metric is a measure of how good or bad a particular model is at its task. In a software context, a metric is a measure defined for an application, program, or function.
model
In a general scientific context, a model is some formal description, especially a mathematical one, of a phenomenon or system. In the machine learning context, a model is a set of hyperparameters, a set of learned parameters, and an evaluation or prediction function, especially one learned from data. In Spark MLlib, a model is what is produced by an Estimator when fitted to data.
model publishing
Once a machine learning model has been learned, it must be published to be used by other applications.
model training
Model training is the process of fitting a model to data.
monitoring
In a software context, monitoring is the process of recording and publishing information about a running application.
morphology
Morphology is a branch of linguistics focused on structure and parts of a word (actually morphemes).
N-gram
An N-gram is a subsequence of words. Sometimes, “N-gram” can refer to a subsequence of characters.
naïve Bayes
Naïve Bayes is a classification technique built on the naïve assumption that the features are all independent of each other.
named-entity recognition (NER)
NER is a task in NLP that focuses on finding particular entities in text.
natural language
Natural language is a language spoken or signed by people, in contrast to a programming language which is used for giving instruction to computers. Natural language also contrasts with artificial or constructed languages, which are designed by a person or group of people.
natural language processing (NLP)
NLP is a field of computer science and linguistics focused on techniques and algorithms for processing data, continuing natural language.
neural network
An artificial neural network is a collection of neurons connected by weights.
notebook
In this book, a notebook refers to a programming and writing environment, for example Jupyter Notebook and Databricks notebooks.
numpy
Numpy is a Python library for performing linear algebra operations and an assortment of other mathematical operations.
object
In an object-oriented programming context, an object is an instance of a class or type.
optical character recognition (OCR)
OCR is the set of techniques used to identify characters in an image.
overfitting
In machine learning, our data has biases as well as useful information for our task. The more exactly our machine learning model fits the data, the more it reflects these biases. This means that the predictions may be based on spurious relationships that incidentally occur in the training data.
pandas
pandas is a Python library for data analysis and processing that uses DataFrames.
parallelism
In computer science, parallelism is how much an algorithm is or can be distributed across multiple threads, processes, or machines.
parameter
In a mathematics context, a parameter is a value in a mathematical model. In a programming context, a parameter is another name for an argument of a function. In a machine learning context, a parameter is value learned in the training process using the training data.
partition
In Spark, a partition is a subset of the distributed data that is collocated on a machine.
parts of speech (POS)
POS are word categories. The most well known are nouns and verbs. In an NLP context, the Penn Treebank tags are the most frequently used set of parts of speech.
PDF
Portable document format (PDF) is a common file format for formatted text. It is a common input to NLP applications.
phonetics
Phonetics is the branch of linguistics focused on the study of speech sounds.
phrase
In linguistics, a phrase is a sequence of words that make up a constituency. For example, in the sentence “The red dog wags his tail,” “the red dog” is a noun phrase, but “the red” is not.
pickle
The pickle module is part of the Python standard library used for serializing data.
pipeline
In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a PipelineModel containing transformers, models, and evaluators.
pragmatics
Pragmatics is the branch of linguistics focused on understanding meaning in context.
process
In a computing context, a process is a running program.
product owner
In software development, the product owner is the person or people who represent the customer in the development process. They also own the requirements and prioritizing development tasks.
production
Production is the environment an application is deployed into.
profiling
In an application context, profiling is the process of measuring the resources an application or program requires to run.
program
A program is a set of instructions given to a computer.
programming language
A programming language is a formal language for writing high-level (human readable) instructions for a computer.
Python
Python is a programming language that is popular among NLP developers and data scientists. It is a multi-paradigm language, allowing object-oriented, functional, and imperative programming.
random forest
Random forest is a machine learning technique for training an ensemble of decision trees. The training data for each decision tree is a subset of the rows and features of the total data.
recurrent neural network (RNN)
An RNN is a special kind of neural network used for modeling sequential data.
register
In linguistics, a register is a variation of language that is defined by the context in which it is used. This contrasts with a dialect, which is defined by the group of people who speak it.
regression
In a machine learning context, regression is the task of assigning scalar value to examples.
regular expression
A regular expression is a string that defines a pattern to be matched in text.
repository
In a software context, a repository is a data store that contains the code and or data for a project.
resilient distributed dataset (RDD)
In Spark, an RDD is a distributed collection. In early versions of Spark, they were the fundamental elements of Spark programming.
scale out
In computing, scaling out is when more machines are used to increase the available resources.
scale up
In computing, scaling up is when a machine with more resources is used to increase available resources.
schema
In data engineering, a schema is the structure and some metadata (e.g. column names and types). In Spark, this is the metadata for defining a Spark DataFrame.
script
In programming, a script is a computer program that is generally written on a runnable code file (also called a script).
Scrum
Scrum is a style of agile software development. It is built around the idea of iterative development and short daily meetings (called scrums) where progress or problems are shared.
search
In computing, search is a task in information retrieval concerned with finding documents that are relevant to a query.
semantics
Semantics is a branch of linguistics focused on the meaning communicated by language.
sentence
In linguistics, a sentence is a special kind of phrase, especially a clausal phrase, that is considered complete.
sentiment
In an NLP context, sentiment is the emotion or opinion a human encodes in a language act.
serialization
In computing, serialization is the process of converting objects or other programming elements into a format for storage.
software developer
A software developer is someone who writes software, especially using software engineering.
software development
Software development is the process of making an application (or an update to an application) available in the production environment.
software engineering
Software engineering is the discipline and best practices used in developing software.
software library
A software library is a piece of software that is not necessarily an application. Applications are generally built by combining libraries. Some software libraries also contain applications.
software test
A software test is a program, function, or set of human instructions used to test or verify the behavior of a piece of software.
Spark NLP
Spark NLP is an NLP annotation library that extends Spark MLlib.
stakeholder
In software development, a stakeholder is a person who has a vested interest in the software being developed. For example customers and users are stakeholders.
stop word
In an NLP context, a stop word is a word or token that is considered to have negligible value for the given task.
structured query language (SQL)
SQL is a programming language used to interact with relational data.
syntax
In a linguistics context, syntax is a branch of linguistics focused on the structure of phrases and sentences. It is also used to refer to the rules used by a language for constructing phrases and sentences.
tag
In an NLP context, a tag is a kind of annotation where a subsequence, especially a token, is marked with a label from a fixed set of labels. For example, annotators that identify the POS of tokens are often called POS taggers.
TensorFlow
TensorFlow is a data processing and mathematics library. It was popularized for its implementation of neural networks.
TF.IDF
TF.IDF refers to the technique developed in information retrieval. TF refers to the term frequency of a given term in a given document, and IDF refers to the inverse of the document frequency of the given term. TF.IDF is the product of the term frequency and the inverse document frequency which is supposed to represent the relevance of the given document to the given term.
thread
In computing, a thread is a subsequence of instructions in a program that may be executed in parallel.
token
In an NLP context, a token is a unit of text, generally—but not necessarily—a word.
topic
In an NLP context, a topic is a kind of cluster of meaning (or a quantified representation).
Transformer
In a Spark MLlib context, a Transformer is a stage of a pipeline that does not need to be fit or trained on data.
Unicode
Unicode is a standard for encoding characters.
vector
In a mathematics context, a vector is an element of a Cartesian space with more than one dimension.
virtual machine
A virtual machine is a software representation of a computer.
word
In linguistics, a word is loosely defined as an unbound morpheme, that is, a unit of language that can be used alone and still have meaning.
word vector
In distributional semantics, a word is represented as a vector. The mapping from word to vector is learned from data.
Word2vec
Word2vec is a distributional semantics technique that learns word representations by building a neural network.
XML
Extensible Markup Language is a markup language used to encode data.