Glossary

algorithmic complexity: The complexity of an algorithm is generally measured in the time it takes to run or how much space (memory or disk space) is needed to run it.
annotation: In an NLP context, an annotation is a marking on a segment of text or audio with some extra information. Generally, an annotation will require character indices for the start and end of the annotated segment, as well as an annotation type.
annotator: An annotator is a function that takes text and produces annotations. It is not uncommon for some annotators to have a dependency on another type of annotator.
Apache Hadoop: Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig.
Apache Parquet: Parquet is a data format originally created for Hadoop. It allows for efficient compression of columnar data. It is a popular format in the Spark ecosystem.
Apache Spark: Spark is a distributed computing framework with a high-level interface and in memory processing. Spark was developed in Scala, but there are now APIs for Java, Python, R, and SQL.
application: An application is a program with an end user. Many applications have a graphical user interface (GUI), though this is not necessary. In this book, we also consider programs that do batch data processing as “applications”.
array: An array is a data structure where elements are associated with an index. They are implemented differently in different programming languages. Numpy arrays, `ndarrays`, are the most popular kind of arrays used by Python users (especially among data scientists).
autoencoder: An autoencoder is a neural-network–based technique used to convert some input data into vectors, matrices, or tensors. This new representation is generally of a lower dimension than the input data.
Bidirectional Encoder Representations from Transformers (BERT): BERT from Google is a technique for converting words into a vector representation. Unlike Word2vec, which disregards context, BERT uses the context a word is found in to produce the vector.
classification: In a machine learning context, classification is the task of assigning classes to examples. The simplest form is the binary classification task where each example can have one of two classes. The binary classification task is a special case of the multiclass classification task where each example can have one of a fixed set of classes. There is also the multilabel classification task where each example can have zero or more labels from a fixed set of labels.
clustering: In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms.
container: In software there are two common senses of “container.” In this book, the term is primarily used to refer to a virtual environment that contains a program or programs. The term “container” is also sometimes used to refer to an abstract data type of data structure that contains a collection of elements.
context: In an NLP, “context” generally refers to the surrounding language data around a segment of text or audio. In linguistics, it can also refer to the “real world” context in which a language act occurs.
CSV: A CSV (Comma Separated Values) file is a common way to store structured data. Elements are separated by commas, and rows are separated by new lines. Another common separator is the tab character. Files that use the tab are called TSVs. It is not uncommon for files that use a separator other than a comma to still be called CSVs.
data scientist: A data scientist is someone who uses scientific techniques to analyze data or build applications that consume data.
DataFrame: A DataFrame is a data structure that is used to manipulate tabular data.
decision tree: In a machine learning context, a decision tree is a data structure that is built for classification or regression tasks. Each node in the tree splits on a particular feature.
deep learning: Deep learning is a collection of neural-network techniques that generally use multiple layers.
dialect: In a linguistics context, a dialect is a particular variety of a language associated with a specific group of people.
differentiate: In a mathematics context, to differentiate is to find the derivative of a function. The derivative function is a function that maps from the domain to the instantaneous rate of change of the original function.
discourse: In a linguistics context, a discourse is a sequence of language acts, especially between two or more people.
distributed computing: Distributed computing is using multiple computers to perform parallelized computation.
distributional semantics: In an NLP context, this refers to techniques that attempt to represent words in a numerical form, almost always a vector, based on the words’ distribution in a corpus. This name originally comes from linguistics where it refers to theories that attempt to use the distribution of words in data to understand the words’ semantics.
Docker: Docker is software that allows users to create containers (virtual environments) with Docker scripts.
document: In an NLP context, a document is a complete piece of text especially if it contains multiple sentences.
embedding: In an NLP context, an embedding is a technique of representing words (or other language elements) as a vector, especially when such a representation is produced by a neural network.
encoding: In an NLP context, the encoding or character encoding refers to the mapping from characters, e.g. “a”, “?”, to bytes.
estimator: In a Spark MLlib context, an estimator is a stage of a pipeline that uses data to produce a model that transforms the data.
evaluator: In a Spark MLlib context, an evaluator is a stage of a pipeline that produces metrics from predictions.
feature: In a machine learning context, a feature is an attribute of an input, especially a numerical attribute. For example, if the input is a document, the number of unique tokens in the document is a feature. The words present in a document are also referred to as features.
function: In a programming context, a function is a sequence of instructions. In a mathematics context, a function is a mapping between two sets, the domain and the range, such that each element of the domain is mapped to a single element in the range.
GloVe: GloVe is a distributional semantics technique for representing words as vectors using word-to-word co-occurrences.
graph: In a computer science or mathematics context, a graph is a set of nodes and edges that connect the nodes.
guidelines: In a human labeling context, guidelines are the instructions given to the human labelers.
hidden Markov model: A hidden Markov model is a technique for modeling sequences using a hidden state that only uses the previous part of the sequence.
hyperparameter: In a machine learning context, a hyperparameter is a setting of a learning algorithm. For example, in a neural network, the weights are parameters, but the number and size of the layers are hyperparameters.
index: In an information retrieval context, an index is a mapping from documents to the words contained in the documents.
interlabeler agreement: In a human labeling context, interlabeler agreement is a measure of how much labelers agree (generally unknowingly) when labeling the same example.
inverted index: In an information retrieval context, an index is a mapping from words to the documents that contain the words.
Java: Java is an object-oriented programming language. Java is almost always compiled to run on the Java Virtual Machine (JVM). Scala and a number of other popular languages run on the JVM and so are interoperable with Java.
Java Virtual Machine (JVM): The JVM is a virtual machine that runs programs that have been compiled into Java bytecode. As the name suggests, Java is the primary language which uses the JVM, but Scala and a number of other programming languages use it as well.
JSON: JavaScript Object Notation (JSON) is a data format.
K-Means: K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid.
knowledge base: A knowledge base is a collection of knowledge or facts in a computationally usable format.
labeling: In a machine learning context, labeling is the process of assigning labels to examples, especially when done by humans.
language model: In an NLP context, a language model is a model of the probability distribution of word sequences.
latent Dirichlet allocation (LDA): LDA is a technique for topic modeling that treats documents as a sequence of words selected from weighted topics (probability distributions over words).
latent semantic indexing (LSI): LSI is a technique for topic modeling that performs single value decomposition on the term-document matrix.
linear algebra: Linear algebra is the branch of mathematics focused on linear equations. In a programming context, linear algebra generally refers to the mathematics that describe vectors, matrices, and their associated operations.
linear regression: Linear regression is a statistical technique for modeling the relationship between a single variable and one or more other variables. In a machine learning context, linear regression refers to a regression model based on this statistical technique.
linguist: A linguist is a person who studies human languages.
linguistic typology: Linguistic typology is a field of linguistics that groups languages by their traits.
logging: In a software context, logging is information output by an application for use in monitoring and debugging the application.
logistic regression: Logistic regression is a statistical technique for modeling the probability of an event. In a machine learning context, logistic regression refers to a classification model based on this statistical technique.
long short-term memory (LSTM): LSTM is a neural-network technique that is used for learning sequences. It attempts to learn when to use and update the context.
loss: In a machine learning context, loss refers to a measure of how wrong a supervised model is.
machine learning: Machine learning is a field of computer science and mathematics that focuses on algorithms for building and using models “learned” from data.
MapReduce: MapReduce is a style of programming based on functional programming that was the basis of Hadoop.
matrix: A matrix is a rectangular array of numeric values. The mathematical definition is much more abstract.
metric: In a machine learning context, a metric is a measure of how good or bad a particular model is at its task. In a software context, a metric is a measure defined for an application, program, or function.
model: In a general scientific context, a model is some formal description, especially a mathematical one, of a phenomenon or system. In the machine learning context, a model is a set of hyperparameters, a set of learned parameters, and an evaluation or prediction function, especially one learned from data. In Spark MLlib, a model is what is produced by an Estimator when fitted to data.
model publishing: Once a machine learning model has been learned, it must be published to be used by other applications.
model training: Model training is the process of fitting a model to data.
monitoring: In a software context, monitoring is the process of recording and publishing information about a running application.
morphology: Morphology is a branch of linguistics focused on structure and parts of a word (actually morphemes).
N-gram: An N-gram is a subsequence of words. Sometimes, “N-gram” can refer to a subsequence of characters.
naïve Bayes: Naïve Bayes is a classification technique built on the naïve assumption that the features are all independent of each other.
named-entity recognition (NER): NER is a task in NLP that focuses on finding particular entities in text.
natural language: Natural language is a language spoken or signed by people, in contrast to a programming language which is used for giving instruction to computers. Natural language also contrasts with artificial or constructed languages, which are designed by a person or group of people.
natural language processing (NLP): NLP is a field of computer science and linguistics focused on techniques and algorithms for processing data, continuing natural language.
neural network: An artificial neural network is a collection of neurons connected by weights.
notebook: In this book, a notebook refers to a programming and writing environment, for example Jupyter Notebook and Databricks notebooks.
numpy: Numpy is a Python library for performing linear algebra operations and an assortment of other mathematical operations.
object: In an object-oriented programming context, an object is an instance of a class or type.
optical character recognition (OCR): OCR is the set of techniques used to identify characters in an image.
overfitting: In machine learning, our data has biases as well as useful information for our task. The more exactly our machine learning model fits the data, the more it reflects these biases. This means that the predictions may be based on spurious relationships that incidentally occur in the training data.
pandas: pandas is a Python library for data analysis and processing that uses DataFrames.
parallelism: In computer science, parallelism is how much an algorithm is or can be distributed across multiple threads, processes, or machines.
parameter: In a mathematics context, a parameter is a value in a mathematical model. In a programming context, a parameter is another name for an argument of a function. In a machine learning context, a parameter is value learned in the training process using the training data.
partition: In Spark, a partition is a subset of the distributed data that is collocated on a machine.
parts of speech (POS): POS are word categories. The most well known are nouns and verbs. In an NLP context, the Penn Treebank tags are the most frequently used set of parts of speech.
PDF: Portable document format (PDF) is a common file format for formatted text. It is a common input to NLP applications.
phonetics: Phonetics is the branch of linguistics focused on the study of speech sounds.
phrase: In linguistics, a phrase is a sequence of words that make up a constituency. For example, in the sentence “The red dog wags his tail,” “the red dog” is a noun phrase, but “the red” is not.
pickle: The pickle module is part of the Python standard library used for serializing data.
pipeline: In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a PipelineModel containing transformers, models, and evaluators.
pragmatics: Pragmatics is the branch of linguistics focused on understanding meaning in context.
process: In a computing context, a process is a running program.
product owner: In software development, the product owner is the person or people who represent the customer in the development process. They also own the requirements and prioritizing development tasks.
production: Production is the environment an application is deployed into.
profiling: In an application context, profiling is the process of measuring the resources an application or program requires to run.
program: A program is a set of instructions given to a computer.
programming language: A programming language is a formal language for writing high-level (human readable) instructions for a computer.
Python: Python is a programming language that is popular among NLP developers and data scientists. It is a multi-paradigm language, allowing object-oriented, functional, and imperative programming.
random forest: Random forest is a machine learning technique for training an ensemble of decision trees. The training data for each decision tree is a subset of the rows and features of the total data.
recurrent neural network (RNN): An RNN is a special kind of neural network used for modeling sequential data.
register: In linguistics, a register is a variation of language that is defined by the context in which it is used. This contrasts with a dialect, which is defined by the group of people who speak it.
regression: In a machine learning context, regression is the task of assigning scalar value to examples.
regular expression: A regular expression is a string that defines a pattern to be matched in text.
repository: In a software context, a repository is a data store that contains the code and or data for a project.
resilient distributed dataset (RDD): In Spark, an RDD is a distributed collection. In early versions of Spark, they were the fundamental elements of Spark programming.
scale out: In computing, scaling out is when more machines are used to increase the available resources.
scale up: In computing, scaling up is when a machine with more resources is used to increase available resources.
schema: In data engineering, a schema is the structure and some metadata (e.g. column names and types). In Spark, this is the metadata for defining a Spark DataFrame.
script: In programming, a script is a computer program that is generally written on a runnable code file (also called a script).
Scrum: Scrum is a style of agile software development. It is built around the idea of iterative development and short daily meetings (called scrums) where progress or problems are shared.
search: In computing, search is a task in information retrieval concerned with finding documents that are relevant to a query.
semantics: Semantics is a branch of linguistics focused on the meaning communicated by language.
sentence: In linguistics, a sentence is a special kind of phrase, especially a clausal phrase, that is considered complete.
sentiment: In an NLP context, sentiment is the emotion or opinion a human encodes in a language act.
serialization: In computing, serialization is the process of converting objects or other programming elements into a format for storage.
software developer: A software developer is someone who writes software, especially using software engineering.
software development: Software development is the process of making an application (or an update to an application) available in the production environment.
software engineering: Software engineering is the discipline and best practices used in developing software.
software library: A software library is a piece of software that is not necessarily an application. Applications are generally built by combining libraries. Some software libraries also contain applications.
software test: A software test is a program, function, or set of human instructions used to test or verify the behavior of a piece of software.
Spark NLP: Spark NLP is an NLP annotation library that extends Spark MLlib.
stakeholder: In software development, a stakeholder is a person who has a vested interest in the software being developed. For example customers and users are stakeholders.
stop word: In an NLP context, a stop word is a word or token that is considered to have negligible value for the given task.
structured query language (SQL): SQL is a programming language used to interact with relational data.
syntax: In a linguistics context, syntax is a branch of linguistics focused on the structure of phrases and sentences. It is also used to refer to the rules used by a language for constructing phrases and sentences.
tag: In an NLP context, a tag is a kind of annotation where a subsequence, especially a token, is marked with a label from a fixed set of labels. For example, annotators that identify the POS of tokens are often called POS taggers.
TensorFlow: TensorFlow is a data processing and mathematics library. It was popularized for its implementation of neural networks.
TF.IDF: TF.IDF refers to the technique developed in information retrieval. TF refers to the term frequency of a given term in a given document, and IDF refers to the inverse of the document frequency of the given term. TF.IDF is the product of the term frequency and the inverse document frequency which is supposed to represent the relevance of the given document to the given term.
thread: In computing, a thread is a subsequence of instructions in a program that may be executed in parallel.
token: In an NLP context, a token is a unit of text, generally—but not necessarily—a word.
topic: In an NLP context, a topic is a kind of cluster of meaning (or a quantified representation).
Transformer: In a Spark MLlib context, a Transformer is a stage of a pipeline that does not need to be fit or trained on data.
Unicode: Unicode is a standard for encoding characters.
vector: In a mathematics context, a vector is an element of a Cartesian space with more than one dimension.
virtual machine: A virtual machine is a software representation of a computer.
word: In linguistics, a word is loosely defined as an unbound morpheme, that is, a unit of language that can be used alone and still have meaning.
word vector: In distributional semantics, a word is represented as a vector. The mapping from word to vector is learned from data.
Word2vec: Word2vec is a distributional semantics technique that learns word representations by building a neural network.
XML: Extensible Markup Language is a markup language used to encode data.