Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Half title
Copyright
Title
Abstract
Contents
Preface
Acknowledgments
1 Introduction
1.1 The Challenges of Natural Language Processing
1.2 Neural Networks and Deep Learning
1.3 Deep Learning in NLP
1.3.1 Success Stories
1.4 Coverage and Organization
1.5 What’s not Covered
1.6 A Note on Terminology
1.7 Mathematical Notation
Part I Supervised Classification and Feed-forward Neural Networks
2 Learning Basics and Linear Models
2.1 Supervised Learning and Parameterized Functions
2.2 Train, Test, and Validation Sets
2.3 Linear Models
2.3.1 Binary Classification
2.3.2 Log-linear Binary Classification
2.3.3 Multi-class Classification
2.4 Representations
2.5 One-Hot and Dense Vector Representations
2.6 Log-linear Multi-class Classification
2.7 Training as Optimization
2.7.1 Loss Functions
2.7.2 Regularization
2.8 Gradient-based Optimization
2.8.1 Stochastic Gradient Descent
2.8.2 Worked-out Example
2.8.3 Beyond SGD
3 From Linear Models to Multi-layer Perceptrons
3.1 Limitations of Linear Models: The XOR Problem
3.2 Nonlinear Input Transformations
3.3 Kernel Methods
3.4 Trainable Mapping Functions
4 Feed-forward Neural Networks
4.1 A Brain-inspired Metaphor
4.2 In Mathematical Notation
4.3 Representation Power
4.4 Common Nonlinearities
4.5 Loss Functions
4.6 Regularization and Dropout
4.7 Similarity and Distance Layers
4.8 Embedding Layers
5 Neural Network Training
5.1 The Computation Graph Abstraction
5.1.1 Forward Computation
5.1.2 Backward Computation (Derivatives, Backprop)
5.1.3 Software
5.1.4 Implementation Recipe
5.1.5 Network Composition
5.2 Practicalities
5.2.1 Choice of Optimization Algorithm
5.2.2 Initialization
5.2.3 Restarts and Ensembles
5.2.4 Vanishing and Exploding Gradients
5.2.5 Saturation and Dead Neurons
5.2.6 Shuffling
5.2.7 Learning Rate
5.2.8 Minibatches
Part II Working with Natural Language Data
6 Features for Textual Data
6.1 Typology of NLP Classification Problems
6.2 Features for NLP Problems
6.2.1 Directly Observable Properties
6.2.2 Inferred Linguistic Properties
6.2.3 Core Features vs. Combination Features
6.2.4 Ngram Features
6.2.5 Distributional Features
7 Case Studies of NLP Features
7.1 Document Classification: Language Identification
7.2 Document Classification: Topic Classification
7.3 Document Classification: Authorship Attribution
7.4 Word-in-context: Part of Speech Tagging
7.5 Word-in-context: Named Entity Recognition
7.6 Word in Context, Linguistic Features: Preposition Sense Disambiguation
7.7 Relation Between Words in Context: Arc-Factored Parsing
8 From Textual Features to Inputs
8.1 Encoding Categorical Features
8.1.1 One-hot Encodings
8.1.2 Dense Encodings (Feature Embeddings)
8.1.3 Dense Vectors vs. One-hot Representations
8.2 Combining Dense Vectors
8.2.1 Window-based Features
8.2.2 Variable Number of Features: Continuous Bag of Words
8.3 Relation Between One-hot and Dense Vectors
8.4 Odds and Ends
8.4.1 Distance and Position Features
8.4.2 Padding, Unknown Words, and Word Dropout
8.4.3 Feature Combinations
8.4.4 Vector Sharing
8.4.5 Dimensionality
8.4.6 Embeddings Vocabulary
8.4.7 Network’s Output
8.5 Example: Part-of-Speech Tagging
8.6 Example: Arc-factored Parsing
9 Language Modeling
9.1 The Language Modeling Task
9.2 Evaluating Language Models: Perplexity
9.3 Traditional Approaches to Language Modeling
9.3.1 Further Reading
9.3.2 Limitations of Traditional Language Models
9.4 Neural Language Models
9.5 Using Language Models for Generation
9.6 Byproduct: Word Representations
10 Pre-trained Word Representations
10.1 Random Initialization
10.2 Supervised Task-specific Pre-training
10.3 Unsupervised Pre-training
10.3.1 Using Pre-trained Embeddings
10.4 Word Embedding Algorithms
10.4.1 Distributional Hypothesis and Word Representations
10.4.2 From Neural Language Models to Distributed Representations
10.4.3 Connecting the Worlds
10.4.4 Other Algorithms
10.5 The Choice of Contexts
10.5.1 Window Approach
10.5.2 Sentences, Paragraphs, or Documents
10.5.3 Syntactic Window
10.5.4 Multilingual
10.5.5 Character-based and Sub-word Representations
10.6 Dealing with Multi-word Units and Word Inflections
10.7 Limitations of Distributional Methods
11 Using Word Embeddings
11.1 Obtaining Word Vectors
11.2 Word Similarity
11.3 Word Clustering
11.4 Finding Similar Words
11.4.1 Similarity to a Group of Words
11.5 Odd-one Out
11.6 Short Document Similarity
11.7 Word Analogies
11.8 Retrofitting and Projections
11.9 Practicalities and Pitfalls
12 Case Study: A Feed-forward Architecture for Sentence Meaning Inference
12.1 Natural Language Inference and the SNLI Dataset
12.2 A Textual Similarity Network
Part III Specialized Architectures
13 Ngram Detectors: Convolutional Neural Networks
13.1 Basic Convolution + Pooling
13.1.1 1D Convolutions Over Text
13.1.2 Vector Pooling
13.1.3 Variations
13.2 Alternative: Feature Hashing
13.3 Hierarchical Convolutions
14 Recurrent Neural Networks: Modeling Sequences and Stacks
14.1 The RNN Abstraction
14.2 RNN Training
14.3 Common RNN Usage-patterns
14.3.1 Acceptor
14.3.2 Encoder
14.3.3 Transducer
14.4 Bidirectional RNNs (biRNN)
14.5 Multi-layer (stacked) RNNs
14.6 RNNs for Representing Stacks
14.7 A Note on Reading the Literature
15 Concrete Recurrent Neural Network Architectures
15.1 CBOW as an RNN
15.2 Simple RNN
15.3 Gated Architectures
15.3.1 LSTM
15.3.2 GRU
15.4 Other Variants
15.5 Dropout in RNNs
16 Modeling with Recurrent Networks
16.1 Acceptors
16.1.1 Sentiment Classification
16.1.2 Subject-verb Agreement Grammaticality Detection
16.2 RNNs as Feature Extractors
16.2.1 Part-of-speech Tagging
16.2.2 RNN–CNN Document Classification
16.2.3 Arc-factored Dependency Parsing
17 Conditioned Generation
17.1 RNN Generators
17.1.1 Training Generators
17.2 Conditioned Generation (Encoder-Decoder)
17.2.1 Sequence to Sequence Models
17.2.2 Applications
17.2.3 Other Conditioning Contexts
17.3 Unsupervised Sentence Similarity
17.4 Conditioned Generation with Attention
17.4.1 Computational Complexity
17.4.2 Interpretability
17.5 Attention-based Models in NLP
17.5.1 Machine Translation
17.5.2 Morphological Inflection
17.5.3 Syntactic Parsing
Part IV Additional Topics
18 Modeling Trees with Recursive Neural Networks
18.1 Formal Definition
18.2 Extensions and Variations
18.3 Training Recursive Neural Networks
18.4 A Simple Alternative–Linearized Trees
18.5 Outlook
19 Structured Output Prediction
19.1 Search-based Structured Prediction
19.1.1 Structured Prediction with Linear Models
19.1.2 Nonlinear Structured Prediction
19.1.3 Probabilistic Objective (CRF)
19.1.4 Approximate Search
19.1.5 Reranking
19.1.6 See Also
19.2 Greedy Structured Prediction
19.3 Conditional Generation as Structured Output Prediction
19.4 Examples
19.4.1 Search-based Structured Prediction: First-order Dependency Parsing
19.4.2 Neural-CRF for Named Entity Recognition
19.4.3 Approximate NER-CRF With Beam-Search
20 Cascaded, Multi-task and Semi-supervised Learning
20.1 Model Cascading
20.2 Multi-task Learning
20.2.1 Training in a Multi-task Setup
20.2.2 Selective Sharing
20.2.3 Word-embeddings Pre-training as Multi-task Learning
20.2.4 Multi-task Learning in Conditioned Generation
20.2.5 Multi-task Learning as Regularization
20.2.6 Caveats
20.3 Semi-supervised Learning
20.4 Examples
20.4.1 Gaze-prediction and Sentence Compression
20.4.2 Arc Labeling and Syntactic Parsing
20.4.3 Preposition Sense Disambiguation and Preposition Translation Prediction
20.4.4 Conditioned Generation: Multilingual Machine Translation, Parsing, and Image Captioning
20.5 Outlook
21 Conclusion
21.1 What Have We Seen?
21.2 The Challenges Ahead
Bibliography
Author’s Biography
← Prev
Back
Next →
← Prev
Back
Next →