Neural Network Methods in Natural Language Processing by Goldberg, Yoav -- Read -- Imperial Library of Trantor

Index

Half title Copyright Title Abstract Contents Preface Acknowledgments 1 Introduction

1.1 The Challenges of Natural Language Processing 1.2 Neural Networks and Deep Learning 1.3 Deep Learning in NLP

1.3.1 Success Stories

1.4 Coverage and Organization 1.5 What’s not Covered 1.6 A Note on Terminology 1.7 Mathematical Notation

Part I Supervised Classification and Feed-forward Neural Networks

2 Learning Basics and Linear Models

2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models

2.3.1 Binary Classification 2.3.2 Log-linear Binary Classification 2.3.3 Multi-class Classification

2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 2.7 Training as Optimization

2.7.1 Loss Functions 2.7.2 Regularization

2.8 Gradient-based Optimization

2.8.1 Stochastic Gradient Descent 2.8.2 Worked-out Example 2.8.3 Beyond SGD

3 From Linear Models to Multi-layer Perceptrons

3.1 Limitations of Linear Models: The XOR Problem 3.2 Nonlinear Input Transformations 3.3 Kernel Methods 3.4 Trainable Mapping Functions

4 Feed-forward Neural Networks

4.1 A Brain-inspired Metaphor 4.2 In Mathematical Notation 4.3 Representation Power 4.4 Common Nonlinearities 4.5 Loss Functions 4.6 Regularization and Dropout 4.7 Similarity and Distance Layers 4.8 Embedding Layers

5 Neural Network Training

5.1 The Computation Graph Abstraction

5.1.1 Forward Computation 5.1.2 Backward Computation (Derivatives, Backprop) 5.1.3 Software 5.1.4 Implementation Recipe 5.1.5 Network Composition

5.2 Practicalities

5.2.1 Choice of Optimization Algorithm 5.2.2 Initialization 5.2.3 Restarts and Ensembles 5.2.4 Vanishing and Exploding Gradients 5.2.5 Saturation and Dead Neurons 5.2.6 Shuffling 5.2.7 Learning Rate 5.2.8 Minibatches

Part II Working with Natural Language Data

6 Features for Textual Data

6.1 Typology of NLP Classification Problems 6.2 Features for NLP Problems

6.2.1 Directly Observable Properties 6.2.2 Inferred Linguistic Properties 6.2.3 Core Features vs. Combination Features 6.2.4 Ngram Features 6.2.5 Distributional Features

7 Case Studies of NLP Features

7.1 Document Classification: Language Identification 7.2 Document Classification: Topic Classification 7.3 Document Classification: Authorship Attribution 7.4 Word-in-context: Part of Speech Tagging 7.5 Word-in-context: Named Entity Recognition 7.6 Word in Context, Linguistic Features: Preposition Sense Disambiguation 7.7 Relation Between Words in Context: Arc-Factored Parsing

8 From Textual Features to Inputs

8.1 Encoding Categorical Features

8.1.1 One-hot Encodings 8.1.2 Dense Encodings (Feature Embeddings) 8.1.3 Dense Vectors vs. One-hot Representations

8.2 Combining Dense Vectors

8.2.1 Window-based Features 8.2.2 Variable Number of Features: Continuous Bag of Words

8.3 Relation Between One-hot and Dense Vectors 8.4 Odds and Ends

8.4.1 Distance and Position Features 8.4.2 Padding, Unknown Words, and Word Dropout 8.4.3 Feature Combinations 8.4.4 Vector Sharing 8.4.5 Dimensionality 8.4.6 Embeddings Vocabulary 8.4.7 Network’s Output

8.5 Example: Part-of-Speech Tagging 8.6 Example: Arc-factored Parsing

9 Language Modeling

9.1 The Language Modeling Task 9.2 Evaluating Language Models: Perplexity 9.3 Traditional Approaches to Language Modeling

9.3.1 Further Reading 9.3.2 Limitations of Traditional Language Models

9.4 Neural Language Models 9.5 Using Language Models for Generation 9.6 Byproduct: Word Representations

10 Pre-trained Word Representations

10.1 Random Initialization 10.2 Supervised Task-specific Pre-training 10.3 Unsupervised Pre-training

10.3.1 Using Pre-trained Embeddings

10.4 Word Embedding Algorithms

10.4.1 Distributional Hypothesis and Word Representations 10.4.2 From Neural Language Models to Distributed Representations 10.4.3 Connecting the Worlds 10.4.4 Other Algorithms

10.5 The Choice of Contexts

10.5.1 Window Approach 10.5.2 Sentences, Paragraphs, or Documents 10.5.3 Syntactic Window 10.5.4 Multilingual 10.5.5 Character-based and Sub-word Representations

10.6 Dealing with Multi-word Units and Word Inflections 10.7 Limitations of Distributional Methods

11 Using Word Embeddings

11.1 Obtaining Word Vectors 11.2 Word Similarity 11.3 Word Clustering 11.4 Finding Similar Words

11.4.1 Similarity to a Group of Words

11.5 Odd-one Out 11.6 Short Document Similarity 11.7 Word Analogies 11.8 Retrofitting and Projections 11.9 Practicalities and Pitfalls

12 Case Study: A Feed-forward Architecture for Sentence Meaning Inference

12.1 Natural Language Inference and the SNLI Dataset 12.2 A Textual Similarity Network

Part III Specialized Architectures

13 Ngram Detectors: Convolutional Neural Networks

13.1 Basic Convolution + Pooling

13.1.1 1D Convolutions Over Text 13.1.2 Vector Pooling 13.1.3 Variations

13.2 Alternative: Feature Hashing 13.3 Hierarchical Convolutions

14 Recurrent Neural Networks: Modeling Sequences and Stacks

14.1 The RNN Abstraction 14.2 RNN Training 14.3 Common RNN Usage-patterns

14.3.1 Acceptor 14.3.2 Encoder 14.3.3 Transducer

14.4 Bidirectional RNNs (biRNN) 14.5 Multi-layer (stacked) RNNs 14.6 RNNs for Representing Stacks 14.7 A Note on Reading the Literature

15 Concrete Recurrent Neural Network Architectures

15.1 CBOW as an RNN 15.2 Simple RNN 15.3 Gated Architectures

15.3.1 LSTM 15.3.2 GRU

15.4 Other Variants 15.5 Dropout in RNNs

16 Modeling with Recurrent Networks

16.1 Acceptors

16.1.1 Sentiment Classification 16.1.2 Subject-verb Agreement Grammaticality Detection

16.2 RNNs as Feature Extractors

16.2.1 Part-of-speech Tagging 16.2.2 RNN–CNN Document Classification 16.2.3 Arc-factored Dependency Parsing

17 Conditioned Generation

17.1 RNN Generators

17.1.1 Training Generators

17.2 Conditioned Generation (Encoder-Decoder)

17.2.1 Sequence to Sequence Models 17.2.2 Applications 17.2.3 Other Conditioning Contexts

17.3 Unsupervised Sentence Similarity 17.4 Conditioned Generation with Attention

17.4.1 Computational Complexity 17.4.2 Interpretability

17.5 Attention-based Models in NLP

17.5.1 Machine Translation 17.5.2 Morphological Inflection 17.5.3 Syntactic Parsing

Part IV Additional Topics

18 Modeling Trees with Recursive Neural Networks

18.1 Formal Definition 18.2 Extensions and Variations 18.3 Training Recursive Neural Networks 18.4 A Simple Alternative–Linearized Trees 18.5 Outlook

19 Structured Output Prediction

19.1 Search-based Structured Prediction

19.1.1 Structured Prediction with Linear Models 19.1.2 Nonlinear Structured Prediction 19.1.3 Probabilistic Objective (CRF) 19.1.4 Approximate Search 19.1.5 Reranking 19.1.6 See Also

19.2 Greedy Structured Prediction 19.3 Conditional Generation as Structured Output Prediction 19.4 Examples

19.4.1 Search-based Structured Prediction: First-order Dependency Parsing 19.4.2 Neural-CRF for Named Entity Recognition 19.4.3 Approximate NER-CRF With Beam-Search

20 Cascaded, Multi-task and Semi-supervised Learning

20.1 Model Cascading 20.2 Multi-task Learning

20.2.1 Training in a Multi-task Setup 20.2.2 Selective Sharing 20.2.3 Word-embeddings Pre-training as Multi-task Learning 20.2.4 Multi-task Learning in Conditioned Generation 20.2.5 Multi-task Learning as Regularization 20.2.6 Caveats

20.3 Semi-supervised Learning 20.4 Examples

20.4.1 Gaze-prediction and Sentence Compression 20.4.2 Arc Labeling and Syntactic Parsing 20.4.3 Preposition Sense Disambiguation and Preposition Translation Prediction 20.4.4 Conditioned Generation: Multilingual Machine Translation, Parsing, and Image Captioning

20.5 Outlook

21 Conclusion

21.1 What Have We Seen? 21.2 The Challenges Ahead

Bibliography Author’s Biography

← Prev
Back
Next →

← Prev
Back
Next →