Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Natural Language Annotation for Machine Learning
Preface
Natural Language Annotation for Machine Learning
Audience
Organization of This Book
Software Requirements
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
James Adds:
Amber Adds:
1. The Basics
The Importance of Language Annotation
The Layers of Linguistic Description
What Is Natural Language Processing?
A Brief History of Corpus Linguistics
What Is a Corpus?
Early Use of Corpora
Corpora Today
Kinds of Annotation
Language Data and Machine Learning
Classification
Clustering
Structured Pattern Induction
The Annotation Development Cycle
Model the Phenomenon
Annotate with the Specification
Train and Test the Algorithms over the Corpus
Evaluate the Results
Revise the Model and Algorithms
Summary
2. Defining Your Goal and Dataset
Defining Your Goal
The Statement of Purpose
Refining Your Goal: Informativity Versus Correctness
The scope of the annotation task
What will the annotation be used for?
What will the overall outcome be?
Where will the corpus come from?
How will the result be achieved?
Background Research
Language Resources
Organizations and Conferences
NLP Challenges
Assembling Your Dataset
The Ideal Corpus: Representative and Balanced
Collecting Data from the Internet
Eliciting Data from People
Read speech
Spontaneous speech
The Size of Your Corpus
Existing Corpora
Distributions Within Corpora
Summary
3. Corpus Analytics
Basic Probability for Corpus Analytics
Joint Probability Distributions
Bayes Rule
Counting Occurrences
Zipf’s Law
N-grams
Language Models
Summary
4. Building Your Model and Specification
Some Example Models and Specs
Film Genre Classification
Adding Named Entities
Semantic Roles
Adopting (or Not Adopting) Existing Models
Creating Your Own Model and Specification: Generality Versus Specificity
Using Existing Models and Specifications
Using Models Without Specifications
Different Kinds of Standards
ISO Standards
Annotation format standards
Annotation specification standards
Community-Driven Standards
Other Standards Affecting Annotation
Summary
5. Applying and Adopting Annotation Standards
Metadata Annotation: Document Classification
Unique Labels: Movie Reviews
Multiple Labels: Film Genres
Text Extent Annotation: Named Entities
Inline Annotation
Stand-off Annotation by Tokens
Stand-off Annotation by Character Location
Linked Extent Annotation: Semantic Roles
ISO Standards and You
Summary
6. Annotation and Adjudication
The Infrastructure of an Annotation Project
Specification Versus Guidelines
Be Prepared to Revise
Preparing Your Data for Annotation
Metadata
Preprocessed Data
Splitting Up the Files for Annotation
Writing the Annotation Guidelines
Example 1: Single Labels—Movie Reviews
Example 2: Multiple Labels—Film Genres
Example 3: Extent Annotations—Named Entities
Example 4: Link Tags—Semantic Roles
Annotators
Choosing an Annotation Environment
Evaluating the Annotations
Cohen’s Kappa (κ)
Fleiss’s Kappa (κ)
Interpreting Kappa Coefficients
Calculating κ in Other Contexts
Creating the Gold Standard (Adjudication)
Summary
7. Training: Machine Learning
What Is Learning?
Defining Our Learning Task
Classifier Algorithms
Decision Tree Learning
Gender Identification
Naïve Bayes Learning
Movie genre identification
Sentiment classification
Maximum Entropy Classifiers
Other Classifiers to Know About
Sequence Induction Algorithms
Clustering and Unsupervised Learning
Semi-Supervised Learning
Matching Annotation to Algorithms
Summary
8. Testing and Evaluation
Testing Your Algorithm
Evaluating Your Algorithm
Confusion Matrices
Calculating Evaluation Scores
Percentage accuracy
Precision and recall
F-measure
Other evaluation metrics
Interpreting Evaluation Scores
Problems That Can Affect Evaluation
Dataset Is Too Small
Algorithm Fits the Development Data Too Well
Too Much Information in the Annotation
Final Testing Scores
Summary
9. Revising and Reporting
Revising Your Project
Corpus Distributions and Content
Model and Specification
Annotation
Guidelines
Annotators
Tools
Training and Testing
Reporting About Your Work
About Your Corpus
About Your Model and Specifications
About Your Annotation Task and Annotators
About Your ML Algorithm
About Your Revisions
Summary
10. Annotation: TimeML
The Goal of TimeML
Related Research
Building the Corpus
Model: Preliminary Specifications
Times
Signals
Events
Links
Annotation: First Attempts
Model: The TimeML Specification Used in TimeBank
Time Expressions
Events
Signals
Links
Confidence
Annotation: The Creation of TimeBank
TimeML Becomes ISO-TimeML
Modeling the Future: Directions for TimeML
Narrative Containers
Expanding TimeML to Other Domains
Event Structures
Summary
11. Automatic Annotation: Generating TimeML
The TARSQI Components
GUTime: Temporal Marker Identification
EVITA: Event Recognition and Classification
GUTenLINK
Slinket
SputLink
Machine Learning in the TARSQI Components
Improvements to the TTK
Structural Changes
Improvements to Temporal Entity Recognition: BTime
Temporal Relation Identification
Temporal Relation Validation
Temporal Relation Visualization
TimeML Challenges: TempEval-2
TempEval-2: System Summaries
Overview of Results
Future of the TTK
New Input Formats
Narrative Containers/Narrative Times
Medical Documents
Cross-Document Analysis
Summary
12. Afterword: The Future of Annotation
Crowdsourcing Annotation
Amazon’s Mechanical Turk
Games with a Purpose (GWAP)
User-Generated Content
Handling Big Data
Boosting
Active Learning
Semi-Supervised Learning
NLP Online and in the Cloud
Distributed Computing
Shared Language Resources
Shared Language Applications
And Finally...
A. List of Available Corpora and Specifications
Corpora
Specifications, Guidelines, and Other Resources
Representation Standards
B. List of Software Resources
Annotation and Adjudication Software
Multipurpose Tools
Corpus Creation and Exploration Tools
Manual Annotation Tools
Automated Annotation Tools
Multipurpose tools
Phonetic annotation
Part-of-speech taggers/syntactic parsers
Tokenizers/chunkers/stemmers
Other
Machine Learning Resources
C. MAE User Guide
Installing and Running MAE
Loading Tasks and Files
Loading a Task
Loading a File
Annotating Entities
Attribute information
Nonconsuming tags
Annotating Links
Deleting Tags
Saving Files
Defining Your Own Task
Task Name
Elements (a.k.a. Tags)
Attributes
id attributes
start attribute
Attribute types
Default attribute values
Frequently Asked Questions
D. MAI User Guide
Installing and Running MAI
Loading Tasks and Files
Loading a Task
Loading Files
Adjudicating
The MAI Window
Adjudicating a Tag
Extent Tags
Link Tags
Nonconsuming Tags
Adding New Tags
Deleting tags
Saving Files
E. Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Index
About the Authors
Colophon
Copyright
← Prev
Back
Next →
← Prev
Back
Next →