Natural Language Annotation for Machine Learning by Pustejovsky, James -- Read -- Imperial Library of Trantor

Index

Natural Language Annotation for Machine Learning Preface

Natural Language Annotation for Machine Learning Audience Organization of This Book Software Requirements Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments

James Adds: Amber Adds:

1. The Basics

The Importance of Language Annotation

The Layers of Linguistic Description What Is Natural Language Processing?

A Brief History of Corpus Linguistics

What Is a Corpus? Early Use of Corpora Corpora Today Kinds of Annotation

Language Data and Machine Learning

Classification Clustering Structured Pattern Induction

The Annotation Development Cycle

Model the Phenomenon Annotate with the Specification Train and Test the Algorithms over the Corpus Evaluate the Results Revise the Model and Algorithms

Summary

2. Defining Your Goal and Dataset

Defining Your Goal

The Statement of Purpose Refining Your Goal: Informativity Versus Correctness

The scope of the annotation task What will the annotation be used for? What will the overall outcome be? Where will the corpus come from? How will the result be achieved?

Background Research

Language Resources Organizations and Conferences NLP Challenges

Assembling Your Dataset

The Ideal Corpus: Representative and Balanced Collecting Data from the Internet Eliciting Data from People

Read speech Spontaneous speech

The Size of Your Corpus

Existing Corpora Distributions Within Corpora

Summary

3. Corpus Analytics

Basic Probability for Corpus Analytics

Joint Probability Distributions Bayes Rule

Counting Occurrences

Zipf’s Law N-grams

Language Models Summary

4. Building Your Model and Specification

Some Example Models and Specs

Film Genre Classification Adding Named Entities Semantic Roles

Adopting (or Not Adopting) Existing Models

Creating Your Own Model and Specification: Generality Versus Specificity Using Existing Models and Specifications Using Models Without Specifications

Different Kinds of Standards

ISO Standards

Annotation format standards Annotation specification standards

Community-Driven Standards Other Standards Affecting Annotation

Summary

5. Applying and Adopting Annotation Standards

Metadata Annotation: Document Classification

Unique Labels: Movie Reviews Multiple Labels: Film Genres

Text Extent Annotation: Named Entities

Inline Annotation Stand-off Annotation by Tokens Stand-off Annotation by Character Location

Linked Extent Annotation: Semantic Roles ISO Standards and You Summary

6. Annotation and Adjudication

The Infrastructure of an Annotation Project Specification Versus Guidelines Be Prepared to Revise Preparing Your Data for Annotation

Metadata Preprocessed Data Splitting Up the Files for Annotation

Writing the Annotation Guidelines

Example 1: Single Labels—Movie Reviews Example 2: Multiple Labels—Film Genres Example 3: Extent Annotations—Named Entities Example 4: Link Tags—Semantic Roles

Annotators Choosing an Annotation Environment Evaluating the Annotations

Cohen’s Kappa (κ) Fleiss’s Kappa (κ) Interpreting Kappa Coefficients Calculating κ in Other Contexts

Creating the Gold Standard (Adjudication) Summary

7. Training: Machine Learning

What Is Learning? Defining Our Learning Task Classifier Algorithms

Decision Tree Learning Gender Identification Naïve Bayes Learning

Movie genre identification Sentiment classification

Maximum Entropy Classifiers Other Classifiers to Know About

Sequence Induction Algorithms Clustering and Unsupervised Learning Semi-Supervised Learning Matching Annotation to Algorithms Summary

8. Testing and Evaluation

Testing Your Algorithm Evaluating Your Algorithm

Confusion Matrices Calculating Evaluation Scores

Percentage accuracy Precision and recall F-measure Other evaluation metrics

Interpreting Evaluation Scores

Problems That Can Affect Evaluation

Dataset Is Too Small Algorithm Fits the Development Data Too Well Too Much Information in the Annotation

Final Testing Scores Summary

9. Revising and Reporting

Revising Your Project

Corpus Distributions and Content Model and Specification Annotation

Guidelines Annotators Tools

Training and Testing

Reporting About Your Work

About Your Corpus About Your Model and Specifications About Your Annotation Task and Annotators About Your ML Algorithm About Your Revisions

Summary

10. Annotation: TimeML

The Goal of TimeML Related Research Building the Corpus Model: Preliminary Specifications

Times Signals Events Links

Annotation: First Attempts Model: The TimeML Specification Used in TimeBank

Time Expressions Events Signals Links Confidence

Annotation: The Creation of TimeBank TimeML Becomes ISO-TimeML Modeling the Future: Directions for TimeML

Narrative Containers Expanding TimeML to Other Domains Event Structures

Summary

11. Automatic Annotation: Generating TimeML

The TARSQI Components

GUTime: Temporal Marker Identification EVITA: Event Recognition and Classification GUTenLINK Slinket SputLink Machine Learning in the TARSQI Components

Improvements to the TTK

Structural Changes Improvements to Temporal Entity Recognition: BTime Temporal Relation Identification Temporal Relation Validation Temporal Relation Visualization

TimeML Challenges: TempEval-2

TempEval-2: System Summaries Overview of Results

Future of the TTK

New Input Formats Narrative Containers/Narrative Times Medical Documents Cross-Document Analysis

Summary

12. Afterword: The Future of Annotation

Crowdsourcing Annotation

Amazon’s Mechanical Turk Games with a Purpose (GWAP) User-Generated Content

Handling Big Data

Boosting Active Learning Semi-Supervised Learning

NLP Online and in the Cloud

Distributed Computing Shared Language Resources Shared Language Applications

And Finally...

A. List of Available Corpora and Specifications

Corpora Specifications, Guidelines, and Other Resources Representation Standards

B. List of Software Resources

Annotation and Adjudication Software

Multipurpose Tools Corpus Creation and Exploration Tools Manual Annotation Tools Automated Annotation Tools

Multipurpose tools Phonetic annotation Part-of-speech taggers/syntactic parsers Tokenizers/chunkers/stemmers Other

Machine Learning Resources

C. MAE User Guide

Installing and Running MAE Loading Tasks and Files

Loading a Task Loading a File Annotating Entities

Attribute information Nonconsuming tags

Annotating Links Deleting Tags

Saving Files Defining Your Own Task

Task Name Elements (a.k.a. Tags) Attributes

id attributes start attribute Attribute types Default attribute values

Frequently Asked Questions

D. MAI User Guide

Installing and Running MAI Loading Tasks and Files

Loading a Task Loading Files

Adjudicating

The MAI Window Adjudicating a Tag Extent Tags Link Tags Nonconsuming Tags Adding New Tags Deleting tags

Saving Files

E. Bibliography

References for Using Amazon’s Mechanical Turk/Crowdsourcing

Index About the Authors Colophon Copyright

← Prev
Back
Next →

← Prev
Back
Next →