Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Preface
Computational Challenges of Natural Language
Linguistic Data: Tokens and Words
Enter Machine Learning
Tools for Text Analysis
What to Expect from This Book
Who This Book Is For
Code Examples and GitHub Repository
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
1. Language and Computation
The Data Science Paradigm
Language-Aware Data Products
The Data Product Pipeline
The model selection triple
Language as Data
A Computational Model of Language
Language Features
Contextual Features
Structural Features
Conclusion
2. Building a Custom Corpus
What Is a Corpus?
Domain-Specific Corpora
The Baleen Ingestion Engine
Corpus Data Management
Corpus Disk Structure
The Baleen disk structure
Corpus Readers
Streaming Data Access with NLTK
Reading an HTML Corpus
Corpus monitoring
Reading a Corpus from a Database
Conclusion
3. Corpus Preprocessing and Wrangling
Breaking Down Documents
Identifying and Extracting Core Content
Deconstructing Documents into Paragraphs
Segmentation: Breaking Out Sentences
Tokenization: Identifying Individual Tokens
Part-of-Speech Tagging
Intermediate Corpus Analytics
Corpus Transformation
Intermediate Preprocessing and Storage
Writing to pickle
Reading the Processed Corpus
Conclusion
4. Text Vectorization and Transformation Pipelines
Words in Space
Frequency Vectors
With NLTK
In Scikit-Learn
The Gensim way
One-Hot Encoding
With NLTK
In Scikit-Learn
The Gensim way
Term Frequency–Inverse Document Frequency
With NLTK
In Scikit-Learn
The Gensim way
Distributed Representation
The Gensim way
The Scikit-Learn API
The BaseEstimator Interface
Extending TransformerMixin
Creating a custom Gensim vectorization transformer
Creating a custom text normalization transformer
Pipelines
Pipeline Basics
Grid Search for Hyperparameter Optimization
Enriching Feature Extraction with Feature Unions
Conclusion
5. Classification for Text Analysis
Text Classification
Identifying Classification Problems
Classifier Models
Building a Text Classification Application
Cross-Validation
Streaming access to k splits
Model Construction
Model Evaluation
Model Operationalization
Conclusion
6. Clustering for Text Similarity
Unsupervised Learning on Text
Clustering by Document Similarity
Distance Metrics
Partitive Clustering
k-means clustering
Optimizing k-means
Handling uneven geometries
Hierarchical Clustering
Agglomerative clustering
Modeling Document Topics
Latent Dirichlet Allocation
In Scikit-Learn
The Gensim way
Visualizing topics
Latent Semantic Analysis
In Scikit-Learn
The Gensim way
Non-Negative Matrix Factorization
In Scikit-Learn
Conclusion
7. Context-Aware Text Analysis
Grammar-Based Feature Extraction
Context-Free Grammars
Syntactic Parsers
Extracting Keyphrases
Extracting Entities
n-Gram Feature Extraction
An n-Gram-Aware CorpusReader
Choosing the Right n-Gram Window
Significant Collocations
n-Gram Language Models
Frequency and Conditional Frequency
Estimating Maximum Likelihood
Unknown Words: Back-off and Smoothing
Language Generation
Conclusion
8. Text Visualization
Visualizing Feature Space
Visual Feature Analysis
n-gram viewer
Network visualization
Co-occurrence plots
Text x-rays and dispersion plots
Guided Feature Engineering
Part-of-speech tagging
Most informative features
Model Diagnostics
Visualizing Clusters
Visualizing Classes
Diagnosing Classification Error
Classification report heatmaps
Confusion matrices
Visual Steering
Silhouette Scores and Elbow Curves
Silhouette scores
Elbow curves
Conclusion
9. Graph Analysis of Text
Graph Computation and Analysis
Creating a Graph-Based Thesaurus
Analyzing Graph Structure
Visual Analysis of Graphs
Extracting Graphs from Text
Creating a Social Graph
Finding entity pairs
Property graphs
Implementing the graph extraction
Insights from the Social Graph
Centrality
Structural analysis
Entity Resolution
Entity Resolution on a Graph
Blocking with Structure
Fuzzy Blocking
Conclusion
10. Chatbots
Fundamentals of Conversation
Dialog: A Brief Exchange
Maintaining a Conversation
Rules for Polite Conversation
Greetings and Salutations
Handling Miscommunication
Entertaining Questions
Dependency Parsing
Constituency Parsing
Question Detection
From Tablespoons to Grams
Learning to Help
Being Neighborly
Offering Recommendations
Conclusion
11. Scaling Text Analytics with Multiprocessing and Spark
Python Multiprocessing
Running Tasks in Parallel
Process Pools and Queues
Parallel Corpus Preprocessing
Cluster Computing with Spark
Anatomy of a Spark Job
Distributing the Corpus
RDD Operations
NLP with Spark
From Scikit-Learn to MLLib
Feature extraction
Text clustering with MLLib
Text classification with MLLib
Local fit, global evaluation
Conclusion
12. Deep Learning and Beyond
Applied Neural Networks
Neural Language Models
Artificial Neural Networks
Training a multilayer perceptron
Deep Learning Architectures
TensorFlow: A framework for deep learning
Keras: An API for deep learning
Sentiment Analysis
Deep Structure Analysis
Predicting sentiment with a bag-of-keyphrases
The Future Is (Almost) Here
Glossary
Index
← Prev
Back
Next →
← Prev
Back
Next →