Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Preface
Approach of the Book Prerequisites Some Important Libraries to Know Books to Read Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments
1. Gaining Early Insights from Textual Data
What You’ll Learn and What We’ll Build Exploratory Data Analysis Introducing the Dataset Blueprint: Getting an Overview of the Data with Pandas
Calculating Summary Statistics for Columns Checking for Missing Data Plotting Value Distributions Comparing Value Distributions Across Categories Visualizing Developments Over Time
Blueprint: Building a Simple Text Preprocessing Pipeline
Performing Tokenization with Regular Expressions Treating Stop Words Processing a Pipeline with One Line of Code
Blueprints for Word Frequency Analysis
Blueprint: Counting Words with a Counter Blueprint: Creating a Frequency Diagram Blueprint: Creating Word Clouds Blueprint: Ranking with TF-IDF
Blueprint: Finding a Keyword-in-Context Blueprint: Analyzing N-Grams Blueprint: Comparing Frequencies Across Time Intervals and Categories
Creating Frequency Timelines Creating Frequency Heatmaps
Closing Remarks
2. Extracting Textual Insights with APIs
What You’ll Learn and What We’ll Build Application Programming Interfaces Blueprint: Extracting Data from an API Using the Requests Module
Pagination Rate Limiting
Blueprint: Extracting Twitter Data with Tweepy
Obtaining Credentials Installing and Configuring Tweepy Extracting Data from the Search API Extracting Data from a User’s Timeline Extracting Data from the Streaming API
Closing Remarks
3. Scraping Websites and Extracting Data
What You’ll Learn and What We’ll Build Scraping and Data Extraction Introducing the Reuters News Archive URL Generation Blueprint: Downloading and Interpreting robots.txt Blueprint: Finding URLs from sitemap.xml Blueprint: Finding URLs from RSS Downloading Data Blueprint: Downloading HTML Pages with Python Blueprint: Downloading HTML Pages with wget Extracting Semistructured Data Blueprint: Extracting Data with Regular Expressions Blueprint: Using an HTML Parser for Extraction Blueprint: Spidering
Introducing the Use Case Error Handling and Production-Quality Software
Density-Based Text Extraction
Extracting Reuters Content with Readability Summary Density-Based Text Extraction
All-in-One Approach Blueprint: Scraping the Reuters Archive with Scrapy Possible Problems with Scraping Closing Remarks and Recommendation
4. Preparing Textual Data for Statistics and Machine Learning
What You’ll Learn and What We’ll Build A Data Preprocessing Pipeline Introducing the Dataset: Reddit Self-Posts
Loading Data Into Pandas Blueprint: Standardizing Attribute Names Saving and Loading a DataFrame
Cleaning Text Data
Blueprint: Identify Noise with Regular Expressions Blueprint: Removing Noise with Regular Expressions Blueprint: Character Normalization with textacy Blueprint: Pattern-Based Data Masking with textacy
Tokenization
Blueprint: Tokenization with Regular Expressions Tokenization with NLTK Recommendations for Tokenization
Linguistic Processing with spaCy
Instantiating a Pipeline Processing Text Blueprint: Customizing Tokenization Blueprint: Working with Stop Words Blueprint: Extracting Lemmas Based on Part of Speech Blueprint: Extracting Noun Phrases Blueprint: Extracting Named Entities
Feature Extraction on a Large Dataset
Blueprint: Creating One Function to Get It All Blueprint: Using spaCy on a Large Dataset Persisting the Result A Note on Execution Time
There Is More
Language Detection Spell-Checking Token Normalization
Closing Remarks and Recommendations
5. Feature Engineering and Syntactic Similarity
What You’ll Learn and What We’ll Build A Toy Dataset for Experimentation Blueprint: Building Your Own Vectorizer
Enumerating the Vocabulary Vectorizing Documents The Document-Term Matrix The Similarity Matrix
Bag-of-Words Models
Blueprint: Using scikit-learn’s CountVectorizer Blueprint: Calculating Similarities
TF-IDF Models
Optimized Document Vectors with TfidfTransformer Introducing the ABC Dataset Blueprint: Reducing Feature Dimensions Blueprint: Improving Features by Making Them More Specific Blueprint: Using Lemmas Instead of Words for Vectorizing Documents Blueprint: Limit Word Types Blueprint: Remove Most Common Words Blueprint: Adding Context via N-Grams
Syntactic Similarity in the ABC Dataset
Blueprint: Finding Most Similar Headlines to a Made-up Headline Blueprint: Finding the Two Most Similar Documents in a Large Corpus (Much More Difficult) Blueprint: Finding Related Words Tips for Long-Running Programs like Syntactic Similarity
Summary and Conclusion
6. Text Classification Algorithms
What You’ll Learn and What We’ll Build Introducing the Java Development Tools Bug Dataset Blueprint: Building a Text Classification System
Step 1: Data Preparation Step 2: Train-Test Split Step 3: Training the Machine Learning Model Step 4: Model Evaluation
Final Blueprint for Text Classification Blueprint: Using Cross-Validation to Estimate Realistic Accuracy Metrics Blueprint: Performing Hyperparameter Tuning with Grid Search Blueprint Recap and Conclusion Closing Remarks Further Reading
7. How to Explain a Text Classifier
What You’ll Learn and What We’ll Build Blueprint: Determining Classification Confidence Using Prediction Probability Blueprint: Measuring Feature Importance of Predictive Models Blueprint: Using LIME to Explain the Classification Results Blueprint: Using ELI5 to Explain the Classification Results Blueprint: Using Anchor to Explain the Classification Results
Using the Distribution with Masked Words Working with Real Words
Closing Remarks
8. Unsupervised Methods: Topic Modeling and Clustering
What You’ll Learn and What We’ll Build Our Dataset: UN General Debates
Checking Statistics of the Corpus Preparations
Nonnegative Matrix Factorization (NMF)
Blueprint: Creating a Topic Model Using NMF for Documents Blueprint: Creating a Topic Model for Paragraphs Using NMF
Latent Semantic Analysis/Indexing
Blueprint: Creating a Topic Model for Paragraphs with SVD
Latent Dirichlet Allocation
Blueprint: Creating a Topic Model for Paragraphs with LDA Blueprint: Visualizing LDA Results
Blueprint: Using Word Clouds to Display and Compare Topic Models Blueprint: Calculating Topic Distribution of Documents and Time Evolution Using Gensim for Topic Modeling
Blueprint: Preparing Data for Gensim Blueprint: Performing Nonnegative Matrix Factorization with Gensim Blueprint: Using LDA with Gensim Blueprint: Calculating Coherence Scores Blueprint: Finding the Optimal Number of Topics Blueprint: Creating a Hierarchical Dirichlet Process with Gensim
Blueprint: Using Clustering to Uncover the Structure of Text Data Further Ideas Summary and Recommendation Conclusion
9. Text Summarization
What You’ll Learn and What We’ll Build Text Summarization
Extractive Methods Data Preprocessing
Blueprint: Summarizing Text Using Topic Representation
Identifying Important Words with TF-IDF Values LSA Algorithm
Blueprint: Summarizing Text Using an Indicator Representation Measuring the Performance of Text Summarization Methods Blueprint: Summarizing Text Using Machine Learning
Step 1: Creating Target Labels Step 2: Adding Features to Assist Model Prediction Step 3: Build a Machine Learning Model
Closing Remarks Further Reading
10. Exploring Semantic Relationships with Word Embeddings
What You’ll Learn and What We’ll Build The Case for Semantic Embeddings
Word Embeddings Analogy Reasoning with Word Embeddings Types of Embeddings
Blueprint: Using Similarity Queries on Pretrained Models
Loading a Pretrained Model Similarity Queries
Blueprints for Training and Evaluating Your Own Embeddings
Data Preparation Blueprint: Training Models with Gensim Blueprint: Evaluating Different Models
Blueprints for Visualizing Embeddings
Blueprint: Applying Dimensionality Reduction Blueprint: Using the TensorFlow Embedding Projector Blueprint: Constructing a Similarity Tree
Closing Remarks Further Reading
11. Performing Sentiment Analysis on Text Data
What You’ll Learn and What We’ll Build Sentiment Analysis Introducing the Amazon Customer Reviews Dataset Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches
Bing Liu Lexicon Disadvantages of a Lexicon-Based Approach
Supervised Learning Approaches
Preparing Data for a Supervised Learning Approach
Blueprint: Vectorizing Text Data and Applying a Supervised Machine Learning Algorithm
Step 1: Data Preparation Step 2: Train-Test Split Step 3: Text Vectorization Step 4: Training the Machine Learning Model
Pretrained Language Models Using Deep Learning
Deep Learning and Transfer Learning
Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model
Step 1: Loading Models and Tokenization Step 2: Model Training Step 3: Model Evaluation
Closing Remarks Further Reading
12. Building a Knowledge Graph
What You’ll Learn and What We’ll Build Knowledge Graphs
Information Extraction
Introducing the Dataset Named-Entity Recognition
Blueprint: Using Rule-Based Named-Entity Recognition Blueprint: Normalizing Named Entities Merging Entity Tokens
Coreference Resolution
Blueprint: Using spaCy’s Token Extensions Blueprint: Performing Alias Resolution Blueprint: Resolving Name Variations Blueprint: Performing Anaphora Resolution with NeuralCoref Name Normalization Entity Linking
Blueprint: Creating a Co-Occurrence Graph
Extracting Co-Occurrences from a Document Visualizing the Graph with Gephi
Relation Extraction
Blueprint: Extracting Relations Using Phrase Matching Blueprint: Extracting Relations Using Dependency Trees
Creating the Knowledge Graph
Don’t Blindly Trust the Results
Closing Remarks Further Reading
13. Using Text Analytics in Production
What You’ll Learn and What We’ll Build Blueprint: Using Conda to Create Reproducible Python Environments Blueprint: Using Containers to Create Reproducible Environments Blueprint: Creating a REST API for Your Text Analytics Model Blueprint: Deploying and Scaling Your API Using a Cloud Provider Blueprint: Automatically Versioning and Deploying Builds Closing Remarks Further Reading
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion