Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
The Artificial Intelligence Infrastructure Workshop
Preface
About the Book
Audience
About the Chapters
Conventions
Code Presentation
Setting up Your Environment
Installing Anaconda
Installing Scikit-Learn
Installing gawk
Installing Apache Spark
Installing PySpark
Installing Tweepy
Installing spaCy
Installing MySQL
Installing MongoDB
Installing Cassandra
Installing Apache Spark and Scala
Installing Airflow
Installing AWS
Registering Your AWS Account
Creating an IAM Role for Programmatic AWS Access
Installing the AWS CLI
Installing an AWS Python SDK – Boto
Installing MySQL Client
Installing pytest
Installing Moto
Installing PyTorch
Installing Gym
Installing Docker
Kubernetes – Minikube
Installing Maven
Installing JDK
Installing Netcat
Installing Libraries
Accessing the Code Files
1. Data Storage Fundamentals
Introduction
Problems Solved by Machine Learning
Image Processing – Detecting Cancer in Mammograms with Computer Vision
Text and Language Processing – Google Translate
Audio Processing – Automatically Generated Subtitles
Time Series Analysis
Optimizing the Storing and Processing of Data for Machine Learning Problems
Diving into Text Classification
Looking at TF-IDF Vectorization
Looking at Terminology in Text Classification Tasks
Exercise 1.01: Training a Machine Learning Model to Identify Clickbait Headlines
Designing for Scale – Choosing the Right Architecture and Hardware
Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage
Optimizing Volatile Memory
Optimizing Persistent Storage
Optimizing Cloud Costs – Spot Instances and Reserved Instances
Using Vectorized Operations to Analyze Data Fast
Exercise 1.02: Applying Vectorized Operations to Entire Matrices
Activity 1.01: Creating a Text Classifier for Movie Reviews
Summary
2. Artificial Intelligence Storage Requirements
Introduction
Storage Requirements
The Three Stages of Digital Data
Data Layers
From Data Warehouse to Data Lake
Exercise 2.01: Designing a Layered Architecture for an AI System
Requirements per Infrastructure Layer
Raw Data
Security
Basic Protection
The AIC Rating
Role-Based Access
Encryption
Exercise 2.02: Defining the Security Requirements for Storing Raw Data
Scalability
Time Travel
Retention
Metadata and Lineage
Historical Data
Security
Scalability
Availability
Exercise 2.03: Analyzing the Availability of a Data Store
Availability Consequences
Time Travel
Locality of Data
Metadata and Lineage
Streaming Data
Security
Performance
Availability
Retention
Exercise 2.04: Setting the Requirements for Data Retention
Analytics Data
Performance
Cost-Efficiency
Quality
Model Development and Training
Security
Availability
Retention
Activity 2.01: Requirements Engineering for a Data-Driven Application
Summary
3. Data Preparation
Introduction
ETL
Data Processing Techniques
Exercise 3.01: Creating a Simple ETL Bash Script
Traditional ETL with Dedicated Tooling
Distributed, Parallel Processing with Apache Spark
Exercise 3.02: Building an ETL Job Using Spark
Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages
Source to Raw: Importing Data from Source Systems
Raw to Historical: Cleaning Data
Raw to Historical: Modeling Data
Historical to Analytics: Filtering and Aggregating Data
Historical to Analytics: Flattening Data
Analytics to Model: Feature Engineering
Analytics to Model: Splitting Data
Streaming Data
Windows
Event Time
Late Events and Watermarks
Exercise 3.03: Streaming Data Processing with Spark
Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics
Summary
4. The Ethics of AI Data Storage
Introduction
Case Study 1: Cambridge Analytica
Summary and Takeaways
Case Study 2: Amazon's AI Recruiting Tool
Imbalanced Training Sets
Summary and Takeaways
Case Study 3: COMPAS Software
Summary and Takeaways
Finding Built-In Bias in Machine Learning Models
Exercise 4.01: Observing Prejudices and Biases in Word Embeddings
Exercise 4.02: Testing Our Sentiment Classifier on Movie Reviews
Activity 4.01: Finding More Latent Prejudices
Summary
5. Data Stores: SQL and NoSQL Databases
Introduction
Database Components
SQL Databases
MySQL
Advantages of MySQL
Disadvantages of MySQL
Query Language
Terminology
Data Definition Language (DDL)
Data Manipulation Language (DML)
Data Control Language (DCL)
Transaction Control Language (TCL)
Data Retrieval
SQL Constraints
Exercise 5.01: Building a Relational Database for the FashionMart Store
Data Modeling
Normalization
Dimensional Data Modeling
Performance Tuning and Best Practices
Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query
NoSQL Databases
Need for NoSQL
Consistency Availability Partitioning (CAP) Theorem
MongoDB
Advantages of MongoDB
Disadvantages of MongoDB
Query Language
Terminology
Exercise 5.02: Managing the Inventory of an E-Commerce Website Using a MongoDB Query
Data Modeling
Lack of Joins
Joins
Performance Tuning and Best Practices
Activity 5.02: Data Model to Capture User Information
Cassandra
Advantages of Cassandra
Disadvantages of Cassandra
Dealing with Denormalizations in Cassandra
Query Language
Terminology
Exercise 5.03: Managing Visitors of an E-Commerce Site Using Cassandra
Data Modeling
Column Family Design
Distributing Data Evenly across Clusters
Considering Write-Heavy Scenarios
Performance Tuning and Best Practices
Activity 5.03: Managing Customer Feedback Using Cassandra
Exploring the Collective Knowledge of Databases
Summary
6. Big Data File Formats
Introduction
Common Input Files
CSV – Comma-Separated Values
JSON – JavaScript Object Notation
Choosing the Right Format for Your Data
Orientation – Row-Based or Column-Based
Row-Based
Column-Based
Partitions
Schema Evolution
Compression
Introduction to File Formats
Parquet
Exercise 6.01: Converting CSV and JSON Files into the Parquet Format
Avro
Exercise 6.02: Converting CSV and JSON Files into the Avro Format
ORC
Exercise 6.03: Converting CSV and JSON Files into the ORC Format
Query Performance
Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs
Summary
7. Introduction to Analytics Engine (Spark) for Big Data
Introduction
Apache Spark
Fundamentals and Terminology
How Does Spark Work?
Apache Spark and Databricks
Exercise 7.01: Creating Your Databricks Notebook
Understanding Various Spark Transformations
Exercise 7.02: Applying Spark Transformations to Analyze the Temperature in California
Understanding Various Spark Actions
Spark Pipeline
Exercise 7.03: Applying Spark Actions to the Gettysburg Address
Activity 7.01: Exploring and Processing a Movie Locations Database Using Transformations and Actions
Best Practices
Summary
8. Data System Design Examples
Introduction
The Importance of System Design
Components to Consider in System Design
Features
Hardware
Data
Architecture
Security
Scaling
Examining a Pipeline Design for an AI System
Reproducibility – How Pipelines Can Help Us Keep Track of Each Component
Exercise 8.01: Designing an Automatic Trading System
Making a Pipeline System Highly Available
Exercise 8.02: Adding Queues to a System to Make It Highly Available
Activity 8.01: Building the Complete System with Pipelines and Queues
Summary
9. Workflow Management for AI
Introduction
Creating Your Data Pipeline
Exercise 9.01: Implementing a Linear Pipeline to Get the Top 10 Trending Videos
Exercise 9.02: Creating a Nonlinear Pipeline to Get the Daily Top 10 Trending Video Categories
Challenges in Managing Processes in the Real World
Automation
Failure Handling
Retry Mechanism
Exercise 9.03: Creating a Multi-Stage Data Pipeline
Automating a Data Pipeline
Exercise 9.04: Automating a Multi-Stage Data Pipeline Using a Bash Script
Automating Asynchronous Data Pipelines
Exercise 9.05: Automating an Asynchronous Data Pipeline
Workflow Management with Airflow
Exercise 9.06: Creating a DAG for Our Data Pipeline Using Airflow
Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category
Summary
10. Introduction to Data Storage on Cloud Services (AWS)
Introduction
Interacting with Cloud Storage
Exercise 10.01: Uploading a File to an AWS S3 Bucket Using AWS CLI
Exercise 10.02: Copying Data from One Bucket to Another Bucket
Exercise 10.03: Downloading Data from Your S3 Bucket
Exercise 10.04: Creating a Pipeline Using AWS SDK Boto3 and Uploading the Result to S3
Getting Started with Cloud Relational Databases
Exercise 10.05: Creating an AWS RDS Instance via the AWS Console
Exercise 10.06: Accessing and Managing the AWS RDS Instance
Introduction to NoSQL Data Stores on the Cloud
Key-Value Data Stores
Document Data Stores
Columnar Data Store
Graph Data Store
Data in Document Format
Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage
Summary
11. Building an Artificial Intelligence Algorithm
Introduction
Machine Learning Algorithms
Model Training
Closed-Form Solution
Non-Closed-Form Solutions
Gradient Descent
Exercise 11.01: Implementing a Gradient Descent Algorithm in NumPy
Getting Started with PyTorch
Exercise 11.02: Gradient Descent with PyTorch
Mini-Batch SGD with PyTorch
Exercise 11.03: Implementing Mini-Batch SGD with PyTorch
Building a Reinforcement Learning Algorithm to Play a Game
Exercise 11.04: Implementing a Deep Q-Learning Algorithm in PyTorch to Solve the Classic Cart Pole Problem
Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem
Summary
12. Productionizing Your AI Applications
Introduction
pickle and Flask
Exercise 12.01: Creating a Machine Learning Model API with pickle and Flask That Predicts Survivors of the Titanic
Activity 12.01: Predicting the Class of a Passenger on the Titanic
Deploying Models to Production
Docker
Kubernetes
Exercise 12.02: Deploying a Dockerized Machine Learning API to a Kubernetes Cluster
Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers
Model Execution in Streaming Data Applications
PMML
Apache Flink
Exercise 12.03: Exporting a Model to PMML and Loading it in the Flink Stream Processing Engine for Real-time Execution
Activity 12.03: Predicting the Class of Titanic Passengers in Real Time
Summary
Appendix
1. Data Storage Fundamentals
Activity 1.01: Creating a Text Classifier for Movie Reviews
2. Artificial Intelligence Storage Requirements
Activity 2.01: Requirements Engineering for a Data-Driven Application
3. Data Preparation
Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages
Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics
4. Ethics of AI Data Storage
Activity 4.01: Finding More Latent Prejudices
5. Data Stores: SQL and NoSQL Databases
Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query
Activity 5.02: Data Model to Capture User Information
Activity 5.03: Managing Customer Feedback Using Cassandra
6. Big Data File Formats
Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs
7. Introduction to Analytics Engine (Spark) for Big Data
Activity 7.01: Exploring and Processing a Movie Locations Database by Using Spark's Transformations and Actions
8. Data System Design Examples
Activity 8.01: Building the Complete System with Pipelines and Queues
9. Workflow Management for AI
Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category
10. Introduction to Data Storage on Cloud Services (AWS)
Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage
11. Building an Artificial Intelligence Algorithm
Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem
12. Productionizing Your AI Applications
Activity 12.01: Predicting the Class of a Passenger on the Titanic
Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers
Activity 12.03: Predicting the Class of Titanic Passengers in Real Time
← Prev
Back
Next →
← Prev
Back
Next →