The Artificial Intelligence Infrastructure Workshop by Chinmay Arankalle, Gareth Dwyer, Bas Geerdink, Kunal Gera, Kevin Liao -- Read -- Imperial Library of Trantor

Index

The Artificial Intelligence Infrastructure Workshop Preface

About the Book

Audience About the Chapters Conventions Code Presentation Setting up Your Environment Installing Anaconda Installing Scikit-Learn Installing gawk Installing Apache Spark Installing PySpark Installing Tweepy Installing spaCy Installing MySQL Installing MongoDB Installing Cassandra Installing Apache Spark and Scala Installing Airflow Installing AWS Registering Your AWS Account Creating an IAM Role for Programmatic AWS Access Installing the AWS CLI Installing an AWS Python SDK – Boto Installing MySQL Client Installing pytest Installing Moto Installing PyTorch Installing Gym Installing Docker Kubernetes – Minikube Installing Maven Installing JDK Installing Netcat Installing Libraries Accessing the Code Files

1. Data Storage Fundamentals

Introduction Problems Solved by Machine Learning

Image Processing – Detecting Cancer in Mammograms with Computer Vision Text and Language Processing – Google Translate Audio Processing – Automatically Generated Subtitles Time Series Analysis

Optimizing the Storing and Processing of Data for Machine Learning Problems Diving into Text Classification

Looking at TF-IDF Vectorization

Looking at Terminology in Text Classification Tasks

Exercise 1.01: Training a Machine Learning Model to Identify Clickbait Headlines

Designing for Scale – Choosing the Right Architecture and Hardware

Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage Optimizing Volatile Memory Optimizing Persistent Storage Optimizing Cloud Costs – Spot Instances and Reserved Instances

Using Vectorized Operations to Analyze Data Fast

Exercise 1.02: Applying Vectorized Operations to Entire Matrices Activity 1.01: Creating a Text Classifier for Movie Reviews

Summary

2. Artificial Intelligence Storage Requirements

Introduction Storage Requirements

The Three Stages of Digital Data

Data Layers

From Data Warehouse to Data Lake Exercise 2.01: Designing a Layered Architecture for an AI System Requirements per Infrastructure Layer

Raw Data

Security

Basic Protection The AIC Rating Role-Based Access Encryption

Exercise 2.02: Defining the Security Requirements for Storing Raw Data Scalability Time Travel Retention Metadata and Lineage

Historical Data

Security Scalability Availability Exercise 2.03: Analyzing the Availability of a Data Store

Availability Consequences

Time Travel Locality of Data Metadata and Lineage

Streaming Data

Security Performance Availability Retention Exercise 2.04: Setting the Requirements for Data Retention

Analytics Data

Performance Cost-Efficiency Quality

Model Development and Training

Security Availability Retention Activity 2.01: Requirements Engineering for a Data-Driven Application

Summary

3. Data Preparation

Introduction ETL Data Processing Techniques

Exercise 3.01: Creating a Simple ETL Bash Script Traditional ETL with Dedicated Tooling Distributed, Parallel Processing with Apache Spark Exercise 3.02: Building an ETL Job Using Spark Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages Source to Raw: Importing Data from Source Systems Raw to Historical: Cleaning Data Raw to Historical: Modeling Data Historical to Analytics: Filtering and Aggregating Data Historical to Analytics: Flattening Data Analytics to Model: Feature Engineering Analytics to Model: Splitting Data

Streaming Data

Windows

Event Time Late Events and Watermarks

Exercise 3.03: Streaming Data Processing with Spark Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics

Summary

4. The Ethics of AI Data Storage

Introduction

Case Study 1: Cambridge Analytica Summary and Takeaways Case Study 2: Amazon's AI Recruiting Tool Imbalanced Training Sets Summary and Takeaways Case Study 3: COMPAS Software Summary and Takeaways Finding Built-In Bias in Machine Learning Models Exercise 4.01: Observing Prejudices and Biases in Word Embeddings Exercise 4.02: Testing Our Sentiment Classifier on Movie Reviews Activity 4.01: Finding More Latent Prejudices

Summary

5. Data Stores: SQL and NoSQL Databases

Introduction Database Components SQL Databases MySQL

Advantages of MySQL Disadvantages of MySQL Query Language

Terminology Data Definition Language (DDL) Data Manipulation Language (DML) Data Control Language (DCL) Transaction Control Language (TCL) Data Retrieval SQL Constraints

Exercise 5.01: Building a Relational Database for the FashionMart Store Data Modeling

Normalization Dimensional Data Modeling

Performance Tuning and Best Practices Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query

NoSQL Databases

Need for NoSQL Consistency Availability Partitioning (CAP) Theorem

MongoDB

Advantages of MongoDB Disadvantages of MongoDB Query Language

Terminology

Exercise 5.02: Managing the Inventory of an E-Commerce Website Using a MongoDB Query Data Modeling

Lack of Joins Joins

Performance Tuning and Best Practices Activity 5.02: Data Model to Capture User Information

Cassandra

Advantages of Cassandra Disadvantages of Cassandra Dealing with Denormalizations in Cassandra Query Language

Terminology

Exercise 5.03: Managing Visitors of an E-Commerce Site Using Cassandra Data Modeling

Column Family Design Distributing Data Evenly across Clusters Considering Write-Heavy Scenarios

Performance Tuning and Best Practices Activity 5.03: Managing Customer Feedback Using Cassandra

Exploring the Collective Knowledge of Databases Summary

6. Big Data File Formats

Introduction Common Input Files

CSV – Comma-Separated Values JSON – JavaScript Object Notation

Choosing the Right Format for Your Data

Orientation – Row-Based or Column-Based Row-Based Column-Based Partitions Schema Evolution Compression

Introduction to File Formats

Parquet Exercise 6.01: Converting CSV and JSON Files into the Parquet Format Avro Exercise 6.02: Converting CSV and JSON Files into the Avro Format ORC Exercise 6.03: Converting CSV and JSON Files into the ORC Format Query Performance Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs

Summary

7. Introduction to Analytics Engine (Spark) for Big Data

Introduction Apache Spark

Fundamentals and Terminology How Does Spark Work?

Apache Spark and Databricks

Exercise 7.01: Creating Your Databricks Notebook

Understanding Various Spark Transformations

Exercise 7.02: Applying Spark Transformations to Analyze the Temperature in California

Understanding Various Spark Actions

Spark Pipeline Exercise 7.03: Applying Spark Actions to the Gettysburg Address Activity 7.01: Exploring and Processing a Movie Locations Database Using Transformations and Actions

Best Practices Summary

8. Data System Design Examples

Introduction The Importance of System Design Components to Consider in System Design

Features Hardware Data Architecture Security Scaling

Examining a Pipeline Design for an AI System

Reproducibility – How Pipelines Can Help Us Keep Track of Each Component Exercise 8.01: Designing an Automatic Trading System

Making a Pipeline System Highly Available

Exercise 8.02: Adding Queues to a System to Make It Highly Available Activity 8.01: Building the Complete System with Pipelines and Queues

Summary

9. Workflow Management for AI

Introduction Creating Your Data Pipeline

Exercise 9.01: Implementing a Linear Pipeline to Get the Top 10 Trending Videos Exercise 9.02: Creating a Nonlinear Pipeline to Get the Daily Top 10 Trending Video Categories

Challenges in Managing Processes in the Real World

Automation Failure Handling Retry Mechanism Exercise 9.03: Creating a Multi-Stage Data Pipeline

Automating a Data Pipeline

Exercise 9.04: Automating a Multi-Stage Data Pipeline Using a Bash Script

Automating Asynchronous Data Pipelines

Exercise 9.05: Automating an Asynchronous Data Pipeline

Workflow Management with Airflow

Exercise 9.06: Creating a DAG for Our Data Pipeline Using Airflow

Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category

Summary

10. Introduction to Data Storage on Cloud Services (AWS)

Introduction Interacting with Cloud Storage

Exercise 10.01: Uploading a File to an AWS S3 Bucket Using AWS CLI Exercise 10.02: Copying Data from One Bucket to Another Bucket Exercise 10.03: Downloading Data from Your S3 Bucket Exercise 10.04: Creating a Pipeline Using AWS SDK Boto3 and Uploading the Result to S3

Getting Started with Cloud Relational Databases

Exercise 10.05: Creating an AWS RDS Instance via the AWS Console Exercise 10.06: Accessing and Managing the AWS RDS Instance

Introduction to NoSQL Data Stores on the Cloud

Key-Value Data Stores Document Data Stores Columnar Data Store Graph Data Store

Data in Document Format

Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage

Summary

11. Building an Artificial Intelligence Algorithm

Introduction Machine Learning Algorithms Model Training

Closed-Form Solution Non-Closed-Form Solutions

Gradient Descent

Exercise 11.01: Implementing a Gradient Descent Algorithm in NumPy

Getting Started with PyTorch

Exercise 11.02: Gradient Descent with PyTorch

Mini-Batch SGD with PyTorch

Exercise 11.03: Implementing Mini-Batch SGD with PyTorch Building a Reinforcement Learning Algorithm to Play a Game Exercise 11.04: Implementing a Deep Q-Learning Algorithm in PyTorch to Solve the Classic Cart Pole Problem Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem

Summary

12. Productionizing Your AI Applications

Introduction pickle and Flask

Exercise 12.01: Creating a Machine Learning Model API with pickle and Flask That Predicts Survivors of the Titanic Activity 12.01: Predicting the Class of a Passenger on the Titanic

Deploying Models to Production

Docker Kubernetes Exercise 12.02: Deploying a Dockerized Machine Learning API to a Kubernetes Cluster Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers

Model Execution in Streaming Data Applications

PMML Apache Flink Exercise 12.03: Exporting a Model to PMML and Loading it in the Flink Stream Processing Engine for Real-time Execution Activity 12.03: Predicting the Class of Titanic Passengers in Real Time

Summary

Appendix

1. Data Storage Fundamentals

Activity 1.01: Creating a Text Classifier for Movie Reviews

2. Artificial Intelligence Storage Requirements

Activity 2.01: Requirements Engineering for a Data-Driven Application

3. Data Preparation

Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics

4. Ethics of AI Data Storage

Activity 4.01: Finding More Latent Prejudices

5. Data Stores: SQL and NoSQL Databases

Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query Activity 5.02: Data Model to Capture User Information Activity 5.03: Managing Customer Feedback Using Cassandra

6. Big Data File Formats

Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs

7. Introduction to Analytics Engine (Spark) for Big Data

Activity 7.01: Exploring and Processing a Movie Locations Database by Using Spark's Transformations and Actions

8. Data System Design Examples

Activity 8.01: Building the Complete System with Pipelines and Queues

9. Workflow Management for AI

Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category

10. Introduction to Data Storage on Cloud Services (AWS)

Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage

11. Building an Artificial Intelligence Algorithm

Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem

12. Productionizing Your AI Applications

Activity 12.01: Predicting the Class of a Passenger on the Titanic Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers Activity 12.03: Predicting the Class of Titanic Passengers in Real Time

← Prev
Back
Next →

← Prev
Back
Next →