Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
The Artificial Intelligence Infrastructure Workshop Preface
About the Book
Audience About the Chapters Conventions Code Presentation Setting up Your Environment Installing Anaconda Installing Scikit-Learn Installing gawk Installing Apache Spark Installing PySpark Installing Tweepy Installing spaCy Installing MySQL Installing MongoDB Installing Cassandra Installing Apache Spark and Scala Installing Airflow Installing AWS Registering Your AWS Account Creating an IAM Role for Programmatic AWS Access Installing the AWS CLI Installing an AWS Python SDK – Boto Installing MySQL Client Installing pytest Installing Moto Installing PyTorch Installing Gym Installing Docker Kubernetes – Minikube Installing Maven Installing JDK Installing Netcat Installing Libraries Accessing the Code Files
1. Data Storage Fundamentals
Introduction Problems Solved by Machine Learning
Image Processing – Detecting Cancer in Mammograms with Computer Vision Text and Language Processing – Google Translate Audio Processing – Automatically Generated Subtitles Time Series Analysis
Optimizing the Storing and Processing of Data for Machine Learning Problems Diving into Text Classification
Looking at TF-IDF Vectorization
Looking at Terminology in Text Classification Tasks
Exercise 1.01: Training a Machine Learning Model to Identify Clickbait Headlines
Designing for Scale – Choosing the Right Architecture and Hardware
Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage Optimizing Volatile Memory Optimizing Persistent Storage Optimizing Cloud Costs – Spot Instances and Reserved Instances
Using Vectorized Operations to Analyze Data Fast
Exercise 1.02: Applying Vectorized Operations to Entire Matrices Activity 1.01: Creating a Text Classifier for Movie Reviews
Summary
2. Artificial Intelligence Storage Requirements
Introduction Storage Requirements
The Three Stages of Digital Data
Data Layers
From Data Warehouse to Data Lake Exercise 2.01: Designing a Layered Architecture for an AI System Requirements per Infrastructure Layer
Raw Data
Security
Basic Protection The AIC Rating Role-Based Access Encryption
Exercise 2.02: Defining the Security Requirements for Storing Raw Data Scalability Time Travel Retention Metadata and Lineage
Historical Data
Security Scalability Availability Exercise 2.03: Analyzing the Availability of a Data Store
Availability Consequences
Time Travel Locality of Data Metadata and Lineage
Streaming Data
Security Performance Availability Retention Exercise 2.04: Setting the Requirements for Data Retention
Analytics Data
Performance Cost-Efficiency Quality
Model Development and Training
Security Availability Retention Activity 2.01: Requirements Engineering for a Data-Driven Application
Summary
3. Data Preparation
Introduction ETL Data Processing Techniques
Exercise 3.01: Creating a Simple ETL Bash Script Traditional ETL with Dedicated Tooling Distributed, Parallel Processing with Apache Spark Exercise 3.02: Building an ETL Job Using Spark Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages Source to Raw: Importing Data from Source Systems Raw to Historical: Cleaning Data Raw to Historical: Modeling Data Historical to Analytics: Filtering and Aggregating Data Historical to Analytics: Flattening Data Analytics to Model: Feature Engineering Analytics to Model: Splitting Data
Streaming Data
Windows
Event Time Late Events and Watermarks
Exercise 3.03: Streaming Data Processing with Spark Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics
Summary
4. The Ethics of AI Data Storage
Introduction
Case Study 1: Cambridge Analytica Summary and Takeaways Case Study 2: Amazon's AI Recruiting Tool Imbalanced Training Sets Summary and Takeaways Case Study 3: COMPAS Software Summary and Takeaways Finding Built-In Bias in Machine Learning Models Exercise 4.01: Observing Prejudices and Biases in Word Embeddings Exercise 4.02: Testing Our Sentiment Classifier on Movie Reviews Activity 4.01: Finding More Latent Prejudices
Summary
5. Data Stores: SQL and NoSQL Databases
Introduction Database Components SQL Databases MySQL
Advantages of MySQL Disadvantages of MySQL Query Language
Terminology Data Definition Language (DDL) Data Manipulation Language (DML) Data Control Language (DCL) Transaction Control Language (TCL) Data Retrieval SQL Constraints
Exercise 5.01: Building a Relational Database for the FashionMart Store Data Modeling
Normalization Dimensional Data Modeling
Performance Tuning and Best Practices Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query
NoSQL Databases
Need for NoSQL Consistency Availability Partitioning (CAP) Theorem
MongoDB
Advantages of MongoDB Disadvantages of MongoDB Query Language
Terminology
Exercise 5.02: Managing the Inventory of an E-Commerce Website Using a MongoDB Query Data Modeling
Lack of Joins Joins
Performance Tuning and Best Practices Activity 5.02: Data Model to Capture User Information
Cassandra
Advantages of Cassandra Disadvantages of Cassandra Dealing with Denormalizations in Cassandra Query Language
Terminology
Exercise 5.03: Managing Visitors of an E-Commerce Site Using Cassandra Data Modeling
Column Family Design Distributing Data Evenly across Clusters Considering Write-Heavy Scenarios
Performance Tuning and Best Practices Activity 5.03: Managing Customer Feedback Using Cassandra
Exploring the Collective Knowledge of Databases Summary
6. Big Data File Formats
Introduction Common Input Files
CSV – Comma-Separated Values JSON – JavaScript Object Notation
Choosing the Right Format for Your Data
Orientation – Row-Based or Column-Based Row-Based Column-Based Partitions Schema Evolution Compression
Introduction to File Formats
Parquet Exercise 6.01: Converting CSV and JSON Files into the Parquet Format Avro Exercise 6.02: Converting CSV and JSON Files into the Avro Format ORC Exercise 6.03: Converting CSV and JSON Files into the ORC Format Query Performance Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs
Summary
7. Introduction to Analytics Engine (Spark) for Big Data
Introduction Apache Spark
Fundamentals and Terminology How Does Spark Work?
Apache Spark and Databricks
Exercise 7.01: Creating Your Databricks Notebook
Understanding Various Spark Transformations
Exercise 7.02: Applying Spark Transformations to Analyze the Temperature in California
Understanding Various Spark Actions
Spark Pipeline Exercise 7.03: Applying Spark Actions to the Gettysburg Address Activity 7.01: Exploring and Processing a Movie Locations Database Using Transformations and Actions
Best Practices Summary
8. Data System Design Examples
Introduction The Importance of System Design Components to Consider in System Design
Features Hardware Data Architecture Security Scaling
Examining a Pipeline Design for an AI System
Reproducibility – How Pipelines Can Help Us Keep Track of Each Component Exercise 8.01: Designing an Automatic Trading System
Making a Pipeline System Highly Available
Exercise 8.02: Adding Queues to a System to Make It Highly Available Activity 8.01: Building the Complete System with Pipelines and Queues
Summary
9. Workflow Management for AI
Introduction Creating Your Data Pipeline
Exercise 9.01: Implementing a Linear Pipeline to Get the Top 10 Trending Videos Exercise 9.02: Creating a Nonlinear Pipeline to Get the Daily Top 10 Trending Video Categories
Challenges in Managing Processes in the Real World
Automation Failure Handling Retry Mechanism Exercise 9.03: Creating a Multi-Stage Data Pipeline
Automating a Data Pipeline
Exercise 9.04: Automating a Multi-Stage Data Pipeline Using a Bash Script
Automating Asynchronous Data Pipelines
Exercise 9.05: Automating an Asynchronous Data Pipeline
Workflow Management with Airflow
Exercise 9.06: Creating a DAG for Our Data Pipeline Using Airflow
Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category
Summary
10. Introduction to Data Storage on Cloud Services (AWS)
Introduction Interacting with Cloud Storage
Exercise 10.01: Uploading a File to an AWS S3 Bucket Using AWS CLI Exercise 10.02: Copying Data from One Bucket to Another Bucket Exercise 10.03: Downloading Data from Your S3 Bucket Exercise 10.04: Creating a Pipeline Using AWS SDK Boto3 and Uploading the Result to S3
Getting Started with Cloud Relational Databases
Exercise 10.05: Creating an AWS RDS Instance via the AWS Console Exercise 10.06: Accessing and Managing the AWS RDS Instance
Introduction to NoSQL Data Stores on the Cloud
Key-Value Data Stores Document Data Stores Columnar Data Store Graph Data Store
Data in Document Format
Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage
Summary
11. Building an Artificial Intelligence Algorithm
Introduction Machine Learning Algorithms Model Training
Closed-Form Solution Non-Closed-Form Solutions
Gradient Descent
Exercise 11.01: Implementing a Gradient Descent Algorithm in NumPy
Getting Started with PyTorch
Exercise 11.02: Gradient Descent with PyTorch
Mini-Batch SGD with PyTorch
Exercise 11.03: Implementing Mini-Batch SGD with PyTorch Building a Reinforcement Learning Algorithm to Play a Game Exercise 11.04: Implementing a Deep Q-Learning Algorithm in PyTorch to Solve the Classic Cart Pole Problem Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem
Summary
12. Productionizing Your AI Applications
Introduction pickle and Flask
Exercise 12.01: Creating a Machine Learning Model API with pickle and Flask That Predicts Survivors of the Titanic Activity 12.01: Predicting the Class of a Passenger on the Titanic
Deploying Models to Production
Docker Kubernetes Exercise 12.02: Deploying a Dockerized Machine Learning API to a Kubernetes Cluster Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers
Model Execution in Streaming Data Applications
PMML Apache Flink Exercise 12.03: Exporting a Model to PMML and Loading it in the Flink Stream Processing Engine for Real-time Execution Activity 12.03: Predicting the Class of Titanic Passengers in Real Time
Summary
Appendix
1. Data Storage Fundamentals
Activity 1.01: Creating a Text Classifier for Movie Reviews
2. Artificial Intelligence Storage Requirements
Activity 2.01: Requirements Engineering for a Data-Driven Application
3. Data Preparation
Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics
4. Ethics of AI Data Storage
Activity 4.01: Finding More Latent Prejudices
5. Data Stores: SQL and NoSQL Databases
Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query Activity 5.02: Data Model to Capture User Information Activity 5.03: Managing Customer Feedback Using Cassandra
6. Big Data File Formats
Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs
7. Introduction to Analytics Engine (Spark) for Big Data
Activity 7.01: Exploring and Processing a Movie Locations Database by Using Spark's Transformations and Actions
8. Data System Design Examples
Activity 8.01: Building the Complete System with Pipelines and Queues
9. Workflow Management for AI
Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category
10. Introduction to Data Storage on Cloud Services (AWS)
Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage
11. Building an Artificial Intelligence Algorithm
Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem
12. Productionizing Your AI Applications
Activity 12.01: Predicting the Class of a Passenger on the Titanic Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers Activity 12.03: Predicting the Class of Titanic Passengers in Real Time
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion