Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Data Engineering with Python
Why subscribe?
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Building Data Pipelines – Extract Transform, and Load
Chapter 1: What is Data Engineering?
What data engineers do
Required skills and knowledge to be a data engineer
Data engineering versus data science
Data engineering tools
Programming languages
Databases
Data processing engines
Data pipelines
Summary
Chapter 2: Building Our Data Engineering Infrastructure
Installing and configuring Apache NiFi
A quick tour of NiFi
PostgreSQL driver
Installing and configuring Apache Airflow
Installing and configuring Elasticsearch
Installing and configuring Kibana
Installing and configuring PostgreSQL
Installing pgAdmin 4
A tour of pgAdmin 4
Summary
Chapter 3: Reading and Writing Files
Writing and reading files in Python
Writing and reading CSVs
Reading and writing CSVs using pandas DataFrames
Writing JSON with Python
Building data pipelines in Apache Airflow
Handling files using NiFi processors
Working with CSV in NiFi
Working with JSON in NiFi
Summary
Chapter 4: Working with Databases
Inserting and extracting relational data in Python
Inserting data into PostgreSQL
Inserting and extracting NoSQL database data in Python
Installing Elasticsearch
Inserting data into Elasticsearch
Building data pipelines in Apache Airflow
Setting up the Airflow boilerplate
Running the DAG
Handling databases with NiFi processors
Extracting data from PostgreSQL
Running the data pipeline
Summary
Chapter 5: Cleaning, Transforming, and Enriching Data
Performing exploratory data analysis in Python
Downloading the data
Basic data exploration
Handling common data issues using pandas
Drop rows and columns
Creating and modifying columns
Enriching data
Cleaning data using Airflow
Summary
Chapter 6: Building a 311 Data Pipeline
Building the data pipeline
Mapping a data type
Triggering a pipeline
Querying SeeClickFix
Transforming the data for Elasticsearch
Getting every page
Backfilling data
Building a Kibana dashboard
Creating visualizations
Creating a dashboard
Summary
Section 2:Deploying Data Pipelines in Production
Chapter 7: Features of a Production Pipeline
Staging and validating data
Staging data
Validating data with Great Expectations
Building idempotent data pipelines
Building atomic data pipelines
Summary
Chapter 8: Version Control with the NiFi Registry
Installing and configuring the NiFi Registry
Installing the NiFi Registry
Configuring the NiFi Registry
Using the Registry in NiFi
Adding the Registry to NiFi
Versioning your data pipelines
Using git-persistence with the NiFi Registry
Summary
Chapter 9: Monitoring Data Pipelines
Monitoring NiFi using the GUI
Monitoring NiFi with the status bar
Monitoring NiFi with processors
Using Python with the NiFi REST API
Summary
Chapter 10: Deploying Data Pipelines
Finalizing your data pipelines for production
Backpressure
Improving processor groups
Using the NiFi variable registry
Deploying your data pipelines
Using the simplest strategy
Using the middle strategy
Using multiple registries
Summary
Chapter 11: Building a Production Data Pipeline
Creating a test and production environment
Creating the databases
Populating a data lake
Building a production data pipeline
Reading the data lake
Scanning the data lake
Inserting the data into staging
Querying the staging database
Validating the staging data
Insert Warehouse
Deploying a data pipeline in production
Summary
Section 3:Beyond Batch – Building Real-Time Data Pipelines
Chapter 12: Building a Kafka Cluster
Creating ZooKeeper and Kafka clusters
Downloading Kafka and setting up the environment
Configuring ZooKeeper and Kafka
Starting the ZooKeeper and Kafka clusters
Testing the Kafka cluster
Testing the cluster with messages
Summary
Chapter 13: Streaming Data with Apache Kafka
Understanding logs
Understanding how Kafka uses logs
Topics
Kafka producers and consumers
Building data pipelines with Kafka and NiFi
The Kafka producer
The Kafka consumer
Differentiating stream processing from batch processing
Producing and consuming with Python
Writing a Kafka producer in Python
Writing a Kafka consumer in Python
Summary
Chapter 14: Data Processing with Apache Spark
Installing and running Spark
Installing and configuring PySpark
Processing data with PySpark
Spark for data engineering
Summary
Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark
Setting up MiNiFi
Building a MiNiFi task in NiFi
Summary
Appendix
Building a NiFi cluster
The basics of NiFi clustering
Building a NiFi cluster
Building a distributed data pipeline
Managing the distributed data pipeline
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
← Prev
Back
Next →
← Prev
Back
Next →