Principles of Database Management · the Practical Guide to Storing, Managing and Analyzing Big and Small Data by Lemahieu, Wilfried -- Read -- Imperial Library of Trantor

Index

Half title Reviews Title page Imprints page Brief Contents Contents About the Authors Preface

Who This Book is For Topics Covered in this Book How to Read this Book Cross-Chapter Case Study: Sober Additional Material Acknowledgments

Part I Databases and Database Design

Contents 1 Fundamental Concepts of Database Management

1.1 Applications of Database Technology 1.2 Key Definitions 1.3 File versus Database Approach to Data Management

1.3.1 The File-Based Approach 1.3.2 The Database Approach

1.4 Elements of a Database System

1.4.1 Database Model versus Instances 1.4.2 Data Model 1.4.3 The Three-Layer Architecture 1.4.4 Catalog 1.4.5 Database Users 1.4.6 Database Languages

1.5 Advantages of Database Systems and Database Management

1.5.1 Data Independence 1.5.2 Database Modeling 1.5.3 Managing Structured, Semi-Structured, and Unstructured Data 1.5.4 Managing Data Redundancy 1.5.5 Specifying Integrity Rules 1.5.6 Concurrency Control 1.5.7 Backup and Recovery Facilities 1.5.8 Data Security 1.5.9 Performance Utilities

Summary Problems and Exercises

2 Architecture and Categorization of DBMSs

2.1 Architecture of a DBMS

2.1.1 Connection and Security Manager 2.1.2 DDL Compiler 2.1.3 Query Processor

2.1.3.1 DML Compiler 2.1.3.2 Query Parser and Query Rewriter 2.1.3.3 Query Optimizer 2.1.3.4 Query Executor

2.1.4 Storage Manager

2.1.4.1 Transaction Manager 2.1.4.2 Buffer Manager 2.1.4.3 Lock Manager 2.1.4.4 Recovery Manager

2.1.5 DBMS Utilities 2.1.6 DBMS Interfaces

2.2 Categorization of DBMSs

2.2.1 Categorization Based on Data Model

2.2.1.1 Hierarchical DBMSs 2.2.1.2 Network DBMSs 2.2.1.3 Relational DBMSs 2.2.1.4 Object-Oriented DBMSs 2.2.1.5 Object-Relational/Extended Relational DBMSs 2.2.1.6 XML DBMSs 2.2.1.7 NoSQL DBMSs

2.2.2 Categorization Based on Degree of Simultaneous Access 2.2.3 Categorization Based on Architecture 2.2.4 Categorization Based on Usage

Summary Problems and Exercises

3 Conceptual Data Modeling Using the (E)ER Model and UML Class Diagram

3.1 Phases of Database Design 3.2 The Entity Relationship Model

3.2.1 Entity Types 3.2.2 Attribute Types

3.2.3.1 Domains 3.2.3.2 Key Attribute Types 3.2.3.3 Simple versus Composite Attribute Types 3.2.3.4 Single-Valued versus Multi-Valued Attribute Types 3.2.3.5 Derived Attribute Type

3.2.4 Relationship Types

3.2.4.1 Degree and Roles 3.2.4.2 Cardinalities 3.2.4.3 Relationship Attribute Types

3.2.5 Weak Entity Types 3.2.6 Ternary Relationship Types 3.2.7 Examples of the ER Model 3.2.8 Limitations of the ER Model

3.3 The Enhanced Entity Relationship (EER) Model

3.3.1 Specialization/Generalization 3.3.2 Categorization 3.3.3 Aggregation 3.3.4 Examples of the EER Model 3.3.5 Designing an EER Model

3.4 The UML Class Diagram

3.4.1 Recap of Object Orientation 3.4.2 Classes 3.4.3 Variables 3.4.4 Access Modifiers 3.4.5 Associations

3.4.5.1 Association Class 3.4.5.2 Unidirectional versus Bidirectional Association 3.4.5.3 Qualified Association

3.4.6 Specialization/Generalization 3.4.7 Aggregation 3.4.8 UML Example 3.4.9 Advanced UML Modeling Concepts

3.4.9.1 Changeability Property 3.4.9.2 Object Constraint Language (OCL) 3.4.9.3 Dependency Relationship

3.4.10 UML versus EER

Summary Problems and Exercises

4 Organizational Aspects of Data Management

4.1 Data Management

4.1.1 Catalogs and the Role of Metadata 4.1.2 Metadata Modeling 4.1.3 Data Quality

4.1.3.1 Data Quality Dimensions

Accuracy Completeness Consistency Accessibility

4.1.3.2 Data Quality Problems

4.1.4 Data Governance

4.2 Roles in Data Management

4.2.1 Information Architect 4.2.2 Database Designer 4.2.3 Data Owner 4.2.4 Data Steward 4.2.5 Database Administrator 4.2.6 Data Scientist

Summary Problems and Exercises

Part II Types of Database Systems

Contents 5 Legacy Databases

5.1 The Hierarchical Model 5.2 The CODASYL Model Summary Problems and Exercises

6 Relational Databases

6.1 The Relational Model

6.1.1 Basic Concepts 6.1.2 Formal Definitions 6.1.3 Types of Keys

6.1.3.1 Superkeys and Keys 6.1.3.2 Candidate Keys, Primary Keys, and Alternative Keys 6.1.3.3 Foreign Keys

6.1.4 Relational Constraints 6.1.5 Example Relational Data Model

6.2 Normalization

6.2.1 Insertion, Deletion, and Update Anomalies in an Unnormalized Relational Model 6.2.2 Informal Normalization Guidelines 6.2.3 Functional Dependencies and Prime Attribute Types 6.2.4 Normalization Forms

6.2.4.1 First Normal Form (1 NF) 6.2.4.2 Second Normal Form (2 NF) 6.2.4.3 Third Normal Form (3 NF) 6.2.4.4 Boyce–Codd Normal Form (BCNF) 6.2.4.5 Fourth Normal Form (4 NF)

6.3 Mapping a Conceptual ER Model to a Relational Model

6.3.1 Mapping Entity Types 6.3.2 Mapping Relationship Types

6.3.2.1 Mapping a Binary 1:1 Relationship type 6.3.2.2 Mapping a Binary 1:N Relationship Type 6.3.2.3 Mapping a Binary M:N Relationship Type 6.3.2.4 Mapping Unary Relationship Types 6.3.2.5 Mapping n-ary Relationship Types

6.3.3 Mapping Multi-Valued Attribute Types 6.3.4 Mapping Weak Entity Types 6.3.5 Putting it All Together

6.4 Mapping a Conceptual EER Model to a Relational Model

6.4.1 Mapping an EER Specialization 6.4.2 Mapping an EER Categorization 6.4.3 Mapping an EER Aggregation

Summary Problems and Exercises

7 Relational Databases

7.1 Relational Database Management Systems and SQL

7.1.1 Key Characteristics of SQL 7.1.2 Three-Layer Database Architecture

7.2 SQL Data Definition Language

7.2.1 Key DDL Concepts 7.2.2 DDL Example 7.2.3 Referential Integrity Constraints 7.2.4 DROP and ALTER Command

7.3 SQL Data Manipulation Language

7.3.1 SQL SELECT Statement

7.3.1.1 Simple Queries 7.3.1.2 Queries with Aggregate Functions 7.3.1.3 Queries with GROUP BY/HAVING 7.3.1.4 Queries with ORDER BY 7.3.1.5 Join Queries

Inner Joins Outer Joins

7.3.1.6 Nested Queries 7.3.1.7 Correlated Queries 7.3.1.8 Queries with ALL/ANY 7.3.1.9 Queries with EXISTS 7.3.1.10 Queries with Subqueries in SELECT/FROM 7.3.1.11 Queries with Set Operations

7.3.2 SQL INSERT Statement 7.3.3 SQL DELETE Statement 7.3.4 SQL UPDATE Statement

7.4 SQL Views 7.5 SQL Indexes 7.6 SQL Privileges 7.7 SQL for Metadata Management Summary Problems and Exercises

8 Object-Oriented Databases and Object Persistence

8.1 Recap: Basic Concepts of OO 8.2 Advanced Concepts of OO

8.2.1 Method Overloading 8.2.2 Inheritance 8.2.3 Method Overriding 8.2.4 Polymorphism and Dynamic Binding

8.3 Basic Principles of Object Persistence

8.3.1 Serialization

8.4 OODBMS

8.4.1 Object Identifiers 8.4.2 ODMG Standard 8.4.3 Object Model 8.4.4 Object Definition Language (ODL) 8.4.5 Object Query Language (OQL)

8.4.5.1 Simple OQL Queries 8.4.5.2 SELECT FROM WHERE OQL Queries 8.4.5.3 Join OQL Queries 8.4.5.4 Other OQL Queries

8.4.6 Language Bindings

8.5 Evaluating OODBMSs Summary Problems and Exercises

9 Extended Relational Databases

9.1 Limitations of the Relational Model 9.2 Active RDBMS Extensions

9.2.1 Triggers 9.2.2 Stored Procedures

9.3 Object-Relational RDBMS Extensions

9.3.1 User-Defined Types

9.3.1.1 Distinct Data Types 9.3.1.2 Opaque Data Types 9.3.1.3 Unnamed Row Types 9.3.1.4 Named Row Types 9.3.1.5 Table Data Types

9.3.2 User-Defined Functions 9.3.3 Inheritance

9.3.3.1 Inheritance at Data Type Level 9.3.3.2 Inheritance at Table Type Level

9.3.4 Behavior 9.3.5 Polymorphism 9.3.6 Collection Types 9.3.7 Large Objects

9.4 Recursive SQL Queries Summary Problems and Exercises

10 XML Databases

10.1 Extensible Markup Language

10.1.1 Basic Concepts 10.1.2 Document Type Definition and XML Schema Definition 10.1.3 Extensible Stylesheet Language 10.1.4 Namespaces 10.1.5 XPath

10.2 Processing XML Documents 10.3 Storage of XML Documents

10.3.1 The Document-Oriented Approach for Storing XML Documents 10.3.2 The Data-Oriented Approach for Storing XML Documents 10.3.3 The Combined Approach for Storing XML Documents

10.4 Differences Between XML Data and Relational Data 10.5 Mappings Between XML Documents and (Object-) Relational Data

10.5.1 Table-Based Mapping 10.5.2 Schema-Oblivious Mapping 10.5.3 Schema-Aware Mapping 10.5.4 SQL/XML

10.6 Searching XML Data

10.6.1 Full-Text Search 10.6.2 Keyword-Based Search 10.6.3 Structured Search With XQuery 10.6.4 Semantic Search With RDF and SPARQL

10.7 XML for Information Exchange

10.7.1 Message-Oriented Middleware 10.7.2 SOAP-Based Web Services 10.7.3 REST-Based Web Services 10.7.4 Web Services and Databases

10.8 Other Data Representation Formats Summary Problems and Exercises

11 NoSQL Databases

11.1 The NoSQL Movement

11.1.1 The End of the “One Size Fits All” Era? 11.1.2 The Emergence of the NoSQL Movement

11.2 Key–Value Stores

11.2.1 From Keys to Hashes 11.2.2 Horizontal Scaling 11.2.3 An Example: Memcached 11.2.4 Request Coordination 11.2.5 Consistent Hashing 11.2.6 Replication and Redundancy 11.2.7 Eventual Consistency 11.2.8 Stabilization 11.2.9 Integrity Constraints and Querying

11.3 Tuple and Document Stores

11.3.1 Items with Keys 11.3.2 Filters and Queries 11.3.3 Complex Queries and Aggregation with MapReduce 11.3.4 SQL After All…

11.4 Column-Oriented Databases 11.5 Graph-Based Databases

11.5.1 Cypher Overview 11.5.2 Exploring a Social Graph

11.6 Other NoSQL Categories Summary Problems and Exercises

Part III Physical Data Storage, Transaction Management, and Database Access

Contents 12 Physical File Organization and Indexing

12.1 Storage Hardware and Physical Database Design

12.1.1 The Storage Hierarchy 12.1.2 Internals of Hard Disk Drives 12.1.3 From Logical Concepts to Physical Constructs

12.2 Record Organization 12.3 File Organization

12.3.1 Introductory Concepts: Search Keys, Primary, and Secondary File Organization 12.3.2 Heap File Organization 12.3.3 Sequential File Organization 12.3.4 Random File Organization (Hashing)

12.3.4.1 Key-to-Address Transformation 12.3.4.2 Factors that Determine the Efficiency of Random File Organization

12.3.5 Indexed Sequential File Organization

12.3.5.1 Basic Terminology of Indexes 12.3.5.2 Primary Indexes 12.3.5.3 Clustered Indexes 12.3.5.4 Multilevel Indexes

12.3.6 List Data Organization (Linear and Nonlinear Lists)

12.3.6.1 Linear Lists 12.3.6.2 Tree Data Structures

12.3.7 Secondary Indexes and Inverted Files

12.3.7.1 Characteristics of Secondary Indexes 12.3.7.2 Inverted Files 12.3.7.3 Multicolumn Indexes 12.3.7.4 Other Index Types

12.3.8 B-Trees and B+-Trees

12.3.8.1 Multilevel Indexes Revisited 12.3.8.2 Binary Search Trees 12.3.8.3 B-Trees 12.3.8.4 B+-Trees

Summary Problems and Exercises

13 Physical Database Organization

13.1 Physical Database Organization and Database Access Methods

13.1.1 From Database to Tablespace 13.1.2 Index Design 13.1.3 Database Access Methods

13.1.3.1 Functioning of the Query Optimizer 13.1.3.2 Index Search (with Atomic Search Key) 13.1.3.3 Multiple Index and Multicolumn Index Search 13.1.3.4 Index-Only Access 13.1.3.5 Full Table Scan

13.1.4 Join Implementations

13.1.4.1 Nested-Loop Join 13.1.4.2 Sort-Merge Join 13.1.4.3 Hash Join

13.2 Enterprise Storage Subsystems and Business Continuity

13.2.1 Disk Arrays and RAID 13.2.2 Enterprise Storage Subsystems

13.2.2.1 Overview and Classification 13.2.2.2 DAS (Directly Attached Storage) 13.2.2.3 SAN (Storage Area Network) 13.2.2.4 NAS (Network Attached Storage) 13.2.2.5 NAS Gateway 13.2.2.6 iSCSI/Storage Over IP

13.2.3 Business Continuity

13.2.3.1 Contingency Planning, Recovery Point, and Recovery Time 13.2.3.2 Availability and Accessibility of Storage Devices 13.2.3.3 Availability of Database Functionality 13.2.3.4 Data Availability

Summary Problems and Exercises

14 Basics of Transaction Management

14.1 Transactions, Recovery, and Concurrency Control 14.2 Transactions and Transaction Management

14.2.1 Delineating Transactions and the Transaction Lifecycle 14.2.2 DBMS Components Involved in Transaction Management 14.2.3 The Logfile

14.3 Recovery

14.3.1 Types of Failures 14.3.2 System Recovery 14.3.3 Media Recovery

14.4 Concurrency Control

14.4.1 Typical Concurrency Problems

14.4.1.1 Lost Update Problem 14.4.1.2 Uncommitted Dependency Problem (aka Dirty Read Problem) 14.4.1.3 Inconsistent Analysis Problem 14.4.1.4 Other Concurrency-Related Problems

14.4.2 Schedules and Serial Schedules 14.4.3 Serializable Schedules 14.4.4 Optimistic and Pessimistic Schedulers 14.4.5 Locking and Locking Protocols

14.4.5.1 Purposes of Locking 14.4.5.2 The Two-Phase Locking Protocol (2PL) 14.4.5.3 Cascading Rollbacks 14.4.5.4 Dealing with Deadlocks 14.4.5.5 Isolation Levels 14.4.5.6 Lock Granularity

14.5 The ACID Properties of Transactions Summary Problems and Exercises

15 Accessing Databases and Database APIs

15.1 Database System Architectures

15.1.1 Centralized System Architectures 15.1.2 Tiered System Architectures

15.2 Classification of Database APIs

15.2.1 Proprietary versus Universal APIs 15.2.2 Embedded versus Call-Level APIs 15.2.3 Early Binding versus Late Binding

15.3 Universal Database APIs

15.3.1 ODBC 15.3.2 OLE DB and ADO 15.3.3 ADO.NET 15.3.4 Java DataBase Connectivity (JDBC) 15.3.5 Intermezzo: SQL Injection and Access Security 15.3.6 SQLJ 15.3.7 Intermezzo: Embedded APIs versus Embedded DBMSs 15.3.8 Language-Integrated Querying

15.4 Object Persistence and Object-Relational Mapping APIs

15.4.1 Object Persistence with Enterprise JavaBeans 15.4.2 Object Persistence with the Java Persistence API 15.4.3 Object Persistence with Java Data Objects 15.4.4 Object Persistence in Other Host Languages

15.5 Database API Summary 15.6 Database Access in the World Wide Web

15.6.1 Introduction: the Original Web Server 15.6.2 The Common Gateway Interface: Toward Dynamic Web Pages 15.6.3 Client-Side Scripting: The Desire for a Richer Web 15.6.4 JavaScript as a Platform 15.6.5 DBMSs Adapt: REST, Other Web Services, and a Look Ahead

Summary Problems and Exercises

16 Data Distribution and Distributed Transaction Management

16.1 Distributed Systems and Distributed Databases 16.2 Architectural Implications of Distributed Databases 16.3 Fragmentation, Allocation, and Replication

16.3.1 Vertical Fragmentation 16.3.2 Horizontal Fragmentation (Sharding) 16.3.3 Mixed Fragmentation 16.3.4 Replication 16.3.5 Distribution and Replication of Metadata

16.4 Transparency 16.5 Distributed Query Processing 16.6 Distributed Transaction Management and Concurrency Control

16.6.1 Primary Site and Primary Copy 2PL 16.6.2 Distributed 2PL 16.6.3 The Two-Phase Commit Protocol (2PC) 16.6.4 Optimistic Concurrency and Loosely Coupled Systems 16.6.5 Compensation-Based Transaction Models

16.7 Eventual Consistency and BASE Transactions

16.7.1 Horizontal Fragmentation and Consistent Hashing 16.7.2 The CAP Theorem 16.7.3 BASE Transactions 16.7.4 Multi-Version Concurrency Control and Vector Clocks 16.7.5 Quorum-Based Consistency

Summary Problems and Exercises

Part IV Data Warehousing, Data Governance, and (Big) Data Analytics

Contents 17 Data Warehousing and Business Intelligence

17.1 Operational versus Tactical/Strategic Decision-Making 17.2 Data Warehouse Definition 17.3 Data Warehouse Schemas

17.3.1 Star Schema 17.3.2 Snowflake Schema 17.3.3 Fact Constellation 17.3.4 Specific Schema Issues

17.3.4.1 Surrogate Keys 17.3.4.2 Granularity of the Fact Table 17.3.4.3 Factless Fact Tables 17.3.4.4 Optimizing the Dimension Tables 17.3.4.5 Defining Junk Dimensions 17.3.4.6 Defining Outrigger Tables 17.3.4.7 Slowly Changing Dimensions 17.3.4.8 Rapidly Changing Dimensions

17.4 The Extraction, Transformation, and Loading (ETL) Process 17.5 Data Marts 17.6 Virtual Data Warehouses and Virtual Data Marts 17.7 Operational Data Store 17.8 Data Warehouses versus Data Lakes 17.9 Business Intelligence

17.9.1 Query and Reporting 17.9.2 Pivot Tables 17.9.3 On-Line Analytical Processing (OLAP)

17.9.3.1 MOLAP 17.9.3.2 ROLAP 17.9.3.3 HOLAP 17.9.3.4 OLAP Operators 17.9.3.5 OLAP Queries in SQL

Summary Problems and Exercises

18 Data Integration, Data Quality, and Data Governance

18.1 Data and Process Integration

18.1.1 Convergence of Analytical and Operational Data Needs 18.1.2 Data Integration and Data Integration Patterns

18.1.2.1 Data Consolidation: Extract, Transform, Load (ETL) 18.1.2.2 Data Federation: Enterprise Information Integration (EII) 18.1.2.3 Data Propagation: Enterprise Application Integration (EAI) 18.1.2.4 Data Propagation: Enterprise Data Replication (EDR) 18.1.2.5 Changed Data Capture (CDC), Near-Real-Time ETL, and Event Processing 18.1.2.6 Data Virtualization 18.1.2.7 Data as a Service and Data in the Cloud

18.1.3 Data Services and Data Flows in the Context of Data and Process Integration

18.1.3.1 Business Process Integration 18.1.3.2 Patterns for Managing Sequence Dependencies and Data Dependencies in Processes 18.1.3.3 A Unified View on Data and Process Integration

18.2 Searching Unstructured Data and Enterprise Search

18.2.1 Principles of Full-Text Search 18.2.2 Indexing Full-Text Documents 18.2.3 Web Search Engines 18.2.4 Enterprise Search

18.3 Data Quality and Master Data Management 18.4 Data Governance

18.4.1 Total Data Quality Management (TDQM) 18.4.2 Capability Maturity Model Integration (CMMI) 18.4.3 Data Management Body of Knowledge (DMBOK) 18.4.4 Control Objectives for Information and Related Technology (COBIT) 18.4.5 Information Technology Infrastructure Library

18.5 Outlook 18.6 Conclusion Problems and Exercises

19 Big Data

19.1 The 5 Vs of Big Data 19.2 Hadoop

19.2.1 History of Hadoop 19.2.2 The Hadoop Stack

19.2.2.1 The Hadoop Distributed File System 19.2.2.2 MapReduce 19.2.2.3 Yet Another Resource Negotiator

19.3 SQL on Hadoop

19.3.1 HBase: The First Database on Hadoop 19.3.2 Pig 19.3.3 Hive

19.4 Apache Spark

19.4.1 Spark Core 19.4.2 Spark SQL 19.4.3 MLlib, Spark Streaming, and GraphX

19.5 Conclusion Problems and Exercises

20 Analytics

20.1 The Analytics Process Model 20.2 Example Analytics Applications 20.3 Data Scientist Job Profile 20.4 Data Pre-Processing

20.4.1 Denormalizing Data for Analysis 20.4.2 Sampling 20.4.3 Exploratory Analysis 20.4.4 Missing Values 20.4.5 Outlier Detection and Handling

20.5 Types of Analytics

20.5.1 Predictive Analytics

20.5.1.1 Linear Regression 20.5.1.2 Logistic Regression

Logistic Regression Properties

20.5.1.3 Decision Trees

Splitting Decision Stopping Decision Decision Tree Properties Regression Trees

20.5.1.4 Other Predictive Analytics Techniques

20.5.2 Evaluating Predictive Models

20.5.2.1 Splitting Up the Dataset 20.5.2.2 Performance Measures for Classification Models 20.5.2.3 Performance Measures for Regression Models 20.5.2.4 Other Performance Measures for Predictive Analytical Models

20.5.3 Descriptive Analytics

20.5.3.1 Association Rules

Basic Setting Support, Confidence, and Lift Post-Processing Association Rules

20.5.3.2 Sequence Rules 20.5.3.3 Clustering

Hierarchical Clustering K-means Clustering

20.5.4 Social Network Analytics

20.5.4.1 Social Network Definitions 20.5.4.2 Social Network Metrics 20.5.4.3 Social Network Learning

20.6 Post-Processing of Analytical Models 20.7 Critical Success Factors for Analytical Models 20.8 Economic Perspective on Analytics

20.8.1 Total Cost of Ownership (TCO) 20.8.2 Return on Investment 20.8.3 In- versus Outsourcing 20.8.4 On-Premises versus Cloud Solutions 20.8.5 Open-Source versus Commercial Software

20.9 Improving the ROI of Analytics

20.9.1 New Sources of Data 20.9.2 Data Quality 20.9.3 Management Support 20.9.4 Organizational Aspects 20.9.5 Cross-Fertilization

20.10 Privacy and Security

20.10.1 Overall Considerations Regarding Privacy and Security 20.10.2 The RACI Matrix 20.10.3 Accessing Internal Data

20.10.3.1 Anonymization 20.10.3.2 SQL Views 20.10.3.3 Label-Based Access Control

20.10.4 Privacy Regulation

20.11 Conclusion Problems and Exercises

Appendix Using the Online Environment

How to Access the Online Environment Environment: Relational Databases and SQL Environment: MongoDB Environment: Neo4j and Cypher Environment: Tree Structure Visualizations Environment: HBase

Glossary Index Endorsements

← Prev
Back
Next →

← Prev
Back
Next →