Algorithms and Data Structures in Action MEAP V12 by Rocca, Marcello La -- Read -- Imperial Library of Trantor

Index

1.1 Why should you learn about Data Structures?

1.1.1 Algorithms and Data Structures: is there a difference?

1.2 Data Structures

1.2.1 Defining a Data Structure 1.2.2 Describing a Data Structure

1.3 Setting Goals: Your Expectations After Reading this Book 1.4 Packing your Knapsack: Data Structures Meet the Real World

1.4.1 Abstracting the Problem Away 1.4.2 Looking for Solutions 1.4.3 Algorithms to the Rescue 1.4.4 Thinking (literally) out of the Box 1.4.5 Happy Ending

1.5 Summary

2 Improve priority queues: d-way heaps

2.1 Structure of this chapter 2.2 The problem: handling priority

2.2.1 Priority in practice: bug tracking

2.3 Solutions at hand: keeping a sorted list

2.3.1 From sorted lists to priority queues

2.4 Describing the data structure API: Priority queues

2.4.1 Priority Queue at work 2.4.2 Priority Matters: Generalize FIFO

2.5 Concrete Data Structures

2.5.1 Comparing performance 2.5.2 What’s the right concrete data structure? 2.5.3 Heap 2.5.4 Priority, min-heap and max-heap 2.5.5 Advanced variant: d-ary heap

2.6 How to implement a heap

2.6.1 BubbleUp 2.6.2 PushDown 2.6.3 Insert 2.6.4 Top 2.6.5 Update 2.6.6 Dealing with Duplicates 2.6.7 Heapify 2.6.8 Beyond API methods: contains 2.6.9 Performance recap 2.6.10 From Pseudo-code to implementation

2.7 Use case: find the k largest elements

2.7.1 The right data structure… 2.7.2 … and the right use 2.7.3 Coding it up

2.8 More use-cases

2.8.1 Minimum distance in graphs: Dijkstra 2.8.2 More graphs: Prim's algorithm 2.8.3 Data Compression: Huffman Codes

2.9 Analysis of branching factor[20]

2.9.1 Do we need D-ary heaps? 2.9.2 Running time 2.9.3 Finding the optimal branching factor 2.9.4 Branching factor vs memory

2.10 Performance Analysis: Finding the best Branching Factor

2.10.1 Please Welcome Profiling 2.10.2 Interpreting Results 2.10.3 The Mystery with Heapify 2.10.4 Choosing the Best Branching Factor

2.11 Summary

3 Treaps: Using Randomization to Balance Binary Search Trees

3.1 Problem: Multi-Indexing

3.1.1 The Gist of the Solution

3.2 Solution: Description and API 3.3 Treap

3.3.1 Rotations 3.3.2 A Few Design Questions 3.3.3 Implementing Search 3.3.4 Insert 3.3.5 Delete 3.3.6 Top, Peak, Update 3.3.7 Min, Max 3.3.8 Performance Recap

3.4 Applications: Randomized Treaps

3.4.1 Balanced Trees 3.4.2 Introducing Randomization 3.4.3 Applications of Randomized Treaps

3.5 Performance Analysis and Profiling

3.5.1 Theory: Expected Height 3.5.2 Profiling Height 3.5.3 Profiling Running Time 3.5.4 Profiling Memory Usage 3.5.5 Conclusions

3.6 Summary

4 Bloom Filter: Reduce the memory needed to keep track of content

4.1 The Dictionary Problem: Keeping Track of Things 4.2 Alternatives to Implement a Dictionary 4.3 Describing the Data Structure API: Associative Array 4.4 Concrete Data Structures

4.4.1 Unsorted Array: Fast Insertion, Slow Search 4.4.2 Binary Search: Slow Insertion, Fast(-ish) Search 4.4.3 Hash Table: Constant-Time on Average, Unless You Need Ordering 4.4.4 Binary Search Tree: Every Operation is Logarithmic 4.4.5 Bloom Filter: as Fast as Hash Tables, But Saving Memory (with a catch)

4.5 Under the Hood - How Do Bloom Filters Work 4.6 Implementation

4.6.1 Using a Bloom Filter 4.6.2 Read and Write Bits 4.6.3 Find Where a Key is Stored 4.6.4 Generate Hash Functions 4.6.5 Constructor 4.6.6 Checking a Key 4.6.7 Storing a Key 4.6.8 Estimating Accuracy

4.7 Applications

4.7.1 Cache 4.7.2 Router 4.7.3 Crawler 4.7.4 IO Fetcher 4.7.5 Spell checker 4.7.6 Distributed Databases and File Systems

4.8 Why Bloom Filters Work[15]

4.8.1 Why There Are No False Negatives... 4.8.2 ...But there are false positives 4.8.3 Bloom Filters as Randomized Algorithms

4.9 Performance Analysis

4.9.1 Running Time 4.9.2 Constructor 4.9.3 Storing an Element 4.9.4 Looking Up an Element

4.10 Estimating Bloom Filter Precision[19]

4.10.1 Explanation of the False-Positive Ratio Formula

4.11 Improved Variants

4.11.1 Bloomier Filter 4.11.2 Combining Bloom Filters 4.11.3 Layered Bloom Filter 4.11.4 Compressed Bloom Filter 4.11.5 Scalable Bloom Filter

4.12 Summary

5 Disjoint Set: Sub-linear-time processing of disjoint sets

5.1 The Distinct Subsets Problem 5.2 Reasoning on Solutions 5.3 Describing the Data Structure API: Disjoint Set 5.4 Naïve Solution[4]

5.4.1 Implementing Naïve Solution

5.5 Using a Tree-like Structure[10]

5.5.1 From List to Trees 5.5.2 Implementing the Tree Version

5.6 Heuristics to Improve the Running Time[12]

5.6.1 Path Compression 5.6.2 Implementing Balancing and Path Compression

5.7 Applications

5.7.1 Graphs: Connected Components 5.7.2 Graphs[14]: Kruskal Algorithm for Minimum Spanning Tree 5.7.3 Clustering 5.7.4 Unification

5.8 Summary

6 Trie, Radix Trie: Efficient Strings Search

6.1 Spell Check

6.1.1 A Prncess, a Damon and an Elf Walkz into a Bar 6.1.2 Compression is the Key 6.1.3 Description and API

6.2 Trie

6.2.1 Why is it Better Again? 6.2.2 Search 6.2.3 Insert 6.2.4 Remove 6.2.5 Longest Prefix 6.2.6 Keys Matching a Prefix 6.2.7 When Should We Use Tries?

6.3 Radix Tries

6.3.1 Nodes and Edges 6.3.2 Search 6.3.3 Insert 6.3.4 Remove 6.3.5 Longest Common Prefix 6.3.6 Keys Starting with a Prefix

6.4 Applications

6.4.1 Spell Checker 6.4.2 Strings Similarity 6.4.3 String Sorting 6.4.4 T9 6.4.5 Autocomplete

6.5 Summary

7 Use case: LRU Cache

7.1 Don’t Compute Things Twice 7.2 First Attempt: Remembering Values

7.2.1 Description and API 7.2.2 Fresh Data, Please 7.2.3 Handling Asynchronous Calls 7.2.4 Marking Cache Values as “Loading”

7.3 Memory Is Not Enough (Literally) 7.4 Getting Rid of Stale Data: LRU cache

7.4.1 Sometimes you have to double down on problems 7.4.2 Temporal Ordering 7.4.3 Performance

7.5 When Fresher Data Is More Valuable: LFU

7.5.1 So How Do We Choose? 7.5.2 What Makes LFU Different 7.5.3 Performance 7.5.4 Problems With LFU

7.6 How to Use cache is Just as Important 7.7 Introducing Synchronization

7.7.1 Solving Concurrency (in Java) 7.7.2 Introducing Locks 7.7.3 Acquiring a Lock 7.7.4 Reentrant Locks 7.7.5 Read Locks 7.7.6 Other Approaches to Concurrency

7.8 Cache Applications 7.9 Summary

8 Nearest Neighbors search

8.1 The Nearest Neighbors Search Problem 8.2 Solutions

8.2.1 First Attempts 8.2.2 Sometimes Caching is NOT the Answer 8.2.3 Simplify Things to Get a Hint 8.2.4 Carefully Choose a Data Structure

8.3 Description and API 8.4 Moving to k-dimensional Spaces

8.4.1 Unidimensional Binary Search 8.4.2 Moving to Higher Dimensions 8.4.3 Modeling 2D Partitions with a Data Structure

8.5 Summary

9 K-d Trees: Multi-dimensional Data Indexing

9.1 Right Where We Left 9.2 Moving to k-D Spaces: Cycle Through Dimensions

9.2.1 Constructing the BST 9.2.2 Invariants 9.2.3 The Importance of Being Balanced

9.3 Methods

9.3.1 Search 9.3.2 Insert 9.3.3 Balanced Tree 9.3.4 Remove 9.3.5 Nearest Neighbor 9.3.6 Region Search 9.3.7 A Recap of all Methods

9.4 Limits and Possible Improvements 9.5 Summary

10 Similarity Search Trees: Approximate Nearest Neighbors Search for Image Retrieval

10.1 Right Where We Left

10.1.1 A New (More Complex) Example 10.1.2 Overcoming k-d trees Flaws

10.2 R-tree

10.2.1 A step back: Introducing B-trees 10.2.2 From B-Tree to R-tree 10.2.3 Inserting Points in an R-tree 10.2.4 Search

10.3 Similarity Search Tree

10.3.1 SS-tree Search 10.3.2 Insert 10.3.3 Insertion: Variance, Means, and Projections 10.3.4 Insertion: Split Nodes 10.3.5 Delete

10.4 Similarity Search

10.4.1 Nearest Neighbor Search 10.4.2 Region Search 10.4.3 Approximated Similarity Search

10.5 Ss+-Tree[17]

10.5.1 Are Ss-trees Better? 10.5.2 Mitigating Hyper-Spheres Limitations 10.5.3 Improved Split Heuristic 10.5.4 Reducing Overlap

10.6 Summary

11 Applications of Nearest Neighbors Search

11.1 An Application: Find Nearest Hub

11.1.1 Sketching a Solution 11.1.2 Trouble in Paradise

11.2 Centralized Application

11.2.1 Filtering Points 11.2.2 Complex Decisions

11.3 Moving to a Distributed Application

11.3.1 Issues Handling HTTP Communication 11.3.2 Keeping the Inventory in Sync 11.3.3 Lessons Learned

11.4 Other Applications

11.4.1 Color reduction 11.4.2 Particle Interaction 11.4.3 Multidimensional DB queries optimization 11.4.4 Clustering

11.5 Summary

12 Clustering

12.1 Intro to Clustering

12.1.1 Types of Learning 12.1.2 Types of Clustering

12.2 K-means

12.2.1 Issues With K-means 12.2.2 The Curse of Dimensionality Strikes Again 12.2.3 K-means Performance Analysis 12.2.4 Boosting k-means with k-d Trees 12.2.5 Final Remarks on k-means

12.3 DBSCAN

12.3.1 Directly vs Density-Reachable 12.3.2 From Definitions to an Algorithm 12.3.3 And Finally, an Implementation 12.3.4 PROs and CONs With DBSCAN

12.4 OPTICS

12.4.1 Definitions 12.4.2 OPTICS Algorithm 12.4.3 From Reachability Distance to Clustering 12.4.4 Hierarchical clustering 12.4.5 Performance Analysis and Final Considerations

12.5 Evaluating Clustering Results: Evaluation Metrics

12.5.1 Interpreting the Results

12.6 Summary

13 Parallel Clustering: Map-Reduce and Canopy Clustering

13.1 Parallelization

13.1.1 Parallel vs Distributed 13.1.2 Parallelizing k-means 13.1.3 Canopy Clustering 13.1.4 Applying Canopy Clustering

13.2 MapReduce

13.2.1 Imagine You Are Donald Duck… 13.2.2 First Map, Then Reduce 13.2.3 There is More, Under the Hood

13.3 MapReduce k-means

13.3.1 Parallelizing Canopy Clustering 13.3.2 Centroid Initialization with Canopy Clustering 13.3.3 MapReduce Canopy Clustering

13.4 MapReduce DBSCAN 13.5 Summary

14 An Introduction to Graphs: Finding Paths of Minimum Distance

14.1 Definitions

14.1.1 Implementing Graphs 14.1.2 Graphs as Algebraic Types 14.1.3 Pseudo code

14.2 Graph Properties

14.2.1 Undirected 14.2.2 Connected 14.2.3 Acyclic

14.3 Graph Traversal: BFS, DFS

14.3.1 Optimizing Delivery Routes 14.3.2 Breadth First Search 14.3.3 Reconstructing the Path to Target 14.3.4 Depth First Search 14.3.5 It’s Queue vs Stack Again 14.3.6 Best Route to Deliver a Parcel

14.4 Shortest Path in Weighted Graphs: Dijkstra

14.4.1 Differences with BFS 14.4.2 Implementation 14.4.3 Analysis 14.4.4 Shortest Route for Deliveries

14.5 Beyond Dijkstra: A*

14.5.1 How Good is A* Search? 14.5.2 Heuristics as a Way to Balance Real-time Data

14.6 Summary

15 Graph Embeddings and Planarity: Drawing Graphs with Minimal Edges Intersections

15.1 Graph Embeddings

15.1.1 Some Basic Definitions 15.1.2 Complete and Bipartite Graphs

15.2 Planar Graphs

15.2.1 Using Kuratowski’s Theorem in Practice 15.2.2 Planarity Testing 15.2.3 A Naïve Algorithm for Planarity Testing 15.2.4 Improving performance 15.2.5 Efficient Algorithms

15.3 Non-Planar Graphs

15.3.1 Finding the Crossing Number 15.3.2 Rectilinear Crossing Number

15.4 Edge Intersections

15.4.1 Straight-line Segments 15.4.2 Polylines 15.4.3 Bézier Curves 15.4.4 Intersections Between Quadratic Bézier Curves 15.4.5 Vertex-Vertex and Edge-Vertex Intersections

15.5 Summary

16 Gradient Descent: Optimization Problems (not just) on Graphs

16.1 Heuristics for Crossing Number

16.1.1 Did You Just Say Heuristics? 16.1.2 Extending to Curve-Line Edges

16.2 How Optimization Works

16.2.1 Cost Functions 16.2.2 Step Functions and Local Minima 16.2.3 Optimizing Random Sampling

16.3 Gradient Descent

16.3.1 The Math of Gradient Descent 16.3.2 Geometrical Interpretation 16.3.3 When is Gradient Descent Appliable? 16.3.4 Problems with Gradient Descent

16.4 Applications of Gradient Descent

16.4.1 An Example: Linear Regression

16.5 Gradient Descent for Graph Embedding

16.5.1 A Different Criterion 16.5.2 Implementation

16.6 Summary

17 Simulated Annealing Optimization Beyond Local Minima

17.1 Simulated Annealing

17.1.1 Sometimes You Need to Climb Up to Get to the Bottom 17.1.2 Implementation 17.1.3 Why Simulated Annealing Works 17.1.4 Short-range vs Long-range Transitions 17.1.5 Variants 17.1.6 Simulated Annealing vs Gradient Descent: Which One Should I Use?

17.2 Simulated Annealing + Traveling Salesman

17.2.1 Exact vs Approximated Solutions 17.2.2 Visualizing Cost 17.2.3 Pruning the Domain 17.2.4 State Transitions 17.2.5 Adjacent vs Random Swaps 17.2.6 Applications of TSP

17.3 Simulated Annealing and Graph Embedding

17.3.1 Minimum Edge Crossing 17.3.2 Force-directed Drawing

17.4 Summary

Appendix A. A Quick Guide to Pseudo-Code

A.1 Variables and Basics A.2 Arrays A.3 Conditional Instructions

A.3.1 Else-if A.3.2 Switch

A.4 Loops

A.4.1 For Loop A.4.2 While Loop A.4.3 Break and Continue

A.5 Blocks and Indent A.6 Functions

A.6.1 Overloading and default arguments A.6.2 Tuples A.6.3 Tuples and destructuring objects

Appendix B. Big-O Notation

B.1 Algorithms and Performance B.2 The RAM Model B.3 Order of Magnitude B.4 Notation B.5 Examples

Appendix C. Core Data Structures

C.1 Core Data Structures

C.1.1 Array C.1.2 Linked List C.1.3 Tree C.1.4 Starting drag operations

C.2 Hash Table

C.2.1 Storing key-value pairs C.2.2 Hashing C.2.3 Conflicts Resolution in Hashing C.2.4 Performance

C.3 Comparative Analysis of Core Data Structures

Appendix D. Containers as priority queues

D.1 Bag D.2 Stack D.3 Queue D.4 A Comparative Analysis for Containers

Appendix E. Recursion

E.1 Simple Recursion

E.1.1 Pitfalls E.1.2 Good Recursion E.1.3 Tail Recursion E.1.4 Mutual Recursion

Appendix F. Classification problems Randomized algorithms Metrics

F.1 Decision Problems F.2 Las Vegas Algorithms F.3 Monte Carlo Algorithms F.4 Classification Metrics

F.4.1 Accuracy F.4.2 Precision and Recall F.4.3 Other Metrics + Recap

Index

← Prev
Back
Next →

← Prev
Back
Next →