Chapter 8

  1. The BSP computer is an abstract computer model made up of a collection of potentially heterogeneous processors that are interconnected via a computer network. Processors can not only access their own local memory, but they can also use the network link to exchange data with other processors. In other words, the BSP computer is effectively a distributed memory computer that can perform computations in parallel.
  2. The Single Program Multiple Data (SPMD) technique models distributed data processing tasks as a self-contained piece of software that runs on a single-core machine. The program receives a set of data as input, applies a processing function to it, and emits some output. Parallelism is then achieved by splitting the dataset into batches, launching multiple instances of the same program to process each batch in parallel, and combining the results.
  3. A super-step is broken down into two phases, or sub-steps:
    • A compute step, where each processor executes (in parallel) a single iteration of the user's program using the data that was assigned to the processor as input.
    • A communication step that runs after all the processors complete the compute step. During this step, processors communicate through the network and compare, exchange, or aggregate the results of their individual computations.

  1. The following block of code demonstrates how we can create an aggregator to keep track of the minimum int64 value we've seen so far. The use of an int64 pointer allows us to detect whether any value has been seen so far (otherwise, the pointer will be nil) and if so, the minimum value that's been seen by the Aggregate method. Atomic access to the int64 value is enforced via the use of sync.Mutex:
  1. Under the random surfer model, a user performs an initial search and lands on a page from the link graph. From that point on, users randomly select one of the following two options:
    • They can click any outgoing link from the current page and navigate to a new page
    • Alternatively, they can decide to run a new search query

The preceding steps continue in perpetuity.

  1. A PageRank score reflects the probability that a random surfer lands on a particular web page. In other words, the score expresses the importance (ranking) of each web page relative to every other web page on the internet.
  2. At each step of the PageRank algorithm, each link distributes its accumulated PageRank score to its outgoing links. Dead-ends receive the PageRank scores from pages that are linked to them but never redistribute them as they have no outgoing links. If we don't take steps to handle these problematic cases, the graph dead-ends will end up with a significantly higher (and incorrect) PageRank score compared to regular pages in the graph.