Relaying messages between workers

For the master to be able to relay messages between workers, it needs to be able to efficiently answer the following question: "given a destination ID, which partition does it belong to?"

This certainly sounds like a query that the Range type should be able to answer! To jog your memory, this is what the Range type definition from Chapter 10, Building, Packaging, and Deploying Software, looks like:

type Range struct {
    start       uuid.UUID
    rangeSplits []uuid.UUID
}

The start field keeps track of the range's start UUID while rangeSplits[p] tracks the end UUID value for the p_th partition. Therefore, the UUID range for a partition p can be calculated as follows:

Before we examine how the UUID-to-partition number query is actually implemented, try as a simple thought exercise to think of an algorithm for answering this type of query (no peeking!).

One way to achieve this is to iterate the rangeSplits slice and locate a range that includes the specified ID. While this naive approach would yield the correct answer, it will unfortunately not scale in a scenario where you might have hundreds of workers exchanging messages with each other.

Can we do any better? The answer is yes. We can exploit the observation that the values in the rangeSplits field are stored in sorted order and use the handy Search function from the Go sort package to perform a binary search.

Here is a much more efficient implementation of this type of query:

func (r *Range) PartitionForID(id uuid.UUID) (int, error) {
    partIndex := sort.Search(len(r.rangeSplits), func(n int) bool {
        return bytes.Compare(id[:], r.rangeSplits[n][:]) < 0
    })

    if bytes.Compare(id[:], r.start[:]) < 0 || partIndex >= len(r.rangeSplits) {
        return -1, xerrors.Errorf("unable to detect partition for ID %q", id)
    }
    return partIndex, nil
}

The sort.Search function executes a binary search on a slice and returns the smallest index for which a user-defined predicate function returns true. Our predicate function checks that the provided ID value is strictly less than the end UUID of the partition currently being scanned.

Now that we have the means to efficiently answer UUID-to-partition queries, let's take a look at the implementation of the relayMessageToWorker method, which is invoked by the worker payload handler for message relay requests:

func (c *masterJobCoordinator) relayMessageToWorker(srcWorkerIndex int, relayMsg *proto.RelayMessage) {
    dstUUID, err := uuid.Parse(relayMsg.Destination)
    if err != nil {
        c.cancelJobCtx()
        return
    }

    partIndex, err := c.partRange.PartitionForID(dstUUID)
    if err != nil || partIndex == srcWorkerIndex {
        c.cancelJobCtx()
        return
    }

    c.sendToWorker(c.cfg.workers[partIndex], &proto.MasterPayload{
        Payload: &proto.MasterPayload_RelayMessage{RelayMessage: relayMsg},
    })
}

The first thing we need to do is to parse the destination ID and make sure that it actually contains a valid UUID value.

Then, we call the PartitionForID helper to look up the index of the partition that the destination ID belongs to and forward the message to the worker assigned to it.

What if it turns out that the worker that asked us to relay the message in the first place is also the one we need to relay the message to? In such a scenario, we will treat the destination ID as being invalid and abort the job with an error. The justification for this decision is that if the local graph was aware of that particular destination, it would simply locally enqueue the message for delivery instead of attempting to relay it through the master node.