Raft is a distributed consensus algorithm designed to be easily understood and implemented. It’s the consensus algorithm behind services like Etcd—the distributed key-value store that backs Kubernetes, Consul, and soon Kafka, whose team is migrating from ZooKeeper to Raft.[48] Because Raft is easy to understand and implement, developers have written many quality Raft libraries used in many projects and it’s become the most widely deployed consensus algorithm today.
Let’s talk about Raft’s leader election first and then talk about its replication, and that’ll transition into coding replication in our service.
A Raft cluster has one leader and the rest of the servers are followers. The leader maintains power by sending heartbeat requests to its followers, effectively saying: “I’m still here and I’m still the boss.” If the follower times out waiting for a heartbeat request from the leader, then the follower becomes a candidate and begins an election to decide the next leader. The candidate votes for itself and then requests votes from the followers. “The boss is gone! I’m the new boss, right?” If the candidate receives a majority of the votes, it becomes the leader, and it sends heartbeat requests to the followers to establish authority: “Hey y’all, new boss here.”
Followers can become candidates simultaneously if they time out at the same time waiting for the leader’s heartbeats. They’ll hold their own elections and the elections might not result in a new leader because of vote splitting. So they’ll hold another election. Candidates will hold elections until there’s a winner that becomes the new leader.
Every Raft server has a term: a monotonically increasing integer that tells other servers how authoritative and current this server is. The servers’ terms act as a logical clock: a way to capture chronological and causal relationships in distributed systems, where real-time clocks are untrustworthy and unimportant. Each time a candidate begins an election, it increments its term. If the candidate wins the election and becomes the leader, the followers update their terms to match and the terms don’t change until the next election. Servers vote once per term for the first candidate that requests votes, as long as the candidate’s term is greater than the voters’. These conditions help prevent vote splits and ensure the voters elect an up-to-date leader.
Depending on your use case, you might use Raft just for leader election. Imagine you’ve built a job system with a database of jobs to run and a program that queries the database every second to check if there’s a job to run and, if so, runs the job. You want this system to be highly available and resilient to failures, so you run multiple instances of the job runner. But you don’t want all of the runners running simultaneously and duplicating the work. So you use Raft to elect a leader; only the leader runs the jobs, and if the leader fails, Raft elects a new leader that runs the jobs. Most use cases rely on Raft for both its leader election and replication to get consensus on state.
Raft’s leader election can be useful by itself, but usually the point is to elect a leader that’s responsible for replicating a log to its followers and doing something with the log data. Raft breaks consensus into two parts: leader election and log replication. Let’s talk about how Raft’s replication works.
The leader accepts client requests, each of which represents some command to run across the cluster. (In a key-value service for example, you’d have a command to assign a key’s value.) For each request, the leader appends the command to its log and then requests its followers to append the command to their logs. After a majority of followers have replicated the command—when the leader considers the command committed—the leader executes the command with a finite-state machine and responds to the client with the result. The leader tracks the highest committed offset and sends this in the requests to its followers. When a follower receives a request, it executes all commands up to the highest committed offset with its finite-state machine. All Raft servers run the same finite-state machine that defines how to handle each command.
Replication saves us from losing data when servers fail. There’s a cost-benefit to replication. Like any insurance, replication costs (in complexity, in network bandwidth, in data storage), but the benefit of having replicated data to handle when a server fails makes it worth paying for the time the servers work. A Raft leader replicates to most of its followers, assuring that we won’t lose data unless a majority of the followers fail.
The recommended number of servers in a Raft cluster is three and five. A Raft cluster of three servers will tolerate a single server failure while a cluster of five will tolerate two server failures. I recommend odd number cluster sizes because Raft will handle (N–1)/2 failures, where N is the size of your cluster. If you had a cluster with four servers, it would handle losing one server, the same as a cluster with three servers—so you’d pay for an extra server that didn’t increase your fault tolerance. For larger clusters, CockRoachDB wrote a layer on top of Raft called MultiRaft[49] that divides the database’s data into ranges, each with its own consensus group. To keep our project simple, we’ll have a single Raft cluster.
Our service’s use case is unique because replicating a log is our end goal. Raft’s algorithm replicates a log, and we could defer all log management to Raft’s internals. This would make our service efficient and easy to code, but wouldn’t teach you how to use Raft to build distributed services that aren’t distributed logs.
In other services, you’ll use Raft as a means to replicate a log of commands and then execute those commands with state machines. If you were building a distributed SQL database, you’d replicate and execute the insert and update SQL commands; if you were building a key-value store, you’d replicate and execute set commands. Because other services you build will replicate a log as a means rather than an end, we’ll build our service the way you would other types of service, by replicating the transformation commands—which in our service are append commands. Technically we’ll replicate two logs: the log containing Raft’s commands and the log that results from the finite-state machines applying those commands. This service may not be as optimized as it could be, but what you’ll learn will be more useful for when you build other services.