Consensus Algorithms: Raft, Paxos, and Byzantine Fault Tolerance

Introduction

Distributed systems are the foundation of modern computing. Every time you use a database cluster, a distributed file system, or a replicated service, there's a consensus algorithm working behind the scenes to ensure that all nodes agree on the same state. Consensus is the fundamental problem of distributed computing: how do you get multiple independent computers to agree on a single value when messages can be delayed, nodes can crash, and networks can partition?

The consensus problem is deceptively simple to state but remarkably difficult to solve. If a network splits in half, should each partition continue accepting writes (risking inconsistency) or should one partition stop accepting writes (sacrificing availability)? This trade-off, formalized by the CAP theorem, is at the heart of every consensus algorithm's design.

This guide covers the three most important families of consensus algorithms: Paxos, the theoretically elegant but practically challenging algorithm; Raft, designed for understandability and practical implementation; and Byzantine Fault Tolerance (BFT) algorithms that handle malicious nodes. Understanding these algorithms is essential for anyone building or operating distributed systems.

Understanding Consensus: Core Concepts

The Consensus Problem

A consensus algorithm must satisfy four properties: agreement (all correct nodes decide on the same value), validity (the decided value was proposed by some node), termination (all correct nodes eventually decide), and integrity (each node decides at most once). These properties must hold even when some nodes crash or messages are delayed.

The number of faults a system can tolerate depends on the type of faults. For crash faults (nodes that stop responding), a system of 2f+1 nodes can tolerate f failures. For Byzantine faults (nodes that can behave arbitrarily, including sending conflicting messages), a system needs 3f+1 nodes to tolerate f failures. This difference in fault tolerance is the fundamental distinction between crash-fault-tolerant (CFT) and Byzantine-fault-tolerant (BFT) algorithms.

Leader-Based Consensus

Most practical consensus algorithms use a leader-based approach. One node is elected as the leader, and all client requests are directed to the leader. The leader proposes values, replicates them to followers, and once a majority acknowledges the value, it's considered committed. If the leader fails, a new leader is elected.

The leader-based model simplifies reasoning about the system: there's a single point of coordination. However, it creates a potential bottleneck and requires careful leader election to avoid split-brain scenarios where two nodes believe they're the leader simultaneously.

Log Replication

Consensus algorithms typically maintain a replicated log—a sequence of commands that all nodes apply in the same order. The log ensures that all nodes reach the same state by applying the same sequence of operations. This is the foundation of state machine replication: if all nodes start from the same initial state and apply the same log entries in the same order, they'll reach the same final state.

Raft: Designed for Understandability

Raft was designed by Diego Ongaro and John Ousterhout specifically to be more understandable than Paxos while providing the same guarantees. It decomposes consensus into three relatively independent subproblems: leader election, log replication, and safety.

Leader Election

In Raft, time is divided into terms. Each term begins with an election. A node starts as a follower. If it doesn't hear from a leader within a randomized timeout (typically 150-300ms), it becomes a candidate and requests votes from other nodes. A candidate becomes leader if it receives votes from a majority of nodes. The randomized timeout ensures that elections are usually decided quickly without split votes.

// Raft node states
enum NodeState {
  FOLLOWER = 'follower',
  CANDIDATE = 'candidate',
  LEADER = 'leader'
}
 
interface RaftNode {
  id: string
  state: NodeState
  currentTerm: number
  votedFor: string | null
  log: LogEntry[]
  commitIndex: number
  lastApplied: number
  nextIndex: Record<string, number>
  matchIndex: Record<string, number>
}
 
interface LogEntry {
  term: number
  index: number
  command: any
}
 
async function startElection(node: RaftNode, peers: string[]): Promise<void> {
  node.currentTerm++
  node.state = NodeState.CANDIDATE
  node.votedFor = node.id
  
  let votes = 1 // Vote for self
  const promises = peers.map(peer => requestVote(peer, {
    term: node.currentTerm,
    candidateId: node.id,
    lastLogIndex: node.log.length - 1,
    lastLogTerm: node.log[node.log.length - 1]?.term || 0
  }))
  
  const results = await Promise.allSettled(promises)
  
  for (const result of results) {
    if (result.status === 'fulfilled' && result.value.voteGranted) {
      votes++
    }
  }
  
  if (votes > peers.length / 2) {
    node.state = NodeState.LEADER
    initializeLeaderState(node, peers)
  }
}

Log Replication

The leader accepts client requests, appends them to its log, and replicates them to followers via AppendEntries RPCs. Once a majority of nodes have stored the entry, the leader commits it and applies it to the state machine. The leader tracks each follower's progress and retries entries that haven't been acknowledged.

async function replicateLog(
  node: RaftNode, peers: string[]
): Promise<void> {
  if (node.state !== NodeState.LEADER) return
  
  const promises = peers.map(async peer => {
    const nextIdx = node.nextIndex[peer]
    const prevLogEntry = node.log[nextIdx - 1]
    
    const response = await appendEntries(peer, {
      term: node.currentTerm,
      leaderId: node.id,
      prevLogIndex: nextIdx - 1,
      prevLogTerm: prevLogEntry?.term || 0,
      entries: node.log.slice(nextIdx),
      leaderCommit: node.commitIndex
    })
    
    if (response.success) {
      node.nextIndex[peer] = node.log.length
      node.matchIndex[peer] = node.log.length - 1
    } else {
      // Decrement nextIndex and retry
      node.nextIndex[peer] = Math.max(0, nextIdx - 1)
    }
  })
  
  await Promise.allSettled(promises)
  
  // Update commitIndex
  for (let i = node.log.length - 1; i > node.commitIndex; i--) {
    const replicatedCount = peers.filter(
      p => node.matchIndex[p] >= i
    ).length + 1 // +1 for leader
    
    if (replicatedCount > peers.length / 2) {
      node.commitIndex = i
      break
    }
  }
}

Safety Guarantees

Raft guarantees that if a log entry is committed, it will be present in the logs of all future leaders. This is ensured by the election restriction: a candidate must have a log that is at least as up-to-date as a majority of nodes. This means a leader never needs to delete committed entries—it already has all committed entries in its log.

Paxos: The Theoretical Foundation

Paxos, invented by Leslie Lamport in 1989, is the most well-known consensus algorithm. It's theoretically elegant but notoriously difficult to understand and implement. Lamport's original paper described the algorithm through the metaphor of a parliamentary system on the island of Paxos.

Basic Paxos

Basic Paxos decides a single value through a two-phase process. In Phase 1 (Prepare), a proposer sends a Prepare message with a proposal number to acceptors. If a majority responds, the proposer learns of any previously accepted values. In Phase 2 (Accept), the proposer sends an Accept message with the value to be decided. If a majority accepts, the value is chosen.

// Basic Paxos implementation sketch
interface PaxosProposer {
  proposalNumber: number
  value: any
}
 
interface PrepareResponse {
  accepted: boolean
  highestProposal: number
  acceptedValue: any | null
}
 
async function runPaxos(
  proposer: PaxosProposer,
  acceptors: string[]
): Promise<{ chosen: boolean; value: any }> {
  // Phase 1: Prepare
  const prepareResponses = await Promise.all(
    acceptors.map(acceptor =>
      sendPrepare(acceptor, proposer.proposalNumber)
    )
  )
  
  const promises = prepareResponses.filter(r => r.accepted)
  if (promises.length <= acceptors.length / 2) {
    return { chosen: false, value: null }
  }
  
  // Use the value from the highest-numbered proposal, or our own
  const highestPromise = promises.reduce((a, b) =>
    a.highestProposal > b.highestProposal ? a : b
  )
  const value = highestPromise.acceptedValue || proposer.value
  
  // Phase 2: Accept
  const acceptResponses = await Promise.all(
    acceptors.map(acceptor =>
      sendAccept(acceptor, proposer.proposalNumber, value)
    )
  )
  
  const accepted = acceptResponses.filter(r => r.accepted)
  if (accepted.length > acceptors.length / 2) {
    return { chosen: true, value }
  }
  
  return { chosen: false, value: null }
}

Multi-Paxos

Basic Paxos decides a single value. Multi-Paxos extends this to a sequence of values (a log) by electing a stable leader that skips Phase 1 for subsequent proposals. This optimization reduces the message complexity from four messages per decision to two, making Multi-Paxos competitive with Raft in practice.

Byzantine Fault Tolerance

Byzantine faults are the most challenging class of failures. A Byzantine node can behave arbitrarily—sending conflicting messages, lying about its state, or colluding with other faulty nodes. Byzantine Fault Tolerance (BFT) algorithms handle these adversarial scenarios, making them essential for blockchain and adversarial environments.

PBFT (Practical Byzantine Fault Tolerance)

PBFT, proposed by Miguel Castro and Barbara Liskov in 1999, was the first BFT algorithm practical enough for real systems. It requires 3f+1 nodes to tolerate f Byzantine faults and uses a three-phase protocol (pre-prepare, prepare, commit) to ensure consistency.

// PBFT message types
enum PBFTMessageType {
  REQUEST = 'request',
  PRE_PREPARE = 'pre-prepare',
  PREPARE = 'prepare',
  COMMIT = 'commit',
  VIEW_CHANGE = 'view-change',
  NEW_VIEW = 'new-view'
}
 
interface PBFTMessage {
  type: PBFTMessageType
  view: number
  sequence: number
  digest: string
  replicaId: string
}
 
async function handlePrePrepare(
  node: PBFTNode,
  message: PBFTMessage
): Promise<void> {
  // Verify the pre-prepare is from the current primary
  if (getPrimary(node.view) !== message.replicaId) return
  
  // Verify we haven't accepted a different pre-prepare for this view/sequence
  const existing = node.log.find(
    e => e.view === message.view && e.sequence === message.sequence
  )
  if (existing && existing.digest !== message.digest) return
  
  // Accept and broadcast prepare
  node.log.push(message)
  await broadcast(node, {
    type: PBFTMessageType.PREPARE,
    view: node.view,
    sequence: message.sequence,
    digest: message.digest,
    replicaId: node.id
  })
}

Real-World Implementations

etcd (Raft)

etcd, the key-value store used by Kubernetes for cluster state, is the most widely deployed Raft implementation. It provides linearizable reads and writes, ensuring that all clients see a consistent view of the data. etcd's Raft implementation handles leader election, log compaction, and membership changes.

ZooKeeper (ZAB)

Apache ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol, which is similar to Raft but predates it. ZAB provides total ordering of messages and is used for configuration management, leader election, and distributed locking in Hadoop, Kafka, and other distributed systems.

Tendermint (BFT)

Tendermint is a BFT consensus engine used in blockchain applications. It provides instant finality—once a block is committed, it's final and cannot be reverted. This is in contrast to probabilistic finality in Proof-of-Work systems like Bitcoin.

Performance and Trade-offs

Algorithm	Fault Type	Min Nodes	Message Complexity	Throughput	Use Case
Raft	Crash	2f+1	O(n) per decision	High	etcd, CockroachDB
Paxos	Crash	2f+1	O(n) per decision	High	Chubby, Spanner
PBFT	Byzantine	3f+1	O(n²) per decision	Moderate	Blockchains, BFT-SMaRt
HotStuff	Byzantine	3f+1	O(n) per decision	High	Libra/Diem

The key trade-off is between fault tolerance and performance. Crash-fault-tolerant algorithms (Raft, Paxos) are faster because they don't need to handle adversarial behavior. Byzantine-fault-tolerant algorithms require more messages and more nodes but can handle malicious actors.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Split-brain in leader election	Data inconsistency	Use proper majority quorum, fencing tokens
Log compaction too infrequent	Unbounded log growth	Implement snapshotting with configurable thresholds
Network partition during election	Availability loss	Proper timeout tuning, pre-vote mechanism
Byzantine node flooding	DoS on consensus	Rate limiting, view changes on timeout
Ignoring clock skew	Incorrect timeouts	Use logical clocks, NTP synchronization
Single leader bottleneck	Throughput limit	Multi-leader or leaderless designs for reads
Unbounded view changes	Liveness issues	Exponential backoff, leader rotation

Advanced Patterns

Flexible Paxos

Flexible Paxos relaxes the majority requirement: Phase 1 needs a quorum of Q1 nodes and Phase 2 needs a quorum of Q2 nodes, where Q1 + Q2 > N. This allows optimizing for the common case—for example, requiring only 1 node to acknowledge in Phase 2 if Phase 1 involves all nodes.

Witness Replicas

In geo-distributed databases, witness replicas participate in consensus without storing data. They vote in elections and acknowledge writes, contributing to the majority quorum without the storage overhead. This reduces the cost of maintaining a quorum across regions.

Leader Lease

Raft implementations often use leader leases—a period during which a leader can serve reads without verifying its leadership. This improves read performance but requires careful clock synchronization to avoid serving stale reads after a new leader is elected.

Testing Consensus Algorithms

Testing consensus algorithms requires simulating network partitions, message delays, and node failures. Jepsen is the industry standard for testing distributed systems, using randomized fault injection to find consistency violations.

// Simulating a network partition in tests
async function testNetworkPartition(cluster: RaftCluster) {
  // Partition nodes into two groups
  const group1 = cluster.nodes.slice(0, 2)
  const group2 = cluster.nodes.slice(2)
  
  await cluster.partition(group1, group2)
  
  // Write to group1 (has majority)
  const result = await group1[0].propose({ key: 'test', value: '1' })
  expect(result.committed).toBe(true)
  
  // Group2 should not be able to write
  const result2 = await group2[0].propose({ key: 'test', value: '2' })
  expect(result2.committed).toBe(false)
  
  // Heal partition
  await cluster.heal()
  
  // All nodes should agree on value '1'
  for (const node of cluster.nodes) {
    const value = await node.get('test')
    expect(value).toBe('1')
  }
}

Choosing the Right Consensus Algorithm

The choice of consensus algorithm depends on your system's requirements for consistency, availability, and partition tolerance. Raft is the best choice for systems that prioritize understandability and strong leader-based consistency, which is why etcd and Consul use it. Paxos variants like Multi-Paxos suit systems requiring maximum throughput with leader election. Byzantine fault tolerant algorithms are essential for blockchain and permissionless networks where participants cannot be trusted. For most distributed databases and coordination services, Raft provides the best balance of simplicity and performance.

Community Resources and Further Learning

The technology landscape evolves rapidly, making continuous learning essential for maintaining expertise. Building a systematic approach to staying current with developments in your technology stack ensures you can leverage new features and avoid deprecated patterns.

Curated Learning Pathways

Rather than consuming content randomly, create structured learning pathways aligned with your current projects and career goals. Start with official documentation and specification documents, which provide the most accurate and comprehensive information. Follow this with hands-on tutorials and workshops that reinforce concepts through practical application.

Technical blogs from framework maintainers and core team members often provide deeper insights into design decisions and upcoming features. Subscribe to the official blogs of your primary frameworks and libraries to stay ahead of breaking changes and deprecation timelines.

Contributing to Open Source

Contributing to open-source projects in your technology stack provides unparalleled learning opportunities. Start with documentation improvements and bug reports, then progress to fixing small issues tagged as "good first issue" in your favorite projects. This direct engagement with maintainers and the codebase accelerates your understanding far beyond what passive learning can achieve.

# Setting up for contribution
git clone https://github.com/project/repository.git
cd repository
git checkout -b fix/issue-description
 
# Run the project's contribution setup
npm run setup:dev
npm run test  # Ensure tests pass before making changes
 
# Make your changes, then run the full test suite
npm run test:full
npm run lint
npm run build
 
# Submit your contribution
git add -A
git commit -m "fix: description of the fix
 
Closes #1234"
git push origin fix/issue-description

Building a Technical Knowledge Base

Maintain a personal knowledge base that captures insights, solutions, and patterns you discover during your work. Tools like Obsidian, Notion, or even a simple Markdown repository can serve as an external memory that grows more valuable over time.

Organize your notes by topic rather than chronologically, and include code examples, links to relevant documentation, and explanations of why certain approaches work better than others. When you encounter a particularly insightful article or conference talk, write a summary that captures the key takeaways and how they apply to your current projects.

Staying Current with Industry Trends

Follow key conferences and their published talks to stay informed about emerging patterns and best practices. Many conferences publish recorded talks on YouTube within weeks of the event, making world-class technical content freely accessible.

Join relevant Discord servers, Slack communities, and forums where practitioners discuss real-world challenges and solutions. These communities provide early warning about emerging issues and access to collective wisdom that isn't available through formal documentation.

Teaching others is one of the most effective ways to deepen your own understanding. Consider writing technical blog posts, giving talks at local meetups, or mentoring junior developers. The process of explaining concepts to others forces you to organize your knowledge and identify gaps in your understanding.

Pair programming sessions with colleagues of different experience levels create mutual learning opportunities. Senior developers gain fresh perspectives on problems they've solved the same way for years, while junior developers benefit from exposure to production-grade thinking and decision-making processes.

Conclusion

Consensus algorithms are the foundation of reliable distributed systems. Raft provides understandable, practical consensus for crash-fault-tolerant systems. Paxos offers theoretical elegance and has influenced every subsequent algorithm. BFT algorithms enable consensus in adversarial environments where nodes may be malicious.

Key takeaways:

Raft is the practical choice for most distributed systems—its leader-based design and clear semantics make it the default for new implementations (etcd, CockroachDB, TiKV).
Understand the fault model your system needs: crash faults (2f+1 nodes) vs. Byzantine faults (3f+1 nodes). Most systems only need crash fault tolerance.
Log replication is the core mechanism—all nodes apply the same log entries in the same order, ensuring consistent state across the cluster.
Test with fault injection using tools like Jepsen to verify that your implementation maintains consistency under network partitions and node failures.

Refer to the Raft paper for the definitive reference, Paxos Made Live for Google's practical Paxos experience, and the Jepsen website for distributed systems testing methodology.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline

Consensus Algorithms: Raft, Paxos, and Byzantine Fault Tolerance

Introduction

Understanding Consensus: Core Concepts

The Consensus Problem

Leader-Based Consensus

Log Replication

Raft: Designed for Understandability

Leader Election

Log Replication

Safety Guarantees

Paxos: The Theoretical Foundation

Basic Paxos

Multi-Paxos

Byzantine Fault Tolerance

PBFT (Practical Byzantine Fault Tolerance)

Real-World Implementations

etcd (Raft)

ZooKeeper (ZAB)

Tendermint (BFT)

Performance and Trade-offs

Common Pitfalls and Solutions

Advanced Patterns

Flexible Paxos

Witness Replicas

Leader Lease

Testing Consensus Algorithms

Choosing the Right Consensus Algorithm

Community Resources and Further Learning

Curated Learning Pathways

Contributing to Open Source

Building a Technical Knowledge Base

Staying Current with Industry Trends

Conclusion

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline

Consensus Algorithms: Raft, Paxos, and Byzantine Fault Tolerance

Introduction

Understanding Consensus: Core Concepts

The Consensus Problem

Leader-Based Consensus

Log Replication

Raft: Designed for Understandability

Leader Election

Log Replication

Safety Guarantees

Paxos: The Theoretical Foundation

Basic Paxos

Multi-Paxos

Byzantine Fault Tolerance

PBFT (Practical Byzantine Fault Tolerance)

Real-World Implementations

etcd (Raft)

ZooKeeper (ZAB)

Tendermint (BFT)

Performance and Trade-offs

Common Pitfalls and Solutions

Advanced Patterns

Flexible Paxos

Witness Replicas

Leader Lease

Testing Consensus Algorithms

Choosing the Right Consensus Algorithm

Community Resources and Further Learning

Curated Learning Pathways

Contributing to Open Source

Building a Technical Knowledge Base

Staying Current with Industry Trends

Mentorship and Knowledge Sharing

Conclusion