Powered by GitBook

How to propagate state change across nodes in the distributed network reliably?

Fault Tolerant Consensus Algorithms

What is it?

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems.

What are the possible ways of distributed system failure?

System crashed
Message lost
Message delayed/ out of order
Network partition
Malicious (Byzantine) - Not consider this.

Why is it so useful?

Any algorithm that relies on multiple processes maintaining common state relies on solving the consensus problem. Some examples of places where consensus has come in useful are:

synchronizing replicated state machines and making sure all replicas have the same (consistent) view of system state.
electing a leader (e.g., for mutual exclusion)
distributed, fault-tolerant logging with globally consistent sequencing
managing group membership
deciding to commit or abort for distributed transactions

You can solve this consensus problem in the following ways:

single system-wide coordinator that determines the final outcome
2PC

Paxos

Paxos has been used in several production systems like Google Chubby, ZooKeeper, and Spanner.

Raft