How to propagate state change across nodes in the distributed network reliably?
Fault Tolerant Consensus Algorithms
What is it?
Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems.
What are the possible ways of distributed system failure?
- System crashed
- Message lost
- Message delayed/ out of order
- Network partition
- Malicious (Byzantine) - Not consider this.
Why is it so useful?
Any algorithm that relies on multiple processes maintaining common state relies on solving the consensus problem. Some examples of places where consensus has come in useful are:
- synchronizing replicated state machines and making sure all replicas have the same (consistent) view of system state.
- electing a leader (e.g., for mutual exclusion)
- distributed, fault-tolerant logging with globally consistent sequencing
- managing group membership
- deciding to commit or abort for distributed transactions
You can solve this consensus problem in the following ways:
- single system-wide coordinator that determines the final outcome
- 2PC
Paxos
Paxos has been used in several production systems like Google Chubby, ZooKeeper, and Spanner.