Consensus components

IPFS Cluster peers can be started using different choices for the implementations of some components. The most important one is the “consensus” one. The “consensus component” is in charge of:

  • Managing the global cluster pinset by receiving modifications from other peers and publishing them.
  • Managing the persistent storage of pinset-related data on disk.
  • Achieving strong eventual consistency between all peers: all peers should converge to the same pinset.
  • Managing the Cluster peerset: performing the necessary operation to add or remove peers from the Cluster.
  • Setting peer trust: defining which peers are trusted to perform actions and access local RPC endpoints.

IPFS Cluster offers two “consensus component” options and the users are forced to make a choice when initializing a cluster peer by providing either --consensus crdt or --consensus raft to the init command.

For offline cluster pinset management check the Data, backups and recovery section.

CRDT

crdt is the default implementation of the Cluster’s “consensus component” based on an ipfs-powered distributed key-value store. It:

  • Publishes updates to the pinset via libp2p-pubsub (GossipSub), locates and exchange data via ipfs-lite (dht+bitswap).
  • Stores all persistent data on a local datastore in the .ipfs-cluster/ folder (pebble, badger3, badger or leveldb folder, depending on selected option).
  • Uses Merkle-CRDTs to obtain eventual consistency using go-ds-crdt. These are append-only, immutable Merkle-DAGs. They cannot be compacted on normal conditions and new peers must discover and traverse them from the root, which might be a slow operation if the DAG is very deep.
  • Does not need to perform any peerset management. Every peer for which we received “pings” via pubsub is considered a member of the Cluster until their last metric expires.
  • Trusts peers as defined in the trusted_peers configuration option: only those peers can modify the pinset in the local peer and can access “trusted” RPC endpoints.
  • Can optionally batch many pin/unpin operations on a single update, thus allowing scaling pin ingestion capabilities.

We successfully use crdt on very large clusters and it is regularly maintained and heavily battle-tested.

Raft

The Raft implementation has not received much attention for a long time and is legacy now. It probably does not behave well with very large pinsets in terms of memory consumption.

raft is an implementation of the Cluster’s “consensus component” based on Raft consensus. It:

  • Publishes updates by connecting and sending them directly to every Cluster peer.
  • Stores all persistent data on a local BoltDB store and on regular snapshots in the .ipfs-cluster/raft folder.
  • Uses Raft-consensus implementation (hashicorp/raft with a libp2p network transport) to obtain eventual consistency and protection of network partitions. Peerset views can be outdated in Raft, but they can never diverge in ways that need reconciliation. Raft-clusters elect a leader which is in charge of committing every entry to the log. For it to be valid, more than half of the peers in the cluster must acknowledge each operation. The append-only log can be consolidated and compacted into a snapshot which can be sent to new peers.
  • Performs peerset management by making peerset changes (adding and removing peers) a commit operation in the Raft log, thus subjected to the limitations of them: an elected leader and more than half of the peers online.
  • Trusts all peers. Any peer can request joining a Raft-based cluster and any peer can access RPC endpoints of others (as long as they know the Cluster secret).

Choosing a consensus component

Choose CRDT when:

  • You expect your cluster to work well with peers easily coming and going
  • You plan to have follower peers without permissions to modify the pinset
  • You do not have a fixed peer(s) for bootstrapping or you need to take advantage of mDNS autodiscovery
  • The cluster needs to accommodate regular and heavy bursts of pinning/unpinning operations (batching support helps).

Choose Raft when:

  • You think the above warning about Raft consensus being legacy does not apply to you.
  • Your cluster peerset is stable (always the same peers, always running) and not updated frequently
  • You cannot tolerate temporary partitions that result in divergent states
  • You don’t need any of the things CRDT mode provides
  • Your cluster will be small in terms of pins

CRDT vs Raft comparison

CRDT Raft
GossipSub broadcast Explicit connection to everyone + acknowledgments
Trusted-peer support All peers trusted
“Follower peers” and “Publisher peers” support All peers are publishers
Peers can come and go freely >50% must be online at all times or nothing works. Errors logged when someone is not online.
State size always grow State size reduced after deletions
Cluster state Compaction only possible by taken a full cluster offline Automatic compaction of the state
Potentially slow first-sync Faster first-sync by sending full snapshot
Works with very large number of peers Works with a small number of peers
Based on IPFS-tech (bitswap, dht, pubsub) Based on hashicorp-tech (raft)
Strong Eventual Consistency: Pinsets can diverge until they are consolidated Consensus: Pinsets can be outdated but never diverge
Fast pin ingestion with batching support Slow pin ingestion
Pin committed != Pin arrived to most peers Pin committed == pin arrived to most peers
Maximum tested size: 60M pins 100k pins