Download and installation

In order to run an IPFS Cluster peer and perform actions on the Cluster, you will need to obtain the ipfs-cluster-service and ipfs-cluster-ctl binaries. The former runs the Cluster peer. The latter allows to interact with it:

  • Visit the download page for instructions on the different ways to obtain ipfs-cluster-service and ipfs-cluster-ctl.
  • Place the binaries in a place where they can be run unattended by an ipfs system user (usually /usr/local/bin). IPFS Cluster should be installed and run along ipfs (go-ipfs).
  • Consider configuring your systems to start ipfs and ipfs-cluster-service automatically (beware to check that you need to ensure your cluster is fully operational and peers discover each other beforehand). Some sample Systemd service files are available here: ipfs-cluster-service, ipfs.

Initialization

To create and generate a default configuration file, a unique identity for your peer and an empty peerstore file, run:

$ ipfs-cluster-service init --consensus <crdt/raft>

This assumes that the ipfs-cluster-service command is installed in one of the folders in your $PATH.

If all went well, after running this command there will be three different files in $HOME/.ipfs-cluster:

  • service.json contains a default peer configuration. Usually, all peers in a Cluster should have exactly the same configuration.
  • identity.json contains the peer private key and ID. These are unique to each Cluster peer.
  • peerstore is an empty file used to store the peer addresses of other peers so that this peer knows where to contact them.

The --consensus flag chooses whether to initialize the configuration with a raft or a crdt section. All peers should be initialized in the same way. The choice between raft and crdt depends on multiple factors and affects how the cluster is started and the peerset modified. We have gathered more in-depth explanations in the Consensus Components section.

The new service.json file generated by ipfs-cluster-service init will have a randomly generated secret value in the cluster section. For a Cluster to work, this value should be the same in all cluster peers. This is usually a source of pitfalls since initializing default configurations everywhere results in different random secrets.

If present, the CLUSTER_SECRET environment value is used when running ipfs-cluster-service init to set the cluster secret value.

Remote configuration

ipfs-cluster-service can be initialized to use a remote configuration file accessible on an HTTP(s) location which is read to obtain the running configuration every time the peer is launched. This is useful to initialize all peers with the same configuration and provide seamless upgrades to it.

A good trick is to use IPFS to store the actual configuration and, for example, call init with a gateway url as follows:

$ ipfs-cluster-service init http://localhost:8080/ipns/config.mydomain.com

(a DNSLink TXT record needs to be configured for the example above to work. A regular URL can be used too).

Do not host configurations publicly unless it is OK to expose the Cluster secret. This is only OK in crdt-based clusters which have configured trusted_peers to other than *.

Trusted peers

The crdt section of the service.json file includes a single * value for the trusted_peers array. By default, peers running on crdt-mode trusts all other peers. In raft mode, all peers trust all other peers and this option does not exist.

Read more about trusted peers in the Security and Ports guide.

The peerstore file

The peerstore file will be maintained by the running Cluster peer and will be used to store known-peer addresses. However, you can also pre-fill this file (one line per multiaddress) to help this peer connect to others during its first start. Here is an example:

/dns4/cluster1.domain/tcp/9096/ipfs/QmcQ5XvrSQ4DouNkQyQtEoLczbMr6D9bSenGy6WQUCQUBt
/dns4/cluster2.domain/tcp/9096/ipfs/QmdFBMf9HMDH3eCWrc1U11YCPenC3Uvy9mZQ2BedTyKTDf
/ip4/192.168.1.10/tcp/9096/ipfs/QmSGCzHkz8gC9fNndMtaCZdf9RFtwtbTEEsGo4zkVfcykD

Alternatively, you can also use the peer_addresses configuration value to provide addresses for other peers.

Ports

By default, Cluster uses:

  • 9096/tcp as the cluster swarm endpoint which should be open and diallable by other cluster peers.
  • 9094/tcp as the HTTP API endpoint, when enabled
  • 9095/tcp as the Proxy API endpoint, when enabled
  • 9097/tcp as the IPFS Pinning API endpoint, when enabled
  • 8888/tcp as the Prometheus metrics endpoint, when enabled.
  • 6831/tcp as the Jaeger agent endpoint for traces, when enabled.

A full description of the ports and endpoints is available in the Security guide.

Settings for production

The default IPFS and Cluster settings are conservative and work for simple setups out of the box. There are however, a number of options that can be optimized with regards to:

  • Large pinsets
  • Large number of peers
  • Networks with very high or lower latencies

Additionally to the settings mentioned here, the configuration reference contains detailed information for every configuration section, with extended descriptions of what each value means.

IPFS Configuration

IPFS daemons can be optimized for production. The options are documented in the official repository:

Server profile for cloud deployments

Initialize ipfs using the server profile: ipfs init --profile=server or ipfs config profile apply server if the configuration already exists.

Pay attention to AddrFilters and NoAnnounce options. They should be pre-filled to sensible values with the server configuration profile, but depending on the type of network you are running on, you may want to modify them.

Datastore settings

Unlike a previous recommendation, we have found that the flatfs datastore performs better than badger or very large repositories on modern hardware, and gives less headaches (i.e. does not need several minutes to be ready).

sync should be set to false in the configuration (big performance impact otherwise), and the backing Filesystem should probably be XFS or ZFS (faster when working with folder with large number of files in them). IPFS puts huge pressure on disk by performing random reads, specially when providing popular content.

Flatfs can be improved by setting the Sharding function to /repo/flatfs/shard/v1/next-to-last/3 (next-to-last/2 is the default). This should only be done for multi-terabyte repositories.

Updating the sharding function can be done by initializing from a configuration template or by setting it in config and the datastore_spec and removing the blocks/ folder. It should be done during the first setup of the IPFS node, although small datastores can be converted using ipfs-ds-convert.

Increasing Datastore.BloomFilterSize should be considered in most cases, according to the expected IPFS repository size: 1048576 (1MB) is a good value to start (more info here).

Do not forget to set Datastore.StorageMax to a value according to the disk you want to dedicate for the ipfs repo. This will affect how cluster calculates how much free space there is in every peer.

Connection manager settings

Increase the Swarm.ConnMgr.HighWater (maximum number of connections) and reduce GracePeriod to 20s. It can be as high as your machine would take (10000 is a good value for large machines). Adjust Swarm.ConnMgr.LowWater to about a 25% of the HighWater value.

Experimental DHT providing

Large datastore will have a lot to provide, so enabling AcceleratedDHTClient is a good thing.

File descriptor limit

The IPFS_FD_MAX environment variable controls the FD ulimit value that go-ipfs sets for itself. Depending on your Highwater value, you may want to increase it to 8192 or more.

Garbage collection

We recommend to keep automatic garbage collection on IPFS disabled when using IPFS Cluster to add content as the GC process may delete content as it is being added. Alternatively, it is possible to add content directly to IPFS (this will lock the GC process in the mean time), and only use the Cluster to pin it once added.

Bitswap optimizations

It is also very important to adjust Bitswap internal configuration when nodes have lots of traffic. Multiplying the defaults by 100 is not unhread of in big machines. However, you would have to find the right balance, as this will make IPFS consume much more memory when it is busy bitswapping. Example for machine with 128GB of RAM:

  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 2500,
      "EngineTaskWorkerCount": 500,
      "MaxOutstandingBytesPerPeer": 1048576,
      "TaskWorkerCount": 500
    }
  }

Multiply by 2 these values if your machines can handle it.

Peerings

The peering can be used to ensure IPFS peers in the cluster stay always connected.

IPFS Cluster configuration

First, it is important to ensure that the ipfs-cluster-service daemon can operate with ample “file descriptor count” limits. This can be done by adding LimitNOFILE=infinity to systemd service unit files (in the [Service] section), or by ensuring ulimits are correctly set. The number of file descriptors used by the daemon is proportional to:

  • The number of open badger tables and write-ahead log files (badger configuration)
  • The number of items being pinned in parallel (pintracker configuration)
  • The number of ongoing add requests.

Additionally, the service.json configuration file contains a few options which should be tweaked according to your environment, capacity and requirements.

cluster section

When dealing with large amount of pins, you may further increase the cluster.pin_recover_interval. This operation will perform checks for every pin in the pinset and will trigger ipfs pin ls --type=recursive calls, which may be slow when the number of pinned items is huge. For example, for multimillion pinsets, this should be set at least to 2+ hours.

Consider increasing the cluster.monitor_ping_interval and monitor.*.check_interval. This dictactes how long cluster takes to realize a peer is not responding (and potentially trigger re-pins if cluster.disable_repinning is set to false). Re-pinning might be a very expensive in your cluster. Thus, you may want to set this to be long enough. You can use the same value for both. In general, we recommend leaving repinning disabled, although when enabled, replication_factor_max and replication_factor_min allow some leeway: i.e. a 23 will allow one peer to be down without re-allocating the content assigned to it somewhere else.

raft section

These options only apply when running raft-based clusters.

If you are planning to re-start all Raft peers at the same time (for example, after an upgrade), consider setting raft.wait_for_leader_timeout to something that gives ample time for all your peers to be restarted and come online at once. Usually 30s or 1m.

If your network is very unstable, you can try increasing raft.commit_retries, raft.commit_retry_delay. Note: more retries and higher delays imply slower failures.

For high-latency clusters (like having peers around the world), you can try increasing heartbeat_timeout, election_timeout, commit_timeout and leader_lease_timeout, although defaults are quite big already. For low-latency clusters, these can all be decreased (at least by half).

For very large pinsets, increase raft.snapshot_interval. If your cluster pins or unpins very frequently, increase raft.snapshot_threshold.

crdt section

These options only apply when running crdt-based clusters.

Reducing the crdt.rebroadcast_interval (default 1m) to a few seconds should make new peers start downloading the state faster, and badly connected peers should have more options to receive bits of information, at the expense of increased pubsub chatter in the network. It can be reduced to 10 seconds for clusters under 10 peers, but in general we recommend to leave it at 1 minute.

Most importantly, the batching configuration allows to increase CRDT throughput by orders of magnitude by batching multiple pin updates in a single delta (at the cost of delays). For example, you can let clusters only commit new pins every minute, or when there are 500 in a batch as follows:

      "batching": {
        "max_batch_size": 500,
        "max_batch_age": "1m",
        "max_queue_size": 50000
      },

You can edit the crdt.cluster_name, as long as it is the same for all peers.

pin_tracker section

The stateless pin tracker handles two pinning queues: a priority one and a “normal” one. When processing the queues, pins queued in the priority queue always have preference. All new pins go to the priority queue and stay there until they become old or have been retried too many times:

 "pin_tracker": {
    "stateless": {
      "max_pin_queue_size": 200000000,
      "concurrent_pins": 10,
      "priority_pin_max_age" : "24h",
      "priority_pin_max_retries" : 3
    }
  },

The right value for concurrent_pins depends on the size of the pins and the performance of IPFS to both fetch and write them to disk. We have observed that values between 10-20 tend to work better than larger ones (which may cause too much contention).

ipfs_connector section

Pin requests performed by cluster time out based on the last blocked fetched by IPFS (not on the total length of the pin requests). Therefore the pin_timeout setting can be set very low: 20 seconds will ask cluster to give up a pin if no block can be fetched for 20 seconds. Lower pin timeouts let cluster churn through pinning queues faster. Pin errors will be retried later.

The ipfs_request_timeout is set to a conservative 5m value by default. However, on pinsets with millions of items this might be too low and increasing it is necessary.

api section

Adjust the api.restapi/pinsvcapi/ipfsproxy network timeouts depending on your API usage. This may protect against misuse of the API or DDoS attacks. Note that there are usually client-side timeouts that can be modified too if you control the clients.

Each of the APIs can be disabled by just removing their configuration from the api section.

informer section

On big clusters with many pins/peers/storage size, you can increase the metric_ttl for all informers to 5 or 10 minutes, as there is no need to have extremely fresh freespace metrics etc. This means that peers will be producing a stable set of allocations for new pins as long as the metrics are not refreshed, which allows the rest of the peers to “rest”, giving some space to deal with other tasks like pin recovery.

If you want to geo-distribute your pins, setup the tags allocator with a region tag and the right value for every peer.

The pinqueue allocator should be set with a high weight_bucket_size adjusted to how big it is acceptable for a pinning queue to grow before we start pinning somewhere else. The default weight_bucket_size is 100k, which is a very high number. This means that all peers under 100k items in the queue are considered equal with regards to this metric, allowing the freespace metric to be the deciding factor for pinning allocations (when things are so configured in the balanced allocator section).

allocator section

The balanced allocator can help distribute pins and load effectively. We recommend to set allocate_by to something like:

[ "tag:region", "pinqueue", "freespace" ]

Given a replication factor of 3, and a cluster with peers in 3 regions correctly configured, this will place each pin replica in 1 of the regions, choosing the peer with most free-space in that region among those with the lowest queue weight (“pin queue size / weight_bucket_size”). You can also add additional tag metrics (i.e [ "tag:region", "tag:availability_zone", ...]). See the pinning guide for more information on how pinning process happens.

datastore section

The badger defaults are very conservative and designed to minimize Badger’s memory consumption. On a big machine with 128GB of RAM, these have proven to be much better:

   "badger": {
      "gc_discard_ratio": 0.2,
      "gc_interval": "15m0s",
      "gc_sleep": "10s",
      "badger_options": {
        "dir": "",
        "value_dir": "",
        "sync_writes": false,
        "table_loading_mode": 2,
        "value_log_loading_mode": 0,
        "num_versions_to_keep": 1,
        "max_table_size": 268435456,
        "level_size_multiplier": 10,
        "max_levels": 7,
        "value_threshold": 512,
        "num_memtables": 10,
        "num_level_zero_tables": 10,
        "num_level_zero_tables_stall": 20,
        "level_one_size": 268435456,
        "value_log_file_size": 1073741823,
        "value_log_max_entries": 1000000,
        "num_compactors": 2,
        "compact_l_0_on_close": true,
        "read_only": false,
        "truncate": false
      }
    }

It is very important in any case to spend time understanding the Badger options. Things like the table and value log loading modes, max_table_size, value_threshold etc. have big impact on how much memory Badger uses, and how it performs. This is specially relevant for 10M+ pinsets.