Download and installation
In order to run an IPFS Cluster peer and perform actions on the Cluster, you will need to obtain the ipfs-cluster-service
and ipfs-cluster-ctl
binaries. The former runs the Cluster peer. The latter allows to interact with it:
- Visit the download page for instructions on the different ways to obtain
ipfs-cluster-service
andipfs-cluster-ctl
. - Place the binaries in a place where they can be run unattended by an
ipfs
system user (usually/usr/local/bin
). IPFS Cluster should be installed and run alongipfs
(go-ipfs). - Consider configuring your systems to start
ipfs
andipfs-cluster-service
automatically (beware to check that you need to ensure your cluster is fully operational and peers discover each other beforehand). Some sample Systemd service files are available here: ipfs-cluster-service, ipfs.
Initialization
To create and generate a default configuration file, a unique identity for your peer and an empty peerstore file, run:
$ ipfs-cluster-service init
This assumes that the ipfs-cluster-service
command is installed in one of the folders in your $PATH
.
If all went well, after running this command there will be three different files in $HOME/.ipfs-cluster
:
service.json
contains a default peer configuration. Usually, all peers in a Cluster should have exactly the same configuration.identity.json
contains the peer private key and ID. These are unique to each Cluster peer.peerstore
is an empty file used to store the peer addresses of other peers so that this peer knows where to contact them.
The peer will be initialized using the default crdt
“consensus”. This is the recommended option for most setups. See Consensus Components for more information.
The default and recommended datastore backend (optional --datastore
flag) is pebble
. See the Datastore backends guide for more information.
The new service.json
file generated by ipfs-cluster-service init
will have a randomly generated secret
value in the cluster
section. For a Cluster to work, this value should be the same in all cluster peers. This is usually a source of pitfalls since initializing default configurations everywhere results in different random secrets.
CLUSTER_SECRET
environment value is used when running ipfs-cluster-service init
to set the cluster secret
value.Remote configuration
ipfs-cluster-service
can be initialized to use a remote configuration file accessible on an HTTP(s) location which is read to obtain the running configuration every time the peer is launched. This is useful to initialize all peers with the same configuration and provide seamless upgrades to it.
A good trick is to use IPFS to store the actual configuration and, for example, call init
with a gateway url as follows:
$ ipfs-cluster-service init http://localhost:8080/ipns/config.mydomain.com
(a DNSLink TXT record needs to be configured for the example above to work. A regular URL can be used too).
trusted_peers
to other than *
.Trusted peers
The crdt
section of the service.json
file includes a single *
value for the trusted_peers
array. By default, peers running on crdt-mode trusts all other peers. In raft
mode, all peers trust all other peers and this option does not exist.
Read more about trusted peers in the Security and Ports guide.
The peerstore file
The peerstore
file will be maintained by the running Cluster peer and will be used to store known-peer addresses. However, you can also pre-fill this file (one line per multiaddress) to help this peer connect to others during its first start. Here is an example:
/dns4/cluster1.domain/tcp/9096/ipfs/QmcQ5XvrSQ4DouNkQyQtEoLczbMr6D9bSenGy6WQUCQUBt
/dns4/cluster2.domain/tcp/9096/ipfs/QmdFBMf9HMDH3eCWrc1U11YCPenC3Uvy9mZQ2BedTyKTDf
/ip4/192.168.1.10/tcp/9096/ipfs/QmSGCzHkz8gC9fNndMtaCZdf9RFtwtbTEEsGo4zkVfcykD
Alternatively, you can also use the peer_addresses
configuration value to provide addresses for other peers.
Ports
By default, Cluster uses:
9096/tcp
as the cluster swarm endpoint which should be open and diallable by other cluster peers.9094/tcp
as the HTTP API endpoint, when enabled9095/tcp
as the Proxy API endpoint, when enabled9097/tcp
as the IPFS Pinning API endpoint, when enabled8888/tcp
as the Prometheus metrics endpoint, when enabled.6831/tcp
as the Jaeger agent endpoint for traces, when enabled.
A full description of the ports and endpoints is available in the Security guide.
Settings for production
The default IPFS and Cluster settings are conservative and work for simple setups out of the box. There are however, a number of options that can be optimized with regards to:
- Large pinsets
- Large number of peers
- Networks with very high or lower latencies
Additionally to the settings mentioned here, the configuration reference contains detailed information for every configuration section, with extended descriptions of what each value means.
IPFS Configuration
IPFS daemons can be optimized for production. The options are documented in the official repository:
Server profile for cloud deployments
Initialize ipfs using the server
profile: ipfs init --profile=server
or ipfs config profile apply server
if the configuration already exists.
Pay attention to AddrFilters
and NoAnnounce
options. They should be pre-filled to sensible values with the server
configuration profile, but depending on the type of network you are running on, you may want to modify them.
IPFS Datastore settings
Unlike a previous recommendation, we have found that the flatfs
datastore performs better than badger or very large repositories on modern hardware, and gives less headaches (i.e. does not need several minutes to be ready).
sync
should be set to false
in the configuration (big performance impact otherwise), and the backing Filesystem should probably be XFS or ZFS (faster when working with folder with large number of files in them). IPFS puts huge pressure on disk by performing random reads, specially when providing popular content.
Flatfs can be improved by setting the Sharding function to /repo/flatfs/shard/v1/next-to-last/3
(next-to-last/2
is the default). This should only be done for multi-terabyte repositories.
Updating the sharding function can be done by initializing from a configuration template or by setting it in config
and the datastore_spec
and removing the blocks/
folder. It should be done during the first setup of the IPFS node, although small datastores can be converted using ipfs-ds-convert
.
Increasing Datastore.BloomFilterSize
should be considered in most cases, according to the expected IPFS repository size: 1048576
(1MB) is a good value to start (more info here).
Do not forget to set Datastore.StorageMax
to a value according to the disk you want to dedicate for the ipfs repo. This will affect how cluster calculates how much free space there is in every peer.
Connection manager settings
Increase the Swarm.ConnMgr.HighWater
(maximum number of connections) and reduce GracePeriod
to 20s
. It can be as high as your machine would take (10000
is a good value for large machines). Adjust Swarm.ConnMgr.LowWater
to about a 25% of the HighWater value.
Experimental DHT providing
Large datastore will have a lot to provide, so enabling AcceleratedDHTClient
is a good thing.
File descriptor limit
The IPFS_FD_MAX
environment variable controls the FD ulimit
value that go-ipfs
sets for itself. Depending on your Highwater
value, you may want to increase it to 8192
or more.
Garbage collection
We recommend to keep automatic garbage collection on IPFS disabled when using IPFS Cluster to add content as the GC process may delete content as it is being added. Alternatively, it is possible to add content directly to IPFS (this will lock the GC process in the mean time), and only use the Cluster to pin it once added.
Bitswap optimizations
It is also very important to adjust Bitswap internal configuration when nodes have lots of traffic. Multiplying the defaults by 100 is not unhread of in big machines. However, you would have to find the right balance, as this will make IPFS consume much more memory when it is busy bitswapping. Example for machine with 128GB of RAM:
"Internal": {
"Bitswap": {
"EngineBlockstoreWorkerCount": 2500,
"EngineTaskWorkerCount": 500,
"MaxOutstandingBytesPerPeer": 1048576,
"TaskWorkerCount": 500
}
}
Multiply by 2 these values if your machines can handle it.
Peerings
The peering can be used to ensure IPFS peers in the cluster stay always connected.
IPFS Cluster configuration
First, it is important to ensure that the ipfs-cluster-service
daemon can operate with ample “file descriptor count” limits. This can be done by adding LimitNOFILE=infinity
to systemd service unit files (in the [Service]
section), or by ensuring ulimits are correctly set. The number of file descriptors used by the daemon is proportional to:
- The number of Pebble’s open files and write-ahead log files (Pebble’s configuration)
- The number of items being pinned in parallel (pintracker configuration)
- The number of ongoing add requests.
Additionally, the service.json
configuration file contains a few options which should be tweaked according to your environment, capacity and requirements.
cluster
section
Cluster regularly performs full-pinset sweeps to make sure, for example, that pins in error state are retried, or that expired pins are unpinned. On very large pinsets, these operations are costly. Thus we recommend increasing the following intervals:
cluster.pin_recover_interval
: In order to trigger pin recovery operations, this will trigger aipfs pin ls --type=recursive
calls and list all items in the cluster pinset. For clusters with multimillion pinsets, this should be set to at least 2+ hours, or as long as you can tolerate.cluster.state_sync_interval
: This specifies how often the cluster will check for expired pins and trigger unpin operations, which requires visiting every item in the pinset. For clusters with multimillion pinsets, this should be set to at least 2+hours, or as long as you can tolerate.
For large clusters, we recommend leaving repinning disabled (cluster.disable_repinning
). As it is implemented now, repinning can trigger re-allocation of content when no more heartbeats (pings) are received from a node. On multi-million pinsets, this can be an expensive operation (specially if the node can recover afterwards). When repinning is enabled, cluster.monitor_ping_interval
and monitor.*.check_interval
dictacte how long cluster takes to realize a peer is not responding (and potentially trigger re-pins). If you enable repinning, we recommend using replication_factor_min
and replication_factor_max
to allow some leeway: i.e. a 2⁄3 will allow one peer to be down without re-allocating the content assigned to it somewhere else.
consensus
section
We recommend using crdt
consensus only, for raft
is mostly unmaintained and won’t scale.
Reducing the crdt.rebroadcast_interval
(default 1m
) to a few seconds should make new peers start downloading the state faster, and badly connected peers should have more options to receive bits of information, at the expense of increased pubsub chatter in the network. It can be reduced to 10 seconds for clusters under 10 peers, but in general we recommend to leave it at 1 minute.
Most importantly, the crdt.batching
configuration allows to increase CRDT throughput by orders of magnitude by batching multiple pin updates in a single delta (at the cost of delays). For example, you can let clusters only commit new pins every minute, or when there are 500 in a batch as follows:
"batching": {
"max_batch_size": 500,
"max_batch_age": "1m",
"max_queue_size": 50000
},
You can edit the crdt.cluster_name
, as long as it is the same for all peers.
pin_tracker
section
The stateless
pin tracker handles two pinning queues: a priority one and a “normal” one. When processing the queues, pins queued in the priority queue always have preference. All new pins go to the priority queue and stay there until they become old or have been retried too many times:
"pin_tracker": {
"stateless": {
"max_pin_queue_size": 200000000,
"concurrent_pins": 10,
"priority_pin_max_age" : "24h",
"priority_pin_max_retries" : 3
}
},
The right value for concurrent_pins
depends on the size of the pins and the performance of IPFS to both fetch and write them to disk. We have observed that values between 10-20
tend to work better than larger ones (which may cause too much contention).
ipfs_connector
section
Pin requests performed by cluster time out based on the last blocked fetched by IPFS (not on the total length of the pin requests). Therefore the pin_timeout
setting can be set very low: 20 seconds will ask cluster to give up a pin if no block can be fetched for 20 seconds. Lower pin timeouts let cluster churn through pinning queues faster. Pin errors will be retried later.
The ipfs_request_timeout
is set to a conservative 5m
value by default. However, on pinsets with millions of items this might be too low and increasing it is necessary.
api
section
Adjust the api.restapi/pinsvcapi/ipfsproxy
network timeouts depending on your API usage. This may protect against misuse of the API or DDoS attacks. Note that there are usually client-side timeouts that can be modified too if you control the clients.
Each of the APIs can be disabled by just removing their configuration from the api
section.
informer
section
On big clusters with many pins/peers/storage size, you can increase the metric_ttl
for all informers to 5 or 10 minutes, as there is no need to have extremely fresh freespace metrics etc. This means that peers will be producing a stable set of allocations for new pins as long as the metrics are not refreshed, which allows the rest of the peers to “rest”, giving some space to deal with other tasks like pin recovery.
If you want to geo-distribute your pins, setup the tags
allocator with a region
tag and the right value for every peer.
The pinqueue
allocator should be set with a high weight_bucket_size
adjusted to how big it is acceptable for a pinning queue to grow before we start pinning somewhere else. The default weight_bucket_size
is 100k, which is a very high number. This means that all peers under 100k items in the queue are considered equal with regards to this metric, allowing the freespace
metric to be the deciding factor for pinning allocations (when things are so configured in the balanced
allocator section).
allocator
section
The balanced
allocator can help distribute pins and load effectively. We recommend to set allocate_by
to something like:
[ "tag:region", "pinqueue", "freespace" ]
Given a replication factor of 3, and a cluster with peers in 3 regions correctly configured, this will place each pin replica in 1 of the regions, choosing the peer with most free-space in that region among those with the lowest queue weight (“pin queue size / weight_bucket_size”). You can also add additional tag
metrics (i.e [ "tag:region", "tag:availability_zone", ...]
). See the pinning guide for more information on how pinning process happens.
datastore
section
Our default configuration for pebble
(our default backend), should be tuned to support very large cluster deployments with very large number of pins. We have modified some of the Pebble defaults that we consider a bit conservative (even though they probably work just fine for 95% of cases). Our choices with some additional comments are documented here. All Pebble options are documented at https://pkg.go.dev/github.com/cockroachdb/pebble#Options.
Badger3 is an alternative, particularly for platforms not supporting Pebble. The defaults should be mostly fine too, but it is worth tuning and understanding our defaults and the meaning of the options (https://pkg.go.dev/github.com/dgraph-io/badger/v3#Options).
We heavily discourage using badger
and leveldb
for anything at this point, but we used to recommend the following Bager settings for a large machine with 128GB of RAM:
"badger": {
"gc_discard_ratio": 0.2,
"gc_interval": "15m0s",
"gc_sleep": "10s",
"badger_options": {
"dir": "",
"value_dir": "",
"sync_writes": false,
"table_loading_mode": 2,
"value_log_loading_mode": 0,
"num_versions_to_keep": 1,
"max_table_size": 268435456,
"level_size_multiplier": 10,
"max_levels": 7,
"value_threshold": 512,
"num_memtables": 10,
"num_level_zero_tables": 10,
"num_level_zero_tables_stall": 20,
"level_one_size": 268435456,
"value_log_file_size": 1073741823,
"value_log_max_entries": 1000000,
"num_compactors": 2,
"compact_l_0_on_close": true,
"read_only": false,
"truncate": false
}
}