Adding new nodes is called “bootstrapping”. The
num_tokens parameter will define the amount of virtual nodes
(tokens) the joining node will be assigned during bootstrap. The tokens define the sections of the ring (token ranges)
the node will become responsible for.
With the default token allocation algorithm the new node will pick
num_tokens random tokens to become responsible
for. Since tokens are distributed randomly, load distribution improves with a higher amount of virtual nodes, but it
also increases token management overhead. The default of 256 virtual nodes should provide a reasonable load balance with
On 3.0+ a new token allocation algorithm was introduced to allocate tokens based on the load of existing virtual nodes
for a given keyspace, and thus yield an improved load distribution with a lower number of tokens. To use this approach,
the new node must be started with the JVM option
<keyspace> is the keyspace from which the algorithm can find the load information to optimize token assignment for.
You may specify a comma-separated list of tokens manually with the
cassandra.yaml parameter, and
if that is specified Cassandra will skip the token allocation process. This may be useful when doing token assignment
with an external tool or when restoring a node with its previous tokens.
After the tokens are allocated, the joining node will pick current replicas of the token ranges it will become responsible for to stream data from. By default it will stream from the primary replica of each token range in order to guarantee data in the new node will be consistent with the current state.
In the case of any unavailable replica, the consistent bootstrap process will fail. To override this behavior and
potentially miss data from an unavailable replica, set the JVM flag
On 2.2+, if the bootstrap process fails, it’s possible to resume bootstrap from the previous saved state by calling
nodetool bootstrap resume. If for some reason the bootstrap hangs or stalls, it may also be resumed by simply
restarting the node. In order to cleanup bootstrap state and start fresh, you may set the JVM startup flag
On lower versions, when the bootstrap proces fails it is recommended to wipe the node (remove all the data), and restart the bootstrap process again.
It’s possible to skip the bootstrapping process entirely and join the ring straight away by setting the hidden parameter
auto_bootstrap: false. This may be useful when restoring a node from a backup or creating a new data-center.
You can take a node out of the cluster with
nodetool decommission to a live node, or
nodetool removenode (to any
other machine) to remove a dead one. This will assign the ranges the old node was responsible for to other nodes, and
replicate the appropriate data there. If decommission is used, the data will stream from the decommissioned node. If
removenode is used, the data will stream from the remaining replicas.
No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at a different token on the ring, it should be removed manually.
num_tokens: 1 it’s possible to move the node position in the ring with
nodetool move. Moving is both a
convenience over and more efficient than decommission + bootstrap. After moving a node,
nodetool cleanup should be
run to remove any unnecessary data.
In order to replace a dead node, start cassandra with the JVM startup flag
-Dcassandra.replace_address_first_boot=<dead_node_ip>. Once this property is enabled the node starts in a hibernate
state, during which all the other nodes will see this node to be down.
The replacing node will now start to bootstrap the data from the rest of the nodes in the cluster. The main difference between normal bootstrapping of a new node is that this new node will not accept any writes during this phase.
Once the bootstrapping is complete the node will be marked “UP”, we rely on the hinted handoff’s for making this node consistent (since we don’t accept writes since the start of the bootstrap).
If the replacement process takes longer than
max_hint_window_in_ms you MUST run repair to make the
replaced node consistent again, since it missed ongoing writes during bootstrapping.
Bootstrap, replace, move and remove progress can be monitored using
nodetool netstats which will show the progress
of the streaming operations.
As a safety measure, Cassandra does not automatically remove data from nodes that “lose” part of their token range due
to a range movement operation (bootstrap, move, replace). Run
nodetool cleanup on the nodes that lost ranges to the
joining node when you are satisfied the new node is up and working. If you do not do this the old data will still be
counted against the load on that node.