Adding, replacing, moving and removing nodes
Bootstrap
Adding new nodes is called "bootstrapping". The num_tokens
parameter
will define the amount of virtual nodes (tokens) the joining node will
be assigned during bootstrap. The tokens define the sections of the ring
(token ranges) the node will become responsible for.
Token allocation
With the default token allocation algorithm the new node will pick
num_tokens
random tokens to become responsible for. Since tokens are
distributed randomly, load distribution improves with a higher amount of
virtual nodes, but it also increases token management overhead. The
default of 256 virtual nodes should provide a reasonable load balance
with acceptable overhead.
On 3.0+ a new token allocation algorithm was introduced to allocate
tokens based on the load of existing virtual nodes for a given keyspace,
and thus yield an improved load distribution with a lower number of
tokens. To use this approach, the new node must be started with the JVM
option -Dcassandra.allocate_tokens_for_keyspace=<keyspace>
, where
<keyspace>
is the keyspace from which the algorithm can find the load
information to optimize token assignment for.
Manual token assignment
You may specify a comma-separated list of tokens manually with the
initial_token
cassandra.yaml
parameter, and if that is specified
Cassandra will skip the token allocation process. This may be useful
when doing token assignment with an external tool or when restoring a
node with its previous tokens.
Range streaming
After the tokens are allocated, the joining node will pick current replicas of the token ranges it will become responsible for to stream data from. By default it will stream from the primary replica of each token range in order to guarantee data in the new node will be consistent with the current state.
In the case of any unavailable replica, the consistent bootstrap process
will fail. To override this behavior and potentially miss data from an
unavailable replica, set the JVM flag
-Dcassandra.consistent.rangemovement=false
.
Resuming failed/hanged bootstrap
On 2.2+, if the bootstrap process fails, it’s possible to resume
bootstrap from the previous saved state by calling
nodetool bootstrap resume
. If for some reason the bootstrap hangs or
stalls, it may also be resumed by simply restarting the node. In order
to cleanup bootstrap state and start fresh, you may set the JVM startup
flag -Dcassandra.reset_bootstrap_progress=true
.
On lower versions, when the bootstrap proces fails it is recommended to wipe the node (remove all the data), and restart the bootstrap process again.
Removing nodes
You can take a node out of the cluster with nodetool decommission
to a
live node, or nodetool removenode
(to any other machine) to remove a
dead one. This will assign the ranges the old node was responsible for
to other nodes, and replicate the appropriate data there. If
decommission is used, the data will stream from the decommissioned node.
If removenode is used, the data will stream from the remaining replicas.
No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at a different token on the ring, it should be removed manually.
Moving nodes
When num_tokens: 1
it’s possible to move the node position in the ring
with nodetool move
. Moving is both a convenience over and more
efficient than decommission + bootstrap. After moving a node,
nodetool cleanup
should be run to remove any unnecessary data.
Replacing a dead node
In order to replace a dead node, start cassandra with the JVM startup
flag -Dcassandra.replace_address_first_boot=<dead_node_ip>
. Once this
property is enabled the node starts in a hibernate state, during which
all the other nodes will see this node to be DOWN (DN), however this
node will see itself as UP (UN). Accurate replacement state can be found
in nodetool netstats
.
The replacing node will now start to bootstrap the data from the rest of the nodes in the cluster. A replacing node will only receive writes during the bootstrapping phase if it has a different ip address to the node that is being replaced. (See CASSANDRA-8523 and CASSANDRA-12344)
Once the bootstrapping is complete the node will be marked "UP".
If any of the following cases apply, you MUST run repair to make the replaced node consistent again, since it missed ongoing writes during/prior to bootstrapping. The replacement timeframe refers to the period from when the node initially dies to when a new node completes the replacement process.
|
Monitoring progress
Bootstrap, replace, move and remove progress can be monitored using
nodetool netstats
which will show the progress of the streaming
operations.
Cleanup data after range movements
As a safety measure, Cassandra does not automatically remove data from
nodes that "lose" part of their token range due to a range movement
operation (bootstrap, move, replace). Run nodetool cleanup
on the
nodes that lost ranges to the joining node when you are satisfied the
new node is up and working. If you do not do this the old data will
still be counted against the load on that node.