Find The Misbehaving Nodes
The first step to troubleshooting a Cassandra issue is to use error messages, metrics and monitoring information to identify if the issue lies with the clients or the server and if it does lie with the server find the problematic nodes in the Cassandra cluster. The goal is to determine if this is a systemic issue (e.g. a query pattern that affects the entire cluster) or isolated to a subset of nodes (e.g. neighbors holding a shared token range or even a single node with bad hardware).
There are many sources of information that help determine where the problem lies. Some of the most common are mentioned below.
Client Logs and Errors
Clients of the cluster often leave the best breadcrumbs to follow. Perhaps client latencies or error rates have increased in a particular datacenter (likely eliminating other datacenter’s nodes), or clients are receiving a particular kind of error code indicating a particular kind of problem. Troubleshooters can often rule out many failure modes just by reading the error messages. In fact, many Cassandra error messages include the last coordinator contacted to help operators find nodes to start with.
Some common errors (likely culprit in parenthesis) assuming the client
has similar error names as the Datastax drivers <client-drivers>
:
-
SyntaxError
(client). This and otherQueryValidationException
indicate that the client sent a malformed request. These are rarely server issues and usually indicate bad queries. -
UnavailableException
(server): This means that the Cassandra coordinator node has rejected the query as it believes that insufficent replica nodes are available. If many coordinators are throwing this error it likely means that there really are (typically) multiple nodes down in the cluster and you can identify them usingnodetool status <nodetool-status>
If only a single coordinator is throwing this error it may mean that node has been partitioned from the rest. -
OperationTimedOutException
(server): This is the most frequent timeout message raised when clients set timeouts and means that the query took longer than the supplied timeout. This is a client side timeout meaning that it took longer than the client specified timeout. The error message will include the coordinator node that was last tried which is usually a good starting point. This error usually indicates either aggressive client timeout values or latent server coordinators/replicas. -
ReadTimeoutException
orWriteTimeoutException
(server): These are raised when clients do not specify lower timeouts and there is a coordinator timeouts based on the values supplied in thecassandra.yaml
configuration file. They usually indicate a serious server side problem as the default values are usually multiple seconds.
Metrics
If you have Cassandra metrics
reporting to a
centralized location such as Graphite or
Grafana you can typically use those to narrow down
the problem. At this stage narrowing down the issue to a particular
datacenter, rack, or even group of nodes is the main goal. Some helpful
metrics to look at are:
Errors
Cassandra refers to internode messaging errors as "drops", and provided
a number of Dropped Message Metrics
to help narrow
down errors. If particular nodes are dropping messages actively, they
are likely related to the issue.
Latency
For timeouts or latency related issues you can start with operating/metrics.adoc#table-metrics[table metrics
]
by comparing Coordinator level metrics e.g.
CoordinatorReadLatency
or CoordinatorWriteLatency
with their
associated replica metrics e.g. ReadLatency
or WriteLatency
. Issues
usually show up on the 99th
percentile before they show up on the
50th
percentile or the mean
. While maximum
coordinator latencies
are not typically very helpful due to the exponentially decaying
reservoir used internally to produce metrics, maximum
replica
latencies that correlate with increased 99th
percentiles on
coordinators can help narrow down the problem.
There are usually three main possibilities:
-
Coordinator latencies are high on all nodes, but only a few node’s local read latencies are high. This points to slow replica nodes and the coordinator’s are just side-effects. This usually happens when clients are not token aware.
-
Coordinator latencies and replica latencies increase at the same time on the a few nodes. If clients are token aware this is almost always what happens and points to slow replicas of a subset of token ranges (only part of the ring).
-
Coordinator and local latencies are high on many nodes. This usually indicates either a tipping point in the cluster capacity (too many writes or reads per second), or a new query pattern.
It’s important to remember that depending on the client’s load balancing
behavior and consistency levels coordinator and replica metrics may or
may not correlate. In particular if you use TokenAware
policies the
same node’s coordinator and replica latencies will often increase
together, but if you just use normal DCAwareRoundRobin
coordinator
latencies can increase with unrelated replica node’s latencies. For
example:
-
TokenAware
+LOCAL_ONE
: should always have coordinator and replica latencies on the same node rise together -
TokenAware
+LOCAL_QUORUM
: should always have coordinator and multiple replica latencies rise together in the same datacenter. -
TokenAware
+QUORUM
: replica latencies in other datacenters can affect coordinator latencies. -
DCAwareRoundRobin
+LOCAL_ONE
: coordinator latencies and unrelated replica node’s latencies will rise together. -
DCAwareRoundRobin
+LOCAL_QUORUM
: different coordinator and replica latencies will rise together with little correlation.
Query Rates
Sometimes the table metric
query rate metrics can help narrow
down load issues as "small" increase in coordinator queries per second
(QPS) may correlate with a very large increase in replica level QPS.
This most often happens with BATCH
writes, where a client may send a
single BATCH
query that might contain 50 statements in it, which if
you have 9 copies (RF=3, three datacenters) means that every coordinator
BATCH
write turns into 450 replica writes! This is why keeping
`BATCH’s to the same partition is so critical, otherwise you can
exhaust significant CPU capacitity with a "single" query.