The first step to troubleshooting a Cassandra issue is to use error messages, metrics and monitoring information to identify if the issue lies with the clients or the server and if it does lie with the server find the problematic nodes in the Cassandra cluster. The goal is to determine if this is a systemic issue (e.g. a query pattern that affects the entire cluster) or isolated to a subset of nodes (e.g. neighbors holding a shared token range or even a single node with bad hardware).
There are many sources of information that help determine where the problem lies. Some of the most common are mentioned below.
Clients of the cluster often leave the best breadcrumbs to follow. Perhaps client latencies or error rates have increased in a particular datacenter (likely eliminating other datacenter’s nodes), or clients are receiving a particular kind of error code indicating a particular kind of problem. Troubleshooters can often rule out many failure modes just by reading the error messages. In fact, many Cassandra error messages include the last coordinator contacted to help operators find nodes to start with.
Some common errors (likely culprit in parenthesis) assuming the client has similar error names as the Datastax drivers:
SyntaxError
(client). This and other QueryValidationException
indicate that the client sent a malformed request. These are rarely server
issues and usually indicate bad queries.UnavailableException
(server): This means that the Cassandra
coordinator node has rejected the query as it believes that insufficent
replica nodes are available. If many coordinators are throwing this error it
likely means that there really are (typically) multiple nodes down in the
cluster and you can identify them using nodetool status If only a single coordinator is throwing this error it may
mean that node has been partitioned from the rest.OperationTimedOutException
(server): This is the most frequent
timeout message raised when clients set timeouts and means that the query
took longer than the supplied timeout. This is a client side timeout
meaning that it took longer than the client specified timeout. The error
message will include the coordinator node that was last tried which is
usually a good starting point. This error usually indicates either
aggressive client timeout values or latent server coordinators/replicas.ReadTimeoutException
or WriteTimeoutException
(server): These
are raised when clients do not specify lower timeouts and there is a
coordinator timeouts based on the values supplied in the cassandra.yaml
configuration file. They usually indicate a serious server side problem as
the default values are usually multiple seconds.If you have Cassandra metrics reporting to a centralized location such as Graphite or Grafana you can typically use those to narrow down the problem. At this stage narrowing down the issue to a particular datacenter, rack, or even group of nodes is the main goal. Some helpful metrics to look at are:
Cassandra refers to internode messaging errors as “drops”, and provided a number of Dropped Message Metrics to help narrow down errors. If particular nodes are dropping messages actively, they are likely related to the issue.
For timeouts or latency related issues you can start with Table
Metrics by comparing Coordinator level metrics e.g.
CoordinatorReadLatency
or CoordinatorWriteLatency
with their associated
replica metrics e.g. ReadLatency
or WriteLatency
. Issues usually show
up on the 99th
percentile before they show up on the 50th
percentile or
the mean
. While maximum
coordinator latencies are not typically very
helpful due to the exponentially decaying reservoir used internally to produce
metrics, maximum
replica latencies that correlate with increased 99th
percentiles on coordinators can help narrow down the problem.
There are usually three main possibilities:
It’s important to remember that depending on the client’s load balancing
behavior and consistency levels coordinator and replica metrics may or may
not correlate. In particular if you use TokenAware
policies the same
node’s coordinator and replica latencies will often increase together, but if
you just use normal DCAwareRoundRobin
coordinator latencies can increase
with unrelated replica node’s latencies. For example:
TokenAware
+ LOCAL_ONE
: should always have coordinator and replica
latencies on the same node rise togetherTokenAware
+ LOCAL_QUORUM
: should always have coordinator and
multiple replica latencies rise together in the same datacenter.TokenAware
+ QUORUM
: replica latencies in other datacenters can
affect coordinator latencies.DCAwareRoundRobin
+ LOCAL_ONE
: coordinator latencies and unrelated
replica node’s latencies will rise together.DCAwareRoundRobin
+ LOCAL_QUORUM
: different coordinator and replica
latencies will rise together with little correlation.Sometimes the Table query rate metrics can help
narrow down load issues as “small” increase in coordinator queries per second
(QPS) may correlate with a very large increase in replica level QPS. This most
often happens with BATCH
writes, where a client may send a single BATCH
query that might contain 50 statements in it, which if you have 9 copies (RF=3,
three datacenters) means that every coordinator BATCH
write turns into 450
replica writes! This is why keeping BATCH
’s to the same partition is so
critical, otherwise you can exhaust significant CPU capacitity with a “single”
query.
Once you have narrowed down the problem as much as possible (datacenter, rack , node), login to one of the nodes using SSH and proceed to debug using logs, nodetool, and os tools. If you are not able to login you may still have access to logs and nodetool remotely.