Apache Cassandra is a highly scalable and reliable database. Cassandra is used in web based applications that serve large number of clients and the quantity of data processed is web-scale (Petabyte) large. Cassandra makes some guarantees about its scalability, availability and reliability. To fully understand the inherent limitations of a storage system in an environment in which a certain level of network partition failure is to be expected and taken into account when designing the system it is important to first briefly introduce the CAP theorem.
According to the CAP theorem it is not possible for a distributed data store to provide more than two of the following guarantees simultaneously.
Consistency: Consistency implies that every read receives the most recent write or errors out
Availability: Availability implies that every request receives a response. It is not guaranteed that the response contains the most recent write or data.
Partition tolerance: Partition tolerance refers to the tolerance of a storage system to failure of a network partition. Even if some of the messages are dropped or delayed the system continues to operate.
CAP theorem implies that when using a network partition, with the inherent risk of partition failure, one has to choose between consistency and availability and both cannot be guaranteed at the same time. CAP theorem is illustrated in Figure 1.
Figure 1. CAP Theorem
High availability is a priority in web based applications and to this objective Cassandra chooses Availability and Partition Tolerance from the CAP guarantees, compromising on data Consistency to some extent.
Cassandra makes the following guarantees.
Eventual Consistency of writes to a single table
Lightweight transactions with linearizable consistency
Batched writes across multiple tables are guaranteed to succeed completely or not at all
Secondary indexes are guaranteed to be consistent with their local replicas data
Cassandra is a highly scalable storage system in which nodes may be added/removed as needed. Using gossip-based protocol a unified and consistent membership list is kept at each node.
Cassandra guarantees high availability of data by implementing a fault-tolerant storage system. Failure detection in a node is detected using a gossip-based protocol.
Cassandra guarantees data durability by using replicas. Replicas are multiple copies of a data stored on different nodes in a cluster. In a multi-datacenter environment the replicas may be stored on different datacenters. If one replica is lost due to unrecoverable node/datacenter failure the data is not completely lost as replicas are still available.
Meeting the requirements of performance, reliability, scalability and high availability in production Cassandra is an eventually consistent storage system. Eventually consistent implies that all updates reach all replicas eventually. Divergent versions of the same data may exist temporarily but they are eventually reconciled to a consistent state. Eventual consistency is a tradeoff to achieve high availability and it involves some read and write latencies.
Data must be read and written in a sequential order. Paxos consensus protocol is used to implement lightweight transactions. Paxos protocol implements lightweight transactions that are able to handle concurrent operations using linearizable consistency. Linearizable consistency is sequential consistency with real-time constraints and it ensures transaction isolation with compare and set (CAS) transaction. With CAS replica data is compared and data that is found to be out of date is set to the most consistent value. Reads with linearizable consistency allow reading the current state of the data, which may possibly be uncommitted, without making a new addition or update.
The guarantee for batched writes across multiple tables is that they will eventually succeed, or none will. Batch data is first written to batchlog system data, and when the batch data has been successfully stored in the cluster the batchlog data is removed. The batch is replicated to another node to ensure the full batch completes in the event the coordinator node fails.