Cassandra Documentation

Version:

DataStax glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

A

anti-entropy

The synchronization of replica data on nodes to ensure that the data is fresh.

Approximate Nearest Neighbor (ANN)

A machine learning algorithm that locates the most similar vectors to a given item in a dataset.

authentication

Process of establishing the identity of a user or application.

authorization

Process of establishing permissions to database resources through roles.

B

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

back pressure

Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.

bloom filter

An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.

bootstrap

The process by which new nodes join the cluster transparently gathering the data needed from existing nodes.

C

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

cardinality

The number of unique values in a column. For example, a column of ID numbers unique for each employee would have high cardinality while a column of employee ZIP codes would have low cardinality because multiple employees can have the same ZIP code.

An index on a column with low cardinality can boost read performance because the index is significantly smaller than the column. An index for a high-cardinality column may reduce performance. If your application requires a search on a high-cardinality column, a materialized view is ideal.

cell

The smallest increment of stored data. Contains a value in a row-column intersection.

cluster

Two or more database instances that exchange messages using the gossip protocol.

clustering

The storage engine process that creates an index and keeps data in order based on the index.

clustering column

In the table definition, a clustering column is a column that is part of the compound primary key definition. Note that the clustering column cannot be the first column because that position is reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.

coalescing strategy

Strategy to combine multiple network messages into a single packet for outbound TCP connections to nodes in the same data center (intra-DC) or to nodes in a different data center (inter-DC). A coalescing strategy is provided with a blocking queue of pending messages and an output collection for messages to send.

column

The smallest increment of data. Contains a name, a value, and a timestamp.

column family

A container for rows, similar to the table in a relational system. Called a table in CQL 3.

commit log

A file to which the database appends changed data for recovery in the event of a hardware failure.

compaction

The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index. The available compaction strategies are:

composite partition key

A partition key consisting of multiple columns.

compound primary key

A primary key consisting of the partition key, which determines the node on which data is stored, and one or more additional columns that determine clustering.

consistency

The synchronization of data on replicas in a cluster. Consistency is categorized as weak or strong.

consistency level

A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.

coordinator node

The node that determines which nodes in the ring should get the request based on the cluster configured snitch.

cosine similarity

A metric measuring the similarity between two non-zero vectors in a multi-dimensional space. It quantifies the cosine of the angle between the vectors; the angle representing each vector’s orientation and direction relative to each other. Zero (0) indicates complete dissimilarity. Negative one (-1) indicates exact opposite orientation of the vectors. One (1) indicates complete similarity.

CQL shell

The Cassandra Query Language shell (cqlsh) utility.

cross-data center forwarding

A technique for optimizing replication across datacenters by sending data from one datacenter to a node in another datacenter. The receiving node then forwards the data to other nodes in its data center.

D

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

datacenter

A group of related nodes that are configured together within a cluster for replication and workload segregation purposes. Not necessarily a separate location or physical data center. Datacenter names are case sensitive and cannot be changed.

data type

A particular kind of data item, defined by the values it can take or the operations that can be performed on it.

denormalization

Denormalization refers to the process of optimizing the read performance of a database by adding redundant data or by grouping data. This process is accomplished by duplicating data in multiple tables or by grouping data for queries.

E

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

EBNF

EBNF (Extended Backus-Naur Form) syntax expresses a context-free grammar that formally describes a language. EBNF extends its precursor BNF (Backus-Naur Form) with additional operators allowed in expansions. Syntax (railroad) diagrams graphically depict EBNF grammars.

embeddings

A mathematical technique in machine learning where complex, high-dimensional data is represented as points in a lower-dimensional space. The process of creating an embedding preserves the relevant properties of the original data, such as distance and similarity, enabling easier computational processing. For instance, words with similar meanings in Natural Language Processing (NLP) can be set close to each other in the reduced space, facilitating their use in machine learning models.

Euclidean distance

A coordinate geometry non-negative distance metric between two points, quantifying the similarity or dissimilarity between those data points represented as vectors. Use it to compare generated samples to real data points.

eventual consistency

The database maximizes availability and partition tolerance. The database ensures eventual data consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed. The updating and checking ensures that any query always returns the most recent version of the result set and that all replicas of any given row eventually become completely consistent with each other.

F

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

G

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

garbage collector

A Java background process that frees heap memory when it is no longer in use by the program. The main Java algorithms to allocate and clean up memory are Continuous Mark Sweep (CMS) and Garbage-First (G1).

gossip

A peer-to-peer communication protocol for exchanging location and state information between nodes.

H

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

HDD

A hard disk drive (HDD) or spinning disk is a data storage device used for storing and retrieving digital information using one or more rigid rapidly rotating disks. Compare to SSD.

HDFS

Hadoop Distributed File System (HDFS) stores data on nodes to improve performance. HDFS is a necessary component in addition to MapReduce in a Hadoop distribution.

headroom

The amount of disk space required by a process (such as compaction) in addition to the space occupied by the data being processed.

hint

One of the three ways, in addition to read-repair and full/incremental anti-entropy repair, that Cassandra implements the eventual consistency guarantee that all updates are eventually received by all replicas.

I

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

idempotent

An operation that can occur multiple times without changing the result, such as performing the same update multiple times without affecting the outcome.

immutable

Data on a disk that cannot be overwritten.

index

A native capability for finding a column in the database that does not involve using the primary key.

J

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Jaccard similarity

A measure of similarity between two sets of features or elements in generated data and real data. The mathematical calculation is the size of the intersection of two sets divided by the size of their union, and ranges from zero (0) to one (1). One (1) indicates identical sets.

K

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

keyspace

A namespace container that defines how data is replicated on nodes in each datacenter.

L

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

LeveledCompactionStrategy (LCS)

This compaction strategy creates SSTables of a fixed, relatively small size that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2, and so on) is ten times as large as the previous level. Disk I/O is more uniform and predictable on higher levels than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. This process improves performance for reads because the database can determine which SSTables in each level to check for the existence of row key data.

linearizable consistency

Also called serializable consistency, linearizable consistency is the restriction that one operation cannot be executed unless and until another operation has completed.

The database supports Lightweight transactions to ensure linearizable consistency in writes. The first phase of a Lightweight transaction works at SERIAL consistency and follows the Paxos protocol to ensure that the required operation succeeds. If this phase succeeds, the write is performed at the consistency level specified for the operation. Reads performed at the SERIAL consistency level execute without database built-in read repair operations.

listen address

Address or interface to bind to and tell other Cassandra nodes to connect to a node.

M

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Machine Learning (ML)

A branch of artificial intelligence (AI) and computer science that uses and develops computer systems capable of learning and adapting without explicit instruction. ML uses algorithms and statistical models to analyze data and identify patterns, make decisions, and improve its system.

MapReduce

Hadoop’s parallel processing engine that quickly processes large data sets. A necessary component in addition to MapReduce in a Hadoop distribution.

materialized view

A materialized view is a table with data that is automatically inserted and updated from another base table. Has a primary key that differs from the base table, allowing the implementation of different queries.

memtable

A database table-specific, in-memory data structure that resembles a write-back cache.

mutation

A mutation is either an insertion or a deletion.

N

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Natural Language Processing (NLP)

Helps computers interpret and share the human language to offer the best use for the user.

node

A Java virtual machine (a platform-independent execution environment that converts Java bytecode into machine language and executes it) that runs an instance of the Licensed Software.

node repair

A process that makes all data on a replica consistent.

normalization

Normalization refers to a series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database’s schema. In DataStax Enterprise, this process is inefficient because joining data in multiple tables for queries requires accessing more nodes.

O

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

OLTP

Online transaction processing (OLTP) is characterized by a large number of short on-line transactions for data entry and retrieval.

P

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

partition

A partition is a collection of data addressable by a key. This data resides on one node in a Cassandra cluster. A partition is replicated on as many nodes as the replication factor specifies.

partition index

A list of primary keys and the start position of data.

partition key

A partition keys represents a logical entity which helps a Cassandra cluster know on which node some requested data resides.

The partition key is the first column declared in the primary key definition. In a compound key, multiple columns can declare the columns that form the primary key.

partition range

The limits of the partition that differ depending on the configured partitioner. Murmur3Partitioner (default) range is -263 to +263 and RandomPartitioner range is 0 to 2127-1.

partition summary

A subset of the partition index. By default, 1 partition key out of every 128 is sampled.

Partitioner

Distributes data across a cluster. The types of partitioners are Murmur3Partitioner (default), RandomPartitioner, and OrderPreservingPartitioner.

Unresolved include directive in modules/ROOT/pages/glossary.adoc - include::ROOT:partial$persistent-volume.adoc[]

Unresolved include directive in modules/ROOT/pages/glossary.adoc - include::ROOT:partial$persistent-volume-claim.adoc[]

primary key

The partition key. One or more columns that uniquely identify a row in a table.

R

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

range movement

A change in the expanse of tokens assigned to a node.

read repair

A process that updates database replicas with the most recent version of frequently-read data.

replica

A copy of a portion of the whole database. Each node holds some replicas.

replica placement strategy

A specification that determines the replicas for each row of data.

replication factor (RF)

The total number of replicas across the cluster, abbreviated as RF. A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 indicates two copies of each row and that each copy is on a different node. All replicas are equally important; there is no primary or master replica.

replication group

See datacenter.

role

A set of permissions assigned to users that limits their access to database resources. When using internal authentication, roles can also have passwords and represent a single user, DSE client tool, or application.

rolling restart

A procedure that is performed during upgrading nodes in a cluster for zero downtime. Nodes are upgraded and restarted one at a time while other nodes continue to operate online.

row

1) Columns that have the same primary key.
2) A collection of cells per combination of columns in the storage engine.

row cache

A database component for improving the performance of read-intensive operations. In off-heap memory, the row cache holds the most recently read rows from the local SSTables. Each local read operation stores its result set in the row cache and sends it to the coordinator node. The next read first checks the row cache. If the required data is there, the database returns it immediately. This initial read can save further seeks in the Bloom filter, partition key cache, partition summary, partition index, and SSTables.

The database uses LRU (least-recently-used) eviction to ensure that the row cache is refreshed with the most frequently accessed rows. The size of the row cache can be configured in the cassandra.yaml file.

S

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

seed

A seed, or seed node, is used to bootstrap the gossip process for new nodes joining a cluster. A seed node provides no other function and is not a single point of failure for a cluster.

Unresolved include directive in modules/ROOT/pages/glossary.adoc - include::ROOT:partial$segment.adoc[]

serializable consistency

SizeTieredCompactionStrategy (STCS)

The default compaction strategy. This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, min_threshold. A minor compaction does not involve all the tables in a keyspace. Also see STCS compaction subproperties in the relevant CQL documentation.

slice

A set of clustered columns in a partition that you query as a set using, for example, a conditional WHERE clause.

Snitch

The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. The request routing mechanism is affected by which of the several types of snitches is used.

SSD

A solid-state drive (SSD) is a solid-state storage device that uses integrated circuits to persistently store data. Compare to HDD.

SSTable

A sorted string table (SSTable) is an immutable data file to which the database writes memtables periodically. SSTables are stored on disk sequentially and maintained for each database table.

static column

A special column that is shared by all rows of a partition.

streaming

A component that handles data exchange among nodes in a cluster. It is part of the SSTable file.

Examples include:

  • When bootstrapping a new node, the new node gets data from existing nodes using streaming.

  • When running nodetool repair, nodes exchange out-of-sync data using streaming.

  • When bulkloading data from backup, sstableloader uses streaming to complete a task.

strong consistency

As a database reads data it performs a read repair before returning results.

superuser

Superuser is a role attribute that provides root database access. Superusers have all permissions on all objects. Apache Cassandra databases include the superuser role cassandra with password cassandra by default. This account runs queries, including logins, with a consistency level of QUORUM. It is recommended that users create a superuser for deployments and remove the cassandra role.

T

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

table

A collection of columns ordered by name and fetched by row. A row consists of columns and has a primary key; the first part of the key is a column name. Subsequent parts of a compound key are other column names that define the order of columns in the table.

TimeWindowCompactionStrategy (TWCS)

This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.

token

An element on the ring that depends on the partitioner. Determines the node’s position on the ring and the portion of data for which it is responsible. The range for the Murmur3Partitioner (default) is -263 to +263. The range for the RandomPartitioner is 0 to 2127-1.

tombstone

A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.

TTL

Time-to-live (TTL) is an optional expiration date for values that are inserted into a column.

tunable consistency

The database ensures that all replicas of any given row eventually become completely consistent. For situations requiring immediate and complete consistency, the database can be tuned to provide 100% consistency for specified operations, datacenters, or clusters. The database cannot be tuned to complete consistency for all data and operations.

U

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

UnifiedCompactionStrategy (UCS)

This compaction strategy compacts SSTables based on a series of time windows. During the current time window, the SSTables are compacted into one or more SSTables. At the end of the current time window, all SSTables are compacted into a single larger SSTable. The compaction process repeats at the start of the next time window. Each TWCS time window contains data within a specified range and contains varying amounts of data.

Covers the applications of levelled, tiered and time-windowed compaction strategies, including combinations of levelled and tiered in different levels of the compaction hierarchy. This compaction can work in modes similar to [STCS] (with w = T4 matching STCS’s default threshold of 4), LCS (with w = L10 to match LCS’s default fan factor of 10), and can also work well enough for time-series workloads when used with a large tiered fan factor (e.g. w = T20). Read-heavy workloads, especially ones that cannot benefit from bloom filters or time order (i.e. wide partition non-time-series) are best served by levelled configurations. Write-heavy, time series or key-value workloads are best served by tiered ones.

upsert

A change in the database that updates a specified column in a row if the column exists. If the column does not exist, then that column is inserted.

V

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

Vector

An array of floating point type that represents a specific object or entity.

Reviews data on a database to determine the distance between the vectors. The closer they are, the more similar the data. The more the distance, the less similar the data.

Vnode

Vnode is a virtual node. Normally, nodes are responsible for a single partitioning range in the full token range of a cluster. With vnodes enabled, each node is responsible for several virtual nodes, effectively spreading a partitioning range across more nodes in the cluster. Enabling vnodes can reduce the risk of hotspotting or straining one node in the cluster.

W

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

weak consistency

When reading data, the database performs read repair after returning results.

wide row

A data partition that CQL 3 transposes into familiar row-based resultsets.

X, Y, Z

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W | X, Y, Z

zombie

A row or cell that reappears in a database table after deletion. This can happen if a node goes down for a long period of time and is then restored without being repaired.

Deleted data is not erased from database tables; it is marked with tombstones until compaction. The tombstones created on one node must be propagated to the nodes containing the deleted data. If one of these nodes goes down before this happens, the node may not receive the most up-to-date tombstones. If the node is not repaired before it comes back online, the database finds the non-tombstoned items and propagates them to other nodes as new data.

To avoid this problem, run nodetool repair on any restored node before rejoining it to its cluster.