Apache Cassandra 5.0 is the project’s major release for 2023, and it promises some of the biggest changes for Cassandra to date. After more than a decade of engineering work dedicated to stabilizing and building Cassandra as a distributed database, we now look forward to introducing a host of exciting features and enhancements that empower users to take their data-driven applications to the next level, including machine learning and artificial intelligence.

This blog series aims to give a deeper dive into some of the key features of Cassandra 5.0.

When it comes to managing and retrieving large volumes of data at high speeds, Apache Cassandra is one of the top choices for distributed NoSQL databases. Its robust architecture and scalability make it a go-to solution for organizations dealing with vast amounts of data. Storage Attached Indexes (SAI) are one of the key features within the upcoming Apache Cassandra 5.0 release that stand to make Cassandra even more powerful. In this blog post, we’ll explore what Apache Cassandra Storage Attached Indexes are, why they are valuable, and how they can help you optimize your data retrieval operations.

What are Storage Attached Indexes?

Apache Cassandra is designed to handle massive datasets distributed across multiple nodes, making it a popular choice for applications that require high availability and fault tolerance. However, retrieving data from a Cassandra cluster efficiently can be challenging due to its distributed nature and NoSQL data model. This is where Storage Attached Indexes come into play.

Storage Attached Indexes, or SAI, are a new feature in Apache Cassandra 5.0. SAI provides an indexing mechanism that is closely integrated with the Cassandra storage engine to make data modeling easier for users. Unlike the traditional Cassandra secondary indexes, which had their limitations, SAI is designed to address these limitations and provide a more efficient and versatile indexing solution.

The Benefits of Storage Attached Indexes with Apache Cassandra

1. Improved Query Performance

The primary value proposition of SAI is its ability to significantly improve query performance. With traditional secondary indexes, queries often required full scans of the data, resulting in slow response times as the dataset grew. SAI, on the other hand, uses an efficient indexing mechanism that dramatically reduces the amount of data scanned during query execution. (See ‘Reduced Storage Overhead’ below for more on improved legacy index speeds.) SAI achieves this by building a more compact and precise index structure. This allows for faster data retrieval and reduced I/O operations, making it ideal for applications that require low-latency responses.

2. Reduced Storage Overhead

Traditional secondary indexes in Cassandra often result in significant storage overhead, which can become problematic for applications with large datasets as the storage overhead grows linearly as the number of indexed columns on a table increases. SAI addresses this issue by minimizing the storage footprint required for indexing. Because it shares index elements common to an SSTable, SAI disk usage grows only marginally as more columns are indexed on a table/SSTable. By reducing storage overhead, SAI helps organizations save on infrastructure costs while still benefiting from improved query performance.

3. Support for a Wide Range of Queries

One of the drawbacks of traditional Cassandra secondary indexes was their inability to support a wide range of query types. SAI, however, is designed to support complex queries, including numeric range queries, test suffix/prefix queries and boolean queries, making it a versatile solution for various use cases.

Whether you need to retrieve data based on specific criteria or perform analytics on your Cassandra dataset, SAI can handle it efficiently.

4. Seamless Integration

Apache Cassandra Storage Attached Indexes are designed to seamlessly integrate with existing Cassandra data models. This means that organizations can take advantage of SAI without the need for a complete overhaul of their database design. It offers backward compatibility with existing data structures, making the transition to SAI relatively smooth.

5. Eases Operational Task Burden

Apache Cassandra Storage Attached Indexes are attached to the normal Memtable/SSTable lifecycle, which means the index data mimics the structure/locality of backing table data. This makes operational tasks like streaming, importing data, and partial/resumable bootstrapping ultimately easier to execute.

How to Get Started with Storage Attached Indexes

To leverage the value of Apache Cassandra Storage Attached Indexes, you will need to upgrade to Cassandra 5.0. Once you’ve upgraded, you can start using SAI for your data modeling and query optimization needs.

To get started with Storage-Attached Indexing (SAI), we’ll do the following steps:

  • Create a database and keyspace.

  • Create a table.

  • Create an index using SAI.

  • Add data.

  • Create and run a query using SAI.

The examples in the following quickstart topic show SAI indexes with non-partition key columns.

Create a database and keyspace

Follow the preliminary steps in the Astra DB documentation to create and launch your Astra DB database. In those steps, you’ll specify a keyspace. Name it cycling.

Create a table

Using cqlsh or the CQL Console, create a cyclist_semi_pro database table in the cycling keyspace or the keyspace name of your choice:

CREATE TABLE IF NOT EXISTS cycling.cyclist_semi_pro (
  id int,
  firstname text,
  lastname text,
  age int,
  affiliation text,
  country text,
  registration date,
  PRIMARY KEY (id));

Create an index using SAI

To test a non-trivial query, you’ll need some SAI indexes. Use CREATE CUSTOM INDEX commands to create SAI indexes on a few non-primary-key columns in the cyclist_semi_pro table:

CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname)
USING 'Sai'
WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

CREATE CUSTOM INDEX age_sai_idx ON cycling.cyclist_semi_pro (age)
USING 'Sai';

CREATE CUSTOM INDEX country_sai_idx ON cycling.cyclist_semi_pro (country)
USING 'Sai'
WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

CREATE CUSTOM INDEX registration_sai_idx ON cycling.cyclist_semi_pro (registration)
USING 'Sai';

Note: Index names are unique per keyspace. The index name must be a unique identifier for the index for each table within a keyspace. This requirement is true for both vector and non-vector indexes.

Let’s take a look at the description of the table and its indexes:

QUERY

DESCRIBE TABLE cycling.cyclist_semi_pro;

RESULT

CREATE TABLE cycling.cyclist_semi_pro (
    id int PRIMARY KEY,
    affiliation text,
    age int,
    country text,
    firstname text,
    lastname text,
    registration date
) WITH additional_write_policy = '99PERCENTILE'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND nodesync = {'enabled': 'true', 'incremental': 'true'}
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99PERCENTILE';
CREATE CUSTOM INDEX registration_sai_idx ON cycling.cyclist_semi_pro (registration) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX country_sai_idx ON cycling.cyclist_semi_pro (country) USING 'StorageAttachedIndex' WITH OPTIONS = {'normalize': 'true', 'case_sensitive': 'false', 'ascii': 'true'};
CREATE CUSTOM INDEX age_sai_idx ON cycling.cyclist_semi_pro (age) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname) USING 'StorageAttachedIndex' WITH OPTIONS = {'normalize': 'true', 'case_sensitive': 'false', 'ascii': 'true'};

Add data to your table

Use CQLSH INSERT commands to add some data to the cyclist_semi_pro database table:

INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (1, 'Carlos', 'Perotti', 22, 'Recco Club', 'ITA', '2020-01-12');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (2, 'Giovani', 'Pasi', 19, 'Venezia Velocità', 'ITA', '2016-05-15');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (3, 'Frances', 'Giardello', 24, 'Menaggio Campioni', 'ITA', '2018-07-29');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (4, 'Mark', 'Pastore', 19, 'Portofino Ciclisti', 'ITA', '2017-06-16');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (5, 'Irene', 'Cantona', 24, 'Como Velocità', 'ITA', '2012-07-22');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (6, 'Hugo', 'Herrera', 23, 'Bellagio Ciclisti', 'ITA', '2004-02-12');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (7, 'Marcel', 'Silva', 21, 'Paris Cyclistes', 'FRA', '2018-04-28');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (8, 'Theo', 'Bernat', 19, 'Nice Cavaliers', 'FRA', '2007-05-15');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (9, 'Richie', 'Draxler', 24, 'Normandy Club', 'FRA', '2011-02-26');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (10, 'Agnes', 'Cavani', 22, 'Chamonix Hauteurs', 'FRA', '2020-01-02');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (11, 'Pablo', 'Verratti', 19, 'Chamonix Hauteurs', 'FRA', '2006-05-15');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (12, 'Charles', 'Eppinger', 24, 'Chamonix Hauteurs', 'FRA', '2018-07-29');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (13, 'Stanley', 'Trout', 30, 'Bolder Boulder', 'USA', '2016-02-12');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (14, 'Juan', 'Perez', 31, 'Rutgers Alumni Riders', 'USA', '2017-06-16');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (15, 'Thomas', 'Fulton', 27, 'Exeter Academy', 'USA', '2012-12-15');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (16, 'Jenny', 'Hamler', 28, 'CU Alums Crankworkz', 'USA', '2012-07-22');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (17, 'Alice', 'McCaffrey', 26, 'Pennan Power', 'GBR', '2020-02-12');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (18, 'Nicholas', 'Burrow', 26, 'Aberdeen Association', 'GBR', '2016-02-12');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (19, 'Tyler', 'Higgins', 24, 'Highclere Agents', 'GBR', '2019-07-31');
INSERT INTO cycling.cyclist_semi_pro (id, firstname, lastname, age, affiliation, country, registration) VALUES (20, 'Leslie', 'Boyd', 18, 'London Cyclists', 'GBR', '2012-12-15');

Create and run a query using SAI.

To test a non-trivial query, you’ll need some SAI indexes. Use CREATE CUSTOM INDEX commands to create SAI indexes on a few non-primary-key columns in the cyclist_semi_pro table:

CREATE CUSTOM INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

CREATE CUSTOM INDEX age_sai_idx ON cycling.cyclist_semi_pro (age)
USING 'StorageAttachedIndex';

CREATE CUSTOM INDEX country_sai_idx ON cycling.cyclist_semi_pro (country)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

CREATE CUSTOM INDEX registration_sai_idx ON cycling.cyclist_semi_pro (registration)
USING 'StorageAttachedIndex';

To get started with Storage-Attached Indexing (SAI), refer to this Quickstart Guide.

Apache Cassandra Storage Attached Indexes (SAI) represent a significant advancement in the world of distributed NoSQL databases. They offer improved query performance, support for a wide range of queries, reduced storage overhead, and seamless integration with existing data models. As organizations continue to deal with ever-growing datasets, SAI becomes a valuable tool in their arsenal for optimizing data retrieval operations and delivering low-latency, high-performance applications.

Learn More About Apache Cassandra

As we get closer to the General Availability of Cassandra 5.0, there are a host of ways to get more involved in the community and follow project developments:

Cassandra Summit + Code AI is taking place Dec. 12-13 in San Jose, CA. Cassandra Summit is THE gathering place for Apache Cassandra data practitioners, developers, engineers and enthusiasts, and it’s where we’ll be diving deeper into Cassandra 5.0 features.

For more information about Apache Cassandra or to join the community discussion, you can join us on these channels: