Vector search concepts
Vector Search is a new feature added to Cassandra 5.0. It is a powerful technique for finding relevant content within large datasets and is particularly useful for AI applications. Vector Search also makes use of Storage-Attached Indexes(SAI), leveraging the new modularity of the latter feature. Vector Search is the first instance of validating the extensibility of SAI.
Data stored in a database is useful, but the context of that data is critical to applications. Machine learning in applications allows users to get product recommendations, match similar images, and a host of other capabilities. A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. To power a machine learning model in an application, Vector Search does similarity comparison of stored database data to discover connections in data that may not be explicitly defined.
One key to doing similarity comparisons in a machine learning model is the ability to store embeddings vectors, arrays of floating-point numbers that represent the similarity of specific objects or entities. Vector Search brings that functionality to the high availability Apache Cassandra database.
The foundation of Vector Search lies within the embeddings, compact representations of text or images as high-dimensional vectors of floating-point numbers. For text processing, embeddings are generated by feeding the text to a machine learning model. These models generally use a neural network to transform the input into a fixed-length vector. When words are represented as high-dimensional vectors, the aim is to arrange the vectors so that similar words end up closer together in the vector space and dissimilar word end up further apart. Creating the vectors in this manner is referred to as preserving semantic or structural similarity. Embeddings capture the semantic meaning of the text, which in turn, allow queries to rely on a more nuanced understanding of the text as opposed to traditional term-based approaches.
Large Language Models (LLMs) generate contextual embeddings for the data, and optimize embeddings for queries. Trained embeddings like those produced by LLMs can be used in Natural Language Processing (NLP) tasks such as text classification, sentiment analysis, and machine translation.
SAI is a required feature providing unparalleled I/O throughput for databases to use Vector Search as well as other search indexing. SAI is a highly-scalable and globally-distributed index that adds column-level indexes to any vector data type column.
SAI provides the most indexing functionality available - indexing both queries and content (large inputs include such items as documents, words, and images) to capture semantics.
For more about SAI, see the Storage Attached Index documentation.
You cannot change index settings without dropping and rebuilding the index.
It is better to create the index and then load the data. This method avoids the concurrent building of the index as data loads.
A new vector data type is added to CQL to support Vector Search. It is designed to save and retrieve embeddings vectors.