Kubernetes is a hot technology, and while it seems like everyone is using it for automating deployment, scaling, and management of containerized applications, you’ll still face fundamental issues as you try to grow from a beginner to an intermediate Kubernetes Operator. One of these hurdles is the storage and control of data.
Where and how to store Kubernetes data
Kubernetes is an amazingly flexible and robust way to host stateless computation, but the data layer isn’t a straightforward solution. Traditionally computation would happen within a cluster, with every container in that cluster requesting and updating data from a traditionally stored database.
Running applications in Kubernetes with databases external to Kubernetes creates a mismatched architecture. This situation has limited developer productivity, duplicative stacks for monitoring applications and database infrastructure, and increased cloud computing costs.
What is Cassandra?
Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. The latest version of Apache Cassandra
Cassandra merges the ease-of-use of NoSQL with the reliability of a mature open source project. Most importantly, for real-world applications, it’s designed with distributed architectures in mind.
"Distributed" means Cassandra can run on multiple machines while appearing to users as a unified whole. There is little point in running Cassandra as a single node, although it is constructive to help you get up to speed on how it works. But to get the maximum benefit out of Cassandra, you would run it on multiple machines.
Apache Cassandra and Kubernetes
Kubernetes has emerged as the leading orchestration platform for deploying and managing containerized systems in the cloud. Since managing infrastructure has been standardizing around Kubernetes, many organizations are looking at the data plane as something that should be managed under the same umbrella.
Kubernetes simplifies distributed systems lifecycle management. It’s natural to use Kubernetes to build your flexible, distributed database with Cassandra.
The Challenge of Kubernetes: Complexity
Kubernetes enables you to auto-scale whole containers: providing resources and spinning up new instances, along with load balancing, but without careful management: rather than removing the complexity of managing loads and containers, Kubernetes can increase the complexity of a system, making it even harder to manage.
Running Cassandra on Kubernetes can be difficult. Kubernetes has only a limited understanding or insight into database functionality. It is blind to key operational requirements of the database that’s in use and it requires significant effort to script and leverage existing Kubernetes functionality to run a production-grade Cassandra deployment.
To reduce those complexities, the Apache Cassandra community built Cass Operator, which is installed via Helm (see below). Operators take the process of describing many of the lower-level Kubernetes components and instead provide a more straightforward, logical interface for describing an application. They provide a translation layer between what Kubernetes needs to maintain services and the actual implementation by the database.
There are multiple Kubernetes operators to try and solve the same problem, including those from Instaclustr and Sky UK, but the Cassandra community has coalesced around Cass Operator and is merging features from other operators, such as CassKop, which Orange developed.
As with any Kubernetes operator, the goal is to create a robot that makes it easier to set up, maintain, and scale complex configurations of containers in Kubernetes.
How to simplify deployment: Apache Cassandra as a Helm Chart
Helm is a package manager for Kubernetes. Helm is the Kubernetes’ equivalent of yum or apt. Helm deploys charts, which you can think of like a packaged application. It is a collection of all your versioned, pre-configured application resources, which can be deployed as one unit.
The goal when adopting Cassandra on Kubernetes should be to deploy it as a single helm chart. There are many options here from multiple vendors, and the open source K8ssandra project is one of many but in active development having reached v1.3, which supports the new Apache Cassandra 4.0 GA.
What is K8ssandra?
K8ssandra is a cloud native, open source distribution of Apache Cassandra that runs on Kubernetes. Accompanying Cassandra is a suite of tools to ease and automate operational tasks, which includes metrics, data anti-entropy services for running repairs, and backup tooling. As part of K8ssandra’s installation process, all of these components are installed and wired together and frees your teams from performing the tedious plumbing of components.
Apache Cassandra can be deployed in many environments, including bare-metal hosts, virtual machines, and container platforms. Each deployment type has its pros and cons, but in all cases automation is leveraged to ensure that all nodes are configured homogeneously and without failure.
Operator Problems
Site reliability engineering (SRE) expertise remains an essential resource for running distributed workloads. Challenges such as configuring throttles and scheduling backups, and managing edge case failures, for example concurrent socket problems, are all things that are not currently covered by automation.
While the tools available from multiple vendors for Kubernetes can simplify the process for deploying new workloads, you will still need a team that is excited to increase their Kubernetes expertise.
The particular tools you choose for solving the ‘data on Kubernetes’ problem will be your own, but the good news is that there are viable solutions from both the open source community around Apache Cassandra, and even fully-featured SaaS products that will spin up your cluster and handle data problems for you.