the path to Green CI

As we approach the Cassandra 4.1 GA release, it’s a great time to stop and reflect on some of our development history, the past year since we released 4.0, and where we’re headed in the future. In this blog post, we’re going to limit our focus to testing Cassandra as that’s been a huge focus in the 4.0+ time frame.

The Numbers

But don’t just take my word for it - let’s start with some numbers! We’re going to use “lines of code” (LoC) as a loose proxy for “where we’re spending our time”. This is definitely a fraught metric, but as one way of viewing things it paints a pretty interesting picture, consistent with many of our intuitions.

Below is a list for our past three major releases and how much raw code exists in the following:

  1. src/java: raw database code

  2. test/unit: unit testing code

  3. test/distributed: distributed tests in java code

  4. The cassandra-dtest repo .py test files: python distributed tests using ccm

  Lines                         % total

src

181,871

57.51%

junit

100,284

31.71%

jdtest

9,225

2.92%

pdtest

24,882

7.87%

3.0 Total

316,262

100.00%

All tests

134,391

42.49%


  Lines % total % change since 3.0

src

261,825

52.83%

143.96%

junit

172,672

34.84%

172.18%

jdtest

21,769

4.39%

235.98%

pdtest

39,300

7.93%

157.95%

4.0 Total

495,566

100.00%

156.69%

All tests

233,741

47.17%


  Lines % total % change since 4.0

src

297,685

52.58%

113.70%

junit

197,231

34.84%

114.22%

jdtest

31,306

5.53%

143.81%

pdtest

39,905

7.05%

101.54%

4.1 Total

566,127

100.00%

114.24%

All tests

268,442

47.42%

The biggest thing that immediately jumps out to me: our in-jvm dtests have been growing at a very strong pace relative to the rest of the code-base. Immense effort has gone into not just authoring this testing framework, but also adding new tests to it. As a percentage of our total code, there was a significant jump from 3.0 to 4.0 of almost a 5% relative increase in total test code to the entire codebase.

We can also infer that our new code addition to the database has accelerated in the past year, as the delta from 4.0-4.1 represents a time frame of one calendar year and roughly 38k LoC net add vs. the 5.5 year gap between 3.0 and 4.0 with a net add of ~80k LoC. While the database also had a release line of 3.1-3.11 introducing new features and tests during this time window; for the sake of this analysis we’re only considering major traditional point releases.

So what does all this mean for us working on and depending on the project? It means that with the release of 4.1, we have 10% more code just unit testing the database than we had in the entire database in 3.0. It means that new development on Cassandra is accelerating. It also means we’re constantly moving the goalposts on what’s required to keep Green (passing) CI (continuous integration).

Keeping it Green

We all know Cassandra is an incredibly complex piece of software. The power to scale up linearly to petabytes of data on hundreds of machines, with zero downtime, in a masterless single logical cluster, where machines can drop out and in, hint, heal, and repair simply cannot be implemented without a significant amount of code, infrastructure, and testing to ensure it works as expected. Further, as we support hot upgrades with zero downtime between versions with mixed version clusters running, we have a strong commitment to backwards compatibility in the mix.

One of our struggles over time has been the software and hardware complexity required to keep our testing infrastructure “clean”, or green, on an ongoing basis. Balancing runtime, resourcing, and cost with a system as complex as Cassandra is a fixed challenge to begin with and is only growing over time as we’ve seen above.

We always drive down to stable 0 test failures at a GA release, however what we refer to as “flaky tests” sneak back into our suite over time. Let’s take a recent example, build run 1112 on trunk (effectively Cassandra 4.1 pre-alpha).

19 test failures! Out of an entire suite of 49,704 tests makes that a 99.96% pass rate. Nobody wants a database that works 99.96% of the time, however, and that’s assuming we have 100% test coverage of not just all our code but also all possible combinations of state, a problem so daunting some contributors are pushing the bleeding edge of the state of the art of distributed database testing.

Burning down less than 20 flaky tests to get our release out between freeze and our goal for release, an eight-week window, is quite doable, so why not just continue to float along with a low number of test failures? Well, it gets more complicated when we look at where we run our tests.

Circle vs. Jenkins

As we outline in our contributor guide on testing, tests can both be run on Apache Jenkins infrastructure or on CircleCI. The primary difference between these two systems are runtime, cost, and resources allocated to each individual test. While some contributors have access to paid CircleCI accounts that allow them to dedicate more resources to their test runs and shorten feedback loops, this is an open source volunteer project and our canonical CI is the Apache Jenkins infrastructure.

One challenge this introduces is tests that “flake” due to resource allocation differences. For instance, if you allocate a particularly intensive unit test to eight cores in a container with 16 gigs of RAM, you can expect a different runtime than allocating a container with two cores and eight gigs of RAM. Throw into the mix that all of us are doing development on different laptops, with different core counts, with different architectures, and you have a recipe for some pretty challenging non-deterministic test runtimes.

Currently we accept both a Green run on CircleCI and a Green run on DevBranch on jenkins as acceptable for committers to merge code. This introduces a gap for us as the Circle plan can allocate more resources to containers for running tests based on the plan you use, meaning a test could pass on Circle that subsequently fails on ASF Jenkins due to resourcing limitations.

Another challenge we face is that it can be challenging to author tests in Cassandra that have deterministic results in the face of scheduling pressures. Given the long legacy of our project (which dates back to 2008), we have quite a bit of static state without existing stub implementations for testing, meaning many of our unit tests spin up state in other areas of the database, write files to disk, and otherwise mutate state in adjacent subsystems. What this translates into is the need to run new tests 100 times on CI infrastructure before committing them, a practice we haven’t yet enshrined into our process.

Testing Complex Systems is Itself Complex

The Cassandra testing ecosystem consists of a variety of different suites targeting different subsystems and operations in the database. From a high level, a look at the top level testing pipelines of the project shows standouts like testing with compression, with change-data-capture enabled, during upgrades, both unit vs. distributed, etc. We have a cluster of machines dedicated to testing Cassandra and tending to their needs is a significant task, all of which are donated by different participants within the Apache Cassandra ecosystem.

Taking a quick look at the runtime pipeline under the hood, you can see the large distributed effort that it is to break down the different jobs across these different agents. The code required to generate, distribute, build, collect logs from, teardown, and maintain all these jobs on these machines lives in the cassandra-builds repo inside apache on github.

Throwing all this hardware and parallelization at our almost 50,000 tests takes our total test runtime down to 4h 9m 4s. A big shout-out to Mick Semb Wever, committer and PMC member on the project, who’s done a ton of work to get us this far with our CI infrastructure!

We have a few ideas for ways to reduce the total processing burden of our tests; with this much compute required and this many tests, small percentages add up to big gains. Jacek Lewandowski is targeting some file operations and general speedup in CASSANDRA-17427, Berenguer Blasi is looking into potentially re-using dtest clusters in our python dtests to cut out unnecessary cluster startup and shutdown times in CASSANDRA-16951, and after a little analysis I’ve uncovered that roughly 20% of our unit test runtime is comprised of 2.62% of our tests, giving us some low hanging fruit to potentially target to speed things up in CASSANDRA-17371.

Lastly, we have a Jenkins to JIRA integration script drafted that would auto update tickets with the results of the CI runs on ASF Jenkins infrastructure with the results of their build in CASSANDRA-17277. This is necessary as we have two paths for code to get certified for inclusion (circle or ASF Jenkins) with the former being more heavily resourced than the latter, but the latter being our gatekeeper.

The Future of Testing in Cassandra

As we head into the verification cycle for Cassandra 4.1 we’re going to be using the same Release Lifecycle definitions we ratified back in 2019. Of note, we won’t transition from alpha to beta without green tests: “No flaky tests - All tests (Unit Tests and DTests) should pass consistently. A failing test, upon analyzing the root cause of failure, may be “ignored in exceptional cases”, if appropriate, for the release, after discussion in the dev mailing list.”

So we’re going to drive back to a green test board as we do for each major release, but are we going to make an effort to stay there and if so, how?

I’ve been working on this project since early 2014 (!), and this has always been a challenge for us. That said, after analyzing the numbers for this blog post and realizing just how much we’re proportionally expanding our testing, I’m heartened by the progress we’re making; the proportion of flaky or failing tests is objectively falling over time. A total of 15 failing tests out of 50,000 is a lot less than 15 failing out of 25,000, or 12,500 for example, so we’re definitely moving in the right direction.

If we take the value of having a green test board as self-evident (developer time, triaging, branch stability, feedback loops, etc), how can we stay there after the 4.1 release? The combination of a bot letting us know ASAP if our patch correlates with a new test failure should help, as will lowering the total runtime required between running our tests and merging them. Lastly, in January of 2022 we introduced a new Build Lead role to shepherd integration with our CI tracking system Butler which has had a very positive impact on our visibility of and momentum on fixing test failures.

We have a balanced tension between wanting to get code changes into the system rapidly for contributors fortunate enough to be able to use CircleCI while also providing for and encouraging usage of the freely available Apache Jenkins infrastructure, but we’re bridging the gap this naturally creates.

Contributors around the globe are working hard to get Cassandra 4.1 GA soon and just like Cassandra 4.0 before it, we expect this to be the most stable, best performing version of Apache Cassandra we’ve ever released. You can download the test build of Cassandra 4.1 here and test it out - let us know what you think!

If you haven’t yet, come join the Cassandra development community and get involved in making the most scalable and available database in the world!