Deterministic Simulation Testing on every commit

Distributed Systems are notoriously hard to get right. You rely on Kafka. It’s the database you use when other databases are down. We can’t afford to get it wrong.

Our first step was writing Astradot in Rust, the safest programming language currently available. It avoids memory corruption and prevents data races.

Next comes the testing. Unit and Integration tests only go so far as to test the correctness of a distributed system. To model the various failure scenarios of communication between nodes, one needs Deterministic Simulation Testing (DST).

The Tokio project in Rust includes the Turmoil framework, which enables the creation of deterministic simulations. It can simulate hosts, networks, and time, allowing for fine-grained control over the communication between components during testing. This makes it easy to simulate network partitions, dropped packets, and other scenarios. Additionally, we use the excellent TestContainers framework to spin up containers that host the third-party systems Astradot interacts with.

Challenges of DST in CI

Astradot uses a monorepo to store the entire company codebase. We have numerous simulation tests and continue to add more. Consequently, the CI system needs to build every line of code and run all tests for the entire company with every commit to any branch.

This presents a challenge with simulation tests, as they are highly resource-intensive. Not only does this drive up the cost of CI, but it also prolongs the build duration. This becomes particularly frustrating when considering that many commits are unrelated to the Rust code that comprises the Kafka implementation. Many involve changes to our Infrastructure-as-Code, Golang code, or even minor comments or formatting adjustments to Rust code, which do not warrant rerunning all simulation tests.

Typically, this would lead to significantly increased CI durations and resource requirements. Ideally, one would want to run these tests only on the main branch and even then, only periodically a few times a day.

Our Solution

When setting up our monorepo, we chose Bazel as our build system—the same system Google uses to build and test all of its code. Bazel can produce 100% reproducible binaries for each commit. It tracks the exact set of changes and determines the dependencies of those changes, using this information to build and test only what has changed since the last build. This approach significantly reduces the number of tests run on each build. For commits that don’t touch the Rust code or make changes that don’t affect the resulting binary, such as adding comments or changing formatting, there is no need to run any of the simulation tests.

We also optimize our CI builds by coalescing them. For example, if commits A, B, and C are pushed to the same branch in quick succession, instead of launching three separate builds, our CI system will build only the latest commit, C, since it includes changes from commits A and B. If the build fails, developers can easily reproduce the failure on their dev machines and determine which of the A, B, or C commits caused the failure. Our dev machines are all remote servers with the same high-performance specs as our CI boxes, enabling them to run Bazel and provide developers with the same superfast incremental builds, allowing them to reproduce any CI issue locally with ease.

To further reduce CI costs, we use Hetzner servers, which we also use for our development machines. Hetzner’s servers are significantly less expensive than AWS, and since our builds don’t require any AWS-specific resources, this enables us to cut CI costs even further.

The Result: 100% DST on Every Commit!

The combined effect of all these efforts is that, even with all the deterministic simulation tests, we can build the monorepo in under a minute for the vast majority of builds. For those builds that do run the simulation tests, Bazel efficiently slices through the massive test suite, executing only the tests directly affected by the changes in the commit.

This has enabled us to run deterministic simulation testing on all branches and for every commit, allowing us to catch bugs much earlier in the development process. We can continue adding more tests without the fear of significantly impacting build duration. The end result is a safer, more reliable implementation of Kafka that you can depend on.