Skip to main content
Version: 5.11.x

Optimize for Speed and Cost

The goal in this stage is to tune the code in the repository as well as the Workflows configuration to find a best case that provides very fast build & test results, while reducing the cloud cost required to operate CI.

Turn on "Build without the bytes"

This is the term of art for a Bazel setup in which outputs are never downloaded back to the computer where bazel is running. Instead they are only stored in the remote cache.

This feature is enabled by default in Bazel 7: blog post

Aspect recommends upgrading to Bazel 7 if possible, as this feature became more stable.

To enable it on Bazel 5 or Bazel 6, add build --remote_download_minimal in your .aspect/workflows/bazelrc file.

High cache-hit rate

Non-determinism is the property of a build step where the output varies even when inputs are the same.

Under Bazel, this causes downstream cache misses, as those changed outputs bust the cache key.

There are several ways to check the cache hit rate:

  • If the grafana dashboard is setup, check the cache hit rate meter. We expect over 95%.
  • Look for a line at the end of Bazel output like Executed 1 out of 347 tests: 345 tests pass and 2 were skipped.. The "executed" number should be low on most runs.

To identify causes of non-determinism, collect Bazel's execution logs for two similar builds and compare them.

The bazel_debug_assistance option will gather execution logs. Only enable this while debugging, as it causes builds to be slower and use a lot of disk.

.aspect/workflows/config.yaml
   - build:
bazel:
bazel_debug_assistance: true

Once this option is enabled, execution logs from two separate Bazel runs must be downloaded. One way to accomplish this is to merge a commit with this option, letting a main branch run produce the output. It will appear among the artifacts produced by the Workflows run, with the prefix exec. and is usually a large file, like the following sample on Buildkite:

Artifacts tab with an execution log

The next step is to gather a second log. Ideally it should not include any source code changes that caus legitimate cache misses, so the CI system can just be triggered as a "retry" at the same commit as the first log. Also, it's best to ensure a different runner is used. Terminate all runners, or wait for the pool to scale in to zero, or even retry a build from the previous day.

After downloading two execution logs, the common instructions for comparing them is provided in the Bazel documentation

coming soon

Aspect plans to add first-class determinism checking support including a built-in execution log comparison.

Right-size instance types

Look through the instance types available through your cloud provider, taking into account the availability in your region or partition.

Cheaper instances which are too low on resources cause slow builds, or may run out of memory. However, instances with generous resources are usually expensive.

Aspect recommends performing some experiments by doubling or halving instance sizes to "binary search" for an ideal trade-off between speed and cost.

Consider different CPU architectures as well - ARM machines are generally less expensive.

Enable "rebase" branch freshness

Bazel's performance is severely degraded when a warm machine must sync to a version control state that invalidates expensive cache entries. To avoid this, the update_strategy attribute can rebase Pull Requests onto the most recent commit of the target branch. See Configuration.

Note that a secret may be required for interactions with Version Control.

Warming

When the runner pool scales out, new machines are booted up to run bazel workloads. If these machines are cold, the build and test will be much slower.

coming soon

A dedicated page for warming setup is on the way.

GitHub webhook

To scale the worker pool, Workflows polls the CI system to find new workflow runs every minute. This polling delay means that scale-out of new workers takes longer, but you can avoid this by setting up a webhook from GitHub.

In the GitHub settings for the repository, add a new webhook. Navigate to https://github.com/[org]/[repo]/settings/hooks/new, then fill in the form with the following values:

  1. Payload URL: Find the URL for the scaling function.
  • AWS: You can find the URL by navigating in the AWS Console to Lambda > Functions and select the lambda named aw_{CI_SHORTFORM_NAME}_scaling_webhook__{RUNNER_GROUP_NAME}. Once the lambda is selected, you can find the URL under Function URL.
    • It looks similar to https://hn12345678r2s7q5eb33nc5nca0rnhip.lambda-url.us-east-2.on.aws/.
  • GCP: You can find the URL by navigating in the GCP Console to Cloud Run Functions and selecting the function named {CI_SHORTFORM_NAME}-scaling-webhook--{RUNNER_GROUP_NAME}. Once the function is selected, you can see the url.
    • It looks similar to https://{REGION}-{YOUR_PROJECT_NAME}.cloudfunctions.net/{CI_SHORTFORM_NAME}-scaling-webhook--{RUNNER_GROUP_NAME}.
  1. Content Type: Select application/json.
  2. Secret: Generate using your company's secrets policy.
  3. Which events would you like to trigger this webhook?: Choose "Let me select individual events" and then "Workflow jobs" and "Workflow runs".

Copy the generated secret into Secrets Manager:

AWS:

  1. Navigate to AWS Console > AWS Secrets Manager > Secrets.
  2. Locate the key starting aw_{CI_SHORTFORM_NAME}_lambda_webhook_secret__{RUNNER_GROUP_NAME}_xxxxxxxxxxxxxxxx.
  3. Set the value to generated secret.

GCP:

  1. Navigate to GCP Console > Secret Manager.
  2. Locate the key starting aw_{CI_SHORTFORM_NAME}_webhook_secret_{RUNNER_GROUP_NAME}_xxxxxxxxxxxxxxxx.
  3. Set the value to generated secret.