Optimize for Speed and Cost
The goal in this stage is to tune the code in the repository as well as the Workflows configuration to find a best case that provides very fast build & test results, while reducing the cloud cost required to operate CI.
Turn on "Build without the bytes"
This is the term of art for a Bazel setup in which outputs are never downloaded back to the computer where bazel
is running.
Instead they are only stored in the remote cache.
This feature is enabled by default in Bazel 7: blog post
Aspect recommends upgrading to Bazel 7 if possible, as this feature became more stable.
To enable it on Bazel 5 or Bazel 6, add build --remote_download_minimal
in your .aspect/workflows/bazelrc
file.
High cache-hit rate
Non-determinism is the property of a build step where the output varies even when inputs are the same.
Under Bazel, this causes downstream cache misses, as those changed outputs bust the cache key.
There are several ways to check the cache hit rate:
- If the grafana dashboard is setup, check the cache hit rate meter. We expect over 95%.
- Look for a line at the end of Bazel output like
Executed 1 out of 347 tests: 345 tests pass and 2 were skipped.
. The "executed" number should be low on most runs.
To identify causes of non-determinism, collect Bazel's execution logs for two similar builds and compare them.
The bazel_debug_assistance
option will gather execution logs.
Only enable this while debugging, as it causes builds to be slower and use a lot of disk.
- build:
bazel:
bazel_debug_assistance: true
Once this option is enabled, execution logs from two separate Bazel runs must be downloaded.
One way to accomplish this is to merge a commit with this option, letting a main
branch run produce the output.
It will appear among the artifacts produced by the Workflows run, with the prefix exec.
and is usually a large file, like the following sample on Buildkite:
The next step is to gather a second log. Ideally it should not include any source code changes that caus legitimate cache misses, so the CI system can just be triggered as a "retry" at the same commit as the first log. Also, it's best to ensure a different runner is used. Terminate all runners, or wait for the pool to scale in to zero, or even retry a build from the previous day.
After downloading two execution logs, the common instructions for comparing them is provided in the Bazel documentation
Aspect plans to add first-class determinism checking support including a built-in execution log comparison.
Right-size instance types
Look through the instance types available through your cloud provider, taking into account the availability in your region or partition.
Cheaper instances which are too low on resources cause slow builds, or may run out of memory. However, instances with generous resources are usually expensive.
Aspect recommends performing some experiments by doubling or halving instance sizes to "binary search" for an ideal trade-off between speed and cost.
Consider different CPU architectures as well - ARM machines are generally less expensive.
Enable "rebase" branch freshness
Bazel's performance is severely degraded when a warm machine must sync to a version control state that invalidates expensive cache entries.
To avoid this, the update_strategy
attribute can rebase Pull Requests onto the most recent commit of the target branch.
See Configuration.
Note that a secret may be required for interactions with Version Control.
Warming
When the runner pool scales out, new machines are booted up to run bazel
workloads.
If these machines are cold, the build and test will be much slower.
A dedicated page for warming setup is on the way.
GitHub webhook
To scale the worker pool, Workflows polls the CI system to find new workflow runs every minute. This polling delay means that scale-out of new workers takes longer, but you can avoid this by setting up a webhook from GitHub.
In the GitHub settings for the repository, add a new webhook.
Navigate to https://github.com/[org]/[repo]/settings/hooks/new
, then fill in the form with the following values:
- Payload URL: Find the URL for the scaling function.
- AWS: You can find the URL by navigating in the AWS Console to Lambda > Functions and select the lambda named
aw_{CI_SHORTFORM_NAME}_scaling_webhook__{RUNNER_GROUP_NAME}
. Once the lambda is selected, you can find the URL under Function URL.- It looks similar to
https://hn12345678r2s7q5eb33nc5nca0rnhip.lambda-url.us-east-2.on.aws/
.
- It looks similar to
- GCP: You can find the URL by navigating in the GCP Console to Cloud Run Functions and selecting the function
named
{CI_SHORTFORM_NAME}-scaling-webhook--{RUNNER_GROUP_NAME}
. Once the function is selected, you can see the url.- It looks similar to
https://{REGION}-{YOUR_PROJECT_NAME}.cloudfunctions.net/{CI_SHORTFORM_NAME}-scaling-webhook--{RUNNER_GROUP_NAME}
.
- It looks similar to
- Content Type: Select
application/json
. - Secret: Generate using your company's secrets policy.
- Which events would you like to trigger this webhook?: Choose "Let me select individual events" and then "Workflow jobs" and "Workflow runs".
Copy the generated secret into Secrets Manager:
AWS:
- Navigate to AWS Console > AWS Secrets Manager > Secrets.
- Locate the key starting
aw_{CI_SHORTFORM_NAME}_lambda_webhook_secret__{RUNNER_GROUP_NAME}_xxxxxxxxxxxxxxxx
. - Set the value to generated secret.
GCP:
- Navigate to GCP Console > Secret Manager.
- Locate the key starting
aw_{CI_SHORTFORM_NAME}_webhook_secret_{RUNNER_GROUP_NAME}_xxxxxxxxxxxxxxxx
. - Set the value to generated secret.