Tuning and Operating your Bazel Deployment

Once you've deployed Bazel to a large monorepo with hundreds of developers, you'll have to keep things fast and support product engineers who do not care about build systems and want to Just Work ™️.

By the end of this section, you might feel like an undervalued superhero municipal sewer engineer. 😁

Metaphor

You can imagine developers in a polyrepo setup. Each team maintains their own build and test setup. This is like villagers, taking their laptops to the well to draw water. At their scale, this is fine.

The monorepo is a big city. You can't have engineers spending time walking to a well! They should expect a faucet in their apartment, and water just comes out of it.

Except, there's actually a small team of municipal water supply engineers in the city. There's a lot of work required to keep the plumbing working for everyone.

Village	City

These people are motivated by the value of the work they do, even though their effort is rarely recognized. Except when things go wrong, and the faucet has no water or the water is polluted.

This is a pretty good metaphor for a Developer Infrastructure team. Product engineers won't notice that work is required to maintain the services that keep them productive in the metropolis that is a monorepo.

Leverage

A key selling point of Bazel is that it provides a uniform interface for Build and Test that allows a small DevInfra team to support a large number of engineers. The business wants to minimize the overhead of infrastructure, and your job is to keep it that way.

This "economy of scale" justifies the difficulty in learning and operating Bazel.

Thus, one goal of a DevInfra team must be to protect your ability to operate at scale, by preventing teams from fragmenting the codebase and making "unique snowflakes" that force you to provide custom support.

takeaway

Resist efforts by product engineers to be "special" and diverge from the monorepo standards.

Ensuring high cache-hit rates

The DevInfra team must ensure that CI remains fast by avoiding any unneeded work. Monitor your cache hit rate and be vigilant about repairing regressions.

For example, someone introduces a protoc plugin that stamps a date into the output. This is non-deterministic and means that the files produced will be different on each build, so anything (transitively) depending on them will be a cache miss.

Non-determinism higher in the build graph causes more cascading cache misses and should be addressed first.

To locate non-determinisms, there are two approaches:

Run two builds, collecting the --execution_log_binary_file. Ideally these builds are on different machines but at the same VCS snapshot.
Run bazel dump --action_cache and see if the actionKey matches expectations.

note

Remote Build Execution is sometimes sold as a solution, because it spreads the work over many machines, but if the work is unneeded this is the wrong solution because costs are excessive.

We plan to add a non-determinism detector in our CI/CD product, workflows.

https://blog.aspect.build/diagnose-cache-misses-1

https://blog.aspect.build/npm-determinism

https://bazel.build/remote/cache-remote#troubleshooting-cache-hits

Become expert at Bazel profiles

Every run of Bazel produces a profile. You don't need to set any flags to have one. You should configure CI builds to collect the Bazel profile and store the result where possible.

By default the profile of your most recent bazel run can be found in $(bazel info output_base)/command.profile.gz

It can be viewed in a couple ways:

bazel analyze-profile [path to profile] shows the time-per-phase and the Critical Path
Navigate to https://ui.perfetto.dev and load the profile using your browser for an interactive session.

You'll often spot some easy low-hanging fruit in the profile, such as an action taking much longer than it should, or running when it doesn't need to.

Exercise

Let's understand the example build and test timing by profiling it.

Do any bazel command in the bazel-examples repository, for example, bazel test //client/...
Open the profile.
Look for some things we want to improve. Is there anything in the profile we didn't expect?

Tuning resources per-action

Bazel has a heuristic-based scheduler that tries to maximize how much work can happen on the computer without overloading the system.

Resources that might be overused:

RAM: Bazel schedules too many compilation actions and exhausts system memory, the OS swaps and the machine is unusable or hangs.
CPU: Bazel schedules too many intensive tests in parallel and they all fail to complete within their timeout because they run too slowly on a loaded system.
Network throughput: A mis-configured NAT gateway throttles outbound connections resulting in a hung container test.

https://blog.aspect.build/bazel-oom

Prevent "weeds"

As a monorepo "gardener", you'll constantly fight against accumulation of undesirable things that interfere with proper operation. This is just like weeds which compete with the plants in a garden that you meant to grow.

Weeds are easy to pull out when they are tiny sprouts, but require a lot of work once the roots take hold. The same is true for monorepo maintenance - as a bad pattern gets adoption, it will have dependencies and coupling that make it much harder to remove.

The ideal solution prevents these being checked in at all. The next best is to warn in a code review that a bad pattern is being introduced. Ultimately you'll also need practices for detecting them, such as scanning user feedback, and then a culture of good hygiene where engineers and their managers agree on pulling weeds early.

Prevent build and test actions from accumulating dependencies on the network

This is easy to prevent early in a repository, but very difficult once engineers have depended on lax rules!

Bazel's test sandbox can prevent tests connecting to the network: --sandbox_default_allow_network=false.

Individual test targets can be allowed with the requires-network tag.

To prevent arbitrary fetches from the internet is harder. We suggest setting up iptables-based network blocking on CI workers.

Prevent accidental dependency edges

You can't build a service with high reliability with a dependency on one that has low availability.

The same is true for dependency edges in the graph. If engineers from an important, business-critical service have a dependency within the monorepo on poorly maintained library code, then the library developers are stuck trying to meet an SLA they cannot. They never meant to sign up to support the needs of this service using their library.

Bazel's visibility system is an excellent way to force users to first "sign up" before they depend on your library (or, if you have a service with a client library, prevent them from using the service).

A good pattern is for the visibility to start out minimal, just for the current project. Then add a package_group when you need to expand the visibility. Applications that want to depend on the library have to first send a PR to add their packages to that group, and the code review process should require a review from a library maintainer. This is your chance to discuss whether the dependency should be allowed.

Metaphor​

Leverage​

Ensuring high cache-hit rates​

Become expert at Bazel profiles​

Exercise​

Tuning resources per-action​

Prevent "weeds"​

Prevent build and test actions from accumulating dependencies on the network​

Prevent accidental dependency edges​