Cutting CI costs by over half

2021-12-18 1790 words 9 minutes

Contents

Introduction

Every good project has a Continuous Integration (CI) and Continuous Deployment (CD) process. CI and CD give you the confidence you need to ensure your code is working as expected.

In the Mozilla application-services project, we run a variety of CI/CD jobs. We use Github Actions, CircleCI and a Mozilla-made tool called Taskcluster. For this post, I’ll only focus on CircleCI. CircleCI is what we use for most of our CI jobs that run on pull requests.

We have many jobs that run on CircleCI. Most of those jobs have one thing in common; they need to build our Rust code.

Caching in CI

Before I describe the problem and how we fix it, let’s first talk about caching in CI and why it’s crucial.

I mentioned above that most of our CI jobs need to build our Rust code. We have a lot of Rust code. And the Rust code takes a long time to compile.

Without any caching, it would take our job that runs the Rust tests over 25 minutes to run. The majority of that time is spent compiling.

Sccache for the rescue

sccache is a compiler cache that would help avoid compilation whenever possible. Lucky for us, they have first-class support for Rust.

Essentially, you tell sccache where to store your cache, and it will both cache and retrieve compilation results on its own.

A bit more about sccache

You have tell the Rust compiler how to work with sccache before anything can happen. This can be done by setting the RUSTC_WRAPPER environment variable.

1

export RUSTC_WRAPPER=/path/to/sccache

Once you do that, whenever you invoke the Rust compiler, it will be run through sccache.

Additionally, sccache runs a client-server architecture. So it will be listening for any compilation requests and serving them. This helps abstracting away where your cache is located, since sccache supports both local storage and cloud storage.

Sccache on CI

CI is spending a long time compiling code, and sccache exists to help make sure we don’t recompile the same code. But a question remains, how can we get sccache to run on CI and share caches between jobs?

You have a couple of options. I am not going to go too deep into the tradeoff, but:

You can host your cache on cloud storage! Something like Amazon S3 would work just fine.
Bake the cache into a custom docker image. Then have CI use that image. I have not thought this one through, but it is possible. You need to make sure that the image is updated. You also need to know where to host it (docker hub, for example).
Store the sccache cache in a local directory in the CI runner, then save it and retrieve it as a cache using CircleCIs caching feature.

In application-services: we chose to use the third option. Here’s a snippet of our CircleCI configuration used to look like. I’ll go over what it’s doing, why that’s a problem and how we fixed it in the rest of this post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


  restore-sccache-cache:
    steps:
      - restore_cache:
          name: Restore sccache cache
          key: sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}
  save-sccache-cache:
    steps:
      - save_cache:
          name: Save sccache cache
          # We use {{ epoch }} to always upload a fresh cache:
          # Of course, restore_cache will not find this exact key,
          # but it will fall back to the closest key (aka the most recent).
          # See https://discuss.circleci.com/t/add-mechanism-to-update-existing-cache-key/9014/13
          key: sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}-{{ epoch }}
          paths:
            - "~/.cache/sccache"
      - sccache-stats-and-logs
  sccache-stats-and-logs:
    steps:
      - run: sccache --show-stats
      - store_artifacts:
          path: /tmp/sccache.log
          destination: logs/sccache.log

`save-sccache-cache`

1
2
3
4
5


save_cache:
  name: Save sccache cache
  key: sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}-{{ epoch }}
  paths:
    - "~/.cache/sccache"

This step persists the sccache to a CircleCI cache. Other jobs can then retrieve that cache. Similar to how other caches work, we need to specify the cache key. In this case, the key is sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}-{{ epoch }}. let’s break it down.

The sccache-cache-stable part is just a prefix to indicate that this is a sccache cache.
The {{ arch }} part is the architecture of the machine. The architecture is important because machines running different architectures must have different caches. sccache is a compilation cache. Compilation caches cannot use a cache compiled for one architecture on another.
The {{ .Environment.CIRCLE_JOB }} part is the name of the job. Different jobs should not share the same cache. Jobs run varying tasks and can require different compilation results. Important: jobs with the same name still share caches. For example, a rust-tests job now, and another rust-tests job later, the latter job will use the cache from the first job.
The {{ epoch }} part is a timestamp. Every time the job executes, a new cache is uploaded. Let’s keep this in mind. We added the timestamp to ensure that the cache is as fresh as possible. But it has a dramatic side effect we’ll talk about later.

`restore-sccache-cache`

1
2
3


restore_cache:
  name: Restore sccache cache
  key: sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}

This step is the reverse of save-sccache-cache. This step restores the cache from the CircleCI cache. The key it uses is sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}. The only difference is that it doesn’t have the {{ epoch }} part. However, CircleCI is smart enough to get the most recent cache for us.

Discovering a problem

I happened to look at our CI usage data while making other unrelated improvements to our CI.

That’s when I noticed that we were grossly over on our allotted storage space on CircleCI. In November, Mozilla used over 1.5TB-Month of storage. Our allotted was 200GB-Month. Additionally, the vast majority of Mozilla’s storage came from application-services: our repository. And the final nail in the coffin, the majority of the storage came from caches.

Rough Impact Calculation

So we had a 1.3TB over our alloted storage per month. CircleCI indicates that for every GB of overage storage, we are incurring a 420 credit cost (I am not entirely sure about the dollar cost of a credit, but from their FAQ it seems it’s $0.5 per 820 credits).

So 1300GB * 420 = 546,000 credits. In dollar amounts, that’s 546,000 / ~1600 = $340 per month. It’s pocket change for a company, but it’s a complete waste to keep that expenditure going.

That said, the 546,000 credits was higher than the monthly credit usage of all our CI resources in application-services combined.

The problem

The problem was that our caches were taking too much storage space. The storage cost is because of the {{ epoch }} part in the cache key.

What was happening is that whenever any job ran, it would retrieve the sccache from CircleCI’s storage, run its task then upload that cache back to a different cache key. The job derives the key from the job name and the timestamp. And the timestamp will always be different.

We have many many jobs that run per workflow, and each job would eventually end up with a 2GB cache. Whenever we commit any code, our workflow runs. Therefore, we have our workflows running many times a day. The storage cost explodes into Terabytes of storage used by the caches.

This problem is avoidable. Most of our jobs of the same type don’t compile any new code than the job before them. So they don’t need to upload a new cache, they can leave the cache be, and the next job will use the same cache.

The solution

Once we identified the problem, the solution was relatively simple. We need to ensure that we don’t upload a new cache if we don’t need to!

So when do we need to upload a new cache? Well, we need to upload a new cache if:

There is no existing cache
There have been changes to the code we need to compile. For example, there are new dependencies we need to build and cache.

The new `save-sccache-cache` step

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


  save-sccache-cache:
    parameters:
      path:
        type: string
        default: "~/.cache/sccache"
    steps:
      - save_cache:
          name: Save sccache cache
          # We only upload a new cache if our Cargo.lock changed
          # otherwise, we use the exact same key which won't upload a new cache
          key: sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}-{{ checksum "Cargo.lock" }}
          paths:
            - <<parameters.path>>
      - sccache-stats-and-logs

So what’s different? The difference is that we no longer have the {{ epoch }} part in the cache key. We use the checksum of the Cargo.lock file. Therefore, we only get a unique key if the Cargo.lock file changes. In Rust, the Cargo.lock file represents the dependencies and the exact version of each dependency.

Our dependencies are not updated often, relative to the number of jobs that run. This scheme saves us a ton of storage costs.

The new `restore-sccache-cache` step

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


  restore-sccache-cache:
    steps:
      - restore_cache:
          name: Restore sccache cache
          keys: 
          # We have multiple keys to increase the chance of a cache hit
          # in case the Cargo.lock is updated, we still want to retrieve
          # some cache
            - sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}-{{ checksum "Cargo.lock" }}
            - sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}
            - sccache-cache-stable-{{ arch }}

Fundamentally, this doesn’t change much.

We have multiple keys now. The idea is that we use the cache associated with the current Cargo.lock file, if it exists. If not, we’ll get the next best cache. And once the job is complete, it will upload its updated cache to be associated with the current Cargo.lock file. The following job will then use a fresh cache.

Conclusion

CI is critical, but it’s also crucial to make sure we are conscious of the resources we are using. As developers, we sometimes assume that our CI resources are unlimited. This leaves us with problems like the one highlighted in this post, problems that are completely avoidable.

A Very Subtle Other Problem

As a complete side note, us fixing this problem also fixes another very subtle problem with caching in CI.

To understand that problem, you need to know the names of our CircleCI jobs, mainly the following two:

We have a job with a CIRCLE_JOB called Rust tests that runs all of our Rust tests.
We have another job with a CIRCLE_JOB called Rust tests - min supported version that runs a subset of our Rust tests against the minimum supported Rust version.

This looks fine at first glance, but looking back at our old restore-sccache-cache step, the key was sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}}

However.. the CIRCLE_JOB of the Rust tests - min supported version includes the CIRCLE_JOB of the Rust tests job.

This means that sometimes, when a Rust tests job wants to retrieve a cache, it will retrieve the Rust tests - min supported version cache due to how CircleCi cache retrieval works.

Whenever that happens, our cache hit rate will be extremely low, causing a slow Rust tests job.

Querying for the new keys in the restore-sccache-cache solves this problem.

The problem is avoided because by attaching the checksum, the key for the Rust test job would be something like: sccache-cache-stable-arch1-linux-amd64-6_85-Rust tests-mRDT2OUNSgh1rWIa7OFQlOS4Ear6gnwYnUEJBfMG7x4= The checksum at the end ensures that CircleCi can’t pickup the cache for the Rust tests - min supported version job by mistake.