Cutting CI costs by over half
Introduction
Every good project has a Continuous Integration (CI) and Continuous Deployment (CD) process. CI and CD give you the confidence you need to ensure your code is working as expected.
In the Mozilla application-services
project, we run a variety of CI/CD jobs. We use Github Actions, CircleCI
and a Mozilla-made tool called Taskcluster.
For this post, I’ll only focus on CircleCI
. CircleCI
is what we use for most of our CI jobs that run on pull requests.
We have many jobs that run on CircleCI
. Most of those jobs have one thing in common; they need to build our Rust code.
Caching in CI
Before I describe the problem and how we fix it, let’s first talk about caching in CI and why it’s crucial.
I mentioned above that most of our CI jobs need to build our Rust code. We have a lot of Rust code. And the Rust code takes a long time to compile.
Without any caching, it would take our job that runs the Rust tests over 25 minutes to run. The majority of that time is spent compiling.
Sccache for the rescue
sccache is a compiler cache that would help avoid compilation whenever possible. Lucky for us, they have first-class support for Rust.
Essentially, you tell sccache
where to store your cache, and it will both cache and retrieve compilation results on its own.
You have tell the Rust compiler how to work with sccache before anything can happen. This can be done by setting the RUSTC_WRAPPER
environment variable.
|
|
Once you do that, whenever you invoke the Rust compiler, it will be run through sccache.
Additionally, sccache
runs a client-server architecture. So it will be listening for any compilation requests and serving them. This helps abstracting away where your cache is located, since sccache
supports both local storage and cloud storage.
Sccache on CI
CI is spending a long time compiling code, and sccache
exists to help make sure we don’t recompile the same code. But a question remains, how can we get sccache
to run on CI and share caches between jobs?
You have a couple of options. I am not going to go too deep into the tradeoff, but:
- You can host your cache on cloud storage! Something like Amazon S3 would work just fine.
- Bake the cache into a custom docker image. Then have CI use that image. I have not thought this one through, but it is possible. You need to make sure that the image is updated. You also need to know where to host it (docker hub, for example).
- Store the
sccache
cache in a local directory in the CI runner, then save it and retrieve it as a cache usingCircleCI
s caching feature.
In application-services
: we chose to use the third option. Here’s a snippet of our CircleCI
configuration used to look like. I’ll go over what it’s doing, why that’s a problem and how we fixed it in the rest of this post.
|
|
save-sccache-cache
|
|
This step persists the sccache
to a CircleCI cache. Other jobs can then retrieve that cache. Similar to how other caches work, we need to specify the cache key. In this case, the key is sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}-{{ epoch }}
. let’s break it down.
- The
sccache-cache-stable
part is just a prefix to indicate that this is asccache
cache. - The
{{ arch }}
part is the architecture of the machine. The architecture is important because machines running different architectures must have different caches.sccache
is a compilation cache. Compilation caches cannot use a cache compiled for one architecture on another. - The
{{ .Environment.CIRCLE_JOB }}
part is the name of the job. Different jobs should not share the same cache. Jobs run varying tasks and can require different compilation results. Important: jobs with the same name still share caches. For example, arust-tests
job now, and anotherrust-tests
job later, the latter job will use the cache from the first job. - The
{{ epoch }}
part is a timestamp. Every time the job executes, a new cache is uploaded. Let’s keep this in mind. We added the timestamp to ensure that the cache is as fresh as possible. But it has a dramatic side effect we’ll talk about later.
restore-sccache-cache
|
|
This step is the reverse of save-sccache-cache
. This step restores the cache from the CircleCI cache. The key it uses is sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}
. The only difference is that it doesn’t have the {{ epoch }}
part. However, CircleCI is smart enough to get the most recent cache for us.
Discovering a problem
I happened to look at our CI usage data while making other unrelated improvements to our CI.
That’s when I noticed that we were grossly over on our allotted storage space on CircleCI. In November, Mozilla used over 1.5TB-Month of storage. Our allotted was 200GB-Month. Additionally, the vast majority of Mozilla’s storage came from application-services
: our repository. And the final nail in the coffin, the majority of the storage came from caches.
So we had a 1.3TB
over our alloted storage per month. CircleCI indicates that for every GB
of overage storage, we are incurring a 420
credit cost (I am not entirely sure about the dollar cost of a credit, but from their FAQ it seems it’s $0.5
per 820
credits).
So 1300GB * 420 = 546,000 credits
.
In dollar amounts, that’s 546,000 / ~1600 = $340 per month.
It’s pocket change for a company, but it’s a complete waste to keep that expenditure going.
That said, the 546,000 credits
was higher than the monthly credit usage of all our CI resources in application-services
combined.
The problem
The problem was that our caches were taking too much storage space. The storage cost is because of the {{ epoch }}
part in the cache key.
What was happening is that whenever any job ran, it would retrieve the sccache
from CircleCI’s storage, run its task then upload that cache back to a different cache key. The job derives the key from the job name and the timestamp. And the timestamp will always be different.
We have many many jobs that run per workflow, and each job would eventually end up with a 2GB
cache. Whenever we commit any code, our workflow runs. Therefore, we have our workflows running many times a day. The storage cost explodes into Terabytes of storage used by the caches.
This problem is avoidable. Most of our jobs of the same type don’t compile any new code than the job before them. So they don’t need to upload a new cache, they can leave the cache be, and the next job will use the same cache.
The solution
Once we identified the problem, the solution was relatively simple. We need to ensure that we don’t upload a new cache if we don’t need to!
So when do we need to upload a new cache? Well, we need to upload a new cache if:
- There is no existing cache
- There have been changes to the code we need to compile. For example, there are new dependencies we need to build and cache.
The new save-sccache-cache
step
|
|
So what’s different? The difference is that we no longer have the {{ epoch }}
part in the cache key. We use the checksum
of the Cargo.lock
file. Therefore, we only get a unique key if the Cargo.lock
file changes. In Rust, the Cargo.lock
file represents the dependencies and the exact version of each dependency.
Our dependencies are not updated often, relative to the number of jobs that run. This scheme saves us a ton of storage costs.
The new restore-sccache-cache
step
|
|
Fundamentally, this doesn’t change much.
We have multiple keys now. The idea is that we use the cache associated with the current Cargo.lock
file, if it exists. If not, we’ll get the next best cache. And once the job is complete, it will upload its updated cache to be associated with the current Cargo.lock
file. The following job will then use a fresh cache.
Conclusion
CI is critical, but it’s also crucial to make sure we are conscious of the resources we are using. As developers, we sometimes assume that our CI resources are unlimited. This leaves us with problems like the one highlighted in this post, problems that are completely avoidable.
As a complete side note, us fixing this problem also fixes another very subtle problem with caching in CI.
To understand that problem, you need to know the names of our CircleCI jobs, mainly the following two:
- We have a job with a
CIRCLE_JOB
calledRust tests
that runs all of our Rust tests. - We have another job with a
CIRCLE_JOB
calledRust tests - min supported version
that runs a subset of our Rust tests against the minimum supported Rust version.
This looks fine at first glance, but looking back at our old restore-sccache-cache
step, the key was sccache-cache-stable-{{ arch }}-{{ .Environment.CIRCLE_JOB }}}
However.. the CIRCLE_JOB
of the Rust tests - min supported version
includes the CIRCLE_JOB
of the Rust tests
job.
This means that sometimes, when a Rust tests
job wants to retrieve a cache, it will retrieve the Rust tests - min supported version
cache due to how CircleCi cache retrieval works.
Whenever that happens, our cache hit rate will be extremely low, causing a slow Rust tests
job.
Querying for the new keys in the restore-sccache-cache
solves this problem.
The problem is avoided because by attaching the checksum, the key for the Rust test
job would be something like:
sccache-cache-stable-arch1-linux-amd64-6_85-Rust tests-mRDT2OUNSgh1rWIa7OFQlOS4Ear6gnwYnUEJBfMG7x4=
The checksum at the end ensures that CircleCi can’t pickup the cache for the Rust tests - min supported version
job by mistake.