kasperhermansen-blog/content/posts/2023-08-01-superior-caching-with-dagger.md

---
type: "blog-post"
title: "Superior caching with dagger"
description: "Dagger is an up-and-coming ci/cd orchestration tool as code, this may sound abstract, but it is quite simple, read on to learn more."
draft: false
date: "2023-08-02"
updates:
- time: "2023-08-02"
  description: "first iteration"
tags:
- '#blog'
---

Dagger is an up-and-coming ci/cd orchestration tool as code, this may sound
abstract, but it is quite simple, read on to learn more.

## Introduction

This post is about me finding a solution to a problem, I've faced for a while
with `rust` caching for docker images. I was building a new tool I am working on
called `cuddle-please` (a release manager inspired by
[release-please](https://github.com/googleapis/release-please)).

I will start with a brief introduction to dagger, then the problem and how
dagger solves it, in comparison to docker.

## What is dagger

> If you already know what dagger is, feel free to skip ahead. I will explain
> briefly what it is, and give a short example.

Dagger is a tool where you can define your pipelines as code, dagger doesn't
desire to replace your tools, such as bash, clis, apis and whatnot, but it wants
to allow you to orchestrate them to your hearts content. And at the same time
bring proper engineering principles to it, such as testing, packaging, and
ergonomics.

Dagger allows you to write your pipelines in one of the supported languages (of
which are rapidly expanding).

The official languages are by the dagger team are:

- Go
- Python
- Typescript

Community based ones are:

- Rust (I am currently the author and maintainer of this one, but I don't work
  for `dagger`)
- Elixir
- Dotnet (in-progress)
- Java (In-progress)
- Ruby etc.

Dagger at its simplest is an api on top of `docker` or rather `buildkit`, but
brings with it so much more. You can kind of think of `dagger` as a juiced up
`Dockerfile`, but it brings more interactivity and programmability to it. It
even have elements of `docker-compose` as well. I personally call it
`Programmatic Orchestration`.

Anyways, a sample pipeline could be:

```rust
#[tokio::main]
async fn main() -> eyre::Result<()> {
  let client = dagger::connect().await?;

  let output = client.container()
    .from("alpine")
    .with_exec(vec!["echo", "hello-world"])
    .stdout().await?;

  println!("stdout: {output}");
}
```

Now simply build and run it.

```bash
cargo run
```

This will go ahead and download the image, and run the `echo "hello-world"`
command. Which in turn we can extract and print. This is a very basic example.
The equivalent `Dockerfile` would look like this.

```Dockerfile
FROM alpine
RUN echo "hello-world"
```

> The only prerequisite is a newer version of `docker`, but you can also install
> `dagger` as well, for better ergonomics and output.

However, dagger as its namesake suggests runs on dags, this means that normally
when you would use `multi-stage dockerfiles`

```Dockerfile
FROM alpine as base

FROM base as builder
RUN ...

FROM base as production
COPY --from=builder /mnt/... .
```

This forms a dag when you run `docker build .`, where.

```
base is run first because builder depends on it.
after is done, production will run because depends on builder
```

Dagger does the same things behind the scenes, but with a much more capable api.

In dagger, you can easily, share sockets, files, folders, containers, stdout,
etc. All of which can be done in a programming language, instead of a recipe
like declarative file like a `Dockerfile`.

It should be noted that dagger transforms your code into a declarative manifest
behind the scenes, kind of like `Pulumi`, though it is still interactive, think
`SQL`, where each query is a declarative command/query.

## Why orchestration matters.

Dagger is a paradigm shift, because you can now enable engineering on top of
your pipelines, normally in Dockerfiles, you would download all sorts of clis to
manage your package managers, and tooling such as `jq` and whatnot to perform
small changes to the scripts to transform them into something compatible with
the `docker build`.

## The problem

A good example is building production images for rust. Building ci docker images
for rust is a massive pain. This is because when you run `cargo build`, or any
of its siblings, you refresh package registry if needed, download dependencies,
form the dependency chain between crates, and build the final crates / binaries.
This is very bad for caching, because you can't tell `cargo` to only fetch
dependencies and compile them, but leave your own crates alone.

This is general means that you will cache bust your dependencies each time you
do a code change to your crates, no matter how small. `Dockerfile` or rather
`Buildkit` on its own isn't able to properly split the cache, between these
commands, because from its point of view, it is all a single atomic command.

Existing solutions are downloading tools to handle it for you, but those are
cumbersome, and tbh, incompatible. For example, `cargo-chef`. With cargo chef,
it should allow you to create a recipe.json file, which contains a list of all
your dependencies, which you can move from an step into your build step, and
cache the dependencies that way. I've honestly found this really flaky, as the
lower `recipe.json` producing image, would cache-bust all the time.

```Dockerfile
FROM lukemathwalker/cargo-chef:latest-rust-1 AS chef
WORKDIR /app

FROM chef AS planner
COPY . .
RUN cargo chef prepare --recipe-path recipe.json

FROM chef AS builder
COPY --from=planner /app/recipe.json recipe.json
# Build dependencies - this is the caching Docker layer!
RUN cargo chef cook --release --recipe-path recipe.json
# Build application
COPY . .
RUN cargo build --release --bin app

# We do not need the Rust toolchain to run the binary!
FROM debian:buster-slim AS runtime
WORKDIR /app
COPY --from=builder /app/target/release/app /usr/local/bin
ENTRYPOINT ["/usr/local/bin/app"]
```

The above is the original example, but there are some flaws, it relies on the
checksum of the recipe.json to be the same. If you do a change in one of your
crates it will bust the hash of the recipe.json, because we just load all the
files in `COPY . .`.

Instead, what we would like to do is just load in the `Cargo.toml` and
`Cargo.lock` files in for our workspace, as well as any crates we've got. And
then dynamically construct empty main and lib.rs files to act as the binaries.
This is the simplest approach, but very bothersome in a `Dockerfile`.

```Dockerfile
FROM rustlang/rust:nightly as base

FROM base as dep-builder
WORKDIR /mnt/src
COPY **/.Cargo.toml .
COPY **/.Cargo.toml .

RUN echo "fn main() {}" >> crates/<some-crate>/src/main.rs
RUN echo "fn main() {}" >> crates/<some-crate>/src/lib.rs

RUN echo "fn main() {}" >> crates/<some-other-crate>/src/main.rs
RUN echo "fn main() {}" >> crates/<some-other-crate>/src/lib.rs

# ...

RUN cargo build # refreshes registry, fetches deps, compiles thems, and links them to a dummy binary

FROM base as builder

WORKDIR /mnt/src

COPY --from=dep-builder target target
COPY **/.Cargo.toml .
COPY **/.Cargo.toml .
COPY crates crates

RUN cargo build # Compiles user code and links everything together, reuses cache from incremental build done previously
```

This is very cumbersome, as you have to remember to update the `echo` lines set
above. You can script your way out of it, but it is just an ugly approach, that
is hard to maintain and grok.

## The solution built in dagger

Instead what we can do in `dagger` is to use a proper programmatic tool for
this.

```rust
// Some stuff omitted for brevity

# 1
let mut rust_crates = vec![PathBuf::from("ci")];

# 2
let mut dirs = tokio::fs::read_dir("crates").await?;
while let Some(entry) = dirs.next_entry().await? {
    if entry.metadata().await?.is_dir() {
        rust_crates.push(entry.path())
    }
}

# 3
fn create_skeleton_files(
    directory: dagger_sdk::Directory,
    path: &Path,
) -> eyre::Result<dagger_sdk::Directory> {
    let main_content = r#"fn main() {}"#;
    let lib_content = r#"fn some() {}"#;

    let directory = directory.with_new_file(
        path.join("src").join("main.rs").display().to_string(),
        main_content,
    );
    let directory = directory.with_new_file(
        path.join("src").join("lib.rs").display().to_string(),
        lib_content,
    );

    Ok(directory)
}

# 4
let mut directory = directory;
for rust_crate in rust_crates.into_iter() {
    directory = create_skeleton_files(directory, &rust_crate)?
}
```

You can find this in
[cuddle-please](https://git.front.kjuulh.io/kjuulh/cuddle-please/src/branch/main/ci/src/main.rs).
Which uses dagger as part of its `ci`. Anyways, for those not versed on `rust`,
which most people probably arent. What is happening here, in rough terms:

1. We create a list of known crates. In this case ci, is added, because it is a
   bit special.
2. We list all folders in the folder crates and add them to `rust_crates`
3. An inline function is created, which has the option of adding a new file to
   an existing directory, in this case it adds both a main.rs and lib.rs file
   with some dummy content to a given path.
4. Here we apply these files for all the crates we found above.

This is roughly equivalent to what we had above, but this time we can test
individual parts of the code, or even share it. For example, I could create a
rust library containing this functionality which I could reuse across all of my
projects. This is a game-changer!

> Note that rust is a bit more verbose than the other sdks, especially in
> comparison to the dynamic once, such as Python or Elixir. But to me this is a
> plus, because it allows us to work in the language we're most comfortable
> with, which in my case is `rust`

You can look at the rest of the
[file](https://git.front.kjuulh.io/kjuulh/cuddle-please/src/branch/main/ci/src/main.rs),
but now if I actually build using `cargo run -p ci`, it will first do everything
while it builds its cache, and then afterwards if I do a code change in any of
the files, only the binary will be recompiled and linked.

This is mainly because of these two import of files (which are equivalent to
`COPY` in dockerfiles)

```rust
# 1
let dep_src = client.host().directory_opts(
        args.source
            .clone()
            .unwrap_or(PathBuf::from("."))
            .display()
            .to_string(),
        dagger_sdk::HostDirectoryOptsBuilder::default()
            .include(vec!["**/Cargo.toml", "**/Cargo.lock"])
            .build()?,
    );
# 2
let src = client.host().directory_opts(
        args.source
            .clone()
            .unwrap_or(PathBuf::from("."))
            .display()
            .to_string(),
        dagger_sdk::HostDirectoryOptsBuilder::default()
            .exclude(vec!["node_modules/", ".git/", "target/"])
            .build()?,
    );
```

1. Will load in only the Cargo files, this allows us to only cache-bust if any
   of those files change.
2. We load in everything except for some stuff, this is a mix of `COPY` and
   `.dockerignore`.

Now we simply load them at different times and execute builds in between:

```rust
# 1
let rust_build_image = client.container().from(
    args.rust_builder_image
        .as_ref()
        .unwrap_or(&"rustlang/rust:nightly".into()),
);

# 2
let target_cache = client.cache_volume("rust_target");

# 3
let rust_build_image = rust_build_image
    .with_workdir("/mnt/src")
    .with_directory("/mnt/src", dep_src.id().await?)
    .with_exec(vec!["cargo", "build"])
    .with_mounted_cache("/mnt/src/target/", target_cache.id().await?)
    .with_directory("/mnt/src/crates", src.directory("crates").id().await?);

# 4
let rust_exe_image = rust_build_image.with_exec(vec!["cargo", "build"]);

# 5
rust_exe_image.exit_code().await?;
```

1. Do a `FROM` equivalent, creating a base container.
2. Builds a cache volume, this is extremely useful, because you can setup a
   shared cache pool for these volumes, so that you don't have to rely on
   buildkit-layer caching. (what is normally used in Dockerfiles)
3. Here we build the image
   1. First we set the workdir,
   2. then load in the directory fetched from above, this includes, the Cargo
      files as well as stub main and lib.rs files
   3. Next we fire off a normal build with `with_exec` which function like a
      `RUN`. here we build the stub, with refreshed registry, downloaded and
      compiled dependencies.
   4. We load in the rest of the source and replace `crates` with out own
      crates, this loads in the proper `.rs` files.
4. We now build the actual binary
5. We trigger exit_code, to actually run the dag, everything previously had been
   lazy, so if we didn't fire off the exit_code, or do another code action on
   it, we wouldn't actually execute the step. Now dagger will figure out the
   most optimal way of running our pipeline for maximum performance and
   cacheability.

## This is very verbose

Rust is a bit more verbose than other languages, especially in comparison to
scripting languages. In the future, I would probably package this up, and
publish this as a `crate` I can depend on myself. This is super nice, and would
make it quite easy to share this across all of my projects.

That project like in my previous
[post](https://blog.kasperhermansen.com/posts/cuddle/) could serve as a singular
component, which could be tested in isolation, and serve as a proper api, and
tool in general. This is something very hard, if not impossible with regular
`Dockerfiles` (without templating).

# Conclusion

I've shown a rough outline of what dagger is, why it is useful and how you can
do stuff with it that isn't possible using `Dockerfile` proper. The code
examples show some contrived code, that highlight that you can solve real
problems, using this new paradigm of mixing code with orchestration. In this
case an unholy union of `rust` and `buildkit` through `dagger`.