How to Start New Projects with saev
saev (docs) is a combination of a library and a framework that I co-developed with Jake Beattie and others to do research with sparse autoencoders (SAEs).
It’s structured like big_vision, Google’s ViT codebase. To get the most use out of saev, you should not use it as a requirement in your project; rather, you should build inside of the source code of saev. This post is a guide to that process.
TL;DR:
- Fork saev.
- Clone your fork.
- Create a new directory in
contrib/. - Update both
src/saevand your new contrib directory as necessary. - (Hopefully) publish.
- If your changes to
src/saevare broadly useful and not overly restrictive, open a PR with your changes tosrc/saev.
Now, I’ll cover this process in depth. I am currently applying SAEs to audio of birdsong, so this is how I’ll develop it.
First, fork and clone saev. Do this however you want, but GitHub has a guide on it.
Second, you probably want to store code related to your project in
this repo. Make a new directory in contrib/. I’m calling my
new subproject “birdsong.”
[I] samuelstevens@host ~/p/saev (main)> tree -L 1 contrib/
contrib/
├── birdsong
├── interactive_interp
└── trait_discovery
Use uv to make a new package inside your new
project:
[I] samuelstevens@host ~/p/s/c/birdsong (main)> uv init --package .
Adding `birdsong` as member of workspace `~/projects/saev`
Initialized project `birdsong` at `~/projects/saev/contrib/birdsong`
Now you have some additional files.
[I] samuelstevens@ascend-login02 ~/p/s/c/birdsong (main)> tree
.
├── pyproject.toml
├── README.md
└── src
└── birdsong
└── __init__.py
Now I can write scripts and source code for birdsong-specific stuff
in here. I’ll probably add a notebook for looking at instances of
birdsongs before and after using SAEs to identify patterns under a new
birdsong/notebooks directory, and will add
birdsong/logbook.md to store ongoing TODO items, and so
on.
To train SAEs on audio files, I’ll need to add a new dataset type to
save activations. In order to do this, I’ll edit
src/saev/data/datasets.py.
I’ll also need to add another model to the dataset, one that expects
audio files. Since I don’t think that DINOv3, OpenCLIP, or the other
existing model families will be suitable, I’ll need to add a new model
family. Again, this will need to go somewhere in
src/saev/data.
If I’m smart about it, these changes will be nice and
non-destructive, and other users of saev can benefit from them. After I
publish some results, to share this code with others, I’ll open a PR
from my fork/branch to main with the new datasets/models. But I won’t
open a PR with birdsong because that’s specific to me,
rather than to the library.1
Technically,
birdsongwill be in saev because I’m a sort of privileged user because I’m the main developer. But other folks probably want their project-specific code attached to their GitHub page, rather than OSU-NLP’s.↩︎
Sam Stevens, 2024