Module `benchmark`

Entrypoint for running all tasks in biobench.

Most of this script is self documenting. Run python benchmark.py --help to see all the options.

Note that you will have to download all the datasets, but each dataset includes its own download script with instructions. For example, see biobench.newt.download for an example.

Examples

Run Everything

Suppose you want to run all the tasks for all the default models to get started using a local GPU (device 4, for example). You need to specify all the --TASK-run flags and --TASK-args.datadir so that each task knows where to load data from.

CUDA_VISIBLE_DEVICES=4 python benchmark.py \
  --kabr-run --kabr-args.datadir /local/scratch/stevens.994/datasets/kabr \
  --iwildcam-run --iwildcam-args.datadir /local/scratch/stevens.994/datasets/iwildcam \
  --plantnet-run --plantnet-args.datadir /local/scratch/stevens.994/datasets/plantnet \
  --birds525-run --birds525-args.datadir /local/scratch/stevens.994/datasets/birds525 \
  --newt-run --newt-args.datadir /local/scratch/stevens.994/datasets/newt \
  --beluga-run --beluga-args.datadir /local/scratch/stevens.994/datasets/beluga \
  --ages-run --ages-args.datadir /local/scratch/stevens.994/datasets/newt \
  --fishnet-run --fishnet-args.datadir /local/scratch/stevens.994/datasets/fishnet

More generally, you can configure options for individual tasks using --TASK-args.<OPTION>, which are all documented python benchmark.py --help.

Just One Task

Suppose you just want to run one task (NeWT).

CUDA_VISIBLE_DEVICES=4 python benchmark.py \
  --newt-run --newt-args.datadir /local/scratch/stevens.994/datasets/newt

Just One Model

Suppose you only want to run the SigLIP SO400M ViT from Open CLIP, but you want to run it on all tasks. Since that model is a checkpoint in Open CLIP, we can use the OpenClip class to load the checkpoint.

CUDA_VISIBLE_DEVICES=4 python benchmark.py \
  --kabr-run --kabr-args.datadir /local/scratch/stevens.994/datasets/kabr \
  --iwildcam-run --iwildcam-args.datadir /local/scratch/stevens.994/datasets/iwildcam \
  --plantnet-run --plantnet-args.datadir /local/scratch/stevens.994/datasets/plantnet \
  --birds525-run --birds525-args.datadir /local/scratch/stevens.994/datasets/birds525 \
  --newt-run --newt-args.datadir /local/scratch/stevens.994/datasets/newt \
  --beluga-run --beluga-args.datadir /local/scratch/stevens.994/datasets/beluga \
  --ages-run --ages-args.datadir /local/scratch/stevens.994/datasets/newt \
  --fishnet-run --fishnet-args.datadir /local/scratch/stevens.994/datasets/fishnet \
  --model open-clip ViT-SO400M-14-SigLIP/webli  # <- This is the new line!

Use Slurm

Slurm clusters with lots of GPUs can be used to run lots of tasks in parallel. It's really easy with biobench.

python benchmark.py \
  --kabr-run --kabr-args.datadir /local/scratch/stevens.994/datasets/kabr \
  --iwildcam-run --iwildcam-args.datadir /local/scratch/stevens.994/datasets/iwildcam \
  --plantnet-run --plantnet-args.datadir /local/scratch/stevens.994/datasets/plantnet \
  --birds525-run --birds525-args.datadir /local/scratch/stevens.994/datasets/birds525 \
  --newt-run --newt-args.datadir /local/scratch/stevens.994/datasets/newt \
  --beluga-run --beluga-args.datadir /local/scratch/stevens.994/datasets/beluga \
  --ages-run --ages-args.datadir /local/scratch/stevens.994/datasets/newt \
  --fishnet-run --fishnet-args.datadir /local/scratch/stevens.994/datasets/fishnet \
  --slurm  # <- Just add --slurm to use slurm!

Note that you don't need to specify CUDA_VISIBLE_DEVICES anymore because you're not running on the local machine anymore.

Design

biobench is designed to make it easy to add both models and tasks that work with other models and tasks.

To add a new model, look at biobench.registry's documentation, which includes a tutorial for adding a new model.

Functions

def export_to_csv(args: Args) ‑> set[str]

Exports (and writes) to a wide table format for viewing (long table formats are better for additional manipulation/graphing, but wide is easy for viewing).

def main(args: Args)

Launch all jobs, using either a local GPU or a Slurm cluster. Then report results and save to disk.

def plot_task(conn: sqlite3.Connection, task: str)

Plots the most recent result for each model on given task, including confidence intervals. Returns the figure so the caller can save or display it.

Args

conn: connection to database.
task: which task to run.

Returns

matplotlib.pyplot.Figure

def save(args: Args, model_args: ModelArgsCvml | ModelArgsVlm, report: TaskReport) ‑> None

Saves the report to disk in a machine-readable SQLite format.

Args

args: launch script arguments.
model_args: a pair of model_org, model_ckpt strings.
report: the task report from the model_args.

Classes

class Args (slurm: bool = False, slurm_acct: str = 'PAS2136', models_cvml: Annotated[list[ModelArgsCvml], _ArgConfiguration(name='models', metavar=None, help=None, aliases=None, prefix_name=None, constructor_factory=None)] = <factory>, models_vlm: Annotated[list[ModelArgsVlm], _ArgConfiguration(name='vlms', metavar=None, help=None, aliases=None, prefix_name=None, constructor_factory=None)] = <factory>, device: Literal['cpu', 'cuda'] = 'cuda', debug: bool = False, ssl: bool = True, ages_run_cvml: bool = False, ages_run_vlm: bool = False, ages_args: Args = <factory>, beluga_run_cvml: bool = False, beluga_run_vlm: bool = False, beluga_args: Args = <factory>, birds525_run_cvml: bool = False, birds525_run_vlm: bool = False, birds525_args: Args = <factory>, fishnet_run_cvml: bool = False, fishnet_run_vlm: bool = False, fishnet_args: Args = <factory>, imagenet_run_cvml: bool = False, imagenet_run_vlm: bool = False, imagenet_args: Args = <factory>, inat21_run_cvml: bool = False, inat21_run_vlm: bool = False, inat21_args: Args = <factory>, iwildcam_run_cvml: bool = False, iwildcam_run_vlm: bool = False, iwildcam_args: Args = <factory>, kabr_run_cvml: bool = False, kabr_run_vlm: bool = False, kabr_args: Args = <factory>, leopard_run_cvml: bool = False, leopard_run_vlm: bool = False, leopard_args: Args = <factory>, newt_run_cvml: bool = False, newt_run_vlm: bool = False, newt_args: Args = <factory>, plankton_run_cvml: bool = False, plankton_run_vlm: bool = False, plankton_args: Args = <factory>, plantnet_run_cvml: bool = False, plantnet_run_vlm: bool = False, plantnet_args: Args = <factory>, rarespecies_run_cvml: bool = False, rarespecies_run_vlm: bool = False, rarespecies_args: Args = <factory>, report_to: str = './reports', graph: bool = True, graph_to: str = './graphs', log_to: str = './logs')

Params to run one or more benchmarks in a parallel setting.

Expand source code

@beartype.beartype
@dataclasses.dataclass(frozen=True)
class Args:
    """Params to run one or more benchmarks in a parallel setting."""

    slurm: bool = False
    """whether to use submitit to run jobs on a slurm cluster."""
    slurm_acct: str = "PAS2136"
    """slurm account string."""

    models_cvml: typing.Annotated[
        list[interfaces.ModelArgsCvml], tyro.conf.arg(name="models")
    ] = dataclasses.field(
        default_factory=lambda: [
            interfaces.ModelArgsCvml("open-clip", "RN50/openai"),
            interfaces.ModelArgsCvml("open-clip", "ViT-B-16/openai"),
            interfaces.ModelArgsCvml("open-clip", "ViT-B-16/laion400m_e32"),
            interfaces.ModelArgsCvml("open-clip", "hf-hub:imageomics/bioclip"),
            interfaces.ModelArgsCvml("open-clip", "ViT-B-16-SigLIP/webli"),
            interfaces.ModelArgsCvml(
                "timm-vit", "vit_base_patch14_reg4_dinov2.lvd142m"
            ),
        ]
    )
    """CV models; a pair of model org (interface) and checkpoint."""
    models_vlm: typing.Annotated[
        list[interfaces.ModelArgsVlm], tyro.conf.arg(name="vlms")
    ] = dataclasses.field(
        default_factory=lambda: [
            # interfaces.ModelArgsVlm("openrouter/google/gemini-2.0-flash-001"),
            interfaces.ModelArgsVlm("openrouter/google/gemini-flash-1.5-8b"),
        ]
    )
    """VLM checkpoints."""
    device: typing.Literal["cpu", "cuda"] = "cuda"
    """which kind of accelerator to use."""
    debug: bool = False
    """whether to run in debug mode."""
    ssl: bool = True
    """Use SSL when connecting to remote servers to download checkpoints; use --no-ssl if your machine has certificate issues. See `biobench.third_party_models.get_ssl()` for a discussion of how this works."""

    # Individual benchmarks.
    ages_run_cvml: bool = False
    """Whether to run the bird age benchmark with CV+ML."""
    ages_run_vlm: bool = False
    """Whether to run the bird age benchmark with VLM."""
    ages_args: ages.Args = dataclasses.field(default_factory=ages.Args)
    """Arguments for the bird age benchmark."""

    beluga_run_cvml: bool = False
    """Whether to run the Beluga whale re-ID benchmark with CV+ML."""
    beluga_run_vlm: bool = False
    """Whether to run the Beluga whale re-ID benchmark with VLM."""
    beluga_args: beluga.Args = dataclasses.field(default_factory=beluga.Args)
    """Arguments for the Beluga whale re-ID benchmark."""

    birds525_run_cvml: bool = False
    """Whether to run the Birds 525 benchmark with CV+ML."""
    birds525_run_vlm: bool = False
    """Whether to run the Birds 525 benchmark with VLM."""
    birds525_args: birds525.Args = dataclasses.field(default_factory=birds525.Args)
    """Arguments for the Birds 525 benchmark."""

    fishnet_run_cvml: bool = False
    """Whether to run the FishNet benchmark with CV+ML."""
    fishnet_run_vlm: bool = False
    """Whether to run the FishNet benchmark with VLM."""
    fishnet_args: fishnet.Args = dataclasses.field(default_factory=fishnet.Args)
    """Arguments for the FishNet benchmark."""

    imagenet_run_cvml: bool = False
    """Whether to run the ImageNet-1K benchmark with CV+ML."""
    imagenet_run_vlm: bool = False
    """Whether to run the ImageNet-1K benchmark with VLM."""
    imagenet_args: imagenet.Args = dataclasses.field(default_factory=imagenet.Args)
    """Arguments for the ImageNet-1K benchmark."""

    inat21_run_cvml: bool = False
    """Whether to run the iNat21 benchmark with CV+ML."""
    inat21_run_vlm: bool = False
    """Whether to run the iNat21 benchmark with VLM."""
    inat21_args: inat21.Args = dataclasses.field(default_factory=inat21.Args)
    """Arguments for the iNat21 benchmark."""

    iwildcam_run_cvml: bool = False
    """Whether to run the iWildCam benchmark with CV+ML."""
    iwildcam_run_vlm: bool = False
    """Whether to run the iWildCam benchmark with VLM."""
    iwildcam_args: iwildcam.Args = dataclasses.field(default_factory=iwildcam.Args)
    """Arguments for the iWildCam benchmark."""

    kabr_run_cvml: bool = False
    """Whether to run the KABR benchmark with CV+ML."""
    kabr_run_vlm: bool = False
    """Whether to run the KABR benchmark with VLM."""
    kabr_args: kabr.Args = dataclasses.field(default_factory=kabr.Args)
    """Arguments for the KABR benchmark."""

    leopard_run_cvml: bool = False
    """Whether to run the leopard re-ID benchmark with CV+ML."""
    leopard_run_vlm: bool = False
    """Whether to run the leopard re-ID benchmark with VLM."""
    leopard_args: leopard.Args = dataclasses.field(default_factory=leopard.Args)
    """Arguments for the leopard re-ID benchmark."""

    newt_run_cvml: bool = False
    """Whether to run the NeWT benchmark with CV+ML."""
    newt_run_vlm: bool = False
    """Whether to run the NeWT benchmark with VLM."""
    newt_args: newt.Args = dataclasses.field(default_factory=newt.Args)
    """Arguments for the NeWT benchmark."""

    plankton_run_cvml: bool = False
    """Whether to run the Plankton benchmark with CV+ML."""
    plankton_run_vlm: bool = False
    """Whether to run the Plankton benchmark with VLM."""
    plankton_args: plankton.Args = dataclasses.field(default_factory=plankton.Args)
    """Arguments for the Plankton benchmark."""

    plantnet_run_cvml: bool = False
    """Whether to run the Pl@ntNet benchmark with CV+ML."""
    plantnet_run_vlm: bool = False
    """Whether to run the Pl@ntNet benchmark with VLM."""
    plantnet_args: plantnet.Args = dataclasses.field(default_factory=plantnet.Args)
    """Arguments for the Pl@ntNet benchmark."""

    rarespecies_run_cvml: bool = False
    """Whether to run the Rare Species benchmark with CV+ML."""
    rarespecies_run_vlm: bool = False
    """Whether to run the Rare Species benchmark with VLM."""
    rarespecies_args: rarespecies.Args = dataclasses.field(
        default_factory=rarespecies.Args
    )
    """Arguments for the Rare Species benchmark."""

    # Reporting and graphing.
    report_to: str = os.path.join(".", "reports")
    """where to save reports to."""
    graph: bool = True
    """whether to make graphs."""
    graph_to: str = os.path.join(".", "graphs")
    """where to save graphs to."""
    log_to: str = os.path.join(".", "logs")
    """where to save logs to."""

    def to_dict(self) -> dict[str, object]:
        return dataclasses.asdict(self)

    def get_sqlite_connection(self) -> sqlite3.Connection:
        """Get a connection to the reports database.
        Returns:
            a connection to a sqlite3 database.
        """
        return sqlite3.connect(os.path.join(self.report_to, "reports.sqlite"))

Class variables

var ages_args : Args: Arguments for the bird age benchmark.
var ages_run_cvml : bool: Whether to run the bird age benchmark with CV+ML.
var ages_run_vlm : bool: Whether to run the bird age benchmark with VLM.
var beluga_args : Args: Arguments for the Beluga whale re-ID benchmark.
var beluga_run_cvml : bool: Whether to run the Beluga whale re-ID benchmark with CV+ML.
var beluga_run_vlm : bool: Whether to run the Beluga whale re-ID benchmark with VLM.
var birds525_args : Args: Arguments for the Birds 525 benchmark.
var birds525_run_cvml : bool: Whether to run the Birds 525 benchmark with CV+ML.
var birds525_run_vlm : bool: Whether to run the Birds 525 benchmark with VLM.
var debug : bool: whether to run in debug mode.
var device : Literal['cpu', 'cuda']: which kind of accelerator to use.
var fishnet_args : Args: Arguments for the FishNet benchmark.
var fishnet_run_cvml : bool: Whether to run the FishNet benchmark with CV+ML.
var fishnet_run_vlm : bool: Whether to run the FishNet benchmark with VLM.
var graph : bool: whether to make graphs.
var graph_to : str: where to save graphs to.
var imagenet_args : Args: Arguments for the ImageNet-1K benchmark.
var imagenet_run_cvml : bool: Whether to run the ImageNet-1K benchmark with CV+ML.
var imagenet_run_vlm : bool: Whether to run the ImageNet-1K benchmark with VLM.
var inat21_args : Args: Arguments for the iNat21 benchmark.
var inat21_run_cvml : bool: Whether to run the iNat21 benchmark with CV+ML.
var inat21_run_vlm : bool: Whether to run the iNat21 benchmark with VLM.
var iwildcam_args : Args: Arguments for the iWildCam benchmark.
var iwildcam_run_cvml : bool: Whether to run the iWildCam benchmark with CV+ML.
var iwildcam_run_vlm : bool: Whether to run the iWildCam benchmark with VLM.
var kabr_args : Args: Arguments for the KABR benchmark.
var kabr_run_cvml : bool: Whether to run the KABR benchmark with CV+ML.
var kabr_run_vlm : bool: Whether to run the KABR benchmark with VLM.
var leopard_args : Args: Arguments for the leopard re-ID benchmark.
var leopard_run_cvml : bool: Whether to run the leopard re-ID benchmark with CV+ML.
var leopard_run_vlm : bool: Whether to run the leopard re-ID benchmark with VLM.
var log_to : str: where to save logs to.
var models_cvml : list[ModelArgsCvml]: CV models; a pair of model org (interface) and checkpoint.
var models_vlm : list[ModelArgsVlm]: VLM checkpoints.
var newt_args : Args: Arguments for the NeWT benchmark.
var newt_run_cvml : bool: Whether to run the NeWT benchmark with CV+ML.
var newt_run_vlm : bool: Whether to run the NeWT benchmark with VLM.
var plankton_args : Args: Arguments for the Plankton benchmark.
var plankton_run_cvml : bool: Whether to run the Plankton benchmark with CV+ML.
var plankton_run_vlm : bool: Whether to run the Plankton benchmark with VLM.
var plantnet_args : Args: Arguments for the Pl@ntNet benchmark.
var plantnet_run_cvml : bool: Whether to run the Pl@ntNet benchmark with CV+ML.
var plantnet_run_vlm : bool: Whether to run the Pl@ntNet benchmark with VLM.
var rarespecies_args : Args: Arguments for the Rare Species benchmark.
var rarespecies_run_cvml : bool: Whether to run the Rare Species benchmark with CV+ML.
var rarespecies_run_vlm : bool: Whether to run the Rare Species benchmark with VLM.
var report_to : str: where to save reports to.
var slurm : bool: whether to use submitit to run jobs on a slurm cluster.
var slurm_acct : str: slurm account string.
var ssl : bool: Use SSL when connecting to remote servers to download checkpoints; use –no-ssl if your machine has certificate issues. See get_ssl() for a discussion of how this works.

Methods

def get_sqlite_connection(self) ‑> sqlite3.Connection: Get a connection to the reports database.

Returns

a connection to a sqlite3 database.
def to_dict(self) ‑> dict[str, object]