BioBench Leaderboard

At a glance

9 tasks · 4 kingdoms · 6 imaging modalities
3.1 M images · 337 GB total
Evaluation: frozen backbone → simple scikit-learn-style classifier → macro-F1

Leaderboard

Checkpoint	Params (M)	ImageNet‑1K	NeWT	Mean ▼	Beluga	FishNet	FungiCLEF	Herbarium19	iWildCam	KABR	MammalNet	Plankton	Pl@ntNet
DINOv2 ViT‑g/14	304	79.8	82.8	32.5	4.5	75.2	34.2	29.9	22.9	25.3	57.9	3.4	39.7
SigLIP SO400M/14	428	82.2	86.0	32.3	4.0	69.0	38.6	22.6	17.5	33.0	66.5	2.8	36.2
SigLIP ViT‑L/16	316	81.0	85.4	30.7	4.1	67.6	38.0	18.1	15.9	30.7	65.5	3.0	33.5
SigLIP SO400M/14	428	81.1	84.7	30.5	3.9	68.3	37.4	16.9	14.7	33.0	63.6	2.7	33.7
DINOv2 ViT‑L/14	304	78.6	83.0	29.7	3.1	72.3	34.4	21.7	20.2	23.3	53.7	2.8	36.2
SigLIP ViT‑L/16	316	79.2	83.5	28.8	4.0	67.4	36.3	12.8	13.8	29.4	62.3	2.8	30.5
AIMv2 ViT‑1B/14	1237	81.8	84.2	28.8	2.1	60.0	34.3	13.9	12.9	36.3	61.1	2.2	36.1
AIMv2 ViT‑3B/14	2722	-	83.5	28.7	1.9	59.7	32.0	16.3	14.5	28.5	65.3	2.3	38.1
AIMv2 ViT‑3B/14	2721	83.0	82.6	28.7	1.5	61.6	33.8	15.0	13.4	28.8	65.2	2.5	36.6
AIMv2 ViT‑1B/14	1236	81.8	84.0	28.5	2.2	60.9	34.4	12.6	12.6	35.9	60.1	2.1	35.6

Why this exists

Web-photo benchmarks reward features that don’t transfer to camera traps, drone RGB, microscope micrographs, or specimen shots. Above 75 % ImageNet top-1, model rankings on ecology tasks become noise. BioBench replaces proxy metrics with direct measurement.

How to reproduce

git clone https://github.com/samuelstevens/biobench.git
uv run benchmark.py --cfgs configs/all-models.toml
uv run report.py

ViT-L runs in about 1h on a single A6000; results saved as a SQLite database; report.py converts to a statistically validated JSON for easy analysis.

How to cite

@software{stevens2025biobench,
  author = {Stevens, Samuel and Gu, Jianyang},
  license = {MIT},
  title = {{BioBench}},
  url = {https://github.com/samuelstevens/biobench/}
}