Design for Your Compute

I would prefer to live in a world where Nvidia sends me, a grad student, a single GPU with 1TB of VRAM and infinite memory bandwidth. Then I could just write PyTorch code like they do in the tutorials¹ and it would all work out. Unfortunately, that’s not true. My lab has some 8xA6000 servers with a shared NFS drive (slow) and each server has its own NVMe drive (fast). We also do experiments on OSC, which is a HPC cluster that uses Slurm, has shared NFS drives and fast node-specific drives. At Meta, we had a ton of V100s (kind of old) available on a Slurm cluster with great disk read speeds.

Ideally, your experimental codebase is compute-agnostic. Your experiments would equal MFU on any system without any code changes. I tried this for several years in my PhD.

I have abandonded this goal: I design experiments for specific compute. It’s simply too challenging to get good MFU on all these different systems. I realized I didn’t have to feel bad about this after seeing other, much better engineers run into the same problem.

Here are a couple examples.

Tokasaurus

Tokasaurus is a high-throughput LLM serving engine. It’s explicitly designed for systems without NVLink:

One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. (emphasis mine)

This serving engine is not relaly useful for huge models like DeepSeekv3, or massive datacenters that prioritize high-bandwidth interconnect.

DeepSeek

https://www.dwarkesh.com/p/sholto-trenton-2

On Dwarkesh Patel’s second podcast with Sholto Douglas and Trenton Bricken, Sholto explains why he appreciates DeepSeek’s algorithmic innovations:

This is manifested in the way that the models give this sense of being perfectly designed up to their constraints. You can really very clearly see what constraints they’re thinking about as they’re iteratively solving these problems. Let’s take the base Transformer and diff that to DeepSeek v2 and v3. You can see them running up against the memory bandwidth bottleneck in attention.

Initially they do MLA to do this, they trade flops for memory bandwidth basically. Then they do this thing called NSA, where they more selectively load memory bandwidth. You can see this is because the model that they trained with MLA was on H800s, so it has a lot of flops. So they were like, “Okay, we can freely use the flops.” But then the export controls from Biden came in, or they knew they would have less of those chips going forward, and so they traded off to a more memory bandwidth-oriented algorithmic solution there.

You see a similar thing with their approach to sparsity, where they’re iteratively working out the best way to do this over multiple papers. (again, emphasis mine)

DeepSeek, the compnay, makes algorithm trade-offs based on the compute available for every model.

My Work

SAEs for Vision

I train sparse autoencoders for vision models (project website, code, preprint) and the very first design decision was to save activations to disk before training the SAE rather than computing them on the fly. This is inherently more complex but it intuitively made sense to me because I expected to spend a lot of time training SAEs rather than saving activations AND my lab servers had sufficient storage with good random-access read speed available, so caching activtions made sense.

Reflections

I had a deep desire for simplicity over anything else. I still do; additional algorithmic complexity must lead to dramatic improvement to justify its complexity. But I am much more appreciative of complexity to deal with compute restraints. “Scale is king” is a guiding mantra for me: simple learning algorithms with few inductive biases are appealing to me. The somewhat obvious corollary of “scale is king” is that complexity is worth it if it enables bigger models. MLA is more complex than pure self-attention but enabled much larger model training.

This is what the tutorials tell you to do, which is fine, until you want to actually do something besides ResNet on an A100.

model = MyNetwork().to(device)
for imgs, tgts in dataloader:
    imgs = imgs.to(device)
    tgts = tgts.to(device)
    loss = model(imgs, tgts)
    loss.backwards()
    ...

↩︎

[Relevant link] [Source]

Sam Stevens, 2024