Behold, My Stuff

[Home] [Writing] [CV] [Contact]

Maximal Update Parameterization (muP or μP)

Maximal update parameterization (henceforce muP; arXiv) is a initialization and learning rate scheme so you can tune many hyperparameters (but not all) on much smaller models, then transfer those tuned hyperparameters to a much larger model with confidence.

I recently read the joint Eleuther-Cerebras post on muP and decided I would implement it for some ViT training runs I had lined up. I had some bumps along the way and decided to document them for future me and others.

Note that I am training vision models, with dense inputs (pixels) and an unchanging number of classes (there is no concept of “vocab”).

Given a standard initialization of sampling from \(\mathcal{N}(0, 0.02^2)\), as transformer width increases, the mean L1 norm of activations at various points in the model will increase.

Standard parameterization leads to increasing activation norms as width increases.

We can fix this by sampling from \(\mathcal{N}(0, \frac{0.02}{\sqrt{m_d}}^2)\). \(m_d\) is just \(\frac{d_{in}}{d_{in,base}}\); that is, how much you scale your model width by from your base model.

muP parameterization leads to constant initial activation norms as width increases.

However, after taking optimizer steps, our activations are no longer constant with respect to model width. This means we need to tune the learning rate. For the Adam and AdamW optimizer, muP proposes dividing your original learning rate \(\eta\) by \(m_d\) to get \(\frac{\eta}{m_d}\) (for SGD, muP says no change is necessary).

muP parameterization and scaling the learning rate does not fix activations after optimization.

This tripped me up a lot when I first saw it. I had scaled the learning rate appropriately, and my model wasn’t working! Why had this happened?

My explanation is the patch embedding activations. Because \(d_{in}\) of the patch embedding does not change as you scale width, the mean norm for the patch activation also does not change. I think this is why muP suggests not scaling the learning rate for your embedding layer. If you do not change the learning rate for your embedding layer, then you get the following figure.

muP parameterization and scaling the learning rate for all non-embedding layers fixes activations after optimization.

This is pretty good! There are a couple more things to add here:

  1. Divide attention logits by \(d_{head}\) instead of \(\sqrt{d_{head}}\) to account for correlation between \(Q\) and \(K\) that emerges later in training. This doesn’t lead to any meaningful difference in the coord check charts.
  2. Not all hyperparameters transfer using muP. Specifically, weight decay and dropout do not transfer and need to be tuned for large models. The proxy (small) model must also be trained with your target model’s batch size, depth (number of layers) and learning rate schedule. See this section for more details.

Potential Trip-Ups

Other Resources


[Relevant link] [Source]

Sam Stevens, 2024