Maximal Update Parameterization (muP or μP)

Maximal update parameterization (henceforce muP; arXiv) is a initialization and learning rate scheme so you can tune many hyperparameters (but not all) on much smaller models, then transfer those tuned hyperparameters to a much larger model with confidence.

I recently read the joint Eleuther-Cerebras post on muP and decided I would implement it for some ViT training runs I had lined up. I had some bumps along the way and decided to document them for future me and others.

Note that I am training vision models, with dense inputs (pixels) and an unchanging number of classes (there is no concept of “vocab”).

Given a standard initialization of sampling from \(\mathcal{N}(0, 0.02^2)\), as transformer width increases, the mean L1 norm of activations at various points in the model will increase.

Standard parameterization leads to increasing activation norms as width increases. — Weights are initialized from \(\mathcal{N}(0, 0.02^2)\); learning rate is constant for all models. Note that even at timestep 1 (before any optimization), as we increase width, the activations also increase in norm (except the patch norms, because their \(d_{in}\) doesn’t change–it’s a function of the image size and the patch size, which is fixed).

We can fix this by sampling from \(\mathcal{N}(0, \frac{0.02}{\sqrt{m_d}}^2)\). \(m_d\) is just \(\frac{d_{in}}{d_{in,base}}\); that is, how much you scale your model width by from your base model.

muP parameterization leads to constant initial activation norms as width increases. — Weights are initialized from \(\mathcal{N}(0, \frac{0.02}{\sqrt{m_d}}^2)\); learning rate is constant for all models. Note that at timestep 1 (before any optimization), as we increase width, the activations **don’t** change norm a lot.

However, after taking optimizer steps, our activations are no longer constant with respect to model width. This means we need to tune the learning rate. For the Adam and AdamW optimizer, muP proposes dividing your original learning rate \(\eta\) by \(m_d\) to get \(\frac{\eta}{m_d}\) (for SGD, muP says no change is necessary).

muP parameterization and scaling the learning rate does not fix activations after optimization. — Weights are initialized from \(\mathcal{N}(0, \frac{0.02}{\sqrt{m_d}}^2)\); learning rate is \(\frac{\eta}{m_d}\). Note that doesn’t fix the problem! Read on to find out why.

This tripped me up a lot when I first saw it. I had scaled the learning rate appropriately, and my model wasn’t working! Why had this happened?

My explanation is the patch embedding activations. Because \(d_{in}\) of the patch embedding does not change as you scale width, the mean norm for the patch activation also does not change. I think this is why muP suggests not scaling the learning rate for your embedding layer. If you do not change the learning rate for your embedding layer, then you get the following figure.

muP parameterization and scaling the learning rate for all non-embedding layers fixes activations after optimization. — Weights are initialized from \(\mathcal{N}(0, \frac{0.02}{\sqrt{m_d}}^2)\); learning rate is \(\frac{\eta}{m_d}\) for all non-embedding layers. Note that even after several optimization steps, mean activation norm is constant with respect to model width.

This is pretty good! There are a couple more things to add here:

Divide attention logits by \(d_{head}\) instead of \(\sqrt{d_{head}}\) to account for correlation between \(Q\) and \(K\) that emerges later in training. This doesn’t lead to any meaningful difference in the coord check charts.
Not all hyperparameters transfer using muP. Specifically, weight decay and dropout do not transfer and need to be tuned for large models. The proxy (small) model must also be trained with your target model’s batch size, depth (number of layers) and learning rate schedule. See this section for more details.

Potential Trip-Ups

Gradient clipping: with clipping, my muP wasn’t working at all. With muP, you shouldn’t need clipping (I think).
\(d_{in,base}\) should be at least 256 per the original muP paper and Eleuther’s reproduction. But I think 128 will be fine and will update this page after trials.

Other Resources

Eleuther post
Official muP GitHub repo; in particular, the explanation of and tips for coord checks.
Cerebras’s official documentation and their Cerebras-GPT paper; specifically Appendix G. Note, however, that they divide inits and LRs by \(m_d\) and the Eleuther post recommends \(\sqrt{m_d}\).

[Links] [Source]

Sam Stevens, 2024