Maximal Update Parameterization (muP or μP)
Maximal update parameterization (henceforce muP; arXiv) is a initialization and learning rate scheme so you can tune many hyperparameters (but not all) on much smaller models, then transfer those tuned hyperparameters to a much larger model with confidence.
I recently read the joint Eleuther-Cerebras post on muP and decided I would implement it for some ViT training runs I had lined up. I had some bumps along the way and decided to document them for future me and others.
Note that I am training vision models, with dense inputs (pixels) and an unchanging number of classes (there is no concept of “vocab”).
Given a standard initialization of sampling from \(\mathcal{N}(0, 0.02^2)\), as transformer width increases, the mean L1 norm of activations at various points in the model will increase.
We can fix this by sampling from \(\mathcal{N}(0, \frac{0.02}{\sqrt{m_d}}^2)\). \(m_d\) is just \(\frac{d_{in}}{d_{in,base}}\); that is, how much you scale your model width by from your base model.
However, after taking optimizer steps, our activations are no longer constant with respect to model width. This means we need to tune the learning rate. For the Adam and AdamW optimizer, muP proposes dividing your original learning rate \(\eta\) by \(m_d\) to get \(\frac{\eta}{m_d}\) (for SGD, muP says no change is necessary).
This tripped me up a lot when I first saw it. I had scaled the learning rate appropriately, and my model wasn’t working! Why had this happened?
My explanation is the patch embedding activations. Because \(d_{in}\) of the patch embedding does not change as you scale width, the mean norm for the patch activation also does not change. I think this is why muP suggests not scaling the learning rate for your embedding layer. If you do not change the learning rate for your embedding layer, then you get the following figure.
This is pretty good! There are a couple more things to add here:
- Divide attention logits by \(d_{head}\) instead of \(\sqrt{d_{head}}\) to account for correlation between \(Q\) and \(K\) that emerges later in training. This doesn’t lead to any meaningful difference in the coord check charts.
- Not all hyperparameters transfer using muP. Specifically, weight decay and dropout do not transfer and need to be tuned for large models. The proxy (small) model must also be trained with your target model’s batch size, depth (number of layers) and learning rate schedule. See this section for more details.
Potential Trip-Ups
- Gradient clipping: with clipping, my muP wasn’t working at all. With muP, you shouldn’t need clipping (I think).
- \(d_{in,base}\) should be at least 256 per the original muP paper and Eleuther’s reproduction. But I think 128 will be fine and will update this page after trials.
Other Resources
- Eleuther post
- Official muP GitHub repo; in particular, the explanation of and tips for coord checks.
- Cerebras’s official documentation and their Cerebras-GPT paper; specifically Appendix G. Note, however, that they divide inits and LRs by \(m_d\) and the Eleuther post recommends \(\sqrt{m_d}\).
[Relevant link] [Source]
Sam Stevens, 2024