Tips & Tricks for Training Neural Networks

Read Andrej Karpathy’s guide. Do everything he says. Visualize your inputs, write some code to check that you’re not mixing information across the batch dimension–do it all.

These are the other tips and tricks that I picked up over many, many failed attempts.

Batch size matters a lot in pretraining. When pretraining a Swin Transformer (~89M parameters, supervised image classification), I needed a global batch size of at least 1024 to see meaningful improvement.
Learning rate decay will make it seem like your model is not “converging” because the performance will continue to improve until the last epoch. But if you had a constant learning rate, it would likely flatten out.
Your parameter sizes should be multiples of 64 wherever possible. You will get massive (10-25%) speedups for free.

Notes on Google’s Tuning Playbook

I took these notes when working on applying Swin Transformers to the iNaturalist 2021 dataset.

Just use the standard model and optimizer (typically read the model’s original paper to find the standard optimizer; check recent citations of the model’s original paper to find any changes).

Don’t treat batch size as a hyperparameter to tune validation performance because you can just tune optimizer hyperparameters, regularization hyperparameters and number of training steps.

But in my experience, you might have a minimum required global batch size.

If you can double batch size and training throughput does not double, then you likely have an I/O or node synchronization bottleneck. You should fix that first.

If throughput only increases up to batch size N, then there is no benefit to using batch sizes above N (unless you don’t reach that minimum batch size for stable training).

Most hyperparameters are very sensitive to batch size. This makes changing batch size very expensive in terms of re-tuning hyperparameters.

To choose the initial configuration, we need to a find a simple, fast low-resource baseline that achieves a reasonable result.

Simple means no learning rate decay, for example.
Fast and low-resource means starting with a smaller model.
Reasonable means it should be much better than random performance.

You probably also want a lower number of steps so you can tune faster, then do a final run that’s longer.

Incremental tuning has four repeating steps:

Identifiy a goal for the next round of experiments. Examples:
Improving our pipeline (new regularizer, preprocessing choice)
Understanding the impact of a specific hyperparameter
Maximizing validation error
Design and run experiments to make progress towards that goal.
Learn from the results.
Decide whether to update the running “best” configuration.

Most of our goals should be to learn more about the problem, not maximizing validation performance.

Optimizer hyperparameters are typically nuisance parameters (need to be tuned for every experiment) because their optimal values depend heavily on all other hyperparameters (architecture, batch size, number of training steps, etc). We also have no a priori reason to prefer some given optimizer parameter anyways.

But the choice of optimizer is typically a scientific or fixed hyperparameter (under study or already chosen).

Hyperparameters introduced by a regularization technique are nuisance hyperparameters but whether we include the regularization technique is scientific or fixed. The same applies to architectural hyperparameters.

A study is a set of hyperparameter configurations to be run. Each configuration is a trial. Typically we choose the hyperparameters to vary across trials, choose the search space, choose a number of trials, and choose an algorithm to sample trials from the space. Just use quasi-random search since you can run jobs in parallel.

Questions you need to ask before you can draw insights from a set of experiments:

Is the search space large enough? A search space is likely not large enough if the best point is near the search space boundary.
Have we sampled enough points from the search space?
What fraction of the trials are infeasible (trials diverge, get really bad loss, fail to run, etc.)? If there are broad areas of the search space that are infeasible, we should avoid sampling from these areas. It might also indicate a bug in the code.

Tidbits

Weight decay typically depends on model size.
Learning rate typically depends on model architecture.
If you want to use a model on downstream tasks, don’t decay the learning rate.
If you are training a really big model (100B+ parameters), then you can just skip batches that cause loss spikes. See this link about Galactica training.

Learnings from Large Vision Transformers (22B Parameters)

Just like PaLM, you can do your attention and MLP blocks in parallel.
Get rid of biases in qeury/key/value projections and layer norms, which improve GPU utilization by 3% while only reducing parameter count by 0.1%.
Apply layer norm to the query and value outputs. This means you can use a larger learning rate.

Resources & Links

[Relevant link] [Source]

Sam Stevens, 2024