Behold, My Stuff

[Home] [Writing] [CV] [Contact]

Tips & Tricks for Training Neural Networks

Read Andrej Karpathy’s guide. Do everything he says. Visualize your inputs, write some code to check that you’re not mixing information across the batch dimension–do it all.

These are the other tips and tricks that I picked up over many, many failed attempts.

Notes on Google’s Tuning Playbook

I took these notes when working on applying Swin Transformers to the iNaturalist 2021 dataset.

Just use the standard model and optimizer (typically read the model’s original paper to find the standard optimizer; check recent citations of the model’s original paper to find any changes).

Don’t treat batch size as a hyperparameter to tune validation performance because you can just tune optimizer hyperparameters, regularization hyperparameters and number of training steps.

But in my experience, you might have a minimum required global batch size.

If you can double batch size and training throughput does not double, then you likely have an I/O or node synchronization bottleneck. You should fix that first.

If throughput only increases up to batch size N, then there is no benefit to using batch sizes above N (unless you don’t reach that minimum batch size for stable training).

Most hyperparameters are very sensitive to batch size. This makes changing batch size very expensive in terms of re-tuning hyperparameters.

To choose the initial configuration, we need to a find a simple, fast low-resource baseline that achieves a reasonable result.

You probably also want a lower number of steps so you can tune faster, then do a final run that’s longer.

Incremental tuning has four repeating steps:

  1. Identifiy a goal for the next round of experiments. Examples:
  2. Improving our pipeline (new regularizer, preprocessing choice)
  3. Understanding the impact of a specific hyperparameter
  4. Maximizing validation error
  5. Design and run experiments to make progress towards that goal.
  6. Learn from the results.
  7. Decide whether to update the running “best” configuration.

Most of our goals should be to learn more about the problem, not maximizing validation performance.

Optimizer hyperparameters are typically nuisance parameters (need to be tuned for every experiment) because their optimal values depend heavily on all other hyperparameters (architecture, batch size, number of training steps, etc). We also have no a priori reason to prefer some given optimizer parameter anyways.

But the choice of optimizer is typically a scientific or fixed hyperparameter (under study or already chosen).

Hyperparameters introduced by a regularization technique are nuisance hyperparameters but whether we include the regularization technique is scientific or fixed. The same applies to architectural hyperparameters.

A study is a set of hyperparameter configurations to be run. Each configuration is a trial. Typically we choose the hyperparameters to vary across trials, choose the search space, choose a number of trials, and choose an algorithm to sample trials from the space. Just use quasi-random search since you can run jobs in parallel.

Questions you need to ask before you can draw insights from a set of experiments:

  1. Is the search space large enough? A search space is likely not large enough if the best point is near the search space boundary.
  2. Have we sampled enough points from the search space?
  3. What fraction of the trials are infeasible (trials diverge, get really bad loss, fail to run, etc.)? If there are broad areas of the search space that are infeasible, we should avoid sampling from these areas. It might also indicate a bug in the code.

Tidbits

Learnings from Large Vision Transformers (22B Parameters)


[Relevant link] [Source]

Sam Stevens, 2024