TLDR

Research papers provide value through artifacts (datasets, benchmarks, models) or insights (lessons applicable beyond one paper).
Most papers will have an artifact and one to three insights. The insights are much more important; the artifact is typically used as evidence that the insights are true.
Only the top artifacts will have long-term impact; even smaller insight papers can have lasting impact.
You can use the artifact vs insight framework to improve both your paper structure and your experimental planning.

Artifacts vs Insights: Meta-Insights on AI Research

Papers typically provide two kinds of value: artifacts and insights

Artifacts are datasets, models, codebases, sometimes optimization algorithms. Insights are lessons and takeaways that later researchers will apply to their work.

Examples of artifacts:

GPT-3
BERT
CLIP (pre-trained checkpoints)
RoBERTa
ELECTRA, DeBERTa
T5, Flan-T5
Llama2
OPT
Mind2Web
MMMU
MagicBrush

Examples of insights:

Pangu
Chinchilla
Bernal’s tokenization work
Ron’s SQL framework
CLIP (training on image-text pairs)
RoBERTa (training for longer is better)
OPT (logbook)
Emergent abilities paper
Three Things about Vision Transformers paper

Thinking about the difference between artifacts and insights can help you better organize your research during the research process and within a paper, when presenting to outside audiences.

During the Research Process

At every step, you should be asking yourself what insights you’ve accumulated and how you can take advantage of them, typically to produce an artifact. You also need to think about what evidence you have for your insights, and whether that’s enough evidence to convince readers. You should only think about artifacts 10-20% of the time.

When Planning the Paper for Readers

You should think about what value your paper provides for readers. Is it a research artifact? Or research insights?

For both outcomes, you need to convince readers of the value of the artifact or insight. For insights, you also need to convince readers that your insight is true, typically with empirical evidence.

Artifacts

You need to convince readers that your artifact is valuable. This is typically straightforward depending on what kind of artifact you have.

For new models, you should demonstrate that your model outperforms other existing research artifacts on relevant benchmarks. For new benchmarks, you will likely have to do a literature review and explain that existing benchmarks fail in some way to accurately measure model capabilities accurately.A For new training datasets, you should train one model on your dataset and one model on existing datasets with the same amount of compute and demonstrate that your dataset leads to better model weights (and those model weights will likely be part of your research artifact as well).

In general, convincing readers of your artifact’s value is easy, assuming that it is in fact valuable. Actually creating the artifact of value is typically hard. Often, creating a model artifact requires either more compute (the falcon LMs, GPT-3), more (maybe proprietary) data (typical of image models like CLIP, google’s JFT-3B) or some novel insight.

Insights

Convincing readers that your insight is valuable is also typically pretty easy. Normally your insight will have some real-world outcome: you can train models with less compute, less data, with less human supervision, leads to better outcomes, etc. This is typically a compute multiplier (link the non_int blog).

Negative results papers are definitely insight papers, but there the value lies in convincing other researchers not to go down the same route you tried. This typically means you tried something that intuitively should lead to an improvement but you couldn’t make it work.

The challenge for insights lies in convincing readers that your insight is actually true.

Typically, you will do lots of experiments, trying to answer any question that a skeptic might ask. Your evidence should be convincing.

The rule of thumb is that the more general your insight, the more useful it is. However, the more general your insight, the more evidence is required.

Furthermore, the farther from the accepted standard your insight is, the more evidence is required to convince readers.

What This Means for Research Directions as an Academic

Typically, the best artifacts come from the labs with the most compute, because scale is king. If you are a PhD student, you typically don’t have the most compute. So it will be harder for you to make valuable artifacts. You can pick specialty domains (LMs for special domains like legal text) or find smaller areas (recent work uses GPT for 3d mesh decoders). But you shouldn’t try to compete with OpenAI or Meta on making LLMs, and that’s probably already obvious.

But since industry labs know they can crush the artifact side of things (specifically models), that means academic labs can work hard on non-model artifacts and insights. (mention specialization, competitive advantage from microeconomics) This means academic labs can develop benchmarks (MMMU, Mind2Web), new training data (MagicBrush, Joel’s Legal text benchmark or training data) or insights.

Insights Get Fewer Citations

How can I show this? Does it matter?

Insights Last Longer

A paper with good insights is useful for many people.
Most artifacts are a flash in the pan.
Some artifacts matter for a very long time (BERT, GPT-2, T5, ImageNet, The Pile)

“Scale is King” is King

Perhaps the most consistent insight in AI is that bigger models trained on more data with more compute will outperform smaller models trained on less data with less compute. This insight is nearly a fact. It’s been described in the Bitter Lesson by Richard Sutton. It was challenged by the unscaling challenge–then GPT-4 beat it. Unless scale is a core part of your paper’s insight (like the emergent abilities paper), you should treat “scale is king” as a fact.

This should impact your research is two ways:

First, you should do the majority of your research at small scale. As you accumulate empirical evidence to support your insights, you should do most experiments at small to medium scales. You should assume that as you get bigger, your results will get better. You do need to be wary that your insight might not scale as well as the baselines.

Maybe 50% of your experiments are at small scale, 30% are at medium, and 20% are at large scales. These scales depend on your problem: if you are an academic lab pretraining language models, you should train 120M parameters, then try some stuff at 350M and 750M. Your final model should be a 1.3B or even a 7B.

Second, you should be aware that every reviewer will ask “does this scale?” It will always come up in reviews, and you should always think about how to answer this. The emergent abilities paper makes this especially challenging for gpu-poor labs because the presence of phase changes in LMs means that your tricks You should do your best to think about how to answer this question before submitting. I don’t have any good thoughts on how to deal with this.

Finally, if you think you are gonig to challenge the idea that scale is king, you had better be absolutely certain that you’re right.

[Relevant link] [Source]

Sam Stevens, 2024