LLMs for Code
Programming is a super popular application for LLMs.
The idea of using generative AI for autocomplete in a text editor is a really significant innovation, and is still my favorite example of a non-chat UI for interacting with models.
I think it’s popular because individual pieces of code are designed to be separable (modular), they often have extremely clear goals (docstrings describe what a function does and why you might use it), and there are very clear patterns in the language without exceptions (syntax, grammars, etc do not have exceptions like human languages). Furthermore, it can be extremely complex, making it comparatively hard for humans, so even a non-AGI LLM can be incredibly useful for autocomplete.
However, I think that a lot of AI research does not take advantage of the non-AI tools in place to make programming easier for humans.
One exception to this statement is Aider. It uses a custom repo map, reads
CONVENTION.md
files, has an editing-specific benchmark. In general, Aider is an excellent set of engineering practices to make code editing with LLMs easier.
The academic community is primarily focused on building more realistic benchmarks:
- SWE-Bench and its variants (SWE-Bench Multimodal, SWE-Bench+)
- Commit0
- DevEval
- TheAgentCompany
There is plenty of work developing individual systems to attack these benchmarks (primarily SWE-Bench or SWE-Bench Lite). The Agentless paper compares many approaches in Table 1.
However, I feel that these papers present individual systems, which are necessarily going to be improved upon by more complex systems, more expensive systems, or better engineered systems. I would prefer to see papers that propose individual techniques, ideally adding a technique like static analysis-constrained decoding to many different existing systems, then recommending when it’s worth it to add the complexity of such a method to a system.
This relates to some preliminary thinking of mine on artifacts vs insights.
[Relevant link] [Source]
Sam Stevens, 2024