Towards end-to-end automation of AI research

Towards end-to-end automation of AI research

Article Date = 25 March 2026
Article URL = https://www.nature.com/articles/s41586-026-10265-5
Article Title = Towards end-to-end automation of AI research
Article Image = (not provided)

Summary

This Nature paper describes “The AI Scientist”: an end-to-end pipeline that uses current foundation models to autonomously generate research ideas, run experiments (via code generation and an agentic tree search), analyse results, write manuscripts and pass them through an automated peer-review agent. The authors build two flavours: a template-based system that extends a human-provided codebase and a template-free system that writes code from scratch and uses a parallelised tree search for exploration.

Crucially, they also built The Automated Reviewer — an ensemble LLM reviewer that scores papers and makes accept/reject decisions. The reviewer performs comparably to human reviewers on conference datasets. In a controlled experiment, one fully AI-generated manuscript produced by the system achieved scores above the average acceptance threshold at an ICLR workshop (it would likely have been accepted had the authors not withdrawn it by protocol).

Key Points

  • The AI Scientist automates the full research cycle: ideation, literature checks, experiment planning, execution, analysis, manuscript writing and review.
  • Two system modes: template-based (extends existing code with Aider) and template-free (open-ended code generation using Claude Sonnet 4 and a tree-search strategy).
  • Agentic tree search runs many parallel experimental nodes (hyperparameter, ablation, replication, aggregation) and prunes using LLM-guided evaluation.
  • Visual outputs are critiqued by a vision-language model to catch labelling and presentation issues before writing.
  • The Automated Reviewer is an ensemble of five reviews plus a meta-review; it matches human reviewer agreement metrics (balanced accuracy and F1) on conference data.
  • Paper quality rises with stronger base models and with more test-time compute — both model capability and inference investment matter.
  • In a blinded workshop experiment, one of three AI-generated papers would have met the workshop acceptance bar; none met the main-conference standard.
  • Limitations include hallucinated or incorrect citations, coding errors, naive ideas, lack of deep methodological rigour and consistency; the system currently only runs computational experiments.
  • Ethical and social risks: reviewer overload, credential inflation, misattribution, job disruption, and potential for unethical experiments — the team used IRB approval and withdrew submissions after review.
  • Code and repositories are public: template-based and automated-reviewer code at https://github.com/SakanaAI/AI-Scientist and template-free at https://github.com/SakanaAI/AI-Scientist-v2 (Apache 2.0).

Why should I read this?

Short answer: because this is the first clear demonstration that AI can not just help parts of research but attempt the whole loop — idea to peer review. If you’re in ML, research policy, ethics or infrastructure planning, this paper is a fast way to see what fully automated scientific workflows already look like and where they break. It’s both exciting and a little alarming — so worth a skim (or a close read if you pick fights with the limitations section).

Author note

Punchy take: this is a milestone. The authors show that with current LLMs plus smart orchestration you can produce workshop-level papers autonomously. It’s not flawless, but the trend is clear — better models and more compute will rapidly narrow the gap to human-level outputs. Read this if you want to know how researchers are wiring LLMs into the scientific process now, not five years from now.

Context and relevance

Where this fits: The work builds on decades of automated-discovery research but leverages modern foundation models, VLMs and tool use to scale from narrow tasks to long, structured research workflows. The findings are relevant to ongoing debates about reproducibility, peer-review capacity, research workforce impacts and governance of automated science. For labs and institutions, it signals both opportunity (speeding exploratory work, automating routine experiments) and risk (flooding review pipelines, misuse).

Source

Source: https://www.nature.com/articles/s41586-026-10265-5