How Google’s AI Co-Scientist Is Changing the Way We Do Research

This research introduces AI Co-Scientist, a multi-agent AI system capable of generating novel, testable, and plausible hypotheses for scientific research.

Published: June 11, 2025|Updated: June 9, 2025|By: Martin Solomon

What if machines learned how to generate novel, plausible, and testable research hypotheses from existing body of knowledge? That’s the entire promise behind Google DeepMind’s AI Co-Scientist, a multi-agent system that can look at existing literature and propose hypothesis that scientists can test and prove. In an era where there’s too much information, Google’s AI Co-Scientist points towards a future where AI works alongside scientists.

Built with Gemini 2.0, the AI Co-Scientist aims to solve the greatest bottleneck in hypothesis generation. Today’s researchers are often so fed up with too much information and an ever-expanding volume of literature. At the same time, making sense of that data and turning it into testable, impactful research questions is increasingly time-consuming, cognitively demanding, and limited by human bandwidth.

That’s where the AI Co-Scientist comes in. It’s a system that simulates the scientific process from hypothesis generation to peer review and experimental proposal using a coordinated team of AI agents. The Co-Scientist proposes a hypothesis and the scientist’s job is to verify the hypothesis and test its validity. In fact, independent researchers used the Co-Scientist’s generated hypothesis, tested it experimentally, and found good results.

This research holds transformative potential for a wide range of scientific fields. Chemists can use it to explore new reaction pathways or catalysts. Pharmaceutical companies can accelerate preclinical drug discovery. Environmental scientists can uncover overlooked patterns in ecological data. Whether you’re in academia or industry, the AI Co-Scientist represents a powerful shift in how scientific knowledge is created, tested, and applied.

How the AI Co-Scientist Works in General

At its core, the AI Co-Scientist is designed to simulate the entire scientific method using a multi-agent large language model (LLM) system. Unlike single-output AI tools that just summarize papers or answer questions, this system mimics how scientists actually work: generating ideas, debating them, refining them, and proposing experiments, iteratively and autonomously.

The process starts with a research goal, to be written and explained by the scientist. This could be anything, from finding novel drugs for liver fibrosis to explaining the mechanisms behind antimicrobial resistance. Once done, the system kicks off a structured reasoning loop, breaking the problem down, parsing related literature, generating the hypothesis, and improving them over time through a series of debates and collaboration between the individual agents.

Each agent has a specific role, coordinated by a central Supervisor Agent, and all outputs are stored in a context memory that enables self-improvement across iterations. The process continues until a set of high-quality, testable hypotheses is produced. The AI Co-Scientist multi-agent system includes the following:

Generation Agent: The starting point of the whole agent system. Its job is to interpret the research goal, search for related literature, and generate novel but plausible hypotheses, grounded in prior knowledge.
Reflection Agent: This agent acts as a peer reviewer, performing a multi-stage critique of each hypothesis generated by the Generation Agent. It checks correctness, novelty, and feasibility, with or without access to external literature. It also decomposes hypotheses into underlying assumptions to test their soundness and even simulates biological or experimental outcomes step by step.
Ranking Agent: This agent organizes the hypothesis into tournament-style matchups, testing them in pairwise debates. Using an Elo rating system, it scores each hypothesis based on how well it performs based on novelty, testability, and coherence.
Proximity Agent: To avoid redundancy and promote diverse thinking, the proximity agents build a similarity graph across all hypotheses. It clusters similar ideas, flags duplicates, and helps identify underexplored regions in the research space. This informs which ideas to develop further or compare directly.
Evolution Agent: This agent picks promising hypotheses and applies refinement strategies like improving grounding, simplifying language, combining concepts, or introducing alternative assumptions. Importantly, it doesn’t overwrite ideas; it generates new, evolved variants that re-enter the process.
Meta-Review Agent: This agent steps back and looks at the whole system’s performance. It identifies patterns in critiques, surface recurring issues, and provides high-level summaries of the current research direction.
Supervisor Agent (Coordinator): As mentioned, the supervisor agent handles the overall process – distributing tasks, monitoring progress, allocating resources, and triggering system resets based on convergence criteria or time budgets. It also manages the shared memory, which allows the agents to continuously learn from past iterations.

Together, these agents form a self-improving research loop. Ideas get generated, critiqued, ranked, refined, and re-tested until the system converges on a set of high-Elo, diverse, testable, and novel hypotheses.

Figure 1. Overview of the Multi-Agent System of AI Co-Scientist — **Figure 1.** The Multi-Agent Architecture Design of the AI Co-Scientist. The AI co-scientist takes a research goal written in plain language, turns it into a step-by-step plan, and assigns tasks to specialized AI agents. These agents work together—guided by a Supervisor agent—to explore ideas, test hypotheses, and suggest research directions, while also incorporating feedback from the human scientist (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Evaluating the Capability of the AI Co-Scientist

To assess whether the AI Co-Scientist produces high-quality scientific hypotheses, the researchers implemented a comprehensive evaluation framework. This included automated metrics, benchmark datasets, expert judgments, and comparisons with other language models and human researchers. The results show that the system not only generates plausible hypotheses, even improving it over time and outperforming strong baselines.

Elo Rating Correlates with Hypothesis Quality

The AI Co-Scientist used an Elo-style rating system (like Chess) to rank its generated hypotheses based on pairwise comparisons. To test whether this metric is meaningful, researchers evaluated the system’s Elo scores on GPQA, a benchmark of graduate-level science questions across biology, chemistry, and physics.

The findings showed a clear correlation: hypotheses with higher Elo scores were significantly more accurate and reliable. In fact, the AI’s top-ranked answers achieved 78.4% accuracy on GPQA, outperforming random selection and proving that Elo is a strong internal signal for hypothesis quality.

Figure 2. Elo Rating of AI Co-Scientist — **Figure 2.** The chart shows how the AI co-scientist’s Elo score relates to its performance on GPQA questions. Higher Elo scores generally mean better answers. The blue line shows the AI co-scientist’s accuracy, while the red line shows how Gemini 2.0 performed on the same questions. Note: Elo scores are based on AI self-evaluation, not ground-truth answers (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Test-Time Compute Improves Reasoning

Unlike static models, DeepMind researchers built the AI Co-Scientist to improve with time and through iterations. The researchers tested this by running the system longer across 203 research goals, allowing more rounds of generation, critique, ranking, and refinememt.

As compute increased, so did the performance. The hypotheses from later iterations consistently received higher Elo scores and improved in novelty and testability. This confirms that the system benefits from iterative self-improvement, making it more powerful the longer it runs.

Figure 3. Test Time Computing Results — **Figure 3.** This chart shows how increasing compute power during testing improves the AI co-scientist’s performance. Results were grouped over time, and both the highest Elo score and the average Elo of the top 10 hypotheses improved in later groups. This suggests that more compute leads to better research outputs. Elo scores are auto-evaluated, not based on ground-truth answers (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Comparison with Other LLMs and Experts

The AI Co-Scientist was also benchmarked against the following:

Gemini 2.0 Pro Experimental
OpenAI o1 and o3-mini-high
DeepSeek R1
Human experts tasked with writing their best hypothesis for the same biomedical problems.

On 15 expert-curated challenges, the AI Co-Scientist outperformed every other model and the human-written hypotheses, based on Elo ratings. Not only did it generate better ideas, but it also improved on the best expert suggestions after multiple iterations—highlighting the strength of its self-improving loop.

Figure 4. Comparison with other LLM Baselines — **Figure 4.** This chart compares how different LLMs ranked research results from the AI co-scientist and other models across 11 expert-designed research goals. Lower numbers mean better rankings. The results show how each model performed based on the preferences of four different language model evaluators (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Expert Evaluation

Finally, the researchers brought in domain experts to judge the system’s outputs directly. Experts rated hypotheses from the Co-Scientist and three baseline LLMs across three criteria:

Novelty – Is the idea original and previously unpublished?
Impact – Could it lead to meaningful scientific advances?
Overall Preference – Which output would they choose to pursue?

The Co-Scientist consistently scored higher in all categories:

Novelty score: 3.64 out of 5
Impact score: 3.09 out of 5
Experts preferred Co-Scientist’s hypotheses over those from Gemini, OpenAI, and others.

Figure 5. Comparison with other Experts — **Figure 5.** Experts rated the AI co-scientist and other models on how novel and impactful their responses were for 11 research goals. Higher scores (left) mean better quality. Experts also ranked the models overall (right), with lower numbers meaning better performance. These rankings generally match the models’ Elo scores, but they reflect expert opinions—not objective truth (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Applying the AI Co-Scientist to Real-World Problems

To further prove the validity of the AI Co-Scientist, it was tested against real-world biomedical research. The researchers selected three major scientific challenges and used the system to generate hypotheses that were then passed on to human experts and validated in wet-lab experiments. The results demonstrate that the system is capable of contributing to real-world scientific discovery.

Drug Repurposing for Acute Myeloid Leukemia (AML)

In this task, the AI Co-Scientist was given a broad objective: identify new or overlooked drug candidates for treating AML, a complex and aggressive cancer. The system produced hypotheses that included both known and novel drug candidates:

It correctly identified binimetinib and pacritinib, both of which are known to have AML-related activity.
More impressively, it proposed KIRA6, a drug not previously associated with AML treatment.

KIRA6 was taken into in vitro validation, where it showed nanomolar-level efficacy in inhibiting AML tumor cells. The AI identified KIRA6 in a hypothesis-driven context, and it performed successfully in lab testing.

Figure 6. Results for KIRA 6 for AML — **Figure 6.** This graph shows how the drug KIRA6 affects the survival of AML cancer cells (KG-1, MOLM-13, HL-60). As the drug concentration increases (X-axis), cell viability decreases (Y-axis), showing KIRA6’s effectiveness in killing these cells at nanomolar levels (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Novel Targets for Liver Fibrosis

Liver fibrosis, a progressive and often irreversible scarring of liver tissue, currently lacks effective targeted therapies. The AI Co-Scientist was tasked with proposing new epigenetic targets that could reverse or slow the fibrotic process.

The system produced several hypotheses, three of which focused on understudied chromatin regulators. Two of the proposed targets were linked to available compounds that were then tested in human hepatic organoids (miniaturized models of human liver tissue).

The result showed that both compounds demonstrated clear anti-fibrotic effects, validating the AI’s ability to point toward therapeutically relevant molecular targets.

Figure 7. Results for Liver Fibrosis — **Figure 7.** The AI co-scientist identified three epigenetic targets and four potential drugs (Suggested 1–4) for treating liver fibrosis. In lab tests using human liver organoids, these drugs reduced fibroblast activity, measured by changes in fluorescent signals. The chart compares untreated cells, cells exposed to a fibrosis-inducing agent (TGF-β), a known inhibitor, and the four suggested drugs. Lower fold change means stronger antifibrotic effect. Statistical significance is shown by p-values (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Mechanism of Antimicrobial Resistance

To push the limits of novelty, researchers posed a harder challenge involving an unpublished mechanism of bacterial gene transfer related to antimicrobial resistance, something known only to the human team, not present in the literature.

The AI Co-Scientist was tasked with explaining how antibiotic resistance could spread between bacterial species. After 2 days of iterative reasoning, the system produced a hypothesis remarkably similar to a recently discovered, unpublished mechanism involving cf-PICIs (phage-inducible chromosomal islands that carry resistance genes).

This result is particularly significant because it means the AI was able to infer a complex, multi-step biological mechanism that took humans a decade to uncover, using only open-access data and iterative reasoning.

Figure 8. Results for Mechanism of Antimicrobial Resistance — **Figure 8.** This timeline compares the traditional lab process with the AI co-scientist-assisted approach for studying cf-PICIs—genetic elements linked to antibiotic resistance. Blue boxes show inputs from human scientists, red boxes represent the AI system, red text marks lab experiments, and blue text highlights hypothesis generation (“Accelerating Scientific Breakthroughs with an AI Co-Scientist,” 2025).

Limitations & Challenges

While the AI Co-Scientist represents a major leap forward in using AI to accelerate scientific discovery, it’s not without limitations. Like any early-stage innovation, it comes with both technical and practical challenges that need to be addressed before widespread adoption is possible.

Limited Access to Scientific Literature

One of the most significant constraints is that the AI Co-Scientist is currently limited to open-access literature. That means it cannot access many of the most influential or cutting-edge papers locked behind paywalls. This poses a risk of missing critical prior work, generating redundant hypotheses, or overlooking important negative results that are often underreported in open literature.

Experimental, Not Yet Plug-and-Play

Although the system has demonstrated success in controlled biomedical settings, it’s still largely experimental. Deploying it in a general-purpose lab or research team would require:

Careful prompt design,
Custom agent configuration,
Expert interpretation of outputs.

This isn’t yet a product researchers can use off-the-shelf. The system still depends on human review and intervention to validate outputs and guide its use responsibly.

No Specialized Equipment, But High Compute Costs

The AI Co-Scientist doesn’t need physical lab tools to run, but it does require significant computational resources, especially when running multiple iterations and agents in parallel.

While this is manageable for tech companies or research institutions, it could be a barrier for small academic labs, research teams in low-resource environments, and individual users without access to high-performance cloud computing.

As of now, cost and scalability remain obstacles to democratized access.

Evaluation Metrics Are Still Evolving

The system uses Elo scores based on simulated scientific debates to rank hypotheses. While this has shown a strong correlation with quality, it’s still a proxy, not a ground-truth measurement. There’s no universally accepted benchmark for hypothesis quality, and expert judgment, while helpful, introduces subjectivity.

Wider adoption will require more robust, transparent evaluation frameworks that can generalize across domains.

Generalization Outside Biomedicine Is Unproven

All current validations focus on biomedical research—drug discovery, disease mechanisms, and molecular targets. It remains unclear how well the system performs in other fields such as materials science, physics, and environmental science.

Transferring this architecture across disciplines will require domain-specific grounding, tailored databases, and new agent behaviors adapted to different types of data and experiments.

Risk of Overreliance or Misuse

As with all powerful AI systems, there’s a risk that users might over-trust the outputs or fail to critically evaluate them. Without proper guardrails, the system could propose:

Biologically plausible but unsafe experiments,
Ethically questionable ideas,
Hypotheses that violate domain constraints if misused.

Wider adoption will depend not just on technical performance, but on responsible integration into human scientific workflows—with transparency, oversight, and accountability.

The Road Ahead

The biggest challenge before this becomes widely adopted is bridging the gap between lab performance and real-world usability. That means building user-friendly interfaces, reducing compute costs, integrating with real-world data systems, and developing field-specific adaptations. With continued development, the AI Co-Scientist could become a standard tool for hypothesis generation across disciplines, but it’s not quite there yet.

What do you think? Would you trust an AI system to help generate research hypotheses in your field? Would this change how you approach your next project? What would be its impact across the scientific community?

The possibilities are exciting but scary at the same time. But what we can do now is to embrace the fact that AI is here to stay. We’ve opened the Pandora’s Box, and our job is to ensure that tools like AI Co-Scientist remain ethical and humane.

Reference: This article is based on the findings and data presented in the original research study. For full details, methodologies, and supporting information, you may access the research here.

Martin Solomon

Martin Solomon is the creator of Chemolytics, a platform dedicated to advancing chemical research through machine learning, scientific computing, and quantum theory. With a background in chemistry and a focus on AI-driven discovery, he writes about the mathematical foundations, algorithmic methods, and real-world applications of machine learning in chemical sciences

All author posts

Machine Learning Enhances MALDI-TOF MS for Better Antimicrobial Resistance Screening

Study demonstrates how machine learning enhances mass spectrometry for rapid & cost-effective antimicrobial resistance screening.

Structural Elucidation of Epimeric Cerebrosides Using Random Forests and Support Vector Machine

The paper published in Nature’s Communications is about an integrated SERS-based chemical taxonomy machine learning framework for untargeted structural elucidation of 11 epimeric cerebrosides, attaining >90% accuracy and robust single epimer and multiplex quantification with