Date:2026-06-08 09:37:55
If you ask a frontier artificial intelligence model to write an essay about the fall of Rome, it will spit out a flawless narrative in seconds.
But ask that same system to diagnose a rare disease or find a needle-in-a-haystack molecular structure for a new drug, and it will often freeze. It turns out that today’s AI is brilliant at answering questions, but catastrophically bad at asking them.
To fix this, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard’s School of Engineering and Applied Sciences (SEAS) set the advanced AI models down to play a game of Battleship.
The results revealed a reality about the current state of artificial intelligence: size doesn’t equal curiosity.
“Today’s language models are primarily optimized to answer complex queries, but it’s less clear whether they learn to ask good questions for themselves,” said Gabriel Grand, an MIT PhD student and CSAIL researcher.
“Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a ‘world model,’ they ask better questions and make discoveries more efficiently,” the lead author added.
Question-asking skills
To test this, the team created “Collaborative Battleship.” In this natural-language version of the classic board game, one AI acts as the “captain,” guessing where the hidden vessels are by asking questions. Another AI plays the “spotter,” answering in real time.
Building the “BattleshipQA” dataset from over 40 human players, researchers compared human strategic thinking with that of language models such as GPT-5 and the smaller Llama 4 Scout.
When left to their own devices, large language models (LMs) like OpenAI’s heavily anticipated GPT-5 performed decently, but smaller models were completely irrational.
To fix this, researchers equipped the models with a Monte Carlo inference strategy that continuously measures the likelihood of correct options based on each response. This addition transformed the underperforming Llama 4 Scout — increasing its human win rate from 8 percent to 82 percent.
Beyond asking better questions, the researchers also improved how language models answer them, closing a gap where smaller AI systems frequently gave incorrect responses about hidden ship locations.
In introducing a method in which the models automatically converted natural-language questions into code, the systems were forced to explicitly verify their data before responding. This code-based verification strategy boosted the models’ answering accuracy by an average of 15 percent, helping even smaller systems act as more reliable teammates.
Advancing scientific discoveries
To improve the AI “spotters,” the team used Python to automatically convert natural-language questions into encoded commands, giving the systems precise instructions to verify the data before responding.
This combination allowed the captain to extract far more information while boosting answering accuracy across the board. It yielded a nearly 30 percent performance bump for the lightweight GPT-4o-mini and an eight-point jump for the large Claude 4 Opus.
“What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs’ exploration and information-gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving,” said Jacob Andreas, senior author.
When tested on the game “Guess Who?”, this approach boosted the success rate of the smaller Llama 4 Scout from 30 percent to over 72 percent and raised GPT-4o’s success rate from 62 percent to 90 percent.
This strategic exploration ability holds massive potential for real-world “needle-in-a-haystack” scientific discoveries, such as identifying molecular structures.