Four deep research APIs tested on quality and token cost
An internal research tool needed to take a messy question, read many sources, and return an answer that a person could act on, with citations that held up. Four hosted deep research APIs were tested on the same task to decide which one to connect. The task was designed to break shallow AI agents: the first three sources contradicted each other, one source was quietly out of date, and the correct answer required noticing that the question itself contained a false premise.
The scoring checked whether the final answer caught that false premise, whether every claim linked back to a real source, and how many tool calls and tokens were spent. The main difference was not how many pages each system could search. The real gap appeared when sources disagreed.
The weaker two systems followed whichever source they saw last and gave a confident wrong answer, while the better two called out the conflict and resolved it.
Key points
- Four deep research APIs were tested on one difficult multi-step task.
- The test measured false premise detection, real citations, tool calls, and token use.
- The weaker systems chose the last source they saw and answered wrongly with confidence.
- The stronger systems noticed source conflicts and worked through them.
- Agent cost checks should include both answer quality and tokens used.
Quick term guide
- citations
- Citations show which source a claim or answer came from.
- deep research APIs
- Online services that let software search, read sources, and produce research-style answers.
- Deep Research
- An AI feature that produces longer research-style answers using sources or gathered information.
- AI agents
- AI agents are AI tools that can carry out steps toward a goal, not just answer once.
- claim link
- A one-time link used to attach a published page to your own account.
- tool calls
- Times when an AI system uses another function, such as search or file access.
- tool call
- One time an AI agent uses a tool, such as search, calculation, or file reading.
- evaluation
- A process of testing and scoring how well an AI performed its specific task.