My name is Sarmad Qadri and I’m the creator of the open source project, mcp-agent. My philosophy for agent development in 2025 can be summarized as – MCP is all you need. Or more verbosely: Connect state-of-the-art LLMs to MCP servers, and leverage simple design patterns to let them make tool calls, gather context and make decisions.
Over the past few months, I’ve been asked many times when mcp-agent would support deep research workflows. So I set out to build an open source general purpose agent that can work like Claude Code for complex tasks, including deep research, but also multi-step workflows that require making MCP tool calls.
Turns out this is a lot more complex than expected, even if the architectural underpinnings are conceptually simple. This post is about lessons I wanted to share to help others build their own deep research agents.
You can find the open-source Deep Orchestrator agent here: https://github.com/lastmile-ai/mcp-agent/src/mcp_agent/workflows/deep_orchestrator/
The first deep research agents started out with access to the internet and filesystem only. The promise of MCP is to dramatically expand that list of tools while adhering to the same architecture. The goal is to be able to do deep research connected to an internal data warehouse, or any data source accessible via an MCP tool or resource. Plus, being able to mutate state by performing tool calls turns the agent from just a research agent to something much more powerful and general-purpose.
So following the Deep Research approach, I settled on the following requirements:
With these goals in mind, my first instinct was to build an Orchestrator to manage complex queries.
I implemented the Orchestrator pattern from Anthropic’s Building Effective Agents blog post.
Architecture Components:
The first attempt worked somewhat well. The Orchestrator usually did a good job defining and executing a plan, and the architecture was simple and elegant. It was particularly effective for tasks where a full plan could be determined upfront.
For example, for this objective, here’s how the Orchestrator would break it down into a Plan:
Load the student’s short story from short_story.md, and generate a report with feedback across proofreading, factuality/logical consistency and style adherence. Use the style rules from https://owl.purdue.edu/owl/research_and_citation/apa_style/apa_formatting_and_style_guide/general_format.html Write the graded report to graded_report.md in the same directory as short_story.md
However, the Orchestrator approach also uncovered a number of challenges:
To make plans more dynamic, I tried implementing an “iterative” plan mode, whereby the planner would only think of the immediate next step, then re-evaluate once that was completed, and keep going until the objective was accomplished.
The most counterintuitive learning was that asking for the Orchestrator to identify only the next immediate step worked substantially worse than expected. Instinctively, giving the LLM more granular reasoning tasks should provide high quality outputs. However, in practice, asking the Orchestrator for a high-level plan to begin with had dramatically improved the output.
I suspect the reason is that asking the LLM to think of all steps requires it to reason more deliberately.
This Orchestrator pattern is pretty useful for certain types of tasks, and is part of mcp-agent. But it certainly isn’t the general-purpose deep research agent that I was after.
Over the past few months, several AI companies came out with deep research agents. Some of them published detailed blogs on their approaches (Anthropic, Jina AI).
Armed with their learnings and my own from Take 1, I set out to build an “Orchestrator” workflow which would “adapt” its plan and subtasks based on the objective. I named it “AdaptiveWorkflow“, which felt appropriate.
Compared to Take 1, the main architectural updates were:
I was really excited about Adaptive Workflow to address some of the shortcomings that my original Orchestrator uncovered. With better workflow planning, budget management, external memory, more dynamic agent selection and mode detection, my deep research agent should be more versatile and efficient!
The hundreds of unit tests Claude and I wrote all passed, the individual components were all working correctly. Time to press the On button to try it out!
And lo and behold… it didn’t work on real-world examples:
All of the ingredients for a deep research agent were there, matching the theory and architecture I had read, but for some reason, the whole wasn’t greater than the sum of the parts.
The key observation came from cases where the original Orchestrator was outperforming the AdaptiveWorkflow despite all its bells and whistles.
I started debugging the queries where the Orchestrator outperformed my Adaptive Workflow and unlocked a key insight: simpler architecture consistently wins.
So I wiped the slate clean and started by running the original Orchestrator in a loop:
Next, I rewrote all the components I had built for Adaptive Workflow, but with new insights from what worked well in the base Orchestrator:
I instructed the LLM to generate a full plan upfront, so the queue is built with multiple steps (not just the next TODO step). Additionally, added parallelism into sub-tasks for performance and sequential steps.
The original Orchestrator was token-inefficient but context-rich – each task would get full context from previous steps.
I reused most of the architecture for external memory and knowledge extraction from Adaptive Workflow, but improved how tasks utilized memory. The planner now specifies dependencies between tasks when generating the plan, determining when context should be fed in and building a dependency graph that determines when to propagate memory.
Additions to the Task and Plan models between Take 2 and Take 3:
I also added a “full context propagation mode,” but this is much less token efficient and usually not necessary.
LLMs hallucinate. They’re better than they were before, but they’re still not perfect. So I added deterministic plan verification before the plan is executed. The verification validates:
If the plan verification fails, Deep Orchestrator generates an error message prompt with all the issues identified and asks the Planner LLM to create a new plan to address the issues (this is similar to the Evaluator-Optimizer pattern explained in Building Effective Agents, but involves deterministic verification).
The combination of incorporating deterministic (code-based) validation in conjunction with LLM execution was a powerful improvement to the architecture.
There’s more to agents than just LLMs. If we can check something deterministically with code, always prefer that over doing the same with an LLM.
Good prompting really matters. Just take a look at the reverse-engineered Claude Code system prompt. One thing I learned from Roo Code is to build up the prompts progressively in a functional/programmatic way instead of just long, giant strings.
So I added functions that can add sections to prompts, and to organize them all, I switched to XML tags (there are other approaches too).
You can see some of this in action in mcp-agent/src/mcp_agent/workflows/deep_orchestrator/prompts.py. For example, this is building up a prompt for synthesizing given different knowledge artifacts:
Using XML tags (or some other structured language within the string) to disambiguate sections really helped.
Finally, I got rid of the complex Mode Detection from Adaptive Workflow and instead built a simple policy engine that would decide whether to:
Having a module dedicated to decision-making helped simplify the architecture and showed me that my complex mode selection was a hack masquerading as a feature.
Putting all these new pieces together worked pretty well! You can try it out in this AI Alliance example application, Deep Research Agent for Finance.
The full Deep Orchestrator flow (link to source):
I over-complicated things with Adaptive Workflow, even though the base components were correct (TODO queue, external memory/knowledge, lead planner, etc), they didn’t work well together because each component was individually too complex.
I didn’t implement anything specifically for deep research when building Deep Orchestrator. The fundamental building block of MCP-Agent is MCP servers, so the same agent that’s performing general tasks can also be used for deep research. This is the power of the generalizability of MCP.
A little cliche, but the difference between an agent that works well and one that doesn’t is in the many small decisions made along the way. The base components of Adaptive Workflow (attempt 2) and Deep Orchestrator (attempt 3) are really similar, but it took a lot of small tweaks to actually get Deep Orchestrator to work well.
I’m really happy with the progress so far to bring us to Deep Orchestrator, but there are a few things that I’m excited about to take it to the next level.
For the latest improvements and projects with MCP-Agent, check out the open source repo: mcp-agent.
Also, check out the AI Alliance’s Deep Research Agent for Finance open-source project, which uses the Deep Research architecture for searching financial information and preparing reports.
Sarmad Qadri