You're building multi-agents wrong

How to design human-like multi-agent systems

October 20, 2025

In my time building agents at Stanford and Databricks, I've seen all kinds of architectures, frameworks, and approaches. Multi-agents have been incredibly tempting but results have been mixed. In this post, I'll lay out some of the work I've done on building multi-agents since January and explain how we're thinking about the problem in a different way. This post would not be possible without Charlotte Yan's incredible understanding and implementation of agent memory and the mentorship of Hao Zhu.

Why we don’t build multi-agents

The idea of building multi-agent systems is one of the most compelling ideas we've seen in the world of agents. However, most attempts to build them so far have been extremely limited, as laid out in Cognition's Don't Build Multi-Agents. In this post, they lay out the temptation of decomposition but show that the lack of coordination between agents nullifies any benefits gained. Most multi-agent systems we've seen have been very academic in nature, requiring intense custom prompt engineering to design effective agents and not being able to work in production. One of the few successful multi-agent systems we've seen is Claude Code, which uses sub-agents as a means to explore a codebase without crowding context but even this scratches the surface of what we had imagined.

When humans collaborate in professional environments, we focus on tasks that accentuate each of our strengths and avoid those that we may be weak at. Each human working has built up a background of experiences and knowledge that makes them particularly suited for certain tasks. We compose our team of the most suited individuals for our goals and invite or train new members as needed. We collaborate asynchronously, writing documents and knowledge to places that are accessible when we deem relevant.

This model of collaboration is a complete contrast to how we see multi-agent systems being used today. Right now, we set up each agent as if they were a new hire, jammed with instructions in elaborate prompts but without any past experience or real specialization. We then only talk to them through an orchestrator that quickly gets overwhelmed with exploding amounts of context and has to play a game of telephone between it and its sub-agents. It should now fairly obvious why we don't build multi-agents right now.

What is Agent Library

Agent Library is a new approach to building multi-agent systems based on a more human-like model of collaboration. It's named after the agent library, an ever-growing collection of saved specialists who can be drawn upon as needed. We maintain a hierarchical structure with an orchestrator, but instead of having it be all-powerful, we simplify its goal to more of a project manager, assigning tasks but not explaining how they should be done. The orchestrator, like the manager of a human team, is responsible for choosing the agents for its team or hiring new ones when needed. The sub-agents collaborate just like humans, communicating asynchronously through a shared file system.

Agent Library

Deep Research Agent

Memory: 47 episodes, 293 facts

Tools: web_search, pdf_reader

Specialty: Deep web research

Software Engineering Agent

Memory: 29 episodes, 120 facts

Tools: code_executor, repo_analyzer

Specialty: Writing and running code

Product Designer Agent

Memory: 4 episodes, 29 facts

Tools: figma_mcp

Specialty: UI/UX and product design

Data Analyst Agent

Memory: 78 analyses, 12 dashboards

Tools: data_visualizer, read_sheet

Specialty: Analyzing data

Specialized agents with memory, tools, and prompts stored in a library.

Orchestrating multi-agents

The orchestrator is adapted from the lead-agent pattern in Magentic-One but with a few key differences. Instead of having the orchestrator control the entire flow of information throughout the system, we use it more like a human manager. Upon recieving a task, the orchestrator acts as a planner. After a simple deep research run, it returns a single Markdown plan akin to Cursor's Plan mode explaining intended tasks and outcomes for sub-agents as well as expected opportunities for parallelism. This essentially creates a DAG defining the flow of how the agents should work and allows for the orchestrator to select the team of agents that will work on the task in one step. We retrieve the selected agents from the library, and if they aren't present, the orchestrator creates a new agent with the most relevant initial tools and instructions.

Market Research

Research Agent

Design Ideas

Designer Agent

Data Analysis

Data Agent

Coding

Code Agent

Integrate & Summarize

Orchestrator

The orchestrator creates a plan as a DAG for agents to follow

Beyond this, the role of the orchestrator is simply evaluation. When each agent is finished, the orchestrator provides the sub-agent with feedback on its performance and the artifacts it produced, which will then be used to update the agent's system prompt and memory. By minimizing the orchestrator's role, we reduce the risk of it becoming a bottleneck and create opportunities for the more specialized sub-agents to work more autonomously.

Agents need memory

We base our agent memory system on CoALA, a framework developed by Shunyu Yao at Princeton that aims to improve agent memory by mimicking how humans store information. CoALA uses a hierarchical structure of memory stores, each with a specific and extremely different purpose.

Semantic

Semantic memory is the most basic element of the CoALA memory framework and the one most commonly seen in products like Supermemory and Letta. We implement it as a simple knowledge graph that allows agents to retrieve facts and concepts from previously seen knowledge that may be relevant to the current task. After each run, we extract key facts and store them in the graph.

Procedural

Procedural memory is a more specialized store that allows agents to learn relevant skills and procedures that can be applied to specific tasks. After each verified successful trajectory, the agent summarizes essential procedures and ideas used during the reasoning process and updates its system prompt with the new knowledge based on the feedback it recieved from the orchestrator.

Episodic

Episodic memory is the simplest store but also the most powerful. After recieving feedback on each trajectory, the agent saves an entire episode. In later attempts, the agent retrieves the top K most relevant episodes, which can contain essential seen information that can improve the agent's performance. Simply retrieving self-collected in-context examples has been shown to improve agent performance significantly on tasks like world navigation and complex problem solving.

A shared file system

LLMs speak Markdown. They are also heavily post-trained to be able to navigate a file system, read files, and write them. We use this to our advantage by allowing our agents to communicate by writing their thoughts and responses to Markdown files in a shared file system. This allows for asynchronous workflows and selective information sharing, preventing unnecessary context bloat. This is similar to how humans use tools like Notion and was initially popularized by tools like Manus and is being heavily reinforced in agents like Claude Code, which are quite glad to write Markdown files into your codebase.

Shared File System

research_findings.md

summaries.md

TODO.md

mvp_architecture.md

design_wireframes.md

user_flow.md

user_guide.md

faq.md

project_status.md

timeline.md

changelog.md

~/workspace/files

Agents save markdown files as they go.

We are just scratching the surface of what we hope multi-agents can do. Cognition recently demoed sub-agents using a custom model for efficient code search. Anthropic is all in with Claude Code and their Research implementation. By designing our harnesses in a way that plays to the models strengths and a more human architecture, we can take a step towards a more unified future in which our agentic teams act just like our own.