This is a study journal of the blog Building effective agents
What are agents
We can seperate tools build with ability of LLM into two categories:
- Workflows: systems where LLMS and tools are orchestrated through predefined code paths.
- Agents: systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
When (and when not) to use agents
- finding the simplest solution possible, this might mean not building agentic systems at all. Only increasing complexity when needed, as agentic systems often trade latency and cost for better task performance, and this should be considered when this tradeoff makes sense.
- for many applications, optimizing single LLM calls with retrieval and in-context examples (RAG) is usually enough.
- workflows can offer predictability and consistency for well-defined tasks, when more complexity is warranted.
- Agents are the better option when flexibility and model-driven decision-making are needed at scale.
When to use frameworks
frameworks for agentic system
- LangGraph from LangChain
- Amazon Bedrock’s AI Agent framework
- Rivet, a drag and drop GUI LLM workflow builder
- Vellum, another GUI tool for building and testing complex workflows
- AutoGen, LlamaIndex, DSPY, OpenHands
- start by using LLM APIs directly: many patterns can be implemented in a few lines of code.
- If a framework was employed, ensure it is well understood about the underlying code. Incorrect assumptions about what’s under the hood are a common source of error.
Agentic system patterns
Building blocks - the augmented LLM
- The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory.
- Current LLM can actively use these abilities
- generating their own search queries
- selecting appropriate tools
- determining what information to retain
Two key aspects of implementation:
- tailoring these capabilities to the specific use case and ensuring they provide an easy, well-documented interface for LLM (like MCP)
These augmented capabilities should be accessable for each LLM call.
Compositional Workflows
1. Prompt chaining
- Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one.
- Programmatic checks(“gate” in diagram below) can be added on any intermediate steps to ensure that the process is still on track.
When to use this workflow:
- Prompt chaining is ideal for situations where the task can be easily cleanly decomposed into fixed subtasks.
- The main goal is to trade off latency for higher accuracy, by making each LLM can an easier task.
Examples where prompt chaining is useful:
- Generating marketing copy, then translating it into a different language.
- Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
2. Routing
- Routing classifies an input and directs it to a specialized followup task.
- This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
When to use this workflow:
- Routing works well for complex tasks where there are distinct categories that are better handled separately,
- and where classification can be handled accuratedly, either by an LLM or a more traditional classification model/algorithm.
Examples where routing is useful:
- Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
- Routing easy/common questions to smaller models, and hard questions to more capable models to optimize cost and speed.
3. Parallelization
LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically.
This workflow, parallelization, manifests in two key variations:
- Sectioning: Breaking a task into independent subtasks run in parallel.
- Voting: Running the same task multiple times to get diverse outputs.
When to use this workflow:
- Parallelization is effective when the divided subtasks can be parallelized for speed,
- or when multiple perspectives or attempts are needed for higher confidence results.
For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.
Examples where parallelization is useful:
- Sectioning:
- Implementing guardrails
one model instance processes user queries, while another screens them for inappropriate content or requests. - Automating evals for evaluating LLM performance
where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
- Implementing guardrails
- Voting:
- Reviewing a pieve of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
- Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
4. Orchestrator-workers
In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegated them to worker LLMs, and synthesizes their results.
When to use this workflow:
- This workflow is well-suited for complex tasks where it is not predictable for subtasks needed( in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task).
- Wehreas it’s topographically similar from parallelization, the key difference is its flexibility- subtasks aren’t pre-defined, but determined by the orchestrator based on the specific input.
Example where orchestrator-workers is useful
- Coding products that make complex changes to multiple files each time.
- Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
5. Evaluator-optimizer
In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
When to use this workflow:
- Evaluator-optimizer is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value.
- The two signs of good fit are
- LLM responses can be demonstrably improved when a human articulates their feedback;
- LLM can provide such feedback
Examples where evaluator-optimizer is useful:
- Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
- Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
Autonomous Agents
The key capailities of LLMs mature which understanding compplex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors, lead to the usage of agents.
- Agents begin their work with either a command from, or interactive discussion with, the human user.
- Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement.
- During executiion, it’s curcial for the agents to gain “ground truth” from the environment at each step( such as tool call results or code execution) to assess its progress.
- Agents can then pause for human feedback at checkpoints or when encountering blockers.
- The task often terminates upon completion, but it’s also common to include stopping conditions( such as a maximum number of iterations) to maintain control.
- Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop.
- It is crucial to design toolsets and their documentation clearly and thoughtfully.
When to use agents:
- Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making.
- Agents’ automy makes them ideal for scaling tasks in trusted environments.
Examples where agents are useful:
- A coding agent to resolve SWE-bench tasks, which involve edits to many files based on a task description
- Claude computer use reference implementation, where Claude uses a computer to accomplish tasks.
Combining and customizing these patterns
These building blocks aren’t prescritive. They’re common patterns that developers can shape and combine to fit different use cases.
The key to success, as with any LLM features, is measuring performance and iterating on implementations.
To repeat: Considering adding complexity only when it demonstrably improves outcomes.
Summary
Success in the LLM space isn’t about building the most sophisticated system. It’s about building the right system in suffice to requirements.
Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
When implementing agents, we try to follow three core principles:
- Maintain simplicity in agent’s design
- Prioritize transparency by explicitly showing the agent’s planning steps.
- Carefully craft agent-computer interface(ACI) through thorough tool documentation and testing.
Frameworks can help you get started quickly, but don’t hesitate to reduce abstraction layers and build with basic components as you move to production.
By following these principles, you can create agents that are not only powerful but also reliable, maintable, and trusted by their users.
Appendix 1: Agents in practice
Two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above.
Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.
A. Customer support
Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is natural fit for more openended agents because:
- Support interactions naturally follow a conversation flow while requiring access to external infomation and actions.
- Tools can be integrated to pull customer data, order history, and knowledge base articles;
- Actions such as issuing refunds or updating tickets can be handled programmatically;
- Success can be clearly measured through user-defined resolutions.
Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents’ effectiveness.
B. Coding agents
The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:
- Code solutions are verifiable through automated tests
- Agents can iterate on solutions using test results as feedback
- The problem space is well-defined and structured;
- Output quality can be measured objectively
In Claude implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.
Appendix 2: Prompt engineering your tools
Tools is likely be an important part for an agentic system, it enables LLM to interact with external services and APIs by specifying their exact structure and definition in API.
In response of LLM, it will include a tool use block in the API response if it plans to invoke a tool.
Tool definations and specifications should be given just as much prompt engineering attention as the overall prompts.
This appendix, it will be described how to prompt engineer our tools.
There are often several ways to specify the same action. For instance, we can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, we can return code inside markdown or inside JSON.
In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other.
However, some formats are much more diffcult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written.
Written code inside JOSN(compared to markdown) requires extra escaping of newlines and quotes.
The suggestions for deciding on tool formats are the following:
- Give the model enough tokens to “think” before it writes itself into a corner.
- Keep the format close to what the model has seen naturally occuring in text on the internet.
- Make sure there’s no formatting “overhead” such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
One rule of thumb is to think about how much effort goes into human-computer interfaces(HCI), and plan to invest just as much effort in creating good agent-computer interfaces(ACI)
- A good tool definition offten includes example usage, edge cases, input format requirements, and clear boundaries from other tools. it should be obvious how to use this tool based on the description and parameters.
- When using many similar tools, changing parameter names or descriptions to make things more obvious is especially important.
- Test how the model uses these tools, Run many example inputs in the workbench to see what mistakes the model makes, and iterate.
- Poka-yoke the tools.