Implementing complex workflows in AI Agents
Although AI agents are capable of independent reasoning, their capricious nature can make it difficult to implement detailed workflow processes. Agents’ ability to solve problems in novel and unpredictable ways doesn’t play so well when you are trying to implement rigorous, compliance-driven processes.
These processes tend to be implemented in workflow tools that require time-consuming configuration and significant learning curves. Decision-making requires strict adherence to process and depends on a reasonable level of domain knowledge. LLMs can be unreliable in terms of their adherence to strict instructions, so building agents to reliably negotiate this style of workflow can be frustrating.
A naive agent implementation typically uses a generic system prompt to plan how to solve a problem. This prompt provides an LLM with some general instructions and some tool definitions. It also implements some form of memory to embed any relevant details of prior interactions. The LLM returns with an execution plan that the agent can follow in working out a response.
The difficulty with this approach is that generic planning prompts cannot solve more complex problems. Some solutions require a fixed set of workflow tasks that need to be followed in a very specific order. The more complex these tasks become, the less able an LLM is to reason its way to a reliable solution.
This implies that we need to provide more specialised instructions for some problems. There are many ways of accomplishing this, and it’s easy for any development to become mired in complexity as you chase down reliable workflow execution. A process of experimentation supported by a sound framework for evaluating outputs is essential here.
Understanding the trade-offs
When deciding how to implement more complex workflows, there are a set of trade-offs that need to be understood as part of any solution.
Firstly, there is accuracy, which can be a slippery concept for agents. Many problems don’t have clear and deterministic outputs, so you need some means of measuring what “good enough” looks like in terms of reliability. For example, frameworks such as RAGAS can assess outputs based on a “ground truth” data sets of questions and expected responses.
Latency is an important consideration as agents are IO bound systems where each decision requires a round trip to an expensive and slow LLM service. Added to that are equally slow calls to external data sources such as vector databases and product APIs. This is less of a problem for unattended execution, but if you want to provide a “conversational” interface, then you will need to optimise the external service calls.
Although LLMs are getting progressively cheaper, cost is still a concern as LLMs are by no means “cheap”. The more context you provide an LLM to improve reasoning, the more the agent will cost to run. It’s surprisingly easy for agent costs to spin out into costing $0.50 or more per request.
Finally, reasoning errors are a consideration as models don’t always make reliable choices. It can be tempting to stuff prompts with context, but prompts with too much detail or unnecessary information can overwhelm or confuse an LLM. The tipping point varies between different models, but they all suffer from a similar drop in performance when presented with overly verbose context or too many options.
Patterns for implementing detailed workflows
LLM-enabled service
The simplest way to implement this style of workflow would be as a stand-alone service that executes a predetermined sequence of steps or tool calls. This might use an LLM in solving the problem somewhere down the line, but the workflow would be based on deterministic and testable code.
Note that this is not an “agent” as such because it doesn’t do any reasoning to work out how to solve the problem. This is included as a reminder that you don’t have to implement everything as an agent, and sometimes the most reliable solution demands a simpler implementation.
Tools
A predetermined workflow can be implemented as a tool that is exposed to an agent. The tool will encapsulate all the work required to complete the workflow as a set of deterministic and testable steps. The agent will be responsible for deciding whether to invoke the tool and which arguments to pass into it.
This approach will only be useful if it allows the agent to use the workflow as part of a wider process of problem solving. The tool abstraction is normally a better choice for relatively small, discrete, and reusable operations. If tool definitions are too specific, they will not be useful to agents, and you could be left with a glut of narrowly defined tools that an agent struggles to choose from.
Prompt routing
Agents tend to perform better when they receive prompts that are tailored to specific tasks. Prompt routing involves equipping an agent with a library of prompts that are tailored for specific use cases. The first thing an agent does on receiving a request is to select the most appropriate prompt to use in planning a response. Each prompt within the agent would use an appropriate selection of tools to solve a problem.
Although this approach could be used to address a wider range of use cases, it adds an extra round-trip to an LLM so the agent can decide which system prompt to use. This creates latency and there will also be a limit to the number of different prompts an LLM can be expected to choose from.
This approach is no guarantee of accuracy. Reasoning errors will inevitably occur, and the agent will be prone to making occasionally inexplicable choices of prompt. This approach may be a better fit for those scenarios where a different persona and approach to problem solving is required depending on the user request.
Few-shot prompting
Instead of managing a library of separate prompts, LLM performance can be improved by giving it examples of how to solve problems. The technique of adding example inputs and expected outputs to a system prompt is referred to as "few-shot prompting". This is particularly valuable when combined with “chain of thought” prompting that explains the reasoning process for each example.
This approach requires a library of examples and a mechanism to select the most appropriate items based on the user request. These selected examples are then injected into the prompt. The techniques involved in selecting examples can get quite involved, though similarity search might be a reasonable place to start.
This approach can help to improve the scope and reliability of a generic system prompt, though may be more suitable for influencing a prompt rather than addressing complex workflows. For example, an agent may make better choices around how best to combine different data sources if provided with a small number of more targeted, domain-specific examples.
Prompt chaining
Models struggle to follow complex instructions based on multiple steps and can fail to properly digest a large amount of supporting context. Prompt chaining mitigates this by breaking tasks down into smaller, manageable sequences based on a succession of tasks.
Each task in a chain will be based on a dedicated prompt that has access to tools and memory. This allows for more focused and reliable processing where the output of one prompt serves as the input to the next. This creates a managed workflow of interactions that can guide an LLM through a task.
This approach can improve accuracy and provide greater control over a model’s behaviour. It tends to be more effective for specialized tasks that can be meaningfully broken down into individual reasoning steps.
However, each step in the chain involves a separate call to an LLM, bringing extra latency and cost into play. Given the iterative and trial-and-error nature of prompt engineering, any mutually dependent prompts will be difficult to debug and prone to cascading failure.
If you want to compose agents from a series of chains, then you will also need to be a mechanism for deciding which chain to use. This is a similar problem to prompt routing, as it requires yet another call to an LLM, bringing extra latency and the potential for reasoning errors when there are too many choices.
Orchestrated agents
Another approach would involve implementing workflows as separate, stand-alone agent with their own dedicated prompts and tools. An “orchestrator” agent would serve as the start point for requests, deciding which agents to pass the request to and how to synthesize the response.
This is true “agentic behaviour” in that it allows an agent to solve problems in potentially novel ways. In the long run we are more likely to build systems based on collaborating agents, but the frameworks and protocols that manage agent collaboration are still very much in their infancy.
Note that the latency involved in this approach can be problematic. Every agent invocation will involve a call to an external service that in turn makes multiple calls to LLMs and expensive data services. The cascading latency and cost could render this approach unusable for conversational-style agents.
Despite these shortcomings, agent orchestration will become more viable once frameworks improve, costs continue to fall, and latency challenges have been overcome.
Fine-tuning
Fine-tuning is a process of adjusting an LLM with domain-specific examples to improve the accuracy of the results. By showing a model what “good” looks like, fine-tuning can help to improve accuracy and make an LLM better suited to a particular task.
Fine-tuning is an involved process that requires a very well-curated sample set to be effective. A fine-tuned model is also expensive to run, though there are indications that this will become cheaper over time. We may eventually be able to fine-tune a series of inexpensive models to create domain “experts”.
For the moment, fine-tuning should only be considered when cheaper and less complex techniques have failed.
Mixed implementation
Finally, note that none of these implementations are mutually exclusive. For example, it should be possible to implement an LLM-enabled service while also exposing it as a tool for an agent.
Much depends on a good understanding of the “map” of use cases, so you can decide where to draw tool abstractions, agent boundaries, and prompt scope. You will also need to experiment freely with different styles of prompts and examples, as you seek to understand the trade-off between cost, complexity, latency, and reliability.