Introduction
As agents like OpenClaw and Claude Cowork move beyond chat and begin acting directly on real systems, the real challenge is no longer model intelligence but the design of the tooling layer beneath them.
The model is the brain, but without a structured nervous system and immune system, agents are inefficient, insecure, and unpredictable.
Agentic systems rely on a tooling layer that is efficient, secure, and production-ready. This document discusses the most important aspects of such a tooling layer, from execution environments to context management, from authorisation to guardrails.
Background: MCP and Tool Calling
Initially, large language models were used purely to generate text. A major milestone was reached with the introduction of the first "agents." An agent, in this context, is a piece of software that has three capabilities:
- a set of tools, essentially functions, which have effects in the real world: sending an email, updating a database, editing a document
- a memory, the ability to store previous output from an agent, as well as tool call results
- an "agent loop," the agent runs a language model inside a loop: the model produces output, that output is appended to the memory, and the model is invoked again, repeating until the task is considered complete.
This extremely simple agent model, known as the "ReAct Framework," was a huge breakthrough. Instead of generating only natural language, the model could produce a structured response indicating that a function should be invoked. The agent loop would execute that function, capture its result, and feed it back into the model as input for the next iteration. For the first time, LLMs were no longer limited to describing actions, they could initiate them.
Early public examples of this pattern emerged in 2022-2023, including the LangChain framework, OpenAI's SDK function calling in 2023 and later Google Agent Dev Kit, Vercel's AI SDK and several more, which formalised structured tool invocation within model responses. These patterns quickly became foundational in agent development.
At the end of 2024, Anthropic released the Model Context Protocol (MCP), standardised this mechanism. MCP defined a consistent interface for tools and, critically, allowed tools to be injected dynamically at runtime rather than being hard-coded into an agent. This seemingly small shift enabled the emergence of tool gateways, intermediary layers capable of handling authorisation, security enforcement, guardrails, and context-efficiency independently of the model or the agent.
This moves control from the prompt into infrastructure. Instead of trusting the model to behave correctly, we can enforce behaviour at the capability boundary, introducing scoped access, deterministic constraints, runtime tool specialisation, and token-efficient output handling. In other words, MCP transforms tool use from a convenience feature into systems architecture.
Wait, stop! Agents overstepping their bounds
A lot of the almost intoxicating appeal of modern agents comes from their ability to seemingly do everything.
Peter Steinberger, the inventor of OpenClaw, demonstrated an example where he was planning to add speech recognition to his agent. In preparation for this feature, Steinberger sent his agent a voice note on WhatsApp. When he got home, he noticed the agent had answered the note. It turns out that the agent, without any guidance or prompting, found an OpenAI API key on his machine and sent the voice note to the OpenAI Whisper endpoint to transcribe it. It is this intoxicating power that draws people into the power of agents in the first place.
But it is exactly that unconstrained power that prevents agents from being usable in serious industrial systems. Enterprise systems are built around deterministic constraints. Power without boundaries is not a feature in those environments, it's a liability.
Authorisation is the first fault line.
Ambient Authority: Is an agent your avatar, or your delegate?
When you log into Gmail in a browser, you can read, send, and delete your emails. You also implicitly gain access to Calendar, Drive, Sheets... your entire Google Workspace. There is no "read-only inbox but nothing else" mode in a browser session. That level of scoping only exists at the API layer, typically via OAuth.
OAuth operates on the concept of scopes, explicit permissions granted to a client. A tool might request read access to email but not send access. MCP supports OAuth, and any tool-calling mechanism can restrict access at that level. This is a good start. If you want your agent to read your emails, don't give it access to your browser. Aside from being context-inefficient (more on that later), it is a dramatic over-provisioning of access.
But even that is not enough.
There's a worrying trend emerging in MCP ecosystems: servers that expose a single, coarse-grained scope, effectively "use this MCP server," which grants full access to everything behind it.
That is no better than giving the agent your browser session or full shell access after logging in once.
Agents are frequently given full shell access as well. If they don't have a tool, they install one. If you are already authenticated on that machine, they can re-authenticate simply by clicking through. It's incredibly powerful, and also deeply unsafe. You don't usually want to wake up and discover your agent has registered your company for a dozen new SaaS services overnight.
If a system only allows that level of access, without deterministic and auditable constraints, it won't survive real-world deployment.
The principle at stake here is least privilege. Agents should receive only the permissions necessary for a specific task, and ideally those permissions should be progressively approved and constrained. That was one of the first things we built into Civic, mechanisms for progressive privilege and scoped tool access:
- An Agent calls the "read email" tool
- An OAuth request is generated for the "read email" scope, and only that scope
- The agent is granted the permission, and executes the tool
- The grant is optionally stored for subsequent reuse
Beyond Coarse-Grained Permissions
But even that wasn't sufficient, because OAuth scopes are still too coarse for many agentic use cases.
Which brings me to a small but painful anecdote.
I built a very simple agent: once an hour, read my emails and send me a digest via Slack. Nothing exotic. After ingesting my emails into context, the agent had to decide where to send the summary. Despite being prompted to send it to me via DM, on one of its runs, it looked up the general channel in Slack and posted it there. So suddenly everyone in my organisation saw a digest of my inbox.
How best to solve this issue?
I had a choice: add stronger prompt instructions into a skill: "be extremely careful, only send to this channel ID," or move the constraint into code. I asked myself what would actually let me sleep at night.
Note, this isn't just a security problem. It is an efficiency problem. The agent should never have been reasoning about which channel to use in the first place.
The answer was deterministic enforcement at the tooling layer.
I built a guardrail mapping system at the tool gateway. Instead of exposing Slack's generic send_message tool, I rewrote it into a new tool: send_dm_to_dan. The gateway pre-binds the channel ID.
The agent sees only the constrained tool. It does not know it is a wrapper. It does not get to choose the channel. It has no decision to make.
Skills don't help with this, because skills are essentially just prompts. The only reliable place to enforce the rule was the capability boundary itself.
Importantly, the configuration lives in a database. This is not hard-coded into the agent loop or the functions it is given, it is a data-driven reduction of the decision space that the agent operates in. It is business logic encoded as data. If I attach that tool gateway to a different agent, the business logic comes with it. If the business logic changes later, or I want a different agent to have different behaviour, that is just a new row in the database.
This becomes critical once agents operate in multi-tenant environments. When dozens of agents run across an organisation, acting on behalf of different users, the system must enforce capability boundaries centrally.
And of course I have exposed tools to allow an admin agent to help me configure all this.
And that pattern, moving constraints out of prompts and into the tool abstraction layer, is what I'll build on in the next section.
Encoding Business Logic in Agent Systems: Deterministic Guardrails and Tool Configuration
The previous example of rewriting and constraining a Slack tool is powerful in itself, but it's only one instance of a broader pattern: encoding business logic directly into the tooling layer.
Take database access as an example. I had an MCP server that exposes a general-purpose execute_sql tool. I wanted my agent to be able to generate a report on user activity in my system, without exposing any user PII to it. I had a few options for how to deal with this.
The heaviest option is to generate a new user or role in the database. Grant that user access to specific tables, even specific columns. That is often the right architectural move. It mirrors how we treat employees: separate IAM roles, separate mailboxes, separate access boundaries.
I wanted something more lightweight however, and something that I don't need to pay a visit to the DBA when I want a small change.
I could also go down the skill route: dump the DB schema into a skill and let the model reason over it.
This is very powerful, and I would absolutely advocate for this as part of the solution. If I wanted an agent that could answer arbitrary questions about my data model, the skill route, coupled with the least privilege (i.e. limited read-only) concepts described above, would be all I needed.
But I wanted persistence and determinism. I wanted the report to work the same way every time. I also wanted context-efficiency (that word again). I didn't want my agent to have to figure out how to generate the report every time. I used an agent to figure out what report I wanted, then I wanted to encode that knowledge into an agent, without any risk of deviation.
So, I used the same alias-and-constrain pattern I used for Slack.
Instead of exposing execute_sql(query), I expose a new tool: lookup_user_activity(user_id). The SQL statement is encoded inside the tool definition. The tool gateway exposes a new parameter called user_id. And the value for this is interpolated into the SQL query. As another example, I created a tool: generate_weekly_report(week_start_date), where the entire report query is pre-encoded, and only the time window is adjustable.
One recurring need I found was augmenting OAuth with a domain-specific layer of constraints. OAuth scopes are blunt instruments. You can grant "read email" but not "read only emails with label X." You can grant "write to Drive" but not "only modify files in this folder." You can grant "update Jira issue," but not "only some issue transitions are allowed."
Some platforms have permissioning systems that support something close to this level of granularity. But practically none expose it cleanly through their MCP interfaces.
What I needed was effectively a DSL on top of OAuth.
- OAuth answers "may this client act at all?"
- Deterministic guardrails answer "under what conditions may it act?"
I found this feature vital, especially for enforcing two types of rules:
- "Block" rules: You cannot execute this tool with this set of parameters.
- "Ask" rules: Check with me before executing this tool.
Encoding these rules at the tooling layer moves business logic out of prompts and into enforceable capability boundaries. It enforces compliance, enables auditability and significantly reduces token usage.
On the subject of token usage, the next set of patterns goes into more examples of where a rules-based tool gateway can apply similar security patterns to tackle context bloat and increase both agent efficiency and cost-effectiveness.
Ignorance is bliss: Too much knowledge considered harmful
We've already seen that security constraints and determinism often have a side effect: they improve context efficiency. By context efficiency, I mean something very concrete: reducing the number of input tokens we send into the LLM.
The naive way MCP was first integrated into agents was simple: dump the entire toolset into the context. If you had 20 tools, all 20 went in. If you had a gateway in front of multiple MCP servers, everything went in. This created a clear mismatch of incentives:
An agent wants as little tooling context as possible. Every token spent describing unused capabilities is a token not spent reasoning about the task. But tool providers have the opposite incentive. They want their tools to be used correctly and frequently. That often leads to verbose descriptions, examples, comments, and defensive documentation embedded in the tool definition itself.
This leads to a predictable result: as the number of tools increases, performance degrades. The agent must decide what to use while holding all possible capabilities in memory.
This was a significant factor motivating the introduction of Skills. Skills allow tools to be loaded via a pattern called Progressive Disclosure. Instead of loading every tool into context, the agent provides an index of available skills, often with short descriptions, and only load the full definition when the agent chooses to use one. This is an essential pattern, and I would absolutely recommend including something like this in agentic systems.
But it does have some weaknesses.
Skills don't implicitly solve tool filtering when you are consuming tools via MCP. If an MCP server exposes 30 tools and you load it into a skill, there is no built-in mechanism that says, "This skill should only expose one of those tools." You can bypass MCP and create one function per use case, read_gmail, write_gmail, and so on, and package those into skills. But then you've hardwired capabilities into the agent layer. You've coupled the skills very closely to an agent's execution environment. A skill designed to work in a bash shell environment won't work in a python execution environment, or on a cloud chatbot etc.
What I found myself building instead was filtering at the gateway layer.
Rather than exposing all tools from an MCP server, I selectively expose only the subset relevant to a given use case. Those filters are encoded alongside skills but enforced at the gateway.
For example, Grafana exposes around 30 tools through its MCP server. These tools provide access to logs, dashboards, alert rules, annotations, data source configuration and more. When my agent is reading log files, it needs exactly one of them. So I created a log reader skill, and filtered the MCP server so that only the log query tool is visible to that agent in that context.
Ignorance, in this case, is a feature.
Reducing capability surface area improves not just safety but reasoning quality and token efficiency. The agent performs better because it knows less. And that turns out to be a recurring pattern in agent system design: the less unnecessary power you expose, the more reliably the system behaves.
Cutting to the Chase: Avoiding tool response bloat
Another example of this pattern is the tool post-processor.
Most tools are extremely verbose in their responses. There are two reasons for that. First, they are usually thin wrappers around existing APIs, and those APIs are often verbose by design. Second, tool providers don't want to be opinionated about what information an agent might need. So they return everything.
That's reasonable from an API perspective, but it's disastrous from a context perspective. Aside from clogging up the context with distracting information, every token in a tool response is paid for by the agent's owner.
A concrete example for me is Playwright. I use the Playwright MCP server frequently for UI testing and lightweight browser automation. The base Playwright server is incredibly verbose. After every action, it can return large portions of the DOM and detailed execution metadata. That might be useful occasionally, but most of the time it's just noise.
So I built a wrapper around it. It can run entirely independently of Civic, by the way. It introduces an additional tool that executes Playwright commands but suppresses the verbose output.
You could try to solve this in a skill. You could write a script that extracts only the relevant fields and embed that script into the skill. But again, you run into the same problems we've already seen: platform dependence and leakage into the agent layer. A script written for a Bash execution environment won't work in a Python runtime or a cloud chatbot.
So instead, I built postprocessing into the gateway.
A postprocessor is applied after the tool executes but before the result reaches the LLM. It transforms the output deterministically. It can be combined with skills and with the alias-and-constrain pattern described above. For example, in the Grafana case, I have a read_logs_sparse tool. Under the hood, it calls the standard Grafana query tool. But before returning the result to the model, a postprocessor trims the JSON, removing specific paths and retaining only the relevant fields. It can also convert JSON to CSV for tabular data, stripping out repeated JSON keys that would otherwise consume tokens unnecessarily.
Applying post-processors to tools led to up to 80% token savings. But aside from that, the cognitive load measurably improved accuracy. The less irrelevant structure you expose to the model, the fewer degrees of freedom it has to misinterpret.
Again, ignorance is not a weakness here, but a design choice.
And what became clear when building these features into Civic, filtering, aliasing, constraints, and post-processing, is that context bloat and security have very similar causes. Excess capability and excess information both degrade reliability. The solution in both cases is the same: shape the capability surface before it reaches the model.
MCP or CLI? Choosing the Right Transport
One important shift in modern agents is that many now have access to a command line interface, sometimes sandboxed, sometimes with full system access.
Agents can now decide whether to execute tools "directly" or whether to execute them indirectly through a CLI command.
CLI commands have a number of advantages over direct tool calling. Much of the tool knowledge lives inside the model's training data. Commands like grep, git, jq and even service-specific commands like gh and aws are deeply embedded in LLM training corpora, which makes CLI interactions surprisingly efficient.
In addition, CLI commands are historically built to be composable. Tools can be efficiently chained together to provide powerful and complex functionality. MCP has no concept of tool chaining.
On the other hand, tools that are not in the training data are hard for LLMs to use directly. With MCP, the model receives explicit tool schemas at runtime. The server declares functions, parameters, and return formats. This eliminates ambiguity and improves reliability for these unfamiliar APIs, but it comes with a cost: those schemas must be injected into the model's context window.
There is therefore a trade-off.
CLI tools are typically better for:
- Local computation or file manipulation
- High-frequency loops and iterative data processing
- Workflows that benefit from Unix-style composability
MCP tools are typically better for:
- SaaS services that do not provide CLIs
- Multi-user environments with delegated authentication
- Systems requiring strong auditability and permission controls
In practice, production agent systems almost always use both.
A developer-oriented agent might analyse a local codebase using CLI tools while interacting with GitHub, Slack, or Jira through MCP servers. A business-facing agent might rely heavily on MCP for SaaS integrations but still execute local scripts or data-processing tasks via CLI tools behind the scenes.
The correct abstraction is to choose the transport per integration, based on where the tool runs, how it authenticates, and what the workflow requires.
Importantly, the transport layer should not be what the agent reasons about directly.
Just as we saw earlier with aliasing and constrained tools, the agent should ideally interact with a higher-level interface that exposes only the business capability required for the task. Whether the underlying implementation invokes a CLI command or an MCP tool should remain an implementation detail of the tooling layer.
In other words, MCP and CLI are not competing architectures. They are plumbing.
What I built at Civic was a "code-mode" layer, which allows agents to call MCP servers as if they are CLI tools, availing of control-flow constructions such as loops, branching and tool chaining, while retaining the benefits of the MCP layer to provide structured tool information, auditing and authorisation and guardrails. Progressive disclosure and skills helped to solve the context saturation problem.
Conclusion
As agents move from experiments into real production systems, the limiting factor is no longer the intelligence of the models themselves, but the structure of the capability layer that surrounds them.
The patterns explored here point to a consistent principle: reliable agent behaviour emerges when the capability surface is carefully shaped before it reaches the model. Guardrails, constrained tools, progressive disclosure, filtering, and post-processing all serve the same purpose: reducing the decision space the model must navigate.
Security, efficiency, and determinism are therefore not separate concerns in agent architecture. They reinforce each other. The same mechanisms that enforce least privilege also reduce token usage, improve reasoning quality, and make agent behaviour more predictable.
As agents become embedded deeper into production systems, this tooling layer will increasingly resemble traditional systems engineering rather than prompt engineering.
In the end, reliable agents are not built by asking the model to behave better, but by designing the environment in which it operates.
