Most discussions about agent skills assume the same pattern: more structure should produce better behavior, and a reusable workflow wrapper should outperform a smaller always-on rule set.
That assumption did not survive contact with a local benchmark.
In the Sigil workspace, we tested a narrow memory-discipline skill against a simpler setup:
- tiny always-on guidance
- MCP tools for actual recall, confirmation, logging, and todo capture
The result was not what the “more agentic is better” story would predict.
What We Tested
We ran repeated task comparisons in the same repository and model family.
The question was simple: does a dedicated skill improve memory-oriented behavior enough to justify the extra instruction layer?
The most revealing task was not infrastructure continuity. It was a summary/review task:
- review validation material
- produce a concise outcome summary
- decide whether a durable log or todo is warranted
That is exactly the kind of task where a memory-discipline skill sounds useful on paper.
What Happened
The skill did not consistently help.
In the most important failure cases, it did three things:
- It stopped using recall when recall would have been appropriate.
- It narrowed the task toward the skill/spec layer instead of the actual materials under review.
- It introduced inaccurate or over-compressed claims while doing so.
That is worse than “no gain.” That is a workflow layer distorting the reasoning target.
Why This Matters
Most teams worry about skills because of token cost. That concern is real, but it is not the main lesson here.
The bigger risk is this: a wrapper can quietly become the object of reasoning.
Once that happens, the model starts talking about the workflow instead of doing the work. That is exactly what happened in this case.
The Simpler Pattern Won
The cleaner pattern was:
- keep always-on guidance extremely small
- use MCP as the real capability layer
- benchmark optional workflow wrappers before normalizing them
That stack is less glamorous than a layered skills story, but it held scope better.
In this case, “less prompt architecture” produced better behavior.
The Practical Rule
If you are evaluating skills in a real workspace, do not ask, “Do we have skills?” Ask, “Do skills improve task quality enough to justify their extra layer?”
And if the answer is no, keep them out of the default path.
The Better Takeaway
This is not an anti-skill argument. It is a pro-measurement argument.
If a skill helps in a narrow continuity workflow, keep it there. If it degrades summary, review, or validation work, remove it from that path immediately.
The right architecture is not the one that sounds the most agentic. It is the one that survives contact with the benchmark.