Grammar Over Vocabulary: Why CRUD Fails APIs for Agents

A developer who reads POST /reservations knows what it does. The HTTP verb, the resource path, fifteen years of REST convention, and a quick glance at the docs all converge on a single meaning: this books a table. The inference is so automatic that it feels like reading rather than reasoning.

A language model reading the same endpoint has none of those advantages. It reasons in natural language about a user’s goal, “book me a table for four at seven,” and has to bridge from that intent to a POST /reservations request across a gap the human developer never notices. The verb POST means “create a resource,” which is a statement about the server’s internal data model rather than about the user’s goal. The model has to infer that creating a reservation resource is the same act as booking a table. Usually it gets there. Sometimes it fails. And the rate at which it fails turns out to be measurable, large, and dependent on the one design choice almost nobody treats as a design choice: what you name the method.

We ran the experiment. 7,200 trials, four model families, eighteen conditions. The result is hard to argue with, because the data is the argument.

The headline number

The core test presented Agentic API and REST to the same model simultaneously. A catalog of paired endpoints, some named in CRUD style (POST /reservations, GET /restaurants/search) and some named in intent-aligned style (BOOK, FIND, QUERY), with the paradigm assigned randomly per trial. The model had to pick the right endpoint for a natural-language task without any external hint about which style to prefer. This is the mixed-paradigm condition, the most realistic test of what happens when an agent encounters a catalog in the wild.

Across all three frontier models, intent-aligned method names beat CRUD by 10 to 29 percentage points. Claude Sonnet 4.6 went from 59 percent on CRUD to 88 percent on agentic naming, a 29-point gap. Grok-3 showed an 18-point gap. GPT-4o showed 10. Aggregated across the three frontier families, the gap was 18.5 points, with a z-statistic of 3.77 and a p-value below 0.001. Three models, built by three different organizations, on different architectures and different training data, all moved in the same direction by a wide margin.

That directional consistency is what makes the finding generalizable rather than a quirk of one lab’s model. When three independent systems agree this strongly, the effect is in the structure of the problem rather than in any one model’s idiosyncrasies.

The number that complicates the story

One model showed no effect. Llama 3.2, with 3 billion parameters, selected CRUD and agentic endpoints at identical accuracy in the mixed condition: 23% each, with a p-value of 0.95. The semantic signal in BOOK or FIND requires enough reasoning capacity to map the verb to a user’s intent, and at 3B parameters, that capacity is absent.

This is the most important boundary on the whole result, and it deserves to be stated plainly rather than buried. Semantic method naming is a frontier-scale benefit. It emerges somewhere between 3B and frontier scale, and organizations running small models for cost or privacy reasons should expect little from interface redesign alone. For them, scaffolding such as discovery-stage pre-filtering does more good than naming conventions. Honesty about the boundary condition is what makes the rest of the claim credible.

Why the names carry the signal

A skeptic has an obvious objection at this point. Maybe the agentic endpoints just had better documentation. Maybe the effect is about description quality, and the method names are incidental. We designed three independent ablations specifically to kill that objection, and all three converge on the same conclusion.

The sharpest one is the description swap. We took CRUD endpoints and gave them agentic-style descriptions. We took agentic endpoints and gave them CRUD-style descriptions. If descriptions were doing the work, both swaps would behave symmetrically. They did the opposite.

When CRUD names received agentic descriptions, accuracy collapsed. Grok fell 39 points. GPT-4o fell 43 points, from 73 percent to 30 percent, near-chance. The replacement descriptions were designed to be more informative and more intent-aligned than the originals, yet accuracy still cratered because the method name and the description were now telling different stories, and the model had no way to reconcile them.

When agentic names received CRUD descriptions, accuracy barely moved. Claude held steady at 86 percent, statistically indistinguishable from its baseline with matched descriptions. The semantic method name carried enough intent signal to fully absorb the description noise. Grok and GPT-4o each lost some ground but kept most of their advantage.

The asymmetry is the whole point. Semantic verb names encode intent densely enough to survive bad documentation. Generic HTTP verbs encode no intent at all, so they depend entirely on the description, and they fail catastrophically when the description misleads.

A second ablation reinforced this from a different angle. We stripped documentation down to a single line, then to nothing. For Claude, removing verbose documentation and leaving only minimal descriptions raised endpoint selection accuracy from 75 percent to 88 percent. The verbose text had been competing with the cleaner signal in the method name. Take away the noise, and the name alone does better.

A third ablation guaranteed that the correct endpoint was always visible in the candidate set, isolating the selection step from any discovery problem. Even with the right answer present, agentic endpoints were selected more accurately than CRUD ones across all three frontier models. The advantage lives in the act of selection itself, separate from any question of whether the model can find the endpoint.

Three ablations, three different mechanisms, one conclusion. Method names carry the primary selection signal for frontier models. Descriptions are secondary.

The finding that should worry safety teams

There is a result here that matters beyond accuracy, and it is the one I would put in front of anyone building agent governance.

In the description-swap condition, where CRUD endpoints carried mismatched descriptions, the frontier models went beyond simply getting the answer wrong. They got it wrong while reporting high confidence. Grok and GPT-4o reported average confidence around 87 to 88 percent while achieving 30 to 33 percent accuracy. The calibration error spiked to 60-67 percentage points. The models were most certain precisely when they were most wrong.

Sit with what that means for any system that uses model-reported confidence to decide when to escalate to a human. Under documentation degradation, a CRUD-based agent will report high confidence for near-chance answers, meaning it will under-escalate when escalation matters most. The confidence signal becomes anti-informative. The safety mechanism fails silently.

Semantic method naming softened this, too. With agentic names under the same description noise, the calibration error was roughly 26 points for Claude rather than 41. The method name supplied a corroborating signal that kept confidence anchored even when the description lied. Interface design, in other words, has a direct effect on whether an agent’s confidence can be trusted as an escalation trigger. That moves API naming out of the realm of developer ergonomics and into the realm of safety engineering.

What separates selection from parameters

One more result is worth carrying forward, because it prevents an overcorrection. Stripping documentation improved endpoint selection, but it destroyed parametric accuracy. The model’s ability to fill in the correct parameters collapsed across all models and providers when descriptions were removed. Claude fell from 80% parametric accuracy to 38%. Grok fell from 85 to 25.

This cleanly separates two design problems that get conflated. Semantic method naming optimizes which endpoint the agent selects. Documentation quality optimizes whether the agent populates that endpoint’s parameters correctly. Both matter in production, and they respond to different decisions. The lesson here is the opposite of “remove your documentation.” The takeaway is lean, precise descriptions that supply parameter context without burying the method-name signal under disambiguation noise.

Why this is an argument for grammar over vocabulary

Here is where the empirical result meets the design philosophy, and where the case for an agent-native protocol gets its strongest foundation.

The naive response to this data would be to build a giant registry of approved semantic verbs and require every agent API to use them. That is a vocabulary approach, and it fails to scale. Vocabularies fragment, drift, and ossify. The moment you fix a registry, the world produces a verb it lacks.

The better response is to constrain the grammar rather than enumerate the vocabulary. Define the shape and semantic class that a method identifier must take, an intent-aligned verb that maps to a recognizable user goal, and let the specific verb emerge from the domain. BOOK, FIND, REFUND, RESCHEDULE were never on a master list. They are well-formed because they fit the grammar, the same way a sentence can be grammatical without every sentence being pre-written. The data says the signal lives in the intent encoded by the name, so the design constraint should govern how names encode intent rather than which names are allowed.

This is the principle underneath the Runtime Contract Negotiation Substrate. RCNS lets an agent propose an endpoint into existence at the moment of need rather than selecting from a frozen catalog. An agent issues a PROPOSE for the capability it requires, expressed in intent-aligned terms, and the server evaluates and synthesizes a contract on demand. The grammar governs what a well-formed proposal looks like. The vocabulary is open because the agent and the server negotiate the specific terms in context.

The benchmark explains why this works. If frontier models reason most accurately about intent-aligned method names, then a negotiation protocol whose proposals are expressed as intent-aligned verbs operates in exactly the regime where models perform best. RCNS is the architecture that the data recommends. The 10-to-29-point accuracy lift is empirical evidence for designing the negotiation surface around semantic grammar rather than a CRUD vocabulary built for a different kind of reader.

The reader changed

REST was designed around a human developer’s cognitive model, and for that reader, it is a fine design. The developer knows HTTP semantics, consults docs when uncertain, and brings domain knowledge to every ambiguous case. The contract can be frozen in advance because the developer’s intentions were knowable at design time.

The agent is a different reader. It reasons in natural language about goals it receives at runtime. It cannot bring 15 years of REST convention to bear because it maps intent to the interface in a single inference step, and the data shows that step succeeds far more often when the interface speaks the language of intent.

This is the quiet reframe of the benchmark forces.The data says it is a measurable engineering variable with direct accuracy and safety consequences at the frontier scale. When the reader shifted from a human developer to a reasoning model, the optimal interface changed accordingly. CRUD was the right answer for the old reader. AGTP semantic grammar is the right answer for the new one.

The dataset, the task rubric, and the benchmark harness are open. The argument is reproducible, which is the only kind of argument worth making about something this consequential. Run it against your own models and your own catalog. The numbers are awaiting verification.

If you find this content valuable, please share it with your network.

Follow me for daily insights.

Book me to speak at your next event.

Start managing your agents for free.

Chris Hood is an AI strategist and author of the #1 Amazon Best Seller Infailible and Customer Transformation, and has been recognized as one of the Top 30 Global Gurus for Customer Experience. His latest book, Unmapping Customer Journeys, is available now!

Grammar Over Vocabulary: Why CRUD Fails APIs for Agents

Grammar Over Vocabulary: Why CRUD Fails APIs for Agents

The headline number

The number that complicates the story

Why the names carry the signal

The finding that should worry safety teams

What separates selection from parameters

Why this is an argument for grammar over vocabulary

The reader changed

Agents Deserve Their Own Transport, and AGTP Refuses to Pretend Otherwise

Agents Have Been Building One Layer Too High

The Agent Transfer Protocol: What’s New

Chris Hood

The headline number

The number that complicates the story

Why the names carry the signal

The finding that should worry safety teams

What separates selection from parameters

Why this is an argument for grammar over vocabulary

The reader changed

Identifying which Agent is Calling, and Who Authorized it

The History of AI Governance

You may also like

Agents Deserve Their Own Transport, and AGTP Refuses to Pretend Otherwise

Agents Have Been Building One Layer Too High

The Agent Transfer Protocol: What’s New