API-Bench: v2

A benchmark for agentic API integrations

What We Test

Which LLMs can reliably build working integrations?

This is v2 of API-Bench. API-Bench evaluates how well language models can execute against APIs:

Which LLMs can reliably build working integrations?
Where do agents fail, and why?

LLMs are good at writing code. But "good" isn't the same as "reliable." API-Bench measures what really matters for production: can a model build working integrations into real-world APIs, end-to-end?

API-Bench evaluates agentic API execution: chaining requests, handling auth, following specs, paginating, transforming results, and recovering from errors across multi-step workflows. In short: Which models can actually work with your stack, where do agents fail, and why?

Results

Best LLMs for Building Integrations

Average success rate across all tested API integration tasks:

Rank	LLM	Success Rate	Successful Tasks
1	superglue [1]	93%	39/41
2	Claude Opus 4.5	88%	36/41
3	Gemini 3 Pro	85%	35/41
4	Claude Sonnet 4.5	80%	33/41
5	Claude Sonnet 4	63%	26/41
6	GPT-5	63%	26/41
7	GPT-4.1	60%	25/41
8	Gemini 2.5 Pro	58.5%	24/41
8	Gemini 2.5 Flash	43%	18/41

[1] superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM

Methodology

What API-Bench Measures

TL;DR: We tested integration-building capabilities across 6 different LLMs.

API Spec adherence (correct paths, bodies and headers)
Auth handling (API keys, access tokens, legacy flows)
Pagination & batching (single/multi-cursor, page and offset based)
Cross-API workflows (fetch (System1) → enrich (System2) → upload(System 3))

Task examples (no endpoints or credentials given):

"List all workspaces in ClickUp."

"Fetch a Confluence page by ID."

"Compute total Stripe revenue (paginate >100 tx)."

"Download PDF and upload to my FTP."

Execution protocol. LLMs were prompted to output javascript code in <<CODE>> xml tags that we extracted and executed. Each model had 3 attempts. For superglue, we used sdk.buildWorkflow with the given instruction and the integration credentials as input; superglue processes docs and applies self-healing loops that LLMs do not make. All validators and harness prompts were identical across conditions. The full evaluation is open-source for reproducibility. Check out the benchmark implementation on GitHub to run your own tests or contribute new APIs.

Note: superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM. We included it to show the performance gap between specialized agent systems and general language models.

Insights

Key Findings

LLMs are bad at writing working integration code. Here's why:

1. Trained on outdated documentation

For instance, in tasks requiring OpenAI's newer Responses API with file inputs, most LLMs attempted legacy multi-part uploads instead of passing the new input_file content object. This is a particularly frustrating failure mode because it looks "reasonable" in code but doesn't run.

2. Missing training data for niche and long-tail systems

For industry-specific or lesser-known systems (e.g., legacy systems or startups with solid but sparsely discussed APIs), LLMs guessed endpoints, misread pagination styles, or confused core entities (e.g., "workspace" vs. "user" in CRMs like Attio). Without examples in training data, LLMs hallucinate.

3. Can't debug autonomously

LLMs failed to generate working multi-step flows, even if individual steps are "easy" (Download HTML → convert via Jinja API to Markdown → analyze with OpenAI). LLMs struggled to understand failures and failed to auto-correct or debug follow-on steps; superglue's stepwise planning and self-healing isolated and repaired errors correctly.

4. Don't handle Auth reliably

While modern services with API keys or bot tokens were fine, LLMs massively struggled with legacy and edge cases (e.g., OAuth1 or multi-stephandshakes), because they can't persist and reason over intermediate auth state.

5. APIs were built for humans, not agents

When API design is informal or inconsistent, LLMs invent plausible but wrong parameters (API expects Query Parameter, but LLM adds parameters to the request body) or path shapes. superglue retrieves the exact OpenAPI spec for an endpoint while it builds a tool and thereby knows what query parameter to pass.

Implications

Why This Matters

If your roadmap depends on agents wiring systems together, documentation coverage and error-recovery matter more than reasoning capabilities. API-Bench shows that:

Specialization beats generality for integration reliability (self-healing, doc retrieval, spec anchoring).
Agentic execution scaffolding (step isolation, retries) is as important as model choice.
Niche systems remain the decisive gap for general LLMs.

View on GitHub

Explore Tool Library