LLM API Benchmark v2
API-Bench: A benchmark for agentic API integrations
Which LLMs can reliably build working integrations?
This is v2 of API-Bench. API-Bench evaluates how well language models can execute against APIs:
-
Which LLMs can reliably build working integrations into your tech stack?
-
Where do agents fail, and why?
AI agents are getting better at using tools. But "better" isn't the same as "reliable." API-Bench measures what really matters for production: can a model build working integrations into real-world APIs, end-to-end?
API-Bench evaluates agentic API execution: chaining requests, handling auth, following specs, paginating, transforming results, and recovering from errors across multi-step workflows. In short: Which models can actually work with your stack, where do agents fail, and why?
Best LLMs for Building Integrations
Average success rate across all tested API integration tasks:
| Rank | LLM | Success Rate | Successful Tasks |
|---|---|---|---|
| 1 | superglue [1] | 88% | 36/41 |
| 2 | Claude Sonnet 4.5 | 80% | 33/41 |
| 3 | GPT-5 | 63% | 26/41 |
| 4 | Claude Sonnet 4 | 63% | 26/41 |
| 5 | GPT-4.1 | 60% | 25/41 |
| 6 | Gemini 2.5 Pro | 58.5% | 24/41 |
| 7 | Gemini 2.5 Flash | 43% | 18/41 |
[1] superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM
What API-Bench Measures
TL;DR: We tested integration-building capabilities across 6 different LLMs.
-
API Spec adherence (correct paths, bodies and headers)
-
Auth handling (API keys, access tokens, legacy flows)
-
Pagination & batching (single/multi-cursor, page and offset based)
-
Cross-API workflows (fetch (System1) → enrich (System2) → upload(System 3))
Task examples (no endpoints or credentials given):
Execution protocol. LLMs were prompted to output javascript code in <<CODE>> xml tags that we extracted and executed. Each model had 3 attempts. For superglue, we used sdk.buildWorkflow with the given instruction and the integration credentials as input; superglue processes docs and applies self-healing loops that LLMs do not make. All validators and harness prompts were identical across conditions. The full evaluation is open-source for reproducibility. Check out the benchmark implementation on GitHub to run your own tests or contribute new APIs.
Note: superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM. We included it to show the performance gap between specialized agent systems and general language models.
Key Findings
LLMs are bad at writing working integration code. Here's why:
1. Trained on outdated documentation
For instance, in tasks requiring OpenAI's newer Responses API with file inputs, most LLMs attempted legacy multi-part uploads instead of passing the new input_file content object. This is a particularly frustrating failure mode because it looks "reasonable" in code but doesn't run.
2. Missing training data for niche and long-tail systems
For industry-specific or lesser-known systems (e.g., legacy systems or startups with solid but sparsely discussed APIs), LLMs guessed endpoints, misread pagination styles, or confused core entities (e.g., "workspace" vs. "user" in CRMs like Attio). Without examples in training data, LLMs hallucinate.
3. Can't debug autonomously
LLMs failed to generate working multi-step flows, even if individual steps are "easy" (Download HTML → convert via Jinja API to Markdown → analyze with OpenAI). LLMs struggled to understand failures and failed to auto-correct or debug follow-on steps; superglue's stepwise planning and self-healing isolated and repaired errors correctly.
4. Don't handle Auth reliably
While modern services with API keys or bot tokens were fine, LLMs massively struggled with legacy and edge cases (e.g., OAuth1 or multi-stephandshakes), because they can't persist and reason over intermediate auth state.
5. APIs were built for humans, not agents
When API design is informal or inconsistent, LLMs invent plausible but wrong parameters (API expects Query Parameter, but LLM adds parameters to the request body) or path shapes. superglue retrieves the exact OpenAPI spec for an endpoint while it builds a tool and thereby knows what query parameter to pass.
Why This Matters
If your roadmap depends on agents wiring systems together, documentation coverage and error-recovery matter more than reasoning capabilities. API-Bench shows that:
-
Specialization beats generality for integration reliability (self-healing, doc retrieval, spec anchoring).
-
Agentic execution scaffolding (step isolation, retries) is as important as model choice.
-
Niche systems remain the decisive gap for general LLMs.