Agent x API Benchmark

Which LLMs build the best integrations? Which APIs actually work with agents?

The Real Cost of Bad API Design

A major enterprise spent 6 months building AI agents to automate their vendor integrations. The project failed spectacularly. Not because the AI wasn't smart enough—GPT-4.1 scored 98% on their evaluation benchmarks. The real culprit? APIs that were impossible for agents to use.

Custom authentication schemes that required human interpretation. Endpoints that returned different response structures based on mysterious query parameters. Error messages like "Operation failed: Error code 4027" with no documentation. The AI agents were brilliant at understanding intent but helpless against APIs designed for humans clicking through UIs.

This isn't an isolated incident. As enterprises race to deploy AI agents, they're discovering a harsh truth: Most APIs weren't built for autonomous systems.

Why This Benchmark Exists

Current AI benchmarks tell you if a model can write Shakespeare or solve math problems. They don't tell you if it can actually integrate with your CRM, update your billing system, or sync data between your tools. We built this benchmark to answer the questions that actually matter:

  • Which LLMs can reliably build working integrations?
  • Which APIs are actually usable by autonomous agents?
  • Where do AI agents fail, and why?
  • What makes an API "agent-ready"?

We put 21 real-world APIs through 5 multi-step integration scenarios each, tested across 6 leading AI platforms. No toy examples. No cherry-picked successes. Just real integration tasks that your agents will face in production.

The Problem

TL;DR: We tested 21 APIs across 6 different LLMs. The results? Most combinations fail catastrophically.

Out of 630 integration attempts (21 APIs × 6 platforms × 5 attempts each):

  • 23% failed - The agent couldn't even complete basic tasks
  • Only 6 APIs worked 100% of the time across all platforms
  • Custom query and request schemes are the biggest struggle - they usually require careful planning and prompt engineering
  • superglue beats general-purpose LLMs by 30+ points - Purpose-built wins

This isn't about AI capability—it's about API design. When an agent can write perfect code but can't figure out your authentication flow, that's not an AI problem. It's an API problem.

Note: superglue is our purpose-built AI agent platform designed specifically for API integrations, not a general-purpose LLM. We included it to show the performance gap between specialized agent systems and general language models.

Best LLMs for Building Integrations

Average success rate across all tested API integration tasks:

Rank LLM Success Rate

[1] superglue is a purpose-built intent-to-API-execution agent, not a general-purpose LLM

Best Agent-Ready APIs

Which APIs can agents figure out and use without human help?

Rank API Score superglue claude-4-sonnet claude-4-opus gpt-4.1 o4-mini gemini-2.5-flash

Key Findings

  • 84% vs 50-62%: Specialized agent platforms outperform general-purpose LLMs by 30+ points
  • GitLab, Shopify, SendGrid: The only APIs that work 100% across all platforms
  • Authentication kills agents: Custom auth schemes = guaranteed failure
  • Multi-step workflows expose weaknesses: Most LLMs can't chain API calls reliably

What Makes APIs Agent-Ready

  • Clear endpoints: /users/123 not /v2/entities?type=user&id=123
  • Standard auth: OAuth, Bearer tokens, API keys in headers
  • Real error messages: "User not found" not "Error 1047"
  • Consistent responses: Same structure every time
  • No custom query languages or weird filters

Methodology

21 real integration tasks. 6 platforms. 5 attempts each. Pass or fail.

All evaluation code is open source. Check out the full benchmark implementation on GitHub to run your own tests or contribute new APIs.

Real Tasks

Multi-step workflows like "fetch users, filter by criteria, send notifications"

No Retries

If it fails, it fails. No hand-holding or prompt engineering.

Same Prompts

Identical instructions across all platforms. Fair comparison.

Ship reliable agents in minutes

Join thousands of developers building with superglue

Get Started Free Book a Demo
Y Combinator