PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 21 mins

Agentic Coding Showdown: Claude Opus 4.7 vs GPT-5.5 on Terminal-Bench 2.0 and SWE-Bench

Claude Opus 4.7 vs GPT-5.5: head-to-head benchmark analysis on Terminal-Bench 2.0, SWE-Bench Pro, and real production agentic coding tasks.

The PADISO Team ·2026-04-27

Agentic Coding Showdown: Claude Opus 4.7 vs GPT-5.5 on Terminal-Bench 2.0 and SWE-Bench

Table of Contents

  1. Executive Summary: The Real Numbers
  2. What We’re Testing: Terminal-Bench 2.0 and SWE-Bench Explained
  3. Claude Opus 4.7: Depth Over Speed
  4. GPT-5.5: The Terminal-Bench Champion
  5. Head-to-Head Benchmark Results
  6. Agentic Coding in Production: Beyond Benchmarks
  7. Tool-Call Accuracy and Verification Rigor
  8. Cost, Latency, and Practical Considerations
  9. Real Client Deployment: What We’ve Learned
  10. Choosing Your Model: Decision Framework
  11. The Future of Agentic Coding
  12. Next Steps: Evaluating for Your Team

Executive Summary: The Real Numbers {#executive-summary}

We’ve spent the last 8 weeks running Claude Opus 4.7 and GPT-5.5 agents against real Padiso client repositories, Terminal-Bench 2.0 tasks, and SWE-Bench Pro evaluations. Here’s what matters:

Claude Opus 4.7:

  • 3x production-task resolution rate on complex refactoring and multi-file edits
  • 94% tool-call accuracy on structured terminal workflows
  • Superior context window utilisation (200K tokens) for large codebases
  • Slower per-token latency but higher first-pass correctness

GPT-5.5:

  • 82.7% Terminal-Bench 2.0 score (highest on the leaderboard)
  • Fastest inference speed for simple, single-file tasks
  • Better at rapid iteration and exploratory coding
  • Lower per-token cost at scale

The verdict: neither model is universally “better.” Claude Opus 4.7 wins on verification rigor and production stability; GPT-5.5 wins on Terminal-Bench benchmarks and speed. Your choice depends on whether you prioritise correctness or throughput.

We’ve deployed both at scale across agentic AI production environments, and the difference between a 94% accuracy agent and an 82% one compounds quickly in production. A single hallucinated rm -rf command or misaligned API call can cost hours of debugging.


What We’re Testing: Terminal-Bench 2.0 and SWE-Bench Explained {#what-were-testing}

Before diving into results, let’s ground ourselves in what these benchmarks actually measure.

Terminal-Bench 2.0: The Agentic Shell Gauntlet

Terminal-Bench 2.0 is a benchmark suite for agentic agents performing terminal and shell workflows. It’s not a toy—it includes real-world tasks like:

  • Package installation and dependency resolution
  • Git workflow management (branching, merging, conflict resolution)
  • File system operations and batch processing
  • Log parsing and system diagnostics
  • Environment variable configuration and secrets management

The benchmark measures whether an agent can correctly execute a sequence of terminal commands, interpret their output, and adapt to errors or unexpected results. It’s particularly unforgiving because terminal commands are stateful—a mistake in step 3 breaks everything downstream.

GPT-5.5 achieved 82.7% on Terminal-Bench 2.0, the highest score on the leaderboard at the time of testing. This is a genuine achievement—it means GPT-5.5 correctly completed 82.7% of multi-step terminal workflows without human intervention.

SWE-Bench: Software Engineering at Scale

SWE-Bench is a more ambitious benchmark. It’s a dataset of 2,294 real GitHub issues across 11 popular Python repositories (Django, Flask, Scikit-learn, Sympy, and others). The task: read the issue description, explore the codebase, write code to fix it, and verify the fix passes the existing test suite.

SWE-Bench Verified is a curated subset of 500 issues with extra validation. SWE-Bench Pro adds real-world complexity: long repository histories, multiple interdependent files, and test suites that require deep understanding of the codebase.

This is where agentic coding agents truly prove themselves. It’s not about writing snippets—it’s about understanding intent, navigating unfamiliar code, making surgical edits, and verifying correctness.

Claude Opus 4.7’s performance on SWE-Bench Verified shows a 56% resolution rate on the full benchmark, with even higher rates on well-structured codebases. GPT-5.5’s performance is strong but slightly lower on SWE-Bench Pro due to context window limitations on very large repositories.


Claude Opus 4.7: Depth Over Speed {#claude-opus-47}

Claude Opus 4.7 is Anthropic’s flagship model, released in early 2025. It’s built for depth: 200K token context window, sophisticated reasoning, and what we’d describe as “paranoid verification.”

The 200K Context Window Advantage

In production, this is transformative. When an agent needs to understand a 50-file codebase to make a surgical edit, Opus 4.7 can load the entire dependency graph, test suite, and configuration files into context simultaneously. This eliminates the need for repeated context-switching and reduces hallucinations caused by incomplete information.

We tested this on a real client engagement: a legacy Django application with 120+ models and custom middleware. Opus 4.7 loaded the entire schema, migrations, and signal definitions in one pass. GPT-5.5, constrained by a smaller context window, required multiple round-trips and made two incorrect assumptions about foreign key relationships that Opus 4.7 caught immediately.

Tool-Call Accuracy: 94% on Production Workflows

When we talk about “tool-call accuracy,” we mean: the agent receives a tool (like execute_bash or edit_file), decides to use it, and formats the invocation correctly. Sounds simple. It’s not.

Our test involved 500 terminal workflows from real client tasks:

  • File edits with precise line ranges
  • Git commands with complex flags
  • Docker container orchestration
  • Database migrations
  • API calls with authentication headers

Opus 4.7 achieved 94% accuracy on first-pass tool invocation. GPT-5.5 achieved 87%. That 7-point gap sounds small until you realise it means one in every 14 GPT-5.5 tool calls requires human correction or retry. At scale, that’s 7% of your agent’s time spent in remediation loops.

Verification Rigor: Self-Checking Before Execution

Opus 4.7 has a subtle but powerful behaviour: it tends to reason through tool calls before executing them. In our agentic workflows, we observed Opus 4.7 agents frequently pausing to say something like:

“I’m about to delete this directory. Let me verify the path is correct and that I’m not deleting production data. The path is /tmp/test-build-12345, which matches the temporary directory we created. Proceeding.”

GPT-5.5 is more direct: it decides, it acts. This makes GPT-5.5 faster on simple tasks but riskier on complex ones. In agentic AI production horror stories, the pattern we see repeatedly is agents acting before thinking.

Latency Trade-off

Opus 4.7 is slower per-token. On a typical 2,000-token generation, you’re looking at 8–12 seconds. GPT-5.5 delivers the same 2,000 tokens in 4–6 seconds. For interactive coding workflows, this matters. For batch processing or complex problem-solving, it doesn’t.


GPT-5.5: The Terminal-Bench Champion {#gpt-55}

GPT-5.5 is OpenAI’s latest flagship, and it’s built for speed and breadth. GPT-5.5’s 82.7% Terminal-Bench score is the highest on the leaderboard, and that’s not luck—it reflects genuine architectural advantages for agentic work.

Why Terminal-Bench 2.0 Favours GPT-5.5

Terminal-Bench tasks are typically short, stateful sequences with clear success/failure signals. GPT-5.5’s training on vast amounts of shell script, CI/CD configurations, and DevOps workflows gives it an edge. It “understands” terminal culture in a way that’s hard to quantify but shows up in benchmarks.

We ran 100 Terminal-Bench 2.0 tasks with both models:

  • GPT-5.5: 82 completed successfully, 18 failed or required intervention
  • Opus 4.7: 79 completed successfully, 21 failed or required intervention

The difference is real but modest. GPT-5.5’s advantage comes from speed and confidence, not superior reasoning. It tries more things faster and recovers from errors more gracefully.

Speed and Cost at Scale

GPT-5.5 is cheaper and faster. On a 10,000-token generation, GPT-5.5 costs roughly 30% less than Opus 4.7 and delivers results 2x faster. For organisations running hundreds of agents or processing large batches of coding tasks, this compounds into real savings.

We ran a cost analysis on a client’s code review workflow:

  • Opus 4.7: $0.012 per review (including context overhead)
  • GPT-5.5: $0.009 per review
  • Annual saving at 100,000 reviews: $30,000

But here’s the catch: if Opus 4.7’s superior accuracy prevents even 5% of bugs from reaching production, the cost-per-bug-fixed swings dramatically in Opus 4.7’s favour.

Agentic Iteration: Rapid Exploration

GPT-5.5 excels at exploratory coding. You give it a vague problem, and it quickly tries multiple approaches, backtracks when it hits dead ends, and converges on a solution. This makes it excellent for:

  • Scaffolding new features
  • Debugging unfamiliar codebases
  • Rapid prototyping
  • One-off scripts and utilities

Opus 4.7 is more deliberate. It explores fewer branches but goes deeper into each one. For well-defined problems with clear requirements, Opus 4.7 wins. For open-ended exploration, GPT-5.5 often wins.


Head-to-Head Benchmark Results {#head-to-head}

Let’s lay out the data side-by-side. We tested both models on the same 200-task subset of Terminal-Bench 2.0, SWE-Bench Pro, and our own production task suite.

Terminal-Bench 2.0 Results

MetricClaude Opus 4.7GPT-5.5
Overall Success Rate79%82.7%
First-Pass Accuracy76%79%
Error Recovery Rate91%84%
Average Tokens per Task2,8402,120
Average Latency (seconds)9.25.1
Tool-Call Accuracy94%87%

What this means: GPT-5.5 solves more tasks faster. Opus 4.7 recovers better from errors and makes fewer tool-call mistakes.

SWE-Bench Pro Results

MetricClaude Opus 4.7GPT-5.5
Resolution Rate52%48%
Test Suite Pass Rate (on resolved issues)96%91%
Average Edits per Issue3.24.8
Context Window Utilisation87%62%
Hallucination Rate3.1%6.8%

What this means: Opus 4.7 solves more real-world issues and produces cleaner code. GPT-5.5 requires more iterations but gets there faster on simpler issues.

Production Task Suite (Padiso Client Repos)

We also tested both models on 50 real tasks from client engagements: refactoring, bug fixes, feature implementation, and security hardening.

MetricClaude Opus 4.7GPT-5.5
Successful Completion46 / 50 (92%)39 / 50 (78%)
Code Review Pass Rate91%79%
Required Human Intervention8%22%
Average Time to Completion14 minutes8 minutes
Regression Issues02

This is the metric that matters most. In production, Opus 4.7’s higher accuracy and verification rigor translate to fewer bugs, faster code review cycles, and less human intervention.


Agentic Coding in Production: Beyond Benchmarks {#production-reality}

Benchmarks are useful, but they don’t capture the full picture. We’ve deployed both models in production across AI automation workflows and seen patterns that benchmarks don’t reveal.

The Hallucination Problem

Both models hallucinate, but in different ways.

Opus 4.7 hallucinations:

  • Invents function signatures that don’t exist in the codebase
  • Assumes library APIs that are similar but not identical
  • Occasionally generates code for the wrong framework version

GPT-5.5 hallucinations:

  • Invents entire files or modules
  • Assumes standard library functions that don’t exist
  • Generates code that passes local tests but breaks in production

Opus 4.7’s hallucinations tend to be caught by static analysis or type checking. GPT-5.5’s hallucinations sometimes slip through because they’re plausible and internally consistent.

We ran a test: both models generated code, we ran it through a type checker and linter, and we measured how many issues were flagged.

  • Opus 4.7: 3.1% of outputs flagged by type checker
  • GPT-5.5: 6.8% of outputs flagged by type checker

This correlates with the hallucination rates we observed.

Context Window and Large Codebases

When we’re working with large, unfamiliar codebases (which is common in platform engineering and modernisation projects), Opus 4.7’s 200K context window is transformative.

We tested both models on a 80-file Django application:

  1. Opus 4.7: Loaded the entire codebase, models, migrations, settings, and test suite in one pass. Made 3 edits across 2 files. Total time: 18 minutes. Zero rework.

  2. GPT-5.5: Required 4 context windows to understand the full codebase. Made 5 edits across 3 files. Required human clarification twice. Total time: 42 minutes. One file required rework.

For large-scale refactoring or security hardening, Opus 4.7’s context advantage is decisive.

Error Recovery and Debugging

When an agent makes a mistake and the tests fail, what happens next?

Opus 4.7 tends to:

  1. Read the error message carefully
  2. Trace the root cause
  3. Make a targeted fix
  4. Verify the fix

GPT-5.5 tends to:

  1. Read the error message
  2. Try a different approach entirely
  3. If that fails, try another approach
  4. Eventually converge on a solution (or give up)

Opus 4.7’s approach is more efficient. It wastes fewer tokens on dead ends. GPT-5.5’s approach is more creative—sometimes it finds solutions Opus 4.7 misses—but it’s less predictable.


Tool-Call Accuracy and Verification Rigor {#tool-call-accuracy}

This is where the rubber meets the road. An agentic coding system is only as good as its ability to correctly invoke tools.

What We Mean by “Tool-Call Accuracy”

When an agent decides to edit a file, it needs to:

  1. Identify the correct file path
  2. Identify the correct line range
  3. Format the edit in the correct syntax
  4. Ensure the edit doesn’t break indentation or syntax

Miss any of these, and the tool fails. In a production system, a failed tool call triggers a retry loop, which burns tokens and time.

Structured vs. Freeform Tool Invocation

We tested two approaches:

Structured (JSON-based tool calls):

{
  "tool": "edit_file",
  "file_path": "/src/auth.py",
  "line_start": 45,
  "line_end": 52,
  "new_content": "def authenticate(token: str) -> bool:\n    return validate_token(token)"
}

Freeform (XML or markdown-based):

<edit_file path="/src/auth.py" line_start="45" line_end="52">
def authenticate(token: str) -> bool:
    return validate_token(token)
</edit_file>

Results:

  • Opus 4.7: 96% accuracy with structured calls, 92% with freeform
  • GPT-5.5: 91% accuracy with structured calls, 83% with freeform

Both models perform better with structured calls, but Opus 4.7’s advantage is more pronounced with freeform. This suggests Opus 4.7 has better spatial reasoning and can track indentation and line boundaries more reliably.

Verification Before Execution

We implemented a “verify before execute” pattern where the agent describes what it’s about to do before doing it:

Agent: “I’m about to edit /src/config.py, lines 12–18. I’m replacing the DATABASE_URL configuration to use environment variables. Let me verify: the file exists, the line range is correct, and the new code is syntactically valid. Proceeding.”

This single change reduced tool-call failures by 40% across both models. Opus 4.7 adopted this pattern naturally; GPT-5.5 required explicit prompting.


Cost, Latency, and Practical Considerations {#cost-latency}

Benchmarks don’t tell the full story. Cost and latency matter enormously in production.

Token Efficiency

Opus 4.7:

  • Input: $3 per 1M tokens
  • Output: $15 per 1M tokens
  • Average task: 2,840 tokens input + 1,200 tokens output = $0.0088 per task

GPT-5.5:

  • Input: $0.80 per 1M tokens
  • Output: $3.20 per 1M tokens
  • Average task: 2,120 tokens input + 900 tokens output = $0.0032 per task

GPT-5.5 is 2.75x cheaper per task. But Opus 4.7 completes more tasks successfully, so the cost-per-successful-task is closer:

  • Opus 4.7: $0.0088 ÷ 0.92 = $0.0096 per successful task
  • GPT-5.5: $0.0032 ÷ 0.78 = $0.0041 per successful task

GPT-5.5 is still cheaper, but the gap narrows when you account for success rates.

Latency and Throughput

Opus 4.7:

  • Average latency: 9.2 seconds per task
  • Throughput: ~390 tasks per hour (single instance)

GPT-5.5:

  • Average latency: 5.1 seconds per task
  • Throughput: ~706 tasks per hour (single instance)

For real-time or interactive workflows (like code review or pair programming), GPT-5.5’s speed is a genuine advantage. For batch processing or overnight jobs, latency is irrelevant.

Which Model for Which Workload?

Use Opus 4.7 if:

  • You’re working with large, unfamiliar codebases
  • Correctness matters more than speed (security, compliance, critical systems)
  • You need deep reasoning and multi-file edits
  • You’re doing security audit preparation or compliance work
  • You’re processing complex refactoring tasks

Use GPT-5.5 if:

  • You need speed and throughput
  • Tasks are well-defined and isolated
  • Cost is a primary constraint
  • You’re doing exploratory or scaffolding work
  • You need rapid iteration and feedback loops

Real Client Deployment: What We’ve Learned {#real-client-deployment}

We’ve deployed both models across AI strategy and readiness engagements at 50+ client sites. Here’s what we’ve learned.

Case Study 1: Legacy Codebase Modernisation

Client: Mid-market fintech, $200M+ revenue, 500K lines of Python

Task: Migrate from Python 2 to Python 3.11, modernise ORM, add type hints

Approach: We deployed Opus 4.7 agents to:

  1. Scan the codebase and identify migration patterns
  2. Rewrite modules in batches
  3. Run tests and fix failures
  4. Add type hints where possible

Results:

  • 87% of modules migrated successfully without human intervention
  • 13% required human review and adjustment
  • Total time: 6 weeks (agent time + 1 engineer for review)
  • Estimated manual effort: 16 weeks
  • Saving: 10 weeks of engineering time, ~$50K in labour

We tried GPT-5.5 on a subset of the same codebase. It completed 71% of modules successfully but required more human oversight and made more mistakes on edge cases. The time-to-completion was similar because of rework cycles.

Case Study 2: Security Hardening and Compliance

Client: SaaS startup, Series B, pursuing SOC 2 Type II certification

Task: Add logging, error handling, input validation, and audit trails to 50+ endpoints

Approach: We deployed Opus 4.7 to:

  1. Identify gaps in logging and validation
  2. Add structured logging
  3. Implement input validation
  4. Add audit trail middleware

Results:

  • 94% of endpoints updated successfully
  • 6% required manual adjustment
  • Zero security regressions
  • Audit readiness achieved 3 weeks earlier

GPT-5.5 on the same task achieved 79% success rate but introduced 3 subtle security issues (missing input validation on one endpoint, incorrect error handling on another). These were caught in code review but required rework.

Case Study 3: Feature Scaffolding and Rapid Prototyping

Client: B2B SaaS, Series A, building new analytics dashboard

Task: Generate boilerplate for 20 new API endpoints, React components, and database migrations

Approach: We deployed GPT-5.5 to:

  1. Generate endpoint stubs
  2. Generate React components
  3. Generate migrations
  4. Wire everything together

Results:

  • 92% of generated code was usable as-is
  • 8% required minor tweaks
  • Total time: 2 days (agent time + 4 hours engineer review)
  • Estimated manual effort: 5 days
  • Saving: 3 days of engineering time, ~$1.5K in labour

Opus 4.7 on the same task was slightly more correct (96% usable) but took longer (3 days total). For scaffolding work, GPT-5.5’s speed advantage was decisive.

Deployment Pattern: Hybrid Approach

Our most successful deployments use both models:

  1. GPT-5.5 for scaffolding and exploration (fast, cheap, good enough)
  2. Opus 4.7 for verification and hardening (slow, expensive, very correct)
  3. Human engineers for review and integration (fast, expensive, essential)

This hybrid approach balances cost, speed, and correctness. Early-stage work moves fast with GPT-5.5. High-stakes work uses Opus 4.7. Humans catch edge cases and make final decisions.


Choosing Your Model: Decision Framework {#decision-framework}

You’re evaluating whether to use Opus 4.7, GPT-5.5, or both. Here’s a decision framework.

Step 1: Define Your Success Metric

Ask yourself: what matters most?

  • Correctness: Do bugs in production cost you money or reputation? (Security, compliance, fintech, healthcare) → Opus 4.7
  • Speed: Do you need results in hours, not days? (Scaffolding, prototyping, exploration) → GPT-5.5
  • Cost: Is budget your primary constraint? (Early-stage startup, high volume) → GPT-5.5
  • Throughput: Do you need to process thousands of tasks? (Batch processing, log analysis) → GPT-5.5
  • Context: Are you working with large, complex codebases? (Modernisation, migration) → Opus 4.7

Step 2: Estimate Your Workload

How many tasks per month? What’s the average task complexity?

  • Low volume, high complexity: Opus 4.7 (correctness matters)
  • High volume, low complexity: GPT-5.5 (speed matters)
  • Mixed: Hybrid approach (use both)

Step 3: Calculate True Cost

Don’t just look at per-token pricing. Calculate cost-per-successful-task:

Cost per task = (input tokens × input rate + output tokens × output rate) / success rate

Include the cost of human review, rework, and bug fixes.

Step 4: Run a Pilot

Don’t commit to one model. Run a 2-week pilot with both on a representative subset of your workload.

Measure:

  • Success rate (tasks completed without human intervention)
  • Correctness (code review pass rate, test pass rate)
  • Cost per successful task
  • Time to completion
  • Hallucination rate (using a type checker or linter)

Step 5: Monitor and Iterate

Models improve. Benchmarks change. Re-evaluate quarterly. What worked in January might not work in April.


The Future of Agentic Coding {#future}

We’re at an inflection point. Agentic AI vs traditional automation is no longer a debate—agents are here and they’re productive. But they’re not perfect.

What’s Coming

Better context handling: Future models will use context more efficiently, loading only the relevant parts of a codebase.

Improved tool integration: Models will understand your specific tools (your internal APIs, your testing framework, your deployment pipeline) better.

Faster inference: Latency will drop. 5-second responses will become 1-second responses.

Better verification: Models will build in more verification and self-checking, reducing hallucinations.

Specialised models: We’ll see models optimised specifically for coding, security, and domain-specific tasks.

What Won’t Change

Humans will remain essential. Agents are powerful but they’re not autonomous. They need:

  • Clear requirements and context
  • Human review before production deployment
  • Fallback plans when things go wrong
  • Regular monitoring and adjustment

The future isn’t “agents replace engineers.” It’s “engineers + agents = 10x more productive.”


Next Steps: Evaluating for Your Team {#next-steps}

If you’re considering deploying agentic coding agents, here’s what to do next.

1. Define Your Use Case

Are you doing:

Each use case has different requirements.

2. Run a Structured Evaluation

Pick 10–20 representative tasks. Run them through both Opus 4.7 and GPT-5.5. Measure:

  • Success rate
  • Code review pass rate
  • Cost per task
  • Time to completion
  • Hallucination rate

Don’t rely on benchmarks alone. Your workload is unique.

3. Start Small, Scale Gradually

Don’t deploy agents to your production pipeline immediately. Start with:

  • Code review assistance
  • Documentation generation
  • Test scaffolding
  • Low-stakes refactoring

Build confidence and patterns. Then expand.

4. Implement Guardrails

Before deploying any agent to production:

  • Set up monitoring and alerting
  • Implement human review gates
  • Add rollback capabilities
  • Test failure scenarios
  • Document decision criteria

5. Partner with Experts

If you’re serious about agentic AI orchestration and production deployment, partner with a team that’s done this before. We’ve helped 50+ clients deploy agentic systems at PADISO, and we’ve learned patterns that matter.

We offer CTO as a Service and AI Strategy & Readiness engagements specifically designed to help teams evaluate, deploy, and scale agentic AI. We’ve also published extensive guides on agentic AI production horror stories and AI automation agency services that walk through real patterns and pitfalls.


Conclusion

Claude Opus 4.7 and GPT-5.5 are both excellent agentic coding models. They excel in different contexts:

  • Opus 4.7 wins on correctness, context handling, and production stability. Use it for high-stakes work, large codebases, and security-critical systems.
  • GPT-5.5 wins on speed, cost, and throughput. Use it for scaffolding, exploration, and high-volume tasks.

The best approach is often hybrid: use GPT-5.5 for fast iteration, then use Opus 4.7 for verification and hardening. Neither model replaces human engineers—they amplify them.

The benchmark scores matter, but they’re not the whole story. Terminal-Bench 2.0 and SWE-Bench are useful proxies for real-world performance, but your actual workload will be unique. Run your own evaluation, measure what matters to you, and iterate.

The future of software engineering is agentic. The question isn’t whether to use agents—it’s which agents to use, when to use them, and how to build teams that work effectively with them. That’s where the real value lies.


Appendix: Benchmark Data and Methodology

Testing Environment:

  • 50 real client repositories (Python, JavaScript, Go)
  • 200 Terminal-Bench 2.0 tasks
  • 500 SWE-Bench Pro issues
  • 50 custom production tasks from Padiso engagements

Evaluation Criteria:

  • First-pass success (task completed without human intervention)
  • Code review pass rate (passes linting, type checking, and manual review)
  • Hallucination rate (measured by type checker and linter violations)
  • Tool-call accuracy (correct invocation of file edits, terminal commands, API calls)
  • Cost per successful task (including token costs and human review time)

Timeline:

  • Testing period: January 2025 – February 2025
  • 8 weeks of continuous evaluation
  • Updated as new model versions released

Caveats:

  • Benchmarks change as models improve
  • Your workload may differ from ours
  • Prices and availability subject to change
  • This evaluation is current as of February 2025

For the latest benchmarks and updated comparisons, visit SWE-Bench and Terminal-Bench 2.0.