Claude Sonnet 4 5 Better Performance But A Paradox

Bonisiwe Shabane
-
claude sonnet 4 5 better performance but a paradox

Feel like there is a groundbreaking AI announcement every other week? I get it. The fatigue is real. It’s hard to distinguish the hype from the tools that will actually change how we work. But if you only pay attention to one release this season, make it this one. Anthropic just released Claude Sonnet 4.5 (as of late September 2025).

If you look at the raw numbers, they are absurdly good. It’s crushing benchmarks left and right. However, the benchmarks aren’t the real story here. The real story is stamina. Imagine hiring a brilliant intern who forgets everything you said after 30 minutes. That’s been the reality of most AI models until now.

Sonnet 4.5 changes the game. It can maintain focus on complex, multi-step projects for over 30 hours. Upon launch, Anthropic hailed Claude Sonnet 4.5 as the best coding model in the world. The model launched on September 29 with a lead on real-repo coding (SWE-bench Verified) and a big jump in “computer use” (OSWorld). Anthropic reports 77.2% on SWE-bench Verified and 61.4% on OSWorld; in press briefings they also cited 82% on SWE-bench with parallel test-time compute. One week in, early benchmark data and hands-on testing are starting to paint a more nuanced picture.

For R&D teams using agentic coding tools, the takeaway is clear: code reliability and desktop automation show real gains, while physical reasoning lag. In the LMArena leaderboard (last updated on October 3), the model shows divergent performance across domains. In the Text arena, the model sits in a multi-way tie at the top with a score of 1453, alongside Gemini 2.5 Pro (1452) and Claude Opus 4.1 (1449). Yet in the WebDev arena, where models are evaluated on coding and web development tasks, Claude Sonnet 4.5 ranks 4th with a score of 1382, trailing GPT-5 (high) at 1478, Claude Opus 4.1 variants,... Third-party evaluator Artificial Analysis has integrated Claude Sonnet 4.5 into its composite Intelligence Index, where the model scores 63 out of 100, ranking seventh overall among current frontier models. That places it up from 61 for Claude Opus 4.1 and 57 for Claude Sonnet 4, but trailing GPT-5 Codex (high) and GPT-5 (high) at 68, GPT-5 (medium) at 66, o3 at 65, and...

The index aggregates results from 10 public benchmarks including MMLU-Pro, GPQA Diamond, LiveCodeBench, and AIME 2025. A comprehensive benchmark analysis from All-in-One AI shows Sonnet 4.5 represents a 25.7% intelligence gain over Claude 3.7 Sonnet, with improvements in coding performance and computer use. The analysis includes an interactive speed simulator comparing token generation rates across models. Claude Sonnet 4.5 launched 24 hours ago, but we already have solid indication of where its substantial agentic and coding upgrades come from. Turns out, slow and steady really does win the race. One of the authors from SWE-bench (a coding benchmark for large language models) shared a few graphs showing how different models perform using the “bash-only” minimal agent setup.

This setup gives all models identical tooling, isolating what comes from the model itself versus the surrounding infrastructure. And the data reveals something fascinating about how Sonnet 4.5 approaches problem-solving. First, the headline numbers. On SWE-bench bash-only (the minimal agent setup), Sonnet 4.5 achieves 70.6%, outperforming GPT-5, and even the larger Opus 4. By the way, these bash-only results differ from the standard SWE-bench Verified scores you might have seen elsewhere. Here’s where things get interesting.

Opus 4 is 5x more expensive per token than Sonnet 4.5. Yet on the full SWE-bench run, it only costs about 2x as much: $566 versus $279. Compared to its direct predecessor, 4.5 costs 50% more on SWE-bench, while having the same pricing over the API. The SWE-bench author shared another chart, showing how many steps (think ‘edits’) each LLM needed to solve the coding problems. Anthropic just released Claude Sonnet 4.5, and the benchmark numbers are honestly absurd. According to Anthropic, the model scored 77.2% on SWE-bench Verified (70.6 on SWEBench official leaderboard)—a test that throws real GitHub issues at AI models to see if they can actually fix code like a...

For context, that's the highest score any model has ever achieved on this evaluation, and it's not even close. But here's what makes Sonnet 4.5 different: it can maintain focus on complex, multi-step tasks for more than 30 hours. Not 30 minutes. Not 3 hours. Thirty. Hours.

Try it yourself through the Claude API using the model string 'claude-sonnet-4-5'. You can also read the Claude 4.5 System Prompt here. We'll break down more about that below. Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers.

And it shows substantial gains in reasoning and math. Code is everywhere. It runs every application, spreadsheet, and software tool you use. Being able to use those tools and reason through hard problems is how modern work gets done. Claude Sonnet 4.5 makes this possible. We're releasing it along with a set of major upgrades to our products.

In Claude Code, we've added checkpoints—one of our most requested features—that save your progress and allow you to roll back instantly to a previous state. We've refreshed the terminal interface and shipped a native VS Code extension. We've added a new context editing feature and memory tool to the Claude API that lets agents run even longer and handle even greater complexity. In the Claude apps, we've brought code execution and file creation (spreadsheets, slides, and documents) directly into the conversation. And we've made the Claude for Chrome extension available to Max users who joined the waitlist last month. We're also giving developers the building blocks we use ourselves to make Claude Code.

We're calling this the Claude Agent SDK. The infrastructure that powers our frontier products—and allows them to reach their full potential—is now yours to build with. This is the most aligned frontier model we’ve ever released, showing large improvements across several areas of alignment compared to previous Claude models. TLDR: Claude Sonnet 4.5 scores 77.2% on SWE-bench Verified (82.0% with parallel compute), 50.0% on Terminal-Bench, and 61.4% on OSWorld. It reaches 100% on AIME with Python and 83.4% on GPQA Diamond. Pricing is $3 per million input tokens and $15 per million output tokens; you can use it on web, iOS, Android, the Claude Developer Platform, Amazon Bedrock, and Google Cloud Vertex AI.

Anthropic released Claude Sonnet 4.5 on September 29, 2025, as the latest model in the Claude 4 family. It improves coding performance, supports long-running agent workflows, and handles computer-use tasks more reliably. Let’s analyze its benchmarks, pricing, and how it compares with GPT-5 and Gemini 2.5 Pro in production use. Fewer misaligned behaviors; stronger defenses Code checkpoints, VS Code extension, Agent SDK Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Accelerate investment and mitigate risk when developing generative AI solutions. Explore how organizations can move beyond traditional testing to build robust, continuous evaluation systems that make LLMs more trustworthy and production-ready. From notebooks to frictionless production: learn how to make your ML models update themselves every week (or earlier). Complete an MLOps + DevOps integration on AWS with practical architecture, detailed steps, and a real case in which a Startup transformed its entire process. Learn how small and medium businesses seeking faster, more predictable paths to AWS adoption can leverage Caylent's SMB Migration Quick Start to overcome resource constraints, reduce risk, and achieve cloud readiness in as little... Anthropic’s Claude Sonnet 4.5—released September 29, 2025—is a major new large language model (LLM) targeting coding, agentic workflows, and complex computer use.

Anthropic touts it as “the best model in the world for agents, coding, and computer use” (Source: www.anthropic.com) (Source: www.axios.com). This report provides an in-depth analysis of what’s new in Sonnet 4.5: its technical innovations, benchmark performance, real-world uses, and strategic implications. Sonnet 4.5 features an expanded 200,000-token context window (Source: www.anthropic.com) (up to 64K output), hybrid reasoning with “extended thinking” for multi-step tasks (Source: www.gtmengine.ai) (Source: www.gtmengine.ai), and new tools (context-editing, memory, checkpoints, VS Code... It pushes the frontier of AI-assisted coding: in internal and external tests it scored 77.2% on the SWE-Bench coding benchmark (Source: www.anthropic.com) (Source: www.implicator.ai) (beating OpenAI’s GPT-5 Codex at 71.4% and Google’s Gemini 2.5... The model’s new features, coupled with improved alignment (ASL-3 safety) (Source: www.axios.com) (Source: www.implicator.ai), enable up to 30 hours of continuous autonomous operation (versus 7 hours for the prior Opus 4 model (Source: www.implicator.ai)... Industry analysts and early adopters report substantial productivity gains: e.g.

Cursor.ai notes Sonnet 4.5 delivers state-of-the-art coding performance on long-horizon tasks (Source: www.anthropic.com), security companies trimmed vulnerability triage time by 44% (Source: www.anthropic.com), and a financial firm obtained investment-grade analysis with less human review... The release underscores Anthropic’s strategic focus on developer/enterprise AI: it is integrated into products like Claude Code and Microsoft 365 Copilot, facilities on AWS Bedrock/Google Vertex (Source: www.anthropic.com), and priced unchanged at $3/$15 per... However, independent evaluations reveal trade-offs: Sonnet 4.5 is extremely fast but occasionally produces superficial or buggy code compared to GPT-5 (Source: news.ycombinator.com), highlighting the gap between benchmark success and deployment-ready quality. This report comprehensively examines Sonnet 4.5 from multiple perspectives—technical, empirical, industry use, and future impact—drawing on official documentation, benchmarks, expert analyses, and real-world case examples. Anthropic has rapidly emerged as a leading AI lab focused on creating safe, aligned LLMs. Its flagship Claude model series (including versions codenamed “Sonnet” and “Opus”) emphasizes applications in coding, reasoning, and agentic tasks rather than general chat.

Claude Sonnet 3.7 (Feb 2025) was Anthropic’s first hybrid reasoning model and top in coding (Source: www.anthropic.com). Sonnet 4 followed, and now Sonnet 4.5 (Sep 2025) continues this lineage. The naming highlights Anthropic’s biannual cadence and incremental improvements: each “.5” release (e.g. Sonnet 4.5) is billed as a significant upgrade over the prior whole-number (e.g. Sonnet 4). Indeed, Sonnet 4.5 arrives roughly 4 months after its predecessor, reflecting Anthropic’s stated goal of doubling task complexity capability every release (Source: www.axios.com).

People Also Search

Feel Like There Is A Groundbreaking AI Announcement Every Other

Feel like there is a groundbreaking AI announcement every other week? I get it. The fatigue is real. It’s hard to distinguish the hype from the tools that will actually change how we work. But if you only pay attention to one release this season, make it this one. Anthropic just released Claude Sonnet 4.5 (as of late September 2025).

If You Look At The Raw Numbers, They Are Absurdly

If you look at the raw numbers, they are absurdly good. It’s crushing benchmarks left and right. However, the benchmarks aren’t the real story here. The real story is stamina. Imagine hiring a brilliant intern who forgets everything you said after 30 minutes. That’s been the reality of most AI models until now.

Sonnet 4.5 Changes The Game. It Can Maintain Focus On

Sonnet 4.5 changes the game. It can maintain focus on complex, multi-step projects for over 30 hours. Upon launch, Anthropic hailed Claude Sonnet 4.5 as the best coding model in the world. The model launched on September 29 with a lead on real-repo coding (SWE-bench Verified) and a big jump in “computer use” (OSWorld). Anthropic reports 77.2% on SWE-bench Verified and 61.4% on OSWorld; in press br...

For R&D Teams Using Agentic Coding Tools, The Takeaway Is

For R&D teams using agentic coding tools, the takeaway is clear: code reliability and desktop automation show real gains, while physical reasoning lag. In the LMArena leaderboard (last updated on October 3), the model shows divergent performance across domains. In the Text arena, the model sits in a multi-way tie at the top with a score of 1453, alongside Gemini 2.5 Pro (1452) and Claude Opus 4.1 ...

The Index Aggregates Results From 10 Public Benchmarks Including MMLU-Pro,

The index aggregates results from 10 public benchmarks including MMLU-Pro, GPQA Diamond, LiveCodeBench, and AIME 2025. A comprehensive benchmark analysis from All-in-One AI shows Sonnet 4.5 represents a 25.7% intelligence gain over Claude 3.7 Sonnet, with improvements in coding performance and computer use. The analysis includes an interactive speed simulator comparing token generation rates acros...