Anthropic’s latest AI model crushes performance in coding, math and computer tasks with massive upgrades
Anthropic unveiled Claude Sonnet 4.6 on Monday and immediately raised the baseline for what AI models should accomplish. The latest version demonstrates performance that rivals the company’s most advanced offering from last fall, bringing substantial improvements across coding, spreadsheet navigation and complex document analysis. For free and pro plan users, Sonnet 4.6 became the default model starting today with pricing remaining steady at $3 to $15 per million tokens.
Early testing revealed something significant: developers prefer Sonnet 4.6 over its predecessor by a substantial margin. The model demonstrates better consistency and follows instructions more reliably, which matters enormously when people depend on it for work tasks that previously required Anthropic’s smartest model. Box, an enterprise content management company that received early access, put the model through rigorous testing that revealed exactly why tech insiders are excited.
Heavy reasoning jumped 15 percentage points in accuracy
Box evaluated how well Sonnet 4.6 handles everyday enterprise work compared to Claude Sonnet 4.5. In heavy reasoning tests requiring deep thinking over extended periods rather than quick simple answers, Sonnet 4.6 achieved 77 percent accuracy. That’s a massive jump from the previous version’s 62 percent score. That gap matters because it measures the model’s ability to tackle genuinely difficult problems rather than execute routine tasks.
The testing revealed Sonnet 4.6 excels in five specific areas that matter for real-world applications. In public sector work, the model hit 88 percent accuracy when handling government-related tasks and complex bureaucratic requirements. Healthcare applications showed 78 percent accuracy in medical contexts where precision matters enormously. Retail operations peaked at 94 percent accuracy for tasks like analyzing sales data and customer patterns. Mathematical calculations achieved 89 percent accuracy, a massive improvement from the previous version’s embarrassing 62 percent score. Data extraction maintained over 80 percent accuracy pulling information from PDFs and Word documents while understanding context.
Real-world scenarios show what actually matters
Box’s evaluation highlighted practical wins that demonstrate genuine value beyond benchmark numbers. In one test, a retail company analyzing multi-year sales data found that Sonnet 4.5 stumbled with financial interpretation, causing cascading calculation errors. Sonnet 4.6 correctly computed investment-to-cost ratios and successfully ranked top articles by price increase. The difference wasn’t marginal—it was the difference between getting answers right and getting them wrong.
Another scenario involved a teacher evaluating a learning program who needed to summarize student outcomes. Sonnet 4.5 miscounted the students who passed, resulting in flawed recommendations. Sonnet 4.6 got the numbers right and provided advice aligned with actual data. These aren’t abstract improvements. These are situations where getting it right versus getting it wrong changes how decisions get made.
Computer use reached human-level capability
Perhaps the most intriguing advancement involves how the model interacts with computers. Anthropic introduced the first general-purpose computer-using AI model in October 2024, though early versions were experimental and error-prone. Sixteen months later, Sonnet models show steady gains on benchmarks testing AI across real software like Chrome, LibreOffice and VS Code.
The model sees the computer and interacts with it like a person would, clicking a virtual mouse and typing on a virtual keyboard. Early Sonnet 4.6 users report human-level capability in tasks like navigating complex spreadsheets or completing multi-step web forms across multiple browser tabs. That capability means the model can do work humans typically do, following multi-step processes and adapting when things change.
Safety and broader capabilities
Beyond these specific improvements, Sonnet 4.6 upgraded skills across coding, long-context reasoning, agent planning and design work. The model features a 1 million token context window in beta, meaning it can hold entire codebases, lengthy contracts or dozens of research papers in a single request. Safety evaluations showed Sonnet 4.6 performs as safely as or better than other recent Claude models, with researchers describing it as having honest and prosocial character with strong safety behaviors.
The model is available now on all Claude plans, Claude Cowork, Claude Code and major cloud platforms. Developers can access it immediately through the Claude API using the claude-sonnet-4-6 identifier.

