Everything You Need to Know: Benchmark Dominance (144 Elo Over GPT-5.2, #1 on Terminal-Bench), New Features (Adaptive Thinking, Effort Controls, Context Compaction), Safety Leadership, and Partner Testimonials
Part I: The Performance Revolution
Benchmark Dominance Across Categories
- The EvaluationReal-world economically valuable tasks across finance, legal, and professional domains
- Claude Opus 4.6 PerformanceIndustry-leading
- Versus GPT-5.2+144 Elo points (translates to winning ~70% of comparisons)
- Versus Opus 4.5+190 Elo points
- SignificanceLargest performance gap in knowledge work category
- Independent VerificationRun by Artificial Analysis (see methodology)
- The EvaluationReal-world system tasks and coding in terminal environments
- Claude Opus 4.6 ScoreHighest in industry
- FrameworkTerminus-2 harness
- Resource Allocation1× guaranteed / 3× ceiling
- Samples5-15 per task across staggered batches
- What It TestsMulti-file editing, system configuration, debugging, tool usage
- The EvaluationComplex multidisciplinary reasoning test designed to challenge frontier models
- Claude Opus 4.6Leads all frontier models
- ConfigurationWith tools (web search, web fetch, code execution, programmatic tool calling)
- Advanced SettingsContext compaction at 50k→3M tokens, max reasoning effort, adaptive thinking
- Domain DecontaminationBlocklist applied to ensure clean results
- The EvaluationLocating hard-to-find information online through multi-step search
- Claude Opus 4.6Best performance in industry
- With Multi-Agent Harness86.8% score
- ConfigurationWeb search + fetch, context compaction 50k→10M tokens, max effort, no thinking mode
- What It MeasuresInformation retrieval accuracy, search strategy quality, source synthesis
Comprehensive Benchmark Table
| Benchmark | Opus 4.6 | Comparison | Notes |
|---|---|---|---|
| Terminal-Bench 2.0 | #1 Industry | Beats all competitors | Terminus-2 harness, 5-15 samples |
| SWE-bench Verified | 81.42% | With modifications | 25-trial average, prompt optimization |
| OpenRCA | Highest | Root cause analysis | Official methodology verification |
| Multilingual Coding | State-of-art | Cross-language issues | Multiple programming languages |
| Benchmark | Opus 4.6 | Delta | Significance |
|---|---|---|---|
| GDPval-AA | Leading | +144 Elo vs GPT-5.2 | ~70% win rate |
| Humanity's Last Exam | #1 | Leads all frontier | Complex multidisciplinary |
| Life Sciences | 2× vs Opus 4.5 | Nearly double | Biology, chemistry, phylogenetics |
| Test | Opus 4.6 | Opus 4.5 | Sonnet 4.5 | Improvement |
|---|---|---|---|---|
| MRCR v2 (8-needle, 1M) | 76% | N/A | 18.5% | 4.1× vs Sonnet |
| Context Window | 1M tokens | 200k | 256k | First Opus with 1M |
| Domain | Benchmark | Score | Notes |
|---|---|---|---|
| Legal | BigLaw Bench | 90.2% | 40% perfect scores, Harvey |
| Cybersecurity | CyberGym | #1 | Vulnerability detection, NBIM 38/40 wins |
| Long-term Focus | Vending-Bench 2 | +$3,050.53 vs Opus 4.5 | Sustained performance |
| Biology/Chemistry | Life Sciences Tests | 2× Opus 4.5 | Computational biology, organic chem |
What the Improvements Mean
- Identifies most challenging parts without prompting
- Allocates cognitive resources intelligently
- Moves quickly through straightforward sections
- Handles ambiguous problems with better judgment
- Stays productive over extended sessions
- Maintains quality as context grows
- Doesn't drift or lose focus
- Handles multi-hour agentic workflows
- Operates confidently in millions of lines of code
- Tracks dependencies across files
- Navigates unfamiliar architectures
- SentinelOne: "multi-million-line codebase migration like a senior engineer"
- Reviews own code before submitting
- Catches mistakes proactively
- Debugs autonomously
- Vercel: "frontier-level reasoning, especially with edge cases"
Part II: The 1M Context Window Breakthrough
Eliminating "Context Rot"
- The Previous ProblemAI models degrade as conversations exceed token limits—losing track of information, missing buried details, producing inconsistent responses
- Opus 4.6 SolutionFirst Opus-class model with 1 million token context window (beta)
- Test8-needle, 1M variant—information "hidden" in vast amounts of text
- Opus 4.676% retrieval accuracy
- Sonnet 4.518.5% retrieval accuracy
- Improvement4.1× better at maintaining performance over long contexts
- Qualitative ShiftFrom "struggles beyond 200k" to "reliably handles 1M"
Long-Context Applications
- Entire legal briefs with exhibits
- Complete financial reports with appendices
- Technical documentation sets
- Multi-book literature review
- Full application codebases
- Dependency chains across projects
- Architecture documentation + code
- Historical commit context
- Dozens of research papers
- Conference proceedings
- Patent portfolios
- Scientific literature reviews
- Week-long project discussions
- Accumulated knowledge bases
- Historical decision contexts
- Evolving requirements specifications
The 128k Output Token Advantage
- Previous LimitationLong outputs required breaking into multiple requests
- Opus 4.6 CapabilityUp to 128,000 tokens in single output
- Complete technical documentation
- Comprehensive reports
- Full application code
- Detailed analysis documents
- Multi-section deliverables in one response
- Partner TestimonyBolt.new CEO Eric Simons—"one-shotted a fully functional physics engine, handling a large multi-scope task in a single pass"
- Input: $5 per million tokens
- Output: $25 per million tokens
- Input: $10 per million tokens
- Output: $37.50 per million tokens
- Multiplier: 2× for input, 1.5× for output
- Multi-document analysis requiring simultaneous access
- Codebase-scale operations
- Historical context critical to quality
- Single-pass complex deliverables
- The Previous BinaryEnable or disable extended thinking—no middle ground
- How It WorksModel decides when deeper reasoning would be helpful based on task complexity
Premium Pricing for Extended Context
Standard Pricing (up to 200k tokens):
Extended Context (200k-1M tokens):
Part III: New Developer Platform Features
Adaptive Thinking
- Uses extended thinking when useful
- Skips it for straightforward tasks
- Balances quality and speed automatically
- Developer ControlAdjust effort level to make model more/less selective
- BenefitOptimal performance without manual micromanagement per query
- Simple query: Instant response without thinking
- Ambiguous problem: Activates extended reasoning
- Edge case detection: Automatically thinks deeper
- Routine task: Efficient execution
Effort Controls (Four Levels)
- Speed: Fastest responses
- Cost: Lowest token usage
- Use When: Simple queries, routine tasks, time-critical operations
- Thinking: Minimal extended reasoning
- Speed: Balanced
- Cost: Moderate
- Use When: Standard tasks, most everyday work
- Thinking: Selective extended reasoning
High Effort (Default):
- Speed: Quality-optimized
- Cost: Standard pricing
- Use When: Complex tasks requiring careful consideration
- Thinking: Extended reasoning when beneficial (adaptive)
- Speed: Deepest reasoning (may be slower)
- Cost: Highest token usage
- Use When: Most challenging problems, critical decisions, expert-level work
- Thinking: Maximum extended reasoning, 120k thinking budget
- Parameter Access
/effortin API - Anthropic Recommendation"If model is overthinking on a given task, dial effort down from high to medium"
Context Compaction (Beta)
- The ProblemLong conversations and agentic tasks hitting context limits
- The SolutionAutomatic summarization and replacement of older context
- 1. Threshold ConfigurationDeveloper sets token limit (e.g., 50k, 100k)
- 2. Automatic TriggerWhen conversation approaches threshold
- 3. Intelligent SummarizationModel summarizes older context
- 4. Seamless ReplacementSummary replaces original detailed context
- 5. Continued OperationTask proceeds without hitting limits
- Custom threshold settings
- Preservation rules for critical context
- Summary detail levels
- Maximum total context after compaction
- Multi-day debugging sessions
- Iterative design discussions
- Long-running research projects
- Extended customer support threads
- Continuous monitoring workflows
- BrowseComp ExampleCompaction at 50k→10M total tokens enabled deep search
- Humanity's Last ExamCompaction 50k→3M tokens for complex reasoning
US-Only Inference
- RequirementWorkloads must run in United States data centers
- Pricing1.1× token pricing (10% premium)
- Regulated industries (healthcare, finance, government)
- Data residency compliance requirements
- US government contracts
- Legal/contractual restrictions
- How to EnableSpecify in API call or platform settings
- DocumentationSee Data Residency
Part IV: Product Updates
Agent Teams in Claude Code (Research Preview)
- The InnovationMultiple agents working in parallel, coordinating autonomously
- When to UseTasks splitting into independent, read-heavy work
- ExampleCodebase reviews across multiple repositories
- Main agent decomposes task
- Sub-agents instantiated for independent pieces
- Parallel execution
- Autonomous coordination
- Results synthesis
- Take over any subagent: Shift+Up/Down
- Tmux integration support
- Monitor all agent activities
- Intervene when needed
- Replit President Michele Catasta"Breaks complex tasks into independent subtasks, runs tools and subagents in parallel, identifies blockers with real precision"
- Rakuten GM Yusuke Kaji"Autonomously closed 13 issues and assigned 12 issues to right team members in single day, managing ~50-person org across 6 repositories"
Claude in Excel (Substantial Upgrades)
- Long-Running TasksHandles complex multi-step spreadsheet operations
- Harder ProblemsSolves challenging data analysis and modeling
- Plan Before ActingThinks through approach before executing
- Unstructured Data IngestionProcesses messy data, infers proper structure automatically
- Multi-Step ChangesExecutes complex modifications in one pass
- Financial model construction with Pivot Tables
- Data cleaning and structuring
- Complex formula creation
- Multi-sheet coordination
- Automated reporting
- Shortcut.ai CTO Nico Christie"Performance jump feels almost unbelievable. Real-world tasks challenging for Opus [4.5] suddenly became easy. Watershed moment for spreadsheet agents."
- Now AvailableMax, Team, and Enterprise plans
- Layout ReadingUnderstands your existing templates
- Font RecognitionMatches brand typography
- Slide Master AwarenessStays on brand automatically
- Template BuildingConstructs presentations from templates
- Full Deck GenerationCreates complete presentations from descriptions
- Visual IntelligenceBrings data to life with appropriate charts, graphics
Claude in PowerPoint (Research Preview)
- Process and structure data in Excel
- Transfer insights to PowerPoint
- Automatic visual generation
- Brand-consistent output
- Professional presentation ready
- Figma CDO Loredana Crisan"Translates detailed designs and multi-layered tasks into code on first try, powerful starting point for teams to explore ideas"
- Opus 4.6Lowest misaligned behavior score of any frontier model
- Versus Opus 4.5Equal or better alignment (Opus 4.5 was previous best)
Part V: Safety and Alignment Leadership
Lowest Misaligned Behavior Rate
- Deception and dishonesty
- Sycophancy (excessive agreement)
- Encouragement of user delusions
- Cooperation with misuse requests
- Harmful instruction following
- Over-Refusal RateLowest of recent Claude models
- Balance AchievedHigh safety without excessive caution on benign queries
Most Comprehensive Safety Evaluation Ever
- User wellbeing assessments
- Complex refusal testing
- Surreptitious harmful action detection
- Interpretability experiments (understanding why model behaves certain ways)
- Enhanced dangerous request scenarios
- More sophisticated misuse attempts
- Multi-step harmful task detection
- Context-dependent safety evaluation
- Interpretability IntegrationUsing science of AI model inner workings to catch problems standard testing might miss
- System CardFull details in Claude Opus 4.6 System Card
Cybersecurity-Specific Safeguards
- The ContextOpus 4.6 shows enhanced cybersecurity abilities
- Dual-Use RecognitionCapabilities helpful for defense, potentially harmful for offense
- Six New Cybersecurity ProbesMethods detecting harmful responses across:
- Exploit development
- Vulnerability discovery misuse
- Attack methodology guidance
- Malicious code generation
- System intrusion assistance
- Data exfiltration techniques
- Defensive AccelerationUsing Opus 4.6 to find and patch vulnerabilities in open-source software (see Anthropic cybersecurity blog)
- Future PlansReal-time intervention to block abuse as threats evolve
- Philosophy"Critical that cyberdefenders use AI models like Claude to level the playing field"
Part VI: Partner Testimonials and Real-World Validation
Development Tools and Platforms
"Delivering on complex, multi-step coding work developers face daily—especially agentic workflows demanding planning and tool calling. Unlocking long-horizon tasks at frontier."
"Stands out on harder problems. Stronger tenacity, better code review, stays on long-horizon tasks where others drop off. Team really excited."
"Huge leap for agentic planning. Breaks complex tasks into independent subtasks, runs tools and subagents in parallel, identifies blockers with real precision."
"Noticeably better than Opus 4.5 in Windsurf, especially on tasks requiring careful exploration like debugging and understanding unfamiliar codebases. Thinks longer, which pays off."
"Meaningful improvement for design systems and large codebases—enormous enterprise value use cases. One-shotted fully functional physics engine, handling large multi-scope task single pass."
"Only ship models developers genuinely feel difference. Opus 4.6 passed that bar with ease. Frontier-level reasoning with edge cases helps v0 elevate ideas from prototype to production."
Enterprise and Knowledge Work
"Strongest model Anthropic shipped. Takes complicated requests, actually follows through, breaks into concrete steps, executes, produces polished work even when ambitious. Feels like capable collaborator."
"Clear step up. Code, reasoning, planning excellent. Ability to navigate large codebase and identify right changes feels state-of-the-art."
"Meaningful leap in long-context performance. Handles much larger information bodies with consistency level strengthening how we design complex research workflows. More powerful building blocks for expert-grade systems."
"Achieved highest BigLaw Bench score of any Claude model at 90.2%. With 40% perfect scores and 84% above 0.8, remarkably capable for legal reasoning."
"Excels in high-reasoning tasks like multi-source analysis across legal, financial, technical content. Box eval showed 10% lift in performance, reaching 68% vs 58% baseline, near-perfect scores in technical domains."
Product and Creative Tools
"Best Anthropic model we've tested. Understands intent with minimal prompting, went above and beyond, exploring and creating details I didn't know I wanted until I saw them. Felt like working with model, not waiting on it."
"Generates complex, interactive apps and prototypes in Figma Make with impressive creative range. Translates detailed designs and multi-layered tasks into code first try—powerful starting point."
"Uplift in design quality. Works beautifully with our design systems and more autonomous, core to Lovable's values. People should create things that matter, not micromanage AI."
Security and Infrastructure
"Across 40 cybersecurity investigations, Opus 4.6 produced best results 38 of 40 times in blind ranking against Claude 4.5 models. Each model ran end-to-end on same agentic harness with up to 9 subagents and 100+ tool calls."
"Handled multi-million-line codebase migration like senior engineer. Planned up front, adapted strategy as learned, finished in half the time."
"Biggest leap seen in months. More comfortable giving sequence of tasks across stack and letting run. Smart enough to use subagents for individual pieces."
Specialized Applications
"Autonomously closed 13 issues and assigned 12 to right team members in single day, managing ~50-person organization across 6 repositories. Handled product and organizational decisions while synthesizing context across domains, knew when to escalate to human."
"Reasons through complex problems at level we haven't seen before. Considers edge cases other models miss, consistently lands on more elegant, well-considered solutions."
"Performance jump almost unbelievable. Real-world tasks challenging for Opus [4.5] suddenly became easy. Watershed moment for spreadsheet agents on Shortcut."
Part VII: How to Use Claude Opus 4.6
Access Points
Claude.ai (Web/App):
- Direct access for all users
- Max, Team, Enterprise plans
- Integrated with Claude in Excel, PowerPoint
- Cowork autonomous multitasking
API (Developers):
- Model string:
claude-opus-4-6 - Full documentation: platform.claude.com/docs
- All major cloud platforms supported
Claude Code (Terminal):
- Agent teams feature
- IDE integration (Xcode support announced)
- Autonomous coding workflows
Pricing Structure
- Input: $5 per million tokens
- Output: $25 per million tokens
- Unchanged: Same pricing as Opus 4.5
Extended Context (200k-1M tokens):
- Input: $10 per million tokens
- Output: $37.50 per million tokens
- Multiplier: 1.1× standard pricing
- Use For: Compliance requirements
- Full Detailsclaude.com/pricing
- LowQuick answers, simple queries, time-critical → Fast, cheap, minimal thinking
- MediumStandard tasks, most work → Balanced, selectively thinks
Configuration Best Practices
High (Default): Complex problems, quality-critical → Adaptive thinking, optimal balance
- MaxHardest challenges, expert-level → Maximum reasoning, 120k budget
- Enable CompactionFor long-running tasks → Set threshold (e.g., 50k tokens) → Automatic summarization prevents limits
- Use 1M WindowFor multi-document analysis → Be aware of premium pricing 200k+ → Worth it when simultaneous access critical
- Adaptive Thinking→ Leave enabled at default (high effort) → Model decides when to think deeply → Dial down if overthinking simple tasks
Conclusion: The New Standard for AI Intelligence
What Opus 4.6 Achieves
- 144 Elo over GPT-5.2 on knowledge work
- #1 on Terminal-Bench 2.0 agentic coding
- Leads Humanity's Last Exam reasoning test
- Best BrowseComp search performance
- 1M token window (first Opus-class)
- 76% on MRCR v2 (4.1× better than Sonnet)
- 128k output tokens
- Context rot effectively eliminated
- Adaptive thinking intelligence
- Four-level effort controls
- Context compaction for long tasks
- Agent teams in Claude Code
- US-only inference option
- Upgraded Claude in Excel
- New Claude in PowerPoint
- Cowork autonomous multitasking
- Improved everyday work capabilities
- Lowest misaligned behavior rate
- Most comprehensive evaluation ever
- Specialized cybersecurity safeguards
- Lowest over-refusal rate
Who Benefits Most
- DevelopersAgentic coding, codebase navigation, debugging, system tasks
- Knowledge WorkersFinancial analysis, legal research, document creation, presentations
- EnterprisesLong-context analysis, compliance workflows, multi-repository management
- ResearchersLiterature synthesis, data analysis, expert-level reasoning
- Creative ProfessionalsDesign systems, prototyping, autonomous content creation
The Competitive Landscape
- Versus GPT-5.2+144 Elo on GDPval-AA, wins ~70% comparisons
- Versus Previous Opus+190 Elo on knowledge work, 2× on life sciences
- Versus IndustryState-of-the-art across most benchmarks
- Safety ProfileBest alignment of any frontier model
Getting Started
- Access at claude.ai or via API (
claude-opus-4-6) - Start with default high effort, adaptive thinking
- Enable context compaction for long tasks
- Experiment with effort levels for your use cases
- Try agent teams in Claude Code for parallel work
- System card for full technical details
- Developer documentation for API features
- Partner case studies for real-world examples
- Support center for implementation help
