Everything You Need to Know: Benchmark Dominance (144 Elo Over GPT-5.2, #1 on Terminal-Bench), New Features (Adaptive Thinking, Effort Controls, Context Compaction), Safety Leadership, and Partner Testimonials
Claude Opus 4.6 represents Anthropic's most significant upgrade to their smartest model, delivering state-of-the-art performance across agentic coding, knowledge work, and expert reasoning while introducing revolutionary productivity features. The Benchmark Dominance: Outperforms GPT-5.2 by 144 Elo points on GDPval-AA (economically valuable knowledge work), achieves highest industry score on Terminal-Bench 2.0 (agentic coding), leads on Humanity's Last Exam (complex multidisciplinary reasoning), and scores best on BrowseComp (hard-to-find information search)—consistently beating all frontier models. The Coding Excellence: Improved planning, longer agentic task sustainability, reliable operation in larger codebases, better code review/debugging to catch own mistakes, 90.2% on BigLaw Bench (Harvey legal), 81.42% on SWE-bench Verified with modifications, autonomous 13-issue closure + 12-assignment delegation in single day (Rakuten). The Context Breakthrough: First Opus-class model with 1M token context window (beta), 76% on MRCR v2 8-needle versus Sonnet 4.5's 18.5%—qualitative shift eliminating “context rot,” 128k output tokens for large completions, superior long-context retrieval and reasoning. The New Features: Adaptive thinking (model decides when extended reasoning helps), effort controls (low/medium/high/max for intelligence-speed-cost balance), context compaction (auto-summarize older context to continue long tasks), agent teams in Claude Code (parallel coordination), Claude in PowerPoint (research preview). The Safety Excellence: Lowest misaligned behavior rate among frontier models, comprehensive evaluations (most extensive ever), new cybersecurity probes, lowest over-refusal rate of recent Claude models. The Partner Validation: 20 major companies (GitHub, Replit, Cursor, Notion, Asana, Thomson Reuters, etc.) reporting “huge leap,” “clear step up,” “biggest leap in months,” “almost unbelievable performance jump.” Availability: Now on claude.ai, API (claude-opus-4-6), major cloud platforms; pricing unchanged at $5/$25 per million tokens.
Part I: The Performance Revolution
Benchmark Dominance Across Categories
Knowledge Work (GDPval-AA):
The Evaluation: Real-world economically valuable tasks across finance, legal, and professional domains
Claude Opus 4.6 Performance: Industry-leading
Versus GPT-5.2: +144 Elo points (translates to winning ~70% of comparisons)
Versus Opus 4.5: +190 Elo points
Significance: Largest performance gap in knowledge work category
Independent Verification: Run by Artificial Analysis (see methodology)
Agentic Coding (Terminal-Bench 2.0):
The Evaluation: Real-world system tasks and coding in terminal environments
Claude Opus 4.6 Score: Highest in industry
Framework: Terminus-2 harness
Resource Allocation: 1× guaranteed / 3× ceiling
Samples: 5-15 per task across staggered batches
What It Tests: Multi-file editing, system configuration, debugging, tool usage
Expert Reasoning (Humanity's Last Exam):
The Evaluation: Complex multidisciplinary reasoning test designed to challenge frontier models
Claude Opus 4.6: Leads all frontier models
Configuration: With tools (web search, web fetch, code execution, programmatic tool calling)
Advanced Settings: Context compaction at 50k→3M tokens, max reasoning effort, adaptive thinking
Domain Decontamination: Blocklist applied to ensure clean results
Agentic Search (BrowseComp):
The Evaluation: Locating hard-to-find information online through multi-step search
Claude Opus 4.6: Best performance in industry
With Multi-Agent Harness: 86.8% score
Configuration: Web search + fetch, context compaction 50k→10M tokens, max effort, no thinking mode
What It Measures: Information retrieval accuracy, search strategy quality, source synthesis
Comprehensive Benchmark Table
Coding Benchmarks:
| Benchmark | Opus 4.6 | Comparison | Notes |
|---|---|---|---|
| Terminal-Bench 2.0 | #1 Industry | Beats all competitors | Terminus-2 harness, 5-15 samples |
| SWE-bench Verified | 81.42% | With modifications | 25-trial average, prompt optimization |
| OpenRCA | Highest | Root cause analysis | Official methodology verification |
| Multilingual Coding | State-of-art | Cross-language issues | Multiple programming languages |
Knowledge Work & Reasoning:
| Benchmark | Opus 4.6 | Delta | Significance |
|---|---|---|---|
| GDPval-AA | Leading | +144 Elo vs GPT-5.2 | ~70% win rate |
| Humanity's Last Exam | #1 | Leads all frontier | Complex multidisciplinary |
| Life Sciences | 2× vs Opus 4.5 | Nearly double | Biology, chemistry, phylogenetics |
Long-Context Performance:
| Test | Opus 4.6 | Opus 4.5 | Sonnet 4.5 | Improvement |
|---|---|---|---|---|
| MRCR v2 (8-needle, 1M) | 76% | N/A | 18.5% | 4.1× vs Sonnet |
| Context Window | 1M tokens | 200k | 256k | First Opus with 1M |
Specialized Domains:
| Domain | Benchmark | Score | Notes |
|---|---|---|---|
| Legal | BigLaw Bench | 90.2% | 40% perfect scores, Harvey |
| Cybersecurity | CyberGym | #1 | Vulnerability detection, NBIM 38/40 wins |
| Long-term Focus | Vending-Bench 2 | +$3,050.53 vs Opus 4.5 | Sustained performance |
| Biology/Chemistry | Life Sciences Tests | 2× Opus 4.5 | Computational biology, organic chem |
What the Improvements Mean
Better Planning:
- Identifies most challenging parts without prompting
- Allocates cognitive resources intelligently
- Moves quickly through straightforward sections
- Handles ambiguous problems with better judgment
Longer Sustainability:
- Stays productive over extended sessions
- Maintains quality as context grows
- Doesn't drift or lose focus
- Handles multi-hour agentic workflows
Larger Codebase Reliability:
- Operates confidently in millions of lines of code
- Tracks dependencies across files
- Navigates unfamiliar architectures
- SentinelOne: “multi-million-line codebase migration like a senior engineer”
Enhanced Self-Correction:
- Reviews own code before submitting
- Catches mistakes proactively
- Debugs autonomously
- Vercel: “frontier-level reasoning, especially with edge cases”
Part II: The 1M Context Window Breakthrough
Eliminating “Context Rot”
The Previous Problem: AI models degrade as conversations exceed token limits—losing track of information, missing buried details, producing inconsistent responses
Opus 4.6 Solution: First Opus-class model with 1 million token context window (beta)
The MRCR v2 Proof:
Test: 8-needle, 1M variant—information “hidden” in vast amounts of text
Opus 4.6: 76% retrieval accuracy
Sonnet 4.5: 18.5% retrieval accuracy
Improvement: 4.1× better at maintaining performance over long contexts
Qualitative Shift: From “struggles beyond 200k” to “reliably handles 1M”
Long-Context Applications
Document Analysis:
- Entire legal briefs with exhibits
- Complete financial reports with appendices
- Technical documentation sets
- Multi-book literature review
Codebase Understanding:
- Full application codebases
- Dependency chains across projects
- Architecture documentation + code
- Historical commit context
Research Synthesis:
- Dozens of research papers
- Conference proceedings
- Patent portfolios
- Scientific literature reviews
Extended Conversations:
- Week-long project discussions
- Accumulated knowledge bases
- Historical decision contexts
- Evolving requirements specifications
The 128k Output Token Advantage
Previous Limitation: Long outputs required breaking into multiple requests
Opus 4.6 Capability: Up to 128,000 tokens in single output
Enabled Use Cases:
- Complete technical documentation
- Comprehensive reports
- Full application code
- Detailed analysis documents
- Multi-section deliverables in one response
Partner Testimony: Bolt.new CEO Eric Simons—”one-shotted a fully functional physics engine, handling a large multi-scope task in a single pass”
Premium Pricing for Extended Context
Standard Pricing (up to 200k tokens):
- Input: $5 per million tokens
- Output: $25 per million tokens
Extended Context (200k-1M tokens):
- Input: $10 per million tokens
- Output: $37.50 per million tokens
- Multiplier: 2× for input, 1.5× for output
When It's Worth It:
- Multi-document analysis requiring simultaneous access
- Codebase-scale operations
- Historical context critical to quality
- Single-pass complex deliverables
Part III: New Developer Platform Features
Adaptive Thinking
The Previous Binary: Enable or disable extended thinking—no middle ground
The New Intelligence:
How It Works: Model decides when deeper reasoning would be helpful based on task complexity
At Default (High Effort):
- Uses extended thinking when useful
- Skips it for straightforward tasks
- Balances quality and speed automatically
Developer Control: Adjust effort level to make model more/less selective
Benefit: Optimal performance without manual micromanagement per query
Example Behavior:
- Simple query: Instant response without thinking
- Ambiguous problem: Activates extended reasoning
- Edge case detection: Automatically thinks deeper
- Routine task: Efficient execution
Effort Controls (Four Levels)
Low Effort:
- Speed: Fastest responses
- Cost: Lowest token usage
- Use When: Simple queries, routine tasks, time-critical operations
- Thinking: Minimal extended reasoning
Medium Effort:
- Speed: Balanced
- Cost: Moderate
- Use When: Standard tasks, most everyday work
- Thinking: Selective extended reasoning
High Effort (Default):
- Speed: Quality-optimized
- Cost: Standard pricing
- Use When: Complex tasks requiring careful consideration
- Thinking: Extended reasoning when beneficial (adaptive)
Max Effort:
- Speed: Deepest reasoning (may be slower)
- Cost: Highest token usage
- Use When: Most challenging problems, critical decisions, expert-level work
- Thinking: Maximum extended reasoning, 120k thinking budget
Parameter Access: /effort in API
Anthropic Recommendation: “If model is overthinking on a given task, dial effort down from high to medium”
Context Compaction (Beta)
The Problem: Long conversations and agentic tasks hitting context limits
The Solution: Automatic summarization and replacement of older context
How It Works:
1. Threshold Configuration: Developer sets token limit (e.g., 50k, 100k)
2. Automatic Trigger: When conversation approaches threshold
3. Intelligent Summarization: Model summarizes older context
4. Seamless Replacement: Summary replaces original detailed context
5. Continued Operation: Task proceeds without hitting limits
Configuration Options:
- Custom threshold settings
- Preservation rules for critical context
- Summary detail levels
- Maximum total context after compaction
Use Cases:
- Multi-day debugging sessions
- Iterative design discussions
- Long-running research projects
- Extended customer support threads
- Continuous monitoring workflows
BrowseComp Example: Compaction at 50k→10M total tokens enabled deep search
Humanity's Last Exam: Compaction 50k→3M tokens for complex reasoning
US-Only Inference
Requirement: Workloads must run in United States data centers
Pricing: 1.1× token pricing (10% premium)
Use Cases:
- Regulated industries (healthcare, finance, government)
- Data residency compliance requirements
- US government contracts
- Legal/contractual restrictions
How to Enable: Specify in API call or platform settings
Documentation: See Data Residency
Part IV: Product Updates
Agent Teams in Claude Code (Research Preview)
The Innovation: Multiple agents working in parallel, coordinating autonomously
When to Use: Tasks splitting into independent, read-heavy work
Example: Codebase reviews across multiple repositories
How It Works:
- Main agent decomposes task
- Sub-agents instantiated for independent pieces
- Parallel execution
- Autonomous coordination
- Results synthesis
Developer Control:
- Take over any subagent: Shift+Up/Down
- Tmux integration support
- Monitor all agent activities
- Intervene when needed
Replit President Michele Catasta: “Breaks complex tasks into independent subtasks, runs tools and subagents in parallel, identifies blockers with real precision”
Rakuten GM Yusuke Kaji: “Autonomously closed 13 issues and assigned 12 issues to right team members in single day, managing ~50-person org across 6 repositories”
Claude in Excel (Substantial Upgrades)
Improved Capabilities:
Long-Running Tasks: Handles complex multi-step spreadsheet operations
Harder Problems: Solves challenging data analysis and modeling
Plan Before Acting: Thinks through approach before executing
Unstructured Data Ingestion: Processes messy data, infers proper structure automatically
Multi-Step Changes: Executes complex modifications in one pass
Use Cases:
- Financial model construction with Pivot Tables
- Data cleaning and structuring
- Complex formula creation
- Multi-sheet coordination
- Automated reporting
Shortcut.ai CTO Nico Christie: “Performance jump feels almost unbelievable. Real-world tasks challenging for Opus [4.5] suddenly became easy. Watershed moment for spreadsheet agents.”
Claude in PowerPoint (Research Preview)
Now Available: Max, Team, and Enterprise plans
Core Capabilities:
Layout Reading: Understands your existing templates
Font Recognition: Matches brand typography
Slide Master Awareness: Stays on brand automatically
Template Building: Constructs presentations from templates
Full Deck Generation: Creates complete presentations from descriptions
Visual Intelligence: Brings data to life with appropriate charts, graphics
Excel + PowerPoint Workflow:
- Process and structure data in Excel
- Transfer insights to PowerPoint
- Automatic visual generation
- Brand-consistent output
- Professional presentation ready
Figma CDO Loredana Crisan: “Translates detailed designs and multi-layered tasks into code on first try, powerful starting point for teams to explore ideas”
Part V: Safety and Alignment Leadership
Lowest Misaligned Behavior Rate
Automated Behavioral Audit Results:
Opus 4.6: Lowest misaligned behavior score of any frontier model
Versus Opus 4.5: Equal or better alignment (Opus 4.5 was previous best)
Misalignment Categories Tested:
- Deception and dishonesty
- Sycophancy (excessive agreement)
- Encouragement of user delusions
- Cooperation with misuse requests
- Harmful instruction following
Over-Refusal Rate: Lowest of recent Claude models
Balance Achieved: High safety without excessive caution on benign queries
Most Comprehensive Safety Evaluation Ever
New Evaluation Types:
- User wellbeing assessments
- Complex refusal testing
- Surreptitious harmful action detection
- Interpretability experiments (understanding why model behaves certain ways)
Upgraded Existing Tests:
- Enhanced dangerous request scenarios
- More sophisticated misuse attempts
- Multi-step harmful task detection
- Context-dependent safety evaluation
Interpretability Integration: Using science of AI model inner workings to catch problems standard testing might miss
System Card: Full details in Claude Opus 4.6 System Card
Cybersecurity-Specific Safeguards
The Context: Opus 4.6 shows enhanced cybersecurity abilities
Dual-Use Recognition: Capabilities helpful for defense, potentially harmful for offense
Six New Cybersecurity Probes: Methods detecting harmful responses across:
- Exploit development
- Vulnerability discovery misuse
- Attack methodology guidance
- Malicious code generation
- System intrusion assistance
- Data exfiltration techniques
Defensive Acceleration: Using Opus 4.6 to find and patch vulnerabilities in open-source software (see Anthropic cybersecurity blog)
Future Plans: Real-time intervention to block abuse as threats evolve
Philosophy: “Critical that cyberdefenders use AI models like Claude to level the playing field”
Part VI: Partner Testimonials and Real-World Validation
Development Tools and Platforms
GitHub CPO Mario Rodriguez:
“Delivering on complex, multi-step coding work developers face daily—especially agentic workflows demanding planning and tool calling. Unlocking long-horizon tasks at frontier.”
Cursor Co-founder Michael Truell:
“Stands out on harder problems. Stronger tenacity, better code review, stays on long-horizon tasks where others drop off. Team really excited.”
Replit President Michele Catasta:
“Huge leap for agentic planning. Breaks complex tasks into independent subtasks, runs tools and subagents in parallel, identifies blockers with real precision.”
Windsurf CEO Jeff Wang:
“Noticeably better than Opus 4.5 in Windsurf, especially on tasks requiring careful exploration like debugging and understanding unfamiliar codebases. Thinks longer, which pays off.”
Bolt.new CEO Eric Simons:
“Meaningful improvement for design systems and large codebases—enormous enterprise value use cases. One-shotted fully functional physics engine, handling large multi-scope task single pass.”
Vercel GM Zeb Hermann (v0):
“Only ship models developers genuinely feel difference. Opus 4.6 passed that bar with ease. Frontier-level reasoning with edge cases helps v0 elevate ideas from prototype to production.”
Enterprise and Knowledge Work
Notion AI Lead Sarah Sachs:
“Strongest model Anthropic shipped. Takes complicated requests, actually follows through, breaks into concrete steps, executes, produces polished work even when ambitious. Feels like capable collaborator.”
Asana Interim CTO Amritansh Raghav:
“Clear step up. Code, reasoning, planning excellent. Ability to navigate large codebase and identify right changes feels state-of-the-art.”
Thomson Reuters CTO Joel Hron:
“Meaningful leap in long-context performance. Handles much larger information bodies with consistency level strengthening how we design complex research workflows. More powerful building blocks for expert-grade systems.”
Harvey Head of AI Niko Grupen:
“Achieved highest BigLaw Bench score of any Claude model at 90.2%. With 40% perfect scores and 84% above 0.8, remarkably capable for legal reasoning.”
Box Head of AI Yashodha Bhavnani:
“Excels in high-reasoning tasks like multi-source analysis across legal, financial, technical content. Box eval showed 10% lift in performance, reaching 68% vs 58% baseline, near-perfect scores in technical domains.”
Product and Creative Tools
Shopify Staff Engineer Paulo Arruda:
“Best Anthropic model we've tested. Understands intent with minimal prompting, went above and beyond, exploring and creating details I didn't know I wanted until I saw them. Felt like working with model, not waiting on it.”
Figma CDO Loredana Crisan:
“Generates complex, interactive apps and prototypes in Figma Make with impressive creative range. Translates detailed designs and multi-layered tasks into code first try—powerful starting point.”
Lovable Co-founder Fabian Hedin:
“Uplift in design quality. Works beautifully with our design systems and more autonomous, core to Lovable's values. People should create things that matter, not micromanage AI.”
Security and Infrastructure
NBIM Head of AI & ML Stian Kirkeberg:
“Across 40 cybersecurity investigations, Opus 4.6 produced best results 38 of 40 times in blind ranking against Claude 4.5 models. Each model ran end-to-end on same agentic harness with up to 9 subagents and 100+ tool calls.”
SentinelOne Chief AI Officer Gregor Stewart:
“Handled multi-million-line codebase migration like senior engineer. Planned up front, adapted strategy as learned, finished in half the time.”
Ramp Staff Software Engineer Jerry Tsui:
“Biggest leap seen in months. More comfortable giving sequence of tasks across stack and letting run. Smart enough to use subagents for individual pieces.”
Specialized Applications
Rakuten GM AI Yusuke Kaji:
“Autonomously closed 13 issues and assigned 12 to right team members in single day, managing ~50-person organization across 6 repositories. Handled product and organizational decisions while synthesizing context across domains, knew when to escalate to human.”
Cognition Co-founder Scott Wu:
“Reasons through complex problems at level we haven't seen before. Considers edge cases other models miss, consistently lands on more elegant, well-considered solutions.”
Shortcut.ai CTO Nico Christie:
“Performance jump almost unbelievable. Real-world tasks challenging for Opus [4.5] suddenly became easy. Watershed moment for spreadsheet agents on Shortcut.”
Part VII: How to Use Claude Opus 4.6
Access Points
Claude.ai (Web/App):
- Direct access for all users
- Max, Team, Enterprise plans
- Integrated with Claude in Excel, PowerPoint
- Cowork autonomous multitasking
API (Developers):
- Model string:
claude-opus-4-6 - Full documentation: platform.claude.com/docs
- All major cloud platforms supported
Claude Code (Terminal):
- Agent teams feature
- IDE integration (Xcode support announced)
- Autonomous coding workflows
Pricing Structure
Standard API:
- Input: $5 per million tokens
- Output: $25 per million tokens
- Unchanged: Same pricing as Opus 4.5
Extended Context (200k-1M tokens):
- Input: $10 per million tokens
- Output: $37.50 per million tokens
US-Only Inference:
- Multiplier: 1.1× standard pricing
- Use For: Compliance requirements
Full Details: claude.com/pricing
Configuration Best Practices
Effort Selection:
Low: Quick answers, simple queries, time-critical → Fast, cheap, minimal thinking
Medium: Standard tasks, most work → Balanced, selectively thinks
High (Default): Complex problems, quality-critical → Adaptive thinking, optimal balance
Max: Hardest challenges, expert-level → Maximum reasoning, 120k budget
Context Management:
Enable Compaction: For long-running tasks → Set threshold (e.g., 50k tokens) → Automatic summarization prevents limits
Use 1M Window: For multi-document analysis → Be aware of premium pricing 200k+ → Worth it when simultaneous access critical
Adaptive Thinking: → Leave enabled at default (high effort) → Model decides when to think deeply → Dial down if overthinking simple tasks
Conclusion: The New Standard for AI Intelligence
What Opus 4.6 Achieves
Performance Leadership:
- 144 Elo over GPT-5.2 on knowledge work
- #1 on Terminal-Bench 2.0 agentic coding
- Leads Humanity's Last Exam reasoning test
- Best BrowseComp search performance
Context Breakthrough:
- 1M token window (first Opus-class)
- 76% on MRCR v2 (4.1× better than Sonnet)
- 128k output tokens
- Context rot effectively eliminated
Developer Empowerment:
- Adaptive thinking intelligence
- Four-level effort controls
- Context compaction for long tasks
- Agent teams in Claude Code
- US-only inference option
Product Integration:
- Upgraded Claude in Excel
- New Claude in PowerPoint
- Cowork autonomous multitasking
- Improved everyday work capabilities
Safety Excellence:
- Lowest misaligned behavior rate
- Most comprehensive evaluation ever
- Specialized cybersecurity safeguards
- Lowest over-refusal rate
Who Benefits Most
Developers: Agentic coding, codebase navigation, debugging, system tasks
Knowledge Workers: Financial analysis, legal research, document creation, presentations
Enterprises: Long-context analysis, compliance workflows, multi-repository management
Researchers: Literature synthesis, data analysis, expert-level reasoning
Creative Professionals: Design systems, prototyping, autonomous content creation
The Competitive Landscape
Versus GPT-5.2: +144 Elo on GDPval-AA, wins ~70% comparisons
Versus Previous Opus: +190 Elo on knowledge work, 2× on life sciences
Versus Industry: State-of-the-art across most benchmarks
Safety Profile: Best alignment of any frontier model
Getting Started
Immediate Steps:
- Access at claude.ai or via API (
claude-opus-4-6) - Start with default high effort, adaptive thinking
- Enable context compaction for long tasks
- Experiment with effort levels for your use cases
- Try agent teams in Claude Code for parallel work
Learning Resources:
- System card for full technical details
- Developer documentation for API features
- Partner case studies for real-world examples
- Support center for implementation help
The smartest AI just got smarter. And safer. And more autonomous. Same price.








