01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity 01 AGENT API OPENCLAW GEMINI fetch async const let => {} [] terminal signal decode stream token rate_limit antigravity

GPT-5.6 Sol Benchmarks: Terminal-Bench 2.1 Scores, Pricing, and Release Status

[_AI_TOOLS_]

> date: PUBLISHED ON JUN 30, 2026> decoder: VERTU SIGNALS

GPT-5.6 Sol Benchmarks: Terminal-Bench 2.1 Scores, Pricing, and Release Status

Why it matters

GPT-5.6 Sol scored 88.8% on Terminal-Bench 2.1 and 91.9% in Sol Ultra mode. OpenAI says access is limited to a small trusted-partner preview before broader availability in the coming weeks.

Quick facts

What happened

OpenAI shipped GPT-5.6 on June 26, 2026, marking one of the clearest recent examples of an OpenAI flagship holding a sustained lead over Claude on Terminal-Bench 2.1 — an agentic-coding benchmark that has become a more useful signal for agentic coding workflows than older single-shot coding benchmarks such as HumanEval.

The launch came with three SKUs — Sol, Terra, and Luna — each priced and positioned differently. Sol targets agentic coding, long-horizon reasoning, and security; Terra is positioned as a daily workhorse that matches GPT-5.5 at roughly half the cost; Luna is the speed-first tier for classification and routing workloads.

OpenAI says access is limited to a small group of trusted partners whose participation has been shared with the US government. Secondary reporting has described the cohort as roughly 20 partners, with public availability "in the coming weeks." The federal review pattern that began with Anthropic Mythos 5 on June 12 is an early signal that public launch is now subject to a coordination window, although OpenAI's specific gating sequence has not been officially characterized in this way.

Why people are searching for it

Three search intents are spiking on the GPT-5.6 Sol benchmarks topic as of June 30:

Benchmark hunters searching for the actual Terminal-Bench 2.1 numbers and the leaderboard ranking
Enterprise IT procurement teams checking if Sol is worth a Q3 2026 eval budget
AI policy watchers tracking the federal cybersecurity-review pattern that began with Anthropic Mythos 5 on June 12

The first cohort is the highest-volume; the second is the highest-value; the third is the slowest-burn but the most strategic for any regulated-vertical enterprise.

Key numbers

A 5.4-point jump from GPT-5.5 to Sol in eight months is a material single-generation lift on this benchmark. The narrow pass over Claude Mythos 5 gives OpenAI a visible lead on Terminal-Bench 2.1, though the margin is small and benchmark-specific.

Cached input tokens are billed at roughly 10% of standard input across all three tiers.

What changed since the last update

Compared to GPT-5.5:

Agentic coding: +5.4 points on Terminal-Bench 2.1, with Sol Ultra pushing +8.5 points
Cybersecurity: New ExploitBench score roughly matches Claude Mythos 5 Preview at one-third the output tokens
Pricing model: Three-tier structure replaces the previous "main model + separate cheap router" pattern
Availability: Federal review window — first time OpenAI is gated at launch

What it means

For most enterprise IT teams, the headline number — 88.8% on Terminal-Bench 2.1 — is a procurement signal, not a deployment trigger. The actual question is whether Sol is worth a Q3 2026 eval budget, given that the Mythos 5 precedent suggests procurement teams may want to model a 2–6 week delay.

For teams operating in regulated verticals (finance, healthcare, defense-adjacent), the procurement memo may now want to model three tracks: cloud model (federal-gated), local model (no gating), and a wait-and-see track for the public rollout. The Mythos 5 / GPT-5.6 pattern is a useful planning signal for 2026, not yet an established baseline.

Benchmark limitations

A few caveats on the headline numbers worth flagging before a procurement decision:

Terminal-Bench 2.1 was introduced with GPT-5.6. The comparison numbers are honest numbers (not retrofitted marketing), but the benchmark is also new and lacks the long-term predictive track record of older evals.
Sol Ultra pricing is unconfirmed. OpenAI has published Sol / Terra / Luna pricing but not a separate Sol Ultra tier as of June 30, 2026. Treat Ultra-mode cost as unconfirmed until OpenAI publishes billing details.
Cohort size "~20 partners" is from secondary reporting. OpenAI's official language is "a small group of trusted partners whose participation has been shared with the US government." The ~20 figure comes from security press reporting and may shift.
SWE-bench is where Sol does not lead. Public leaderboards still show Claude Fable 5 (95.00% on SWE-bench Verified) and Claude Opus 4.8 (88.60%) ahead. The GPT-5.6 lift is on agentic coding-from-a-shell, not on file-editing-agent benchmarks.

FAQ

Is GPT-5.6 Sol available to the public? No. As of June 30, 2026, access is limited to a small trusted-partner preview. OpenAI has stated the public rollout is "in the coming weeks," with no specific date.

How does GPT-5.6 Sol compare to Claude Mythos 5? On Terminal-Bench 2.1, Sol (88.8%) is roughly 0.8 points ahead of Mythos 5 (88.0%). On ExploitBench, Sol is comparable to Mythos 5 Preview while using approximately one-third the output tokens.

What is Terminal-Bench 2.1? A real-world command-line agentic-coding benchmark introduced with GPT-5.6. The agent is dropped into a container with a goal and has to plan, invoke tools, recover from failures, and iterate until the task succeeds. It is a more useful signal for agentic coding workflows than older single-shot coding benchmarks such as HumanEval.

Should my team run a Sol eval now? If your team is in the trusted-partner cohort: yes, the eval window is open. If not, the realistic eval window is 2–6 weeks after public rollout. Plan a procurement memo with both tracks.

What does Sol Ultra cost? OpenAI has not published separate Sol Ultra pricing as of June 30, 2026. Treat Ultra-mode cost as unconfirmed until OpenAI publishes billing details.

Sources checked

OpenAI — Previewing GPT-5.6 Sol (launch announcement)
OpenAI Help Center — A preview of GPT-5.6 Sol, Terra, and Luna (pricing)
Vals AI — Terminal-Bench 2.1 leaderboard
Cybersecurity Dive — OpenAI model government limit request
SecurityWeek — OpenAI and Anthropic limit new AI models to Trump-approved customers
Lush Binary — GPT-5.6 Sol benchmarks deep dive
Epoch AI — ExploitBench benchmark methodology
Vals AI — SWE-bench leaderboard

For a Different Kind of Audience

If your workload is "summarize a doc, draft an email, route a ticket," the GPT-5.6 cloud is the right tool. If your workload is "an attorney-client conversation that cannot leave the room, a board memo that cannot be logged, or sensitive personal data that should never be aggregated to a third party," a different kind of device exists: see luxury phones with on-device AI assistants for hardware designed for more private, local-first AI workflows.

GPT-5.6 Sol Benchmarks: Terminal-Bench 2.1 Scores, Pricing, and Release Status

More In AI Tools

AI Data Protection: How to Protect Sensitive Information from AI Tools

The Ultimate Guide to OpenClaw WhatsApp Integration: Benefits & How-to Guide

What Is an AI Agent? The Definitive Guide to Types, Use Cases, and the Mobile Command Terminal Future