Quick facts
What happened
OpenAI shipped GPT-5.6 on June 26, 2026, marking one of the clearest recent examples of an OpenAI flagship holding a sustained lead over Claude on Terminal-Bench 2.1 — an agentic-coding benchmark that has become a more useful signal for agentic coding workflows than older single-shot coding benchmarks such as HumanEval.
The launch came with three SKUs — Sol, Terra, and Luna — each priced and positioned differently. Sol targets agentic coding, long-horizon reasoning, and security; Terra is positioned as a daily workhorse that matches GPT-5.5 at roughly half the cost; Luna is the speed-first tier for classification and routing workloads.
OpenAI says access is limited to a small group of trusted partners whose participation has been shared with the US government. Secondary reporting has described the cohort as roughly 20 partners, with public availability "in the coming weeks." The federal review pattern that began with Anthropic Mythos 5 on June 12 is an early signal that public launch is now subject to a coordination window, although OpenAI's specific gating sequence has not been officially characterized in this way.
Why people are searching for it
Three search intents are spiking on the GPT-5.6 Sol benchmarks topic as of June 30:
- Benchmark hunters searching for the actual Terminal-Bench 2.1 numbers and the leaderboard ranking
- Enterprise IT procurement teams checking if Sol is worth a Q3 2026 eval budget
- AI policy watchers tracking the federal cybersecurity-review pattern that began with Anthropic Mythos 5 on June 12
The first cohort is the highest-volume; the second is the highest-value; the third is the slowest-burn but the most strategic for any regulated-vertical enterprise.
Key numbers
A 5.4-point jump from GPT-5.5 to Sol in eight months is a material single-generation lift on this benchmark. The narrow pass over Claude Mythos 5 gives OpenAI a visible lead on Terminal-Bench 2.1, though the margin is small and benchmark-specific.
Cached input tokens are billed at roughly 10% of standard input across all three tiers.
What changed since the last update
Compared to GPT-5.5:
- Agentic coding: +5.4 points on Terminal-Bench 2.1, with Sol Ultra pushing +8.5 points
- Cybersecurity: New ExploitBench score roughly matches Claude Mythos 5 Preview at one-third the output tokens
- Pricing model: Three-tier structure replaces the previous "main model + separate cheap router" pattern
- Availability: Federal review window — first time OpenAI is gated at launch
What it means
For most enterprise IT teams, the headline number — 88.8% on Terminal-Bench 2.1 — is a procurement signal, not a deployment trigger. The actual question is whether Sol is worth a Q3 2026 eval budget, given that the Mythos 5 precedent suggests procurement teams may want to model a 2–6 week delay.
For teams operating in regulated verticals (finance, healthcare, defense-adjacent), the procurement memo may now want to model three tracks: cloud model (federal-gated), local model (no gating), and a wait-and-see track for the public rollout. The Mythos 5 / GPT-5.6 pattern is a useful planning signal for 2026, not yet an established baseline.
Benchmark limitations
A few caveats on the headline numbers worth flagging before a procurement decision:
- Terminal-Bench 2.1 was introduced with GPT-5.6. The comparison numbers are honest numbers (not retrofitted marketing), but the benchmark is also new and lacks the long-term predictive track record of older evals.
- Sol Ultra pricing is unconfirmed. OpenAI has published Sol / Terra / Luna pricing but not a separate Sol Ultra tier as of June 30, 2026. Treat Ultra-mode cost as unconfirmed until OpenAI publishes billing details.
- Cohort size "~20 partners" is from secondary reporting. OpenAI's official language is "a small group of trusted partners whose participation has been shared with the US government." The ~20 figure comes from security press reporting and may shift.
- SWE-bench is where Sol does not lead. Public leaderboards still show Claude Fable 5 (95.00% on SWE-bench Verified) and Claude Opus 4.8 (88.60%) ahead. The GPT-5.6 lift is on agentic coding-from-a-shell, not on file-editing-agent benchmarks.
FAQ
Is GPT-5.6 Sol available to the public? No. As of June 30, 2026, access is limited to a small trusted-partner preview. OpenAI has stated the public rollout is "in the coming weeks," with no specific date.
How does GPT-5.6 Sol compare to Claude Mythos 5? On Terminal-Bench 2.1, Sol (88.8%) is roughly 0.8 points ahead of Mythos 5 (88.0%). On ExploitBench, Sol is comparable to Mythos 5 Preview while using approximately one-third the output tokens.
What is Terminal-Bench 2.1? A real-world command-line agentic-coding benchmark introduced with GPT-5.6. The agent is dropped into a container with a goal and has to plan, invoke tools, recover from failures, and iterate until the task succeeds. It is a more useful signal for agentic coding workflows than older single-shot coding benchmarks such as HumanEval.
Should my team run a Sol eval now? If your team is in the trusted-partner cohort: yes, the eval window is open. If not, the realistic eval window is 2–6 weeks after public rollout. Plan a procurement memo with both tracks.
What does Sol Ultra cost? OpenAI has not published separate Sol Ultra pricing as of June 30, 2026. Treat Ultra-mode cost as unconfirmed until OpenAI publishes billing details.
Sources checked
- OpenAI — Previewing GPT-5.6 Sol (launch announcement)
- OpenAI Help Center — A preview of GPT-5.6 Sol, Terra, and Luna (pricing)
- Vals AI — Terminal-Bench 2.1 leaderboard
- Cybersecurity Dive — OpenAI model government limit request
- SecurityWeek — OpenAI and Anthropic limit new AI models to Trump-approved customers
- Lush Binary — GPT-5.6 Sol benchmarks deep dive
- Epoch AI — ExploitBench benchmark methodology
- Vals AI — SWE-bench leaderboard
For a Different Kind of Audience
If your workload is "summarize a doc, draft an email, route a ticket," the GPT-5.6 cloud is the right tool. If your workload is "an attorney-client conversation that cannot leave the room, a board memo that cannot be logged, or sensitive personal data that should never be aggregated to a third party," a different kind of device exists: see luxury phones with on-device AI assistants for hardware designed for more private, local-first AI workflows.




