o10Last updated 2026-06-14b

GPT OSS 120B — model profile

GPT OSS 120B (openai/gpt-oss-120b) is a text model via the unified inference gateway at ~$0.350/M input with 131K tokens context. o10 routes GPT OSS 120B only when per-use-case evals clear at your quality floor — not as a default frontier choice.

Dashboards observe.
o10 enforces.

Cost dashboards tell you what you spent. o10 sits in the request path and changes what you spend — shadow first, then enforce.

SummaryKey takeaways

What you need to know

Short, self-contained answers with cited stats — read the sections below for full context.

What is GPT OSS 120B?

This model excels at efficient reasoning across science, math, and coding applications. It’s ideal for real-time coding assistance, processing large documents for Q&A and summarization, agentic research workflows, and regulated on-premises workloads.

Up to 638× spread between most and least expensive compliant routes for identical workloads at the same quality floor (o10 State of Inference Spend 2026).

How much does GPT OSS 120B cost via the inference gateway?

GPT OSS 120B via unified inference gateway: ~$0.350/M input, ~$0.750/M output, ~$0.250/M cache read. Endpoint pricing may vary by provider backend. Up to 638× spread between most and least expensive compliant routes for identical workloads at the same quality floor (o10 State of Inference Spend 2026).

Up to 638× spread between most and least expensive compliant routes for identical workloads at the same quality floor (o10 State of Inference Spend 2026).

What is the context window for GPT OSS 120B?

GPT OSS 120B supports 131K tokens with up to 131,000 max output tokens via the unified inference gateway. Context capacity affects RAG and agent workloads — set per-use-case eval floors independently.

Which providers serve GPT OSS 120B?

GPT OSS 120B is served by baseten, bedrock, cerebras, fireworks, groq, nebius, parasail, togetherai via the unified inference gateway (8 endpoints). o10 compares latency, uptime, and price across all venues — not just within one gateway.

What capabilities does GPT OSS 120B support?

GPT OSS 120B capabilities via unified inference gateway: reasoning, tool-use, implicit-caching. Type: text. Route to cheaper compliant tiers when evals pass — marginal productivity should cost marginal tokens.

When should o10 route to GPT OSS 120B?

Evaluate GPT OSS 120B in shadow mode against cheaper compliant alternatives at your quality floor. o10 enforces eval-gated routing in the request path — frontier models for frontier problems only. Up to 638× spread between most and least expensive compliant routes for identical workloads at the same quality floor (o10 State of Inference Spend 2026).

01Deep dive

Specifications

GPT OSS 120B technical profile in the unified inference model catalog.

GPT OSS 120B specifications

Field	Value
Gateway ID	openai/gpt-oss-120b
Provider	openai
Type	text
Context window	131K tokens
Max output tokens	131,000
Released	2025-08-05
Tags	reasoning, tool-use, implicit-caching
Modality	text→text

02Deep dive

Token pricing

List pricing for GPT OSS 120B via the unified inference gateway (per 1M tokens unless noted).

Unified inference gateways pass through provider list prices with zero markup on tokens.

Endpoint-specific pricing may differ — see provider endpoints below.

GPT OSS 120B gateway pricing

Unit	Rate
Input	$0.350/M
Output	$0.750/M
Cache read	$0.250/M

03Deep dive

Provider endpoints

Backends serving GPT OSS 120B via the unified inference gateway with live latency and uptime signals.

Multiple provider endpoints may serve the same model ID with different latency, throughput, and pricing.

o10 can select the cheapest compliant endpoint across all venues — not only within one gateway.

GPT OSS 120B provider endpoints

Provider	Input	Latency p50	Throughput p50	Uptime 24h
baseten	$0.100/M	0.1s	279 tps	100.0%
bedrock	$0.150/M	0.4s	168 tps	98.7%
cerebras	$0.350/M	0.2s	1073 tps	99.8%
fireworks	$0.150/M	0.1s	264 tps	99.1%
groq	$0.150/M	0.3s	488 tps	100.0%
nebius	$0.150/M	0.5s	299.5 tps	100.0%
parasail	$0.100/M	0.3s	222 tps	100.0%
togetherai	$0.150/M	0.6s	127 tps	97.9%

04Deep dive

Capabilities & compliance

GPT OSS 120B feature tags and routing notes.

Supported capabilities: reasoning, tool-use, implicit-caching.

Eval-gated routing ensures this model serves only workloads that clear your quality floor.

reasoning
tool use
implicit caching

05Deep dive

o10 routing guidance

When to route production traffic to GPT OSS 120B.

GPT OSS 120B is in the unified inference model catalog. Compare against cheaper compliant tiers at your eval floor before defaulting production traffic.

Marginal cost of productivity equals marginal cost of tokens when routing is enforced in the path.

Eval-gated selection
Shadow mode baseline
Enforce mode holds envelopes

SourceMethodology

Gateway catalog snapshot 2026-06-13. Model ID: openai/gpt-oss-120b. Verify pricing against your gateway provider's published pricing.

FAQFrequently asked questions

Common questions

What is GPT OSS 120B in the model catalog?

GPT OSS 120B (openai/gpt-oss-120b) is a text model via the unified inference gateway — one endpoint to access baseten, bedrock, cerebras, fireworks, groq, nebius, parasail, togetherai provider backends. The unified inference gateway lists GPT OSS 120B at ~$0.35/1M input tokens and ~$0.75/1M output tokens. Gateways simplify API access; o10 sits above the gateway to enforce eval-gated routing and CFO-grade spend envelopes per use case.

How much does GPT OSS 120B cost?

The unified inference gateway lists GPT OSS 120B at ~$0.35/1M input tokens and ~$0.75/1M output tokens. Per-provider endpoint pricing may differ — see the provider endpoints table on this page. Cache read pricing applies when supported. o10 records fully loaded cost per call in an immutable ledger so finance sees unit economics, not blended invoices. Savings are verified against your own shadow baseline per use case — not industry averages or vendor marketing claims. o10 mirrors a week or more of production traffic, segments by workload, and compares what you actually spent versus what you would have spent on the cheapest eval-passing route at the same quality floor. Finance signs off on the delta before enforce mode flips. Gainshare pricing ties o10 fees to this verified number, so savings must be real and auditable.

What is the context window for GPT OSS 120B?

GPT OSS 120B supports a 131K tokens context window via the unified inference gateway with capabilities: reasoning, tool-use, implicit-caching. Context limits affect RAG and agent workloads — segment by use case and set eval floors independently. Long context does not justify defaulting to frontier pricing on every call.

Which providers serve GPT OSS 120B?

GPT OSS 120B is available via baseten, bedrock, cerebras, fireworks, groq, nebius, parasail, togetherai through the unified inference gateway. Multi-provider routing lets you compare latency, uptime, and per-endpoint pricing. o10 selects the cheapest compliant endpoint across all venues — gateway, OpenRouter, Bedrock committed capacity, and owned models — not just within one gateway.

When should you route production traffic to GPT OSS 120B?

o10 can route GPT OSS 120B via the unified inference gateway when evals clear at your quality floor. Compare against cheaper compliant tiers in shadow mode before enforcing — The 638× figure is the observed ratio between the most and least expensive compliant routing options for identical enterprise workloads at the same per-use-case quality floor across venues — not a guarantee for every team. o10 measured this across unified inference gateway, OpenRouter, Amazon Bedrock committed capacity, and owned open-weight in June 2026. Actual savings depend on your venue mix, volumes, and eval floors; shadow mode proves your organization's number against your baseline.

Does o10 replace the inference gateway for GPT OSS 120B?

No. o10 does not replace your AI gateway or developer-facing APIs. It sits above gateways and clouds, adding spend enforcement, eval-gated routing, policy, and CFO-grade ledger — not proxy compatibility. Teams keep their per-token API gateway, OpenRouter, or LiteLLM for access; o10 changes which model and venue serve each request based on cost, eval floor, and governance rules. The split is intentional: gateways provide doors; control planes enforce economics. Keep your per-token API gateway for developer API access to GPT OSS 120B; add o10 above it for spend enforcement and KYI governance.

What is shadow mode?

Shadow mode mirrors live inference traffic through o10 without changing production routes. For every request, o10 evaluates candidate models against your per-use-case quality floors and records which route would have been cheapest and compliant — along with the cost delta — while the original provider still serves the response. Engineering sees proof without production risk; finance gets a verified savings figure tied to your traffic, not industry averages. Most teams run shadow for 7–14 days segmented by use case (support, RAG, code, batch) before flipping enforce mode.

What is enforce mode?

Enforce mode places o10 in the request path. On every call, o10 selects the cheapest model and venue that clears your eval-defined quality floor, holds the budget envelope, and applies residency and retention policy before the request reaches the provider. Failed eval candidates are never routed. Each enforced call writes an immutable ledger entry: model, venue, policy, jurisdiction, and fully loaded cost. Enforce without shadow proof is possible but discouraged — shadow establishes trust with engineering and finance first.

What is a quality floor?

A quality floor is the minimum eval score a model must achieve for a specific use case before o10 routes production traffic to it. Floors are per workload — support, RAG, code, and batch clear at different bars — and measured by replaying representative traffic through eval suites, not assumed from vendor benchmarks. Once a cheaper candidate passes the floor, o10 can route to it in shadow (proof) or enforce (live). Floors without evals are hopes; evals without floors are expensive defaults.

What is the 638× spread?

The 638× figure is the observed ratio between the most and least expensive compliant routing options for identical enterprise workloads at the same per-use-case quality floor across venues — not a guarantee for every team. o10 measured this across unified inference gateway, OpenRouter, Amazon Bedrock committed capacity, and owned open-weight in June 2026. Actual savings depend on your venue mix, volumes, and eval floors; shadow mode proves your organization's number against your baseline.

How are savings verified?

Savings are verified against your own shadow baseline per use case — not industry averages or vendor marketing claims. o10 mirrors a week or more of production traffic, segments by workload, and compares what you actually spent versus what you would have spent on the cheapest eval-passing route at the same quality floor. Finance signs off on the delta before enforce mode flips. Gainshare pricing ties o10 fees to this verified number, so savings must be real and auditable.

How do I call GPT OSS 120B via the inference gateway?

Use model ID `openai/gpt-oss-120b` with your gateway's OpenAI-compatible `/v1` endpoint. Compatible with AI SDK, OpenAI Chat Completions, and Anthropic Messages formats. o10 intercepts the routing decision above the gateway — pointing your app at o10 in enforce mode changes which model ID is selected per request based on eval floors and budget envelopes.

o10Set the envelope. o10 holds it.

See what you're overpaying.

Paste a week of traffic. Get the number that books the audit.

See what you're overpaying →

verified savings methodology · State of Inference Spend 2026