API reference
Streaming
How responses arrive — complete on the hosted API today, and token-by-token when you self-host.
Hosted API
The managed API currently returns a single complete response. You may send "stream": truefor OpenAI-SDK compatibility — it’s accepted and ignored, and you get the full completion in one object. These models are small and fast, so end-to-end latency is low for typical front-desk turns.
stream=True will keep working unchanged when it lands.When self-hosting
Run the weights with vLLM or llama.cpp and you get native, real-time streaming today — both expose an OpenAI-compatible streaming endpoint. Set stream=True and consume deltas:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="fitness",
messages=[{"role": "user", "content": "Beginner full-body workout?"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)See Self-hosting to stand up that endpoint.
The SSE format
When streaming, the response is text/event-stream: a sequence of data: lines, each carrying a chat.completion.chunk object whose choices[0].delta.content holds the next piece of text. The stream ends with a data: [DONE] sentinel. This is byte-for-byte the OpenAI streaming contract, so any OpenAI streaming client parses it without changes.