API reference

Chat completions

Create a model response for a conversation. The endpoint mirrors OpenAI's Chat Completions API field-for-field.

Endpoint

POST/api/v1/chat/completions

Authenticate with a fw_live_ bearer key (see Authentication), and send Content-Type: application/json.

Request body

modelstringrequired
The niche to answer as — the bare slug (fitness) or the served id (flywheel-fitness, an optional :tag is ignored). Browse every slug in the catalog. An unknown model returns 404.
messagesarrayrequired
The conversation so far, as OpenAI message objects: { "role", "content" } whererole is system, user, or assistant. Send the full history each call — the API is stateless.
max_tokensintegeroptional
Maximum tokens to generate. Defaults to 512 and is clamped to a server ceiling of 1024. Generation also stops at the model’s natural end of turn.
temperaturenumberoptional
Sampling temperature. Lower is more deterministic, higher is more varied. Omit for the model’s tuned default.
streambooleanoptional
Accepted for OpenAI-SDK compatibility. The hosted API currently returns a single complete response; incremental token streaming is on the roadmap — see Streaming.
Note.Unsupported OpenAI fields (e.g. tools, logprobs, n) are ignored rather than rejected, so existing OpenAI payloads keep working. See OpenAI compatibility for the full matrix.

Example

shell
curl https://gyld.dev/api/v1/chat/completions \
  -H "Authorization: Bearer $FLYWHEEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fitness",
    "messages": [
      { "role": "system", "content": "You are a gym front desk." },
      { "role": "user", "content": "Beginner full-body workout?" }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

The response

A standard OpenAI chat.completion object. The assistant’s reply is at choices[0].message.content, and usage reports token counts for billing.

json
{
  "id": "chatcmpl-…",
  "object": "chat.completion",
  "created": 1735689600,
  "model": "fitness",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "A solid beginner full-body plan is …" },
      "finish_reason": "stop"
    }
  ],
  "usage": { "prompt_tokens": 28, "completion_tokens": 96, "total_tokens": 124 }
}
choices[].message.contentstringrequired
The model’s reply text.
choices[].finish_reasonstringrequired
Why generation stopped: stop (natural end) or length (hit max_tokens).
usageobjectrequired
prompt_tokens, completion_tokens, and total_tokens — what the request counts against your plan.

Limits

  • Up to 200 messages per request.
  • 24,000 characters per message, 60,000 characters across all messages.
  • 4 concurrent requests per key (one model loads at a time on a worker).
  • An upstream attempt is bounded at 60 seconds.

Exceeding a content cap returns 400; too many in-flight requests return 429. Full status-code reference on Errors & rate limits.