Skip to main content
The Chat Completions endpoint is the primary way to interact with AI models through NexLLM. It follows the OpenAI Chat Completions format exactly, so any code or library already written for the OpenAI API works out of the box — just point your client at https://www.nexllm.ai/v1 and swap in your NexLLM key.

Endpoint

POST https://www.nexllm.ai/v1/chat/completions

Request Parameters

model
string
required
The ID of the model to use. NexLLM routes your request to the correct provider automatically. Examples: gpt-4o, aws/claude-haiku-4-5, gemini-2.5-flash.
messages
array
required
An array of message objects that make up the conversation history. Each object must include a role (system, user, or assistant) and a content string.
[
  { "role": "system", "content": "You are a helpful assistant." },
  { "role": "user", "content": "Write a short welcome message for a new user." }
]
max_tokens
integer
The maximum number of tokens the model should generate in its response. Defaults to the model’s configured maximum if omitted.
stream
boolean
When set to true, the API streams the response as Server-Sent Events (SSE) instead of returning a single JSON response. Defaults to false.
temperature
number
Controls the randomness of the output. Accepts a value between 0 and 2. Lower values (e.g. 0.2) produce more deterministic responses; higher values (e.g. 1.5) produce more varied output. Defaults to 1.

Response Fields

id
string
A unique identifier for this completion request, useful for logging and debugging.
choices
array
An array of generated response objects. Most requests return a single choice.
choices[].message.content
string
The text generated by the model for this choice.
usage.prompt_tokens
integer
The number of tokens consumed by the input messages.
usage.completion_tokens
integer
The number of tokens generated in the model’s response.

Code Examples

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxx",
    base_url="https://www.nexllm.ai/v1"
)

response = client.chat.completions.create(
    model="aws/claude-haiku-4-5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short welcome message for a new user."}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

Streaming Responses

Set stream: true in your request body to receive the response as a stream of Server-Sent Events. Each event contains a partial delta of the generated text. This is useful for displaying output to users in real time as the model generates it. The OpenAI Python SDK handles SSE streaming automatically when you pass stream=True to the create call.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me a short story."}],
    stream=True
)

for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)