Chat Completions API: Generate AI Responses via NexLLM

The Chat Completions endpoint is the primary way to interact with AI models through NexLLM. It follows the OpenAI Chat Completions format exactly, so any code or library already written for the OpenAI API works out of the box — just point your client at https://www.nexllm.ai/v1 and swap in your NexLLM key.

Endpoint

POST https://www.nexllm.ai/v1/chat/completions

Request Parameters

model

string

required

The ID of the model to use. NexLLM routes your request to the correct provider automatically. Examples: gpt-4o, aws/claude-haiku-4-5, gemini-2.5-flash.

messages

array

required

An array of message objects that make up the conversation history. Each object must include a role (system, user, or assistant) and a content string.

[
  { "role": "system", "content": "You are a helpful assistant." },
  { "role": "user", "content": "Write a short welcome message for a new user." }
]

max_tokens

integer

The maximum number of tokens the model should generate in its response. Defaults to the model’s configured maximum if omitted.

stream

boolean

When set to true, the API streams the response as Server-Sent Events (SSE) instead of returning a single JSON response. Defaults to false.

temperature

number

Controls the randomness of the output. Accepts a value between 0 and 2. Lower values (e.g. 0.2) produce more deterministic responses; higher values (e.g. 1.5) produce more varied output. Defaults to 1.

Response Fields

string

A unique identifier for this completion request, useful for logging and debugging.

choices

array

An array of generated response objects. Most requests return a single choice.

choices[].message.content

string

The text generated by the model for this choice.

usage.prompt_tokens

integer

The number of tokens consumed by the input messages.

usage.completion_tokens

integer

The number of tokens generated in the model’s response.

Code Examples

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxx",
    base_url="https://www.nexllm.ai/v1"
)

response = client.chat.completions.create(
    model="aws/claude-haiku-4-5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short welcome message for a new user."}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

Streaming Responses

Set stream: true in your request body to receive the response as a stream of Server-Sent Events. Each event contains a partial delta of the generated text. This is useful for displaying output to users in real time as the model generates it. The OpenAI Python SDK handles SSE streaming automatically when you pass stream=True to the create call.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me a short story."}],
    stream=True
)

for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

​Endpoint

​Request Parameters

​Response Fields

​Code Examples

​Streaming Responses

Endpoint

Request Parameters

Response Fields

Code Examples

Streaming Responses