Audio API: Speech-to-Text and Text-to-Speech via NexLLM

NexLLM’s audio endpoints give you access to speech-to-text transcription and text-to-speech synthesis through a single, OpenAI-compatible interface. Use the transcriptions endpoint to convert audio recordings into text, and the speech endpoint to turn written content into natural-sounding audio — all with the same API key you use for chat and embeddings.

Endpoints

Endpoint	Path	Description
Speech-to-Text	`POST /v1/audio/transcriptions`	Transcribe an audio file to text
Text-to-Speech	`POST /v1/audio/speech`	Convert text to spoken audio

Speech-to-Text: Transcriptions

Parameters

model

string

required

The transcription model to use. Use whisper-1 for Whisper-compatible transcription.

file

required

The audio file to transcribe. Accepted formats include mp3, mp4, mpeg, mpga, m4a, wav, and webm. The file must be under 25 MB.

language

string

The ISO-639-1 language code of the audio (e.g. en, es, fr). Providing this improves accuracy and speed. If omitted, the model detects the language automatically.

prompt

string

Optional text to guide the model’s transcription style or provide context about the audio content.

Text-to-Speech

Parameters

model

string

required

The text-to-speech model to use. Use tts-1 for standard quality or tts-1-hd for higher quality audio.

input

string

required

The text to convert to speech. Maximum length is 4,096 characters.

voice

string

required

The voice to use for synthesis. Available options: alloy, echo, fable, onyx, nova, shimmer.

response_format

string

The audio format of the output. Supported values: mp3, opus, aac, flac. Defaults to mp3.

Code Examples

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxx",
    base_url="https://www.nexllm.ai/v1"
)

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print(transcript.text)

For long-form content like articles or podcasts, consider splitting the text into smaller segments before calling the speech endpoint. This lets you process segments in parallel and combine the output files, reducing overall latency.

​Endpoints

​Speech-to-Text: Transcriptions

​Parameters

​Text-to-Speech

​Parameters

​Code Examples

Endpoints

Speech-to-Text: Transcriptions

Parameters

Text-to-Speech

Parameters

Code Examples