Audio Understanding

Authorizations

Authorization

string

header

required

All endpoints require Bearer Token authentication. Add to the request header:

Authorization: Bearer YOUR_API_KEY

YOUR_API_KEY is the API Token (sk-... format).

Body

application/json

model

string

default:gemini-2.5-pro

required

Model name. Common audio models:

gemini-2.5-pro
nemotron-3-nano-omni (wav/mp3 only)

Examples:

"gemini-2.5-pro"

"nemotron-3-nano-omni"

audio_url

string

required

Audio source. Accepts one of the following two forms:

Publicly reachable HTTP/HTTPS URL
data:audio/<type>;base64,<payload> data URI (base64 inline)

Audio format support per model:

gemini-2.5-pro: wav/mp3/aiff/aac/ogg/flac/m4a; total request body (prompt + system + inline files) ≤ 20 MB
nemotron-3-nano-omni: .wav / .mp3 only; data URI must use audio/wav or audio/mpeg, otherwise returns 422

Base64 data is not size-validated; oversized payloads may trigger 422.

Minimum string length: 1

Example:

"https://storage.googleapis.com/cloud-samples-tests/speech/brooklyn.flac"

prompt

string | null

User prompt. When omitted, defaults to 'Please transcribe this audio file', aligning with the transcription scenario.

Maximum string length: 100000

Example:

"Identify the speakers and emotion in this audio."

sync

boolean

default:false

Synchronous mode. When true, the endpoint blocks until the upstream completes and returns the full response (if stream=true at the same time, returns an SSE stream); when false, the endpoint returns the task ID immediately, and results are fetched via GET /v1/tasks/{task_id} or the SSE endpoint.

Example:

false

stream

boolean

default:false

Whether to stream. When true, the Submit response includes stream.url pointing to the SSE subscription path; streaming chunks are unified as the OpenAI chat.completion.chunk format.

Example:

false

max_tokens

integer | null

Generation token limit. Optional.

Required range: x >= 1

Example:

256

temperature

number | null

Sampling temperature, range [0, 2]. Optional.

Required range: 0 <= x <= 2

system_prompt

string | null

System instruction. Optional.

Maximum string length: 10000

reasoning

boolean | null

Whether to include reasoning tokens. Some thinking models require this to be set to true.

Response

Task created

Submit response, conforming to the unified task standard shape. results / error are fixed at null during submit; they are returned via GET /v1/tasks/{task_id} after the task completes or fails.

string

required

Task ID, formatted as task-llm-{timestamp}-{8random}.

Example:

"task-llm-1776874565-yq3szvcu"

object

enum<string>

required

Available options:

llm.generation.task

Example:

"llm.generation.task"

type

enum<string>

required

Available options:

llm

Example:

"llm"

model

string

required

The model name submitted by the client (echoed verbatim)

Example:

"gemini-2.5-pro"

status

enum<string>

required

Available options:

pending

Example:

"pending"

progress

integer

required

Example:

0

created

integer

required

Example:

1776874565

stream

object

Returns {url: ...} when stream=true; null when stream=false.

Show child attributes

results

object[] | null

Fixed at null during submit; returned via GET /v1/tasks/{task_id} after the task completes — results[0] is the full OpenAI ChatCompletion response (audio transcription / understanding output is in message.content).

Example:

null

error

object

Fixed at null during submit; returned via GET /v1/tasks/{task_id} when the task fails.

Example:

null

Image Series

Video Series

Audio Series

Text Series

Task Management

File Management

Audio Understanding

Authorizations

Body

Response

Image Series

Video Series

Audio Series

Text Series

Task Management

File Management

Documentation Index

Authorizations

Body

Response