Base URL
Authentication
All inference endpoints accept two authentication methods:| Method | Header | Format |
|---|---|---|
| API Key | x-api-key | pk_live_... |
| JWT | Authorization | Bearer <token> |
API keys are created in the Lunar Console. The SDK reads from the
LUNAR_API_KEY environment variable by default.POST /v1/chat/completions
Create a chat completion. OpenAI-compatible. Request Body:| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID (e.g. gpt-4o-mini) or provider/model to force a provider (e.g. openai/gpt-4o-mini) |
messages | array | Yes | Array of {role, content} objects. Roles: system, user, assistant |
stream | boolean | No | Enable SSE streaming (default: false) |
max_tokens | integer | No | Maximum tokens to generate |
temperature | float | No | Sampling temperature (0-2, default: 0.7) |
top_p | float | No | Nucleus sampling (default: 1.0) |
stop | array | No | Stop sequences |
presence_penalty | float | No | Presence penalty (default: 0.0) |
frequency_penalty | float | No | Frequency penalty (default: 0.0) |
n | integer | No | Number of choices (default: 1) |
provider | string | No | Force a specific provider |
fallbacks | array | No | Fallback models if primary fails (SDK only) |
- Standard
- Streaming
- Force Provider
- Custom Deployment
POST /v1/completions
Create a text completion (legacy endpoint). Request Body:| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID or provider/model |
prompt | string | Yes | Text prompt |
max_tokens | integer | No | Maximum tokens (default: 1024) |
temperature | float | No | Sampling temperature (default: 0.7) |
stop | array | No | Stop sequences |
stream | boolean | No | Enable SSE streaming (default: false) |
- Standard
- Streaming
POST /v1/router
Intelligent routing endpoint. Analyzes the prompt semantically and selects the optimal model using the UniRoute algorithm. Request Body:| Field | Type | Required | Description |
|---|---|---|---|
messages | array | Yes | Array of {role, content} objects (min 1) |
execute | boolean | No | Execute after routing (default: true). Set false to only get the routing decision |
models | array | No | Restrict routing to specific models (min 2). If omitted, uses all available models |
cost_weight | float | No | Cost penalty weight (default: 0.0). 0 = ignore cost, higher = prefer cheaper models |
stream | boolean | No | Enable SSE streaming when execute=true (default: false) |
max_tokens | integer | No | Maximum tokens when execute=true |
temperature | float | No | Temperature when execute=true (0-2) |
- Auto Route + Execute
- Restricted Models
- Decision Only
Routes among all available models and executes:Response:
GET /v1/router/profiles
List models that have semantic routing profiles available.GET /v1/providers
List providers available for a model. Query Parameters:| Field | Type | Required | Description |
|---|---|---|---|
model | string | No | Model ID (defaults to gpt-4o-mini) |
Response Objects
Usage
Included in all completion responses. Extends the OpenAI format with cost and latency data.| Field | Type | Description |
|---|---|---|
prompt_tokens | int | Input token count |
completion_tokens | int | Output token count |
total_tokens | int | Total tokens |
input_cost_usd | float | Input cost in USD |
output_cost_usd | float | Output cost in USD |
cache_input_cost_usd | float | Cached input cost in USD |
total_cost_usd | float | Total cost in USD |
latency_ms | float | Total request latency in ms |
ttft_ms | float | Time to first token in ms |
Error Response
Rate Limits
Rate limits are applied per API key. When exceeded, you receive a429 response with a retry_after field in the body.
SDK Usage
- Python
- TypeScript