Skip to main content

Base URL

https://api.pureai-api.com

Authentication

All inference endpoints accept two authentication methods:
MethodHeaderFormat
API Keyx-api-keypk_live_...
JWTAuthorizationBearer <token>
# API Key
curl -H "x-api-key: pk_live_YOUR_KEY" ...

# JWT (from Cognito)
curl -H "Authorization: Bearer eyJhbG..." ...
API keys are created in the Lunar Console. The SDK reads from the LUNAR_API_KEY environment variable by default.

POST /v1/chat/completions

Create a chat completion. OpenAI-compatible. Request Body:
FieldTypeRequiredDescription
modelstringYesModel ID (e.g. gpt-4o-mini) or provider/model to force a provider (e.g. openai/gpt-4o-mini)
messagesarrayYesArray of {role, content} objects. Roles: system, user, assistant
streambooleanNoEnable SSE streaming (default: false)
max_tokensintegerNoMaximum tokens to generate
temperaturefloatNoSampling temperature (0-2, default: 0.7)
top_pfloatNoNucleus sampling (default: 1.0)
stoparrayNoStop sequences
presence_penaltyfloatNoPresence penalty (default: 0.0)
frequency_penaltyfloatNoFrequency penalty (default: 0.0)
nintegerNoNumber of choices (default: 1)
providerstringNoForce a specific provider
fallbacksarrayNoFallback models if primary fails (SDK only)
curl -X POST "https://api.pureai-api.com/v1/chat/completions" \
  -H "x-api-key: pk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'
Response:
{
  "id": "chatcmpl_a1b2c3d4",
  "object": "chat.completion",
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 9,
    "total_tokens": 21,
    "input_cost_usd": 0.000018,
    "output_cost_usd": 0.000027,
    "cache_input_cost_usd": 0.0,
    "total_cost_usd": 0.000045,
    "latency_ms": 523.4,
    "ttft_ms": 215.2
  }
}

POST /v1/completions

Create a text completion (legacy endpoint). Request Body:
FieldTypeRequiredDescription
modelstringYesModel ID or provider/model
promptstringYesText prompt
max_tokensintegerNoMaximum tokens (default: 1024)
temperaturefloatNoSampling temperature (default: 0.7)
stoparrayNoStop sequences
streambooleanNoEnable SSE streaming (default: false)
curl -X POST "https://api.pureai-api.com/v1/completions" \
  -H "x-api-key: pk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "temperature": 0.7
  }'
Response:
{
  "id": "cmpl_a1b2c3d4",
  "object": "text_completion",
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "text": " Paris, the city of light.",
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 7,
    "total_tokens": 12,
    "input_cost_usd": 0.000008,
    "output_cost_usd": 0.000021,
    "total_cost_usd": 0.000029,
    "latency_ms": 312.1,
    "ttft_ms": 180.5
  }
}

POST /v1/router

Intelligent routing endpoint. Analyzes the prompt semantically and selects the optimal model using the UniRoute algorithm. Request Body:
FieldTypeRequiredDescription
messagesarrayYesArray of {role, content} objects (min 1)
executebooleanNoExecute after routing (default: true). Set false to only get the routing decision
modelsarrayNoRestrict routing to specific models (min 2). If omitted, uses all available models
cost_weightfloatNoCost penalty weight (default: 0.0). 0 = ignore cost, higher = prefer cheaper models
streambooleanNoEnable SSE streaming when execute=true (default: false)
max_tokensintegerNoMaximum tokens when execute=true
temperaturefloatNoTemperature when execute=true (0-2)
Routes among all available models and executes:
curl -X POST "https://api.pureai-api.com/v1/router" \
  -H "x-api-key: pk_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a haiku about programming"}
    ]
  }'
Response:
{
  "id": "router-abc123",
  "object": "chat.completion",
  "model": "gpt-4o-mini",
  "routing": {
    "selected_model": "gpt-4o-mini",
    "selected_provider": "openai",
    "expected_error": 0.12,
    "cost_adjusted_score": 0.12,
    "cluster_id": 3,
    "reasoning": "Simple creative task — lightweight model sufficient"
  },
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Lines of code align\nBugs emerge from morning mist\nFixed by afternoon"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 18,
    "total_tokens": 28,
    "total_cost_usd": 0.00002
  }
}

GET /v1/router/profiles

List models that have semantic routing profiles available.
curl "https://api.pureai-api.com/v1/router/profiles" \
  -H "x-api-key: pk_live_YOUR_KEY"
Response:
{
  "profiles": [
    "gpt-4o",
    "gpt-4o-mini",
    "gpt-4-turbo",
    "gpt-3.5-turbo",
    "mistral-large-latest",
    "mistral-small-latest"
  ],
  "count": 6
}

GET /v1/providers

List providers available for a model. Query Parameters:
FieldTypeRequiredDescription
modelstringNoModel ID (defaults to gpt-4o-mini)
curl "https://api.pureai-api.com/v1/providers?model=gpt-4o-mini" \
  -H "x-api-key: pk_live_YOUR_KEY"
Response:
[
  {
    "id": "openai",
    "type": "primary",
    "enabled": true,
    "params": {}
  },
  {
    "id": "groq",
    "type": "backup",
    "enabled": true,
    "params": {}
  }
]

Response Objects

Usage

Included in all completion responses. Extends the OpenAI format with cost and latency data.
FieldTypeDescription
prompt_tokensintInput token count
completion_tokensintOutput token count
total_tokensintTotal tokens
input_cost_usdfloatInput cost in USD
output_cost_usdfloatOutput cost in USD
cache_input_cost_usdfloatCached input cost in USD
total_cost_usdfloatTotal cost in USD
latency_msfloatTotal request latency in ms
ttft_msfloatTime to first token in ms

Error Response

{
  "detail": "All providers failed for model 'invalid-model': Model not found"
}

Rate Limits

Rate limits are applied per API key. When exceeded, you receive a 429 response with a retry_after field in the body.

SDK Usage

from lunar import Lunar

client = Lunar(api_key="pk_live_YOUR_KEY")

# Chat completions
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
print(f"Cost: ${response.usage.total_cost_usd}")

# With fallbacks
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    fallbacks=["claude-3-haiku", "llama-3.1-8b"]
)

# Streaming
for chunk in client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Text completions
response = client.completions.create(
    model="gpt-4o-mini",
    prompt="Hello"
)

# List models
models = client.models.list()

# List providers
providers = client.providers.list(model="gpt-4o-mini")