Streaming
Streaming allows you to receive responses token by token as they’re generated, enabling real-time output in your applications.
Basic Streaming
from lunar import Lunar
client = Lunar()
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a short story about a robot."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import { Lunar } from "lunar";
const client = new Lunar();
const stream = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Write a short story about a robot." }],
stream: true,
});
for await (const chunk of stream) {
if (chunk.choices[0].delta.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
Stream Response Structure
Each chunk is a ChatCompletionChunk object:
| Field | Type | Description |
|---|
id | str | Completion identifier |
object | str | "chat.completion.chunk" |
created | int | Unix timestamp |
model | str | Model used |
choices | list | List of streaming choices |
Choice Delta
chunk.choices[0].index # Position in choices list
chunk.choices[0].delta.role # Role (only in first chunk)
chunk.choices[0].delta.content # Content fragment
chunk.choices[0].finish_reason # None until final chunk, then "stop"
chunk.choices[0].index // Position in choices list
chunk.choices[0].delta.role // Role (only in first chunk)
chunk.choices[0].delta.content // Content fragment
chunk.choices[0].finish_reason // null until final chunk, then "stop"
Collecting Full Response
full_response = ""
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
full_response += content
print(content, end="", flush=True)
print() # New line
print(f"Complete: {full_response}")
let fullResponse = "";
for await (const chunk of stream) {
const content = chunk.choices[0].delta.content;
if (content) {
fullResponse += content;
process.stdout.write(content);
}
}
console.log(); // New line
console.log(`Complete: ${fullResponse}`);
Async Streaming (Python)
from lunar import AsyncLunar
async def stream_response():
async with AsyncLunar() as client:
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Tell me a joke."}],
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
The TypeScript SDK is async by default — use for await...of to iterate over the stream as shown in the examples above.
Aborting a Stream (TypeScript)
The TypeScript SDK returns a Stream object with an abort() method:
const stream = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Write a long essay." }],
stream: true,
});
// Abort the stream early
setTimeout(() => stream.abort(), 5000);
for await (const chunk of stream) {
if (chunk.choices[0].delta.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
Detecting Stream End
for chunk in stream:
# Check for final chunk
if chunk.choices[0].finish_reason:
print(f"\nStream ended: {chunk.choices[0].finish_reason}")
break
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
for await (const chunk of stream) {
// Check for final chunk
if (chunk.choices[0].finish_reason) {
console.log(`\nStream ended: ${chunk.choices[0].finish_reason}`);
break;
}
const content = chunk.choices[0].delta.content;
if (content) {
process.stdout.write(content);
}
}
With Fallbacks
Streaming works with fallbacks. If the primary model fails, the SDK automatically tries the next model:
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
fallbacks=["mistral-small-latest", "gpt-4o"]
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
const stream = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
fallbacks: ["mistral-small-latest", "gpt-4o"],
});
for await (const chunk of stream) {
if (chunk.choices[0].delta.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
Streaming with Error Handling
from lunar import Lunar, RateLimitError, ServerError
client = Lunar()
try:
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
except RateLimitError as e:
print(f"Rate limited. Retry after: {e.retry_after}s")
except ServerError as e:
print(f"Server error: {e}")
import { Lunar, RateLimitError, ServerError } from "lunar";
const client = new Lunar();
try {
const stream = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0].delta.content;
if (content) {
process.stdout.write(content);
}
}
} catch (e) {
if (e instanceof RateLimitError) {
console.log(`Rate limited. Retry after: ${e.retryAfter}s`);
} else if (e instanceof ServerError) {
console.log(`Server error: ${e}`);
}
}
When to Use Streaming
| Use Case | Stream? |
|---|
| Chat interfaces | Yes |
| Real-time output | Yes |
| Long-form generation | Yes |
| Batch processing | No |
| API integrations | Depends |
- TTFT (Time to First Token): Streaming provides faster perceived response time
- Total latency: Similar to non-streaming
- Memory: Streaming uses less memory for long responses