Evaluations
The Lunar SDK includes a built-in evaluation framework to test and measure the quality of your LLM outputs. Evaluations help you understand how well your models perform and identify areas for improvement.Why Evaluate?
- Quality Assurance: Verify outputs meet your requirements
- Model Comparison: Compare different models objectively
- Regression Testing: Detect quality degradation over time
- Prompt Optimization: Measure impact of prompt changes
Key Concepts
Dataset
A collection of test cases, each with aninput and optionally an expected output:
Task
A function that takes an input and returns an output (typically by calling an LLM):Scorers
Functions that evaluate the output and return a score (0.0 to 1.0):Quick Example
Scorer Types
| Type | Description | Example |
|---|---|---|
| Built-in | Pre-instantiated scorers | exactMatch, jsonValid |
| Factory | Parameterized scorers | regex(pattern), llmJudge(...) |
| Custom | Your own scoring logic | @Scorer decorator |
Evaluation Flow
- Dataset - Provide your test cases with inputs and expected outputs
- Task - Your function calls the LLM with each input
- Scorers - Each output is evaluated by one or more scorers
- Results - Get aggregated scores and detailed per-case results