Fine-Tune Evaluation
Eval Datasets
Create reusable test prompt sets for consistent evaluation across model versions.
Eval datasets are collections of prompts you use to test your models. By using the same dataset across base and fine-tuned models, you ensure a fair comparison.
Dataset Structure
{
"name": "customer-support-scenarios",
"description": "50 customer support prompts across various categories",
"prompts": [
{
"id": "complaint-1",
"messages": [
{ "role": "user", "content": "I've been waiting 3 weeks for my order!" }
],
"category": "complaint"
},
{
"id": "inquiry-1",
"messages": [
{ "role": "user", "content": "What's your return policy?" }
],
"category": "inquiry"
}
]
}If the last message in a prompt is an assistant response, Lindr treats it as a pre-collected output. Otherwise, pass a monitor_id when running batch evals to generate responses from the model.
Creating a Dataset
dataset = client.datasets.create(
name="support-scenarios-v1",
prompts=[
{
"id": "complaint-1",
"messages": [{"role": "user", "content": "..."}],
"category": "complaint"
},
# ... more prompts
]
)
print(f"Created dataset: {dataset.id}")
print(f"Prompt count: {dataset.prompt_count}")Best Practices
- Diversity: Include prompts from different categories and edge cases
- Size: Aim for at least 50 prompts for statistically meaningful results
- Consistency: Use the same dataset for baseline and fine-tune evals
- Versioning: Create new dataset versions as your test cases evolve
JSON Import
Upload a JSON file with your prompts directly via the API or dashboard.
Programmatic Creation
Build datasets dynamically from your existing test suites or production logs.