Fine-Tune Evaluation
Running Batch Evals
Evaluate hundreds of responses in a single API call. Get aggregated personality scores and drift metrics.
Batch evaluation allows you to analyze a large set of LLM responses at once. Run your eval dataset through a monitor to generate outputs, or send pre-collected responses directly for analysis.
API Endpoint
POST /api/v1/evals/batch
{
"personaId": "uuid",
"name": "my-eval-run",
"datasetId": "dataset-uuid",
"monitorId": "monitor-uuid",
"modelName": "gpt-4o"
}To analyze pre-collected responses, omit datasetId and send samples instead:
POST /api/v1/evals/batch
{
"personaId": "uuid",
"name": "my-eval-run",
"samples": [
{
"id": "1",
"content": "Response text from LLM...",
"messages": [{ "role": "user", "content": "Original prompt" }]
}
]
}When using monitorId, the monitor should expose an OpenAI-compatible /v1/chat/completions endpoint.
Response Format
{
"evalRun": {
"id": "uuid",
"status": "completed",
"avgScores": {
"openness": 72,
"agreeableness": 68,
// ... all 10 dimensions
},
"avgDrift": 12.0,
"flaggedCount": 3
},
"summary": {
"sampleCount": 100,
"successCount": 98,
"errorCount": 2,
"avgScores": { ... },
"avgDrift": 12.0,
"flaggedCount": 3
}
}Python Example
import lindr
client = lindr.Client(api_key="lnd_...")
# Run batch evaluation
eval_run, summary = client.evals.batch(
persona_id="your-persona-id",
name="finetune-v1-eval",
samples=[
{"id": str(i), "content": response}
for i, response in enumerate(your_responses)
]
)
# Check results
print(f"Average drift: {summary.avg_drift}")
print(f"Flagged responses: {summary.flagged_count}")