Fine-Tune Evaluation

Running Batch Evals

Evaluate hundreds of responses in a single API call. Get aggregated personality scores and drift metrics.

Batch evaluation allows you to analyze a large set of LLM responses at once. Run your eval dataset through a monitor to generate outputs, or send pre-collected responses directly for analysis.

API Endpoint

POST /api/v1/evals/batch

{
  "personaId": "uuid",
  "name": "my-eval-run",
  "datasetId": "dataset-uuid",
  "monitorId": "monitor-uuid",
  "modelName": "gpt-4o"
}

To analyze pre-collected responses, omit datasetId and send samples instead:

POST /api/v1/evals/batch

{
  "personaId": "uuid",
  "name": "my-eval-run",
  "samples": [
    {
      "id": "1",
      "content": "Response text from LLM...",
      "messages": [{ "role": "user", "content": "Original prompt" }]
    }
  ]
}

When using monitorId, the monitor should expose an OpenAI-compatible /v1/chat/completions endpoint.

Response Format

{
  "evalRun": {
    "id": "uuid",
    "status": "completed",
    "avgScores": {
      "openness": 72,
      "agreeableness": 68,
      // ... all 10 dimensions
    },
    "avgDrift": 12.0,
    "flaggedCount": 3
  },
  "summary": {
    "sampleCount": 100,
    "successCount": 98,
    "errorCount": 2,
    "avgScores": { ... },
    "avgDrift": 12.0,
    "flaggedCount": 3
  }
}

Python Example

import lindr

client = lindr.Client(api_key="lnd_...")

# Run batch evaluation
eval_run, summary = client.evals.batch(
    persona_id="your-persona-id",
    name="finetune-v1-eval",
    samples=[
        {"id": str(i), "content": response}
        for i, response in enumerate(your_responses)
    ]
)

# Check results
print(f"Average drift: {summary.avg_drift}")
print(f"Flagged responses: {summary.flagged_count}")