Fine-Tune Evaluation

A/B Model Comparison

Compare personality profiles between two models. See exactly what changed and whether it moved toward your target.

After running batch evaluations on your base model and fine-tuned model, you can create a comparison to see the dimension-by-dimension personality shift.

Creating a Comparison

comparison = lindr.comparisons.create(
    baseline_eval_id="uuid-of-base-model-eval",
    candidate_eval_id="uuid-of-finetuned-eval",
    name="llama-base-vs-finetune-v1"
)

Understanding the Diff Report

The comparison returns a detailed diff showing how each dimension changed:

Example Comparison Report

Agreeableness
5270+18%✓ Closer
Assertiveness
4540-5%✓ Closer
Neuroticism
2528+3%⚠ Regression
Conscientiousness
6578+13%✓ Closer
Openness
6062+2%✓ Closer
Overall Improvement:+12% toward target

Recommendations

Based on the comparison, Lindr provides a recommendation:

  • Ship - Overall improvement >10%, no major regressions
  • Review - Mixed results, some dimensions improved while others regressed
  • Reject - Overall regression or critical dimensions moved away from target