Fine-Tune Evaluation

A/B Model Comparison

Compare personality profiles between two models. See exactly what changed and whether it moved toward your target.

After running batch evaluations on your base model and fine-tuned model, you can create a comparison to see the dimension-by-dimension personality shift.

Creating a Comparison

comparison = lindr.comparisons.create(
    baseline_eval_id="uuid-of-base-model-eval",
    candidate_eval_id="uuid-of-finetuned-eval",
    name="llama-base-vs-finetune-v1"
)

Understanding the Diff Report

The comparison returns a detailed diff showing how each dimension changed:

Example Comparison Report

Agreeableness

52→70+18%✓ Closer

Assertiveness

45→40-5%✓ Closer

Neuroticism

25→28+3%⚠ Regression

Conscientiousness

65→78+13%✓ Closer

Openness

60→62+2%✓ Closer

Overall Improvement:+12% toward target

Recommendations

Based on the comparison, Lindr provides a recommendation:

Ship - Overall improvement >10%, no major regressions
Review - Mixed results, some dimensions improved while others regressed
Reject - Overall regression or critical dimensions moved away from target

Managing Eval Datasets