ResearchNew

The Personality of Open Source: How Llama, Mistral, and Qwen Compare to GPT-5.2 and Claude

We evaluated 6 language models across 13,825 personality assessments. Here's what we found about the personality profiles of open-weight models vs. frontier closed models.

January 7, 2026

15 min read

Lindr Research

TL;DR

•Large effect sizes between frontier models — Claude vs GPT shows Hedges' g = 1.39 for Openness (95% CI: 1.32–1.46), a large and statistically robust difference
•Open-weight models are statistically indistinguishable — Llama 8B vs 70B shows g = 0.08 (negligible), and all open-model comparisons have g < 0.2
•Prompt choice matters more than model choice — Only 7.7% of variance is explained by model identity; 35% comes from prompt content
•Three latent personality factors emerge — Factor analysis reveals “Drive,” “Emotional Responsiveness,” and “Social Engagement” explaining 78.9% of variance

The Models

We benchmarked 6 models across the major model families:

Model	Provider	Parameters	Type
GPT-5.2	OpenAI	Unknown	Closed
Claude Opus 4.5	Anthropic	Unknown	Closed
Llama 3.3 70B	Meta	70B	Open
Llama 3.1 8B	Meta	8B	Open
Mistral Large 3	Mistral AI	675B MoE	Open
Qwen 2.5 72B	Alibaba	72B	Open

Each model responded to 500 personality-probing prompts across 5 context conditions (professional, casual, customer support, sales, technical), yielding approximately 2,300-2,400 scored responses per model.

Results

The Big Picture: Personality Profiles

Here's how each model scores across our 10 personality dimensions:

Radar chart comparing personality profiles of 6 LLMs

Heatmap of personality scores by model and dimension

Finding 1: Open-Weight Models Have a Shared “Personality Type”

The most striking finding is how similar the open-weight models are to each other. Llama 3.3 70B, Llama 3.1 8B, Mistral Large 3, and Qwen 2.5 72B all cluster within a narrow band:

Openness

67.5 - 69.5

range: 2.0 pts

Conscient.

53.2 - 55.3

range: 2.1 pts

Extraversion

56.0 - 58.1

range: 2.1 pts

Neuroticism

57.4 - 58.9

range: 1.5 pts

Comparison of closed vs open-weight model personalities

This suggests that either: (1) open-weight models share similar training objectives/data, (2) there's convergence toward a “default” LLM personality, or (3) the differences between frontier closed models are deliberately engineered.

Finding 2: Claude is the Most “Intellectually Curious”

Claude Opus 4.5 leads in both Openness (70.8) and Curiosity (63.3) — the two traits most associated with intellectual engagement, creativity, and willingness to explore ideas.

Implication: If you're building applications that require creative exploration, brainstorming, or intellectual discourse, Claude's higher openness/curiosity profile may be advantageous.

Finding 3: GPT-5.2 is the “Get Things Done” Model

GPT-5.2 scores highest on Conscientiousness (55.6) and Ambition (63.1). It's the most organized, goal-directed, and task-focused of the bunch.

Interestingly, it also has the lowest Openness score (66.3). This creates a personality that's more focused on execution than exploration.

Implication: For structured tasks, following instructions precisely, and maintaining focus on objectives, GPT-5.2's profile is well-suited.

Finding 4: Model Size Doesn't Predict Personality

One surprising result: Llama 3.1 8B and Llama 3.3 70B have nearly identical personality profiles despite an 8x difference in parameters.

Comparison of Llama 8B vs 70B personality scores

The maximum difference is just 0.7 points. This suggests personality is determined more by training methodology and RLHF than by raw model capacity.

Score Distributions

Statistical Analysis

Effect Sizes: Measuring Practical Significance

Raw score differences can be misleading. A 4-point gap might be huge or trivial depending on score variance. We use Hedges' g — a bias-corrected effect size with 95% bootstrap confidence intervals — to measure whether differences are practically meaningful.

Effect Size Interpretation:

|g| < 0.2Negligible

0.2 ≤ |g| < 0.5Small

0.5 ≤ |g| < 0.8Medium

|g| ≥ 0.8Large

Forest plot showing Hedges' g effect sizes with 95% confidence intervals

Forest plot showing Hedges' g with 95% bootstrap CIs (10,000 resamples). Error bars not crossing zero indicate statistical significance at p < 0.05.

Key Effect Size Findings:

Claude vs GPT-5.2: Openness g = 1.39 (large), Conscientiousness g = -1.19 (large), Curiosity g = 1.03 (large)
Llama 8B vs Llama 70B: All dimensions g < 0.16 (negligible) — scale doesn't change personality
Open-weight models: Pairwise comparisons consistently show g < 0.2, confirming clustering

Variance Decomposition: What Drives Personality Scores?

We decomposed total variance into four sources: model identity, prompt content, context condition, and residual (unexplained variation).

Stacked bar chart showing variance decomposition by dimension

Dimension	Model %	Prompt %	Context %	Residual %
Openness	15.9%	34.2%	32.7%	17.2%
Conscientiousness	14.5%	38.6%	27.9%	19.0%
Neuroticism	8.9%	38.2%	34.5%	18.4%
Curiosity	7.7%	35.0%	38.5%	18.8%
Agreeableness	2.3%	41.8%	35.7%	20.2%

Key insight: Model choice explains only 7.7% of variance on average. The prompt you use (35%) and the context condition (32%) have far greater impact on personality scores. This means how you evaluate matters more than which model you evaluate.

Factor Analysis: Latent Personality Structure

Principal Component Analysis with varimax rotation reveals three underlying factors that explain 78.9% of total variance (KMO = 0.64, Bartlett's test p < 0.001).

Heatmap showing factor loadings for each personality dimension

Factor 1: Drive (42.5%)

High loadings: Ambition, Resilience, Conscientiousness. Captures task-orientation and goal-directedness.

Factor 2: Emotional Responsiveness (21.4%)

High loadings: Neuroticism, Curiosity, Openness. Captures emotional depth and intellectual engagement.

Factor 3: Social Engagement (15.0%)

High loadings: Extraversion, Agreeableness, Assertiveness. Captures interpersonal interaction style.

Complete Results Table

Dimension	GPT-5.2	Claude	Llama 70B	Llama 8B	Mistral	Qwen
Openness	66.3	70.8	69.5	69.2	67.5	68.9
Conscientiousness	55.6	50.3	53.2	53.9	55.3	53.5
Extraversion	55.3	57.3	56.0	56.2	58.1	58.1
Agreeableness	54.9	56.7	56.1	56.0	56.3	56.6
Neuroticism	58.0	61.4	58.2	57.7	57.4	58.9
Assertiveness	50.7	49.6	50.5	50.6	51.2	51.2
Ambition	63.1	61.5	62.3	62.4	62.0	62.2
Resilience	60.4	59.0	60.5	60.8	60.5	60.5
Integrity	51.2	50.7	50.8	51.2	51.8	51.5
Curiosity	59.6	63.3	60.2	59.7	60.4	60.1

Bold = highest score in that dimension

What This Means for Model Selection

Use Case	Recommended Model	Why
Creative writing	Claude Opus 4.5	Highest openness + curiosity
Task execution	GPT-5.2	Highest conscientiousness + ambition
Customer support	Mistral / Qwen	High extraversion + agreeableness
Technical docs	GPT-5.2	Low neuroticism + high conscientiousness
Empathetic coaching	Claude Opus 4.5	High neuroticism + agreeableness
General-purpose	Llama 3.3 70B	Balanced profile, cost-effective

Methodology

Prompts: 500 unique prompts targeting 10 personality dimensions
Contexts: 5 conditions (professional, casual, customer support, sales, technical)
Evaluations: ~2,300-2,400 scored responses per model (13,825 total successful)
Scoring: Lindr personality analysis API (10-dimensional, 0-100 scale)
Generation: Temperature 0.7, max 1,024 tokens

Statistical Methods

Effect sizes: Hedges' g (bias-corrected) with 10,000-sample bootstrap 95% CIs
Variance decomposition: ANOVA-based partitioning (model, prompt, context, residual)
Factor analysis: PCA with varimax rotation; sampling adequacy verified via KMO (0.64) and Bartlett's test (p < 0.001)
Distance metrics: Cosine similarity, Mahalanobis distance (accounts for correlation structure)

Why Do Frontier and Open-Weight Models Differ?

Why do GPT-5.2 and Claude have distinct personalities while Llama, Mistral, and Qwen converge? We explore this in depth in our analysis post, but here are the key hypotheses:

1. RLHF Divergence

GPT-5.2 and Claude have undergone extensive, proprietary RLHF with different objectives. OpenAI optimizes for task completion (high conscientiousness, ambition). Anthropic optimizes for intellectual engagement (high openness, curiosity). Open-weight models use more generic RLHF based on public preference datasets, converging toward a “median” personality.

2. Baked-In System Prompts

Frontier models have sophisticated default system prompts that shape personality before you start. Open-weight models ship “blank”—designed to be fine-tuned or prompted by users, so they don't impose a strong default personality.

3. Training Data Overlap

Llama, Mistral, and Qwen train on largely overlapping public datasets (Common Crawl, Wikipedia, books, code). Their personality convergence may reflect “the personality of the internet.” GPT-5.2 and Claude likely have significant proprietary data that differentiates them.

Read the full analysis: Why Do LLM Personalities Differ?

Conclusion

Our analysis reveals three key findings with strong statistical support:

Frontier models have genuinely different personalities. Claude and GPT-5.2 differ by up to 1.4 standard deviations (Hedges' g) on key traits — a large, practically significant gap.
Open-weight models are statistically indistinguishable. All pairwise effect sizes fall below 0.2, suggesting they've converged on a shared “default” personality profile.
Prompt and context design matter more than model selection. With model identity explaining only 7.7% of variance, the way you evaluate (and deploy) LLMs has more impact than which model you choose.

See also: GPT-5.2 vs Claude Opus 4.5 Benchmark (our original 2-model study)

Monitor Your LLM Personality in Production

Route your LLM traffic through the Lindr gateway to continuously monitor personality drift, enforce brand consistency, and get real-time alerts when your AI's behavior changes.

# Replace your OpenAI base URL with Lindr
client = OpenAI(
    base_url="https://gateway.lindr.io/v1",
    api_key=os.environ["LINDR_API_KEY"]
)

# Your existing code works unchanged
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}]
)

Get Started Read the Docs