Research

xAI Grok Model Family Personality Analysis: Grok 3 vs Grok 4

We evaluated 4,957 personality assessments across xAI's Grok 3 and Grok 4 models. Here's what we found about personality evolution within the Grok family.

January 8, 2026

10 min read

Lindr Research

TL;DR

•Small but significant differences — Grok 3 vs Grok 4 shows effect sizes of g = 0.32–0.39 (small) on key traits like agreeableness and openness
•Grok 4 is more open and assertive — Higher openness (+1.27 points) and assertiveness (+0.29 points) compared to Grok 3
•Grok 3 is more agreeable and ambitious — Higher agreeableness (+1.62 points) and ambition (+0.92 points) compared to Grok 4
•Model variance is low (3.2%) — Most variation comes from prompt and context, not the model itself

The Models

We benchmarked xAI's two available Grok models:

Model	Provider	Samples	Success Rate
Grok 3	xAI	2,463	98.5%
Grok 4	xAI	2,494	99.8%

Each model responded to 500 personality-probing prompts across 5 context conditions (professional, casual, customer support, sales, technical). Note: Grok 2 was tested but returned errors on all prompts and is excluded from this analysis.

Results

Personality Profiles

Here's how each Grok model scores across our 10 personality dimensions:

Heatmap of personality scores by Grok model and dimension

Key Findings

Grok 4 Strengths

• Higher Openness: 69.42 vs 68.15 (+1.27)
• Higher Assertiveness: 50.92 vs 50.63 (+0.29)
• More exploratory and direct communication style

Grok 3 Strengths

• Higher Agreeableness: 56.84 vs 55.22 (+1.62)
• Higher Ambition: 62.46 vs 61.54 (+0.92)
• Higher Resilience: 61.02 vs 59.57 (+1.45)

Score Distributions

Statistical Analysis

Effect Sizes

We use Hedges' g with 95% bootstrap confidence intervals to measure the practical significance of differences between Grok 3 and Grok 4.

Dimension	Hedges' g	95% CI	Interpretation
Agreeableness	0.39	[0.33, 0.45]	Small (Grok 3 higher)
Openness	-0.32	[-0.38, -0.27]	Small (Grok 4 higher)
Ambition	0.32	[0.26, 0.38]	Small (Grok 3 higher)
Resilience	0.32	[0.26, 0.37]	Small (Grok 3 higher)
Integrity	0.29	[0.24, 0.35]	Small (Grok 3 higher)

Key insight: All effect sizes fall in the “small” range (0.2–0.5), indicating that while Grok 3 and Grok 4 have statistically significant differences, they share a broadly similar personality profile. This is consistent with what we've seen in other model families like Llama.

Variance Decomposition

Model identity explains only 3.2% of variance on average. Prompt content and context condition have far greater impact on personality scores. This suggests Grok 3 and Grok 4 are more similar than different.

Factor Analysis

PCA with varimax rotation reveals three underlying factors explaining 81.2% of variance (KMO = 0.67):

Factor 1: Integrity (51.4%)

High loadings: Integrity, Resilience, Conscientiousness

Factor 2: Assertiveness-Curiosity (20.3%)

High loadings: Assertiveness (+), Curiosity (-), Neuroticism (-)

Factor 3: Social Engagement (9.4%)

High loadings: Extraversion, Assertiveness, Openness

Complete Results Table

Dimension	Grok 3	Grok 4	Δ
Openness	68.15	69.42	+1.27
Conscientiousness	53.94	53.32	-0.62
Extraversion	57.95	57.88	-0.07
Agreeableness	56.84	55.22	-1.62
Neuroticism	58.46	57.98	-0.48
Assertiveness	50.63	50.92	+0.29
Ambition	62.46	61.54	-0.92
Resilience	61.02	59.57	-1.45
Integrity	51.75	50.16	-1.59
Curiosity	61.17	60.71	-0.46

Bold = higher score. Δ = Grok 4 minus Grok 3.

Methodology

Prompts: 500 unique prompts targeting 10 personality dimensions
Contexts: 5 conditions (professional, casual, customer support, sales, technical)
Evaluations: 4,957 successful responses (2,463 Grok 3 + 2,494 Grok 4)
Scoring: Lindr personality analysis API (10-dimensional, 0-100 scale)
Generation: Temperature 0.7, max 1,024 tokens

Statistical Methods

Effect sizes: Hedges' g (bias-corrected) with 10,000-sample bootstrap 95% CIs
Variance decomposition: ANOVA-based partitioning (model, prompt, context, residual)
Factor analysis: PCA with varimax rotation; KMO = 0.67

Conclusion

Grok 3 and Grok 4 show small but consistent personality differences:

Grok 4 trends toward openness and assertiveness — more exploratory and direct.
Grok 3 trends toward agreeableness and ambition — more cooperative and goal-oriented.
The differences are small — all effect sizes fall below 0.4, indicating the models share a common “Grok personality.”

See also: Grok vs GPT-5.2 & Claude — how Grok compares to other frontier models.

Monitor Your LLM Personality in Production

Route your LLM traffic through the Lindr gateway to continuously monitor personality drift, enforce brand consistency, and get real-time alerts when your AI's behavior changes.

# Replace your OpenAI base URL with Lindr
client = OpenAI(
    base_url="https://gateway.lindr.io/v1",
    api_key=os.environ["LINDR_API_KEY"]
)

# Your existing code works unchanged
response = client.chat.completions.create(
    model="grok-4",
    messages=[{"role": "user", "content": "..."}]
)

Get Started Read the Docs