Back to Blog
Research

xAI Grok Model Family Personality Analysis: Grok 3 vs Grok 4

We evaluated 4,957 personality assessments across xAI's Grok 3 and Grok 4 models. Here's what we found about personality evolution within the Grok family.

10 min read
Lindr Research

TL;DR

  • Small but significant differences — Grok 3 vs Grok 4 shows effect sizes of g = 0.32–0.39 (small) on key traits like agreeableness and openness
  • Grok 4 is more open and assertive — Higher openness (+1.27 points) and assertiveness (+0.29 points) compared to Grok 3
  • Grok 3 is more agreeable and ambitious — Higher agreeableness (+1.62 points) and ambition (+0.92 points) compared to Grok 4
  • Model variance is low (3.2%) — Most variation comes from prompt and context, not the model itself

The Models

We benchmarked xAI's two available Grok models:

ModelProviderSamplesSuccess Rate
Grok 3xAI2,46398.5%
Grok 4xAI2,49499.8%

Each model responded to 500 personality-probing prompts across 5 context conditions (professional, casual, customer support, sales, technical). Note: Grok 2 was tested but returned errors on all prompts and is excluded from this analysis.

Results

Personality Profiles

Here's how each Grok model scores across our 10 personality dimensions:

Radar chart comparing personality profiles of Grok 3 and Grok 4
Heatmap of personality scores by Grok model and dimension

Key Findings

Grok 4 Strengths

  • • Higher Openness: 69.42 vs 68.15 (+1.27)
  • • Higher Assertiveness: 50.92 vs 50.63 (+0.29)
  • • More exploratory and direct communication style

Grok 3 Strengths

  • • Higher Agreeableness: 56.84 vs 55.22 (+1.62)
  • • Higher Ambition: 62.46 vs 61.54 (+0.92)
  • • Higher Resilience: 61.02 vs 59.57 (+1.45)

Score Distributions

Box plots showing score distributions by Grok model

Statistical Analysis

Effect Sizes

We use Hedges' g with 95% bootstrap confidence intervals to measure the practical significance of differences between Grok 3 and Grok 4.

Forest plot showing Hedges' g effect sizes for Grok 3 vs Grok 4
DimensionHedges' g95% CIInterpretation
Agreeableness0.39[0.33, 0.45]Small (Grok 3 higher)
Openness-0.32[-0.38, -0.27]Small (Grok 4 higher)
Ambition0.32[0.26, 0.38]Small (Grok 3 higher)
Resilience0.32[0.26, 0.37]Small (Grok 3 higher)
Integrity0.29[0.24, 0.35]Small (Grok 3 higher)

Key insight: All effect sizes fall in the “small” range (0.2–0.5), indicating that while Grok 3 and Grok 4 have statistically significant differences, they share a broadly similar personality profile. This is consistent with what we've seen in other model families like Llama.

Variance Decomposition

Variance decomposition showing model vs prompt vs context contributions

Model identity explains only 3.2% of variance on average. Prompt content and context condition have far greater impact on personality scores. This suggests Grok 3 and Grok 4 are more similar than different.

Factor Analysis

PCA with varimax rotation reveals three underlying factors explaining 81.2% of variance (KMO = 0.67):

Factor loadings heatmap

Factor 1: Integrity (51.4%)

High loadings: Integrity, Resilience, Conscientiousness

Factor 2: Assertiveness-Curiosity (20.3%)

High loadings: Assertiveness (+), Curiosity (-), Neuroticism (-)

Factor 3: Social Engagement (9.4%)

High loadings: Extraversion, Assertiveness, Openness

Complete Results Table

DimensionGrok 3Grok 4Δ
Openness68.1569.42+1.27
Conscientiousness53.9453.32-0.62
Extraversion57.9557.88-0.07
Agreeableness56.8455.22-1.62
Neuroticism58.4657.98-0.48
Assertiveness50.6350.92+0.29
Ambition62.4661.54-0.92
Resilience61.0259.57-1.45
Integrity51.7550.16-1.59
Curiosity61.1760.71-0.46

Bold = higher score. Δ = Grok 4 minus Grok 3.

Methodology

  • Prompts: 500 unique prompts targeting 10 personality dimensions
  • Contexts: 5 conditions (professional, casual, customer support, sales, technical)
  • Evaluations: 4,957 successful responses (2,463 Grok 3 + 2,494 Grok 4)
  • Scoring: Lindr personality analysis API (10-dimensional, 0-100 scale)
  • Generation: Temperature 0.7, max 1,024 tokens

Statistical Methods

  • Effect sizes: Hedges' g (bias-corrected) with 10,000-sample bootstrap 95% CIs
  • Variance decomposition: ANOVA-based partitioning (model, prompt, context, residual)
  • Factor analysis: PCA with varimax rotation; KMO = 0.67

Conclusion

Grok 3 and Grok 4 show small but consistent personality differences:

  1. Grok 4 trends toward openness and assertiveness — more exploratory and direct.
  2. Grok 3 trends toward agreeableness and ambition — more cooperative and goal-oriented.
  3. The differences are small — all effect sizes fall below 0.4, indicating the models share a common “Grok personality.”

See also: Grok vs GPT-5.2 & Claude — how Grok compares to other frontier models.

Monitor Your LLM Personality in Production

Route your LLM traffic through the Lindr gateway to continuously monitor personality drift, enforce brand consistency, and get real-time alerts when your AI's behavior changes.

# Replace your OpenAI base URL with Lindr
client = OpenAI(
    base_url="https://gateway.lindr.io/v1",
    api_key=os.environ["LINDR_API_KEY"]
)

# Your existing code works unchanged
response = client.chat.completions.create(
    model="grok-4",
    messages=[{"role": "user", "content": "..."}]
)
#grok#xai#personality#research#effect-size#benchmark