TutorialsFeatured

How to Monitor AI Behavior in Production

Your LLM passed all the evals. It shipped to production. Three weeks later, users are complaining about tone. Here's how to catch problems before your customers do.

December 10, 2024

10 min read

Lindr Engineering

Most teams treat LLM deployment like traditional software: write tests, check them in CI, ship when green. The problem is that LLMs don't fail like normal software. They don't crash or throw exceptions. They just... get weird. Responses become slightly off-brand. The tone shifts. Users can't pinpoint what changed, but something feels wrong.

Traditional monitoring tools (Datadog, New Relic, Prometheus) track the wrong things. They'll tell you latency is fine, error rates are zero, and tokens are within budget. Meanwhile, your support bot has started responding to frustrated customers with passive-aggressive politeness, and nobody knows until the NPS scores tank.

Behavioral monitoring is different. It treats the model's output as a signal to be measured, not just a payload to be delivered. Here's how to set it up.

What to Measure (And What Not To)

Every observability vendor has a dashboard full of LLM metrics. Time to first token. Completion tokens per request. Cost per conversation. These are useful for capacity planning and billing, but they don't tell you if your AI sounds right.

Behavioral monitoring requires a different set of metrics. Here's what actually predicts user experience:

Personality dimension scores

Track each dimension (agreeableness, assertiveness, etc.) as a time series. You're looking for drift from baseline, not absolute values. A 5-point shift in emotional stability over two weeks matters more than whether the current score is 70 or 75.

Response variance

High variance means inconsistent user experience. If the same question gets wildly different tones on different runs, users will notice. Standard deviation of personality scores per prompt type is a better signal than mean scores.

Conversation-level degradation

Does personality drift within a single conversation? Many models regress toward their base behavior after 5-8 turns as the persona prompt gets attention-decayed. Track scores at turn 1, turn 5, and turn 10 separately.

Stress response patterns

When users are angry, does the bot maintain composure? Segment your metrics by user sentiment (you can infer this from the input). Personality scores on negative-sentiment inputs reveal edge case behavior.

What you don't need: hallucination detection scores (useful, but orthogonal to personality), semantic similarity to reference answers (measures correctness, not tone), or user satisfaction surveys (too lagging and too noisy).

Architecture: Where to Put the Sensors

You have two options for capturing behavior data: inline (in the request path) or async (via log processing).

Option A: Proxy-based capture

Route all LLM traffic through a proxy that logs requests and responses. Tools like Helicone work this way. Change your base URL, and every call gets logged without code changes.

# Before
client = OpenAI(base_url="https://api.openai.com/v1")

# After
client = OpenAI(base_url="https://oai.helicone.ai/v1")

Adds 50-80ms latency. Works with any provider. No code changes needed.

Option B: Async log processing

Log completions to a queue (Kafka, SQS, etc.) and process them out-of-band. Adds zero latency to the request path. Personality analysis happens seconds later, not milliseconds.

response = client.chat.completions.create(...)

# Fire and forget
queue.send({
  "session_id": session_id,
  "prompt": messages,
  "completion": response.choices[0].message.content,
  "timestamp": datetime.now().isoformat()
})

Zero latency impact. Requires more infrastructure. Alerts are slightly delayed.

Most production systems use async processing. The latency hit from inline analysis (even fast analysis) compounds across millions of requests. And for personality monitoring, near-real-time is good enough. You're detecting drift over hours, not responding to individual bad completions.

Setting Up Alerts That Don't Cry Wolf

The hardest part of behavioral monitoring isn't collecting data. It's setting thresholds that catch real problems without flooding your Slack channel with noise.

Here's what works:

1. Alert on deviation, not absolute values

Don't alert when agreeableness drops below 70. Alert when it drops more than 2 standard deviations from the rolling 7-day mean. This catches drift while ignoring normal variance.

# Pseudo-code for deviation alerting
rolling_mean = scores.rolling(window="7d").mean()
rolling_std = scores.rolling(window="7d").std()
z_score = (current_score - rolling_mean) / rolling_std

if abs(z_score) > 2.0:
    fire_alert("personality_drift", dimension, z_score)

2. Use different thresholds for different dimensions

A 10% swing in curiosity (does the bot ask follow-up questions?) might be fine. A 10% swing in emotional stability for a crisis helpline bot is a five-alarm fire. Tune thresholds to business impact, not statistical uniformity.

3. Require sustained deviation before alerting

A single batch of unusual requests can spike scores temporarily. Require that the deviation persists for N consecutive measurement windows (e.g., 3 hours) before firing an alert. This filters out transient noise.

4. Segment alerts by traffic source

If you serve multiple use cases (web chat, mobile app, API integrations), personality profiles might legitimately differ. Alert on per-segment drift, not global averages. A change in API traffic patterns shouldn't trigger alerts for your consumer chat experience.

Guardrails vs. Monitoring: Different Problems

Teams often confuse guardrails with monitoring. They're complementary, but they solve different problems.

Guardrails

• Block bad outputs in real-time
• Binary decisions (allow/block)
• Focus on safety and compliance
• Add latency to every request
• Catch individual failures

Monitoring

• Detect trends over time
• Continuous measurements
• Focus on quality and consistency
• Runs async, no latency impact
• Catch systemic drift

A guardrail might block a single response that contains profanity. Monitoring catches the fact that your bot has become 15% more assertive over the past month, even though no individual response crossed any safety threshold.

Tools like NeMo Guardrails handle the blocking side well. But they don't answer the question: "Is my AI's personality stable over time?" That requires a monitoring layer on top.

Sample Monitoring Dashboard

Here's a minimal dashboard layout that surfaces the metrics that matter:

Row 1: Current State

Sessions (24h)

12,847

Drift Score

0.08

Alert Status

Healthy

Last Anomaly

6d ago

Row 2: Dimension Trends (7d)

Sparklines for each personality dimension. Show baseline (dotted) vs. current (solid). Red highlighting when outside tolerance bands.

Row 3: Variance by Conversation Turn

Box plots showing score distributions at turn 1, 5, and 10. Widening boxes indicate increasing unpredictability later in conversations.

Row 4: Recent Alerts

Table of last 10 alerts with dimension, severity, duration, and resolution status. Link to session samples that triggered each alert.

When Things Go Wrong: Incident Response

You got an alert. Now what? Here's a runbook:

1. Verify it's real. Check if the deviation is sustained or if it's a single batch of weird requests. Look at the raw sessions that contributed to the alert.
2. Check for upstream changes. Did the model provider push an update? Did someone change the system prompt? Did a new feature ship that sends different types of requests?
3. Isolate the traffic segment. Is it all traffic or just mobile users? Just one region? Narrow the scope before reacting.
4. Decide whether to intervene. Small drift on low-stakes dimensions might be acceptable. Large drift on high-stakes dimensions (support bot getting aggressive) requires immediate action.
5. If intervening: adjust prompts, not guardrails. Tightening guardrails treats symptoms. Updating the persona prompt addresses the root cause. Roll out gradually and monitor the correction.

Document every incident. Over time, you'll build a library of drift patterns and fixes that lets you respond faster.

The Bottom Line

LLM behavior monitoring isn't about preventing catastrophic failures. Guardrails handle that. It's about noticing the slow drift that erodes user trust over weeks. The changes are subtle enough that no single response triggers a complaint, but the aggregate effect is real.

Start simple: pick 2-3 personality dimensions that matter most for your use case, log a sample of completions, and chart the scores over time. You'll be surprised how much the numbers move, even when nothing "breaks."

Need behavioral monitoring out of the box?

Lindr provides continuous personality monitoring with drift alerts, dimension tracking, and incident reports. One integration, and you'll see exactly how your AI behaves in production.

Start Free Trial

#production-monitoring#observability#ai-monitoring#brand-consistency#drift-detection#alerting