GoofyCubes
23

Chapter 23

Production Monitoring and Observability

Learning Objective

Learn what to monitor after a GenAI system goes live.

What it means

Observability means understanding what the system is doing in production. For GenAI, monitoring must cover application health, model behavior, retrieval quality, cost, latency, security events, and user feedback.

Why it matters

Model outputs can drift, documents can become outdated, costs can spike, and users can receive poor responses. Monitoring allows teams to detect and fix issues quickly.

Healthcare Example

If users frequently correct AI summaries for missing medications, the team should review retrieval, prompt instructions, and evaluation data. This feedback loop improves quality over time.

Metrics to Monitor

Latency

User experience and batch throughput

Token usage

Cost control

Retrieval score

RAG quality

JSON validity

Integration reliability

Human review rate

Confidence and risk routing

Correction rate

Quality feedback

Safety blocks

Security and policy enforcement

Code: Response Timing

import time

def timed_call(func, *args, **kwargs):
    start = time.time()
    result = func(*args, **kwargs)
    elapsed_ms = round((time.time() - start) * 1000, 2)
    return result, elapsed_ms

Common Mistakes

  • Monitoring only server uptime.
  • No model quality metrics.
  • No cost dashboards.
  • No user feedback loop.
  • No alerting on abnormal usage.

Interview Q&A

Q: What do you monitor in a GenAI system?

A: Latency, cost, token usage, model errors, retrieval quality, output validity, safety events, confidence score, human review rate, and user corrections.

Q: Why is monitoring different for GenAI?

A: In addition to technical health, we must monitor answer quality, grounding, safety, and cost.

Architect Takeaway

Production GenAI monitoring must track both system performance and answer quality.