Chapter 23
Production Monitoring and Observability
Learning Objective
Learn what to monitor after a GenAI system goes live.
What it means
Observability means understanding what the system is doing in production. For GenAI, monitoring must cover application health, model behavior, retrieval quality, cost, latency, security events, and user feedback.
Why it matters
Model outputs can drift, documents can become outdated, costs can spike, and users can receive poor responses. Monitoring allows teams to detect and fix issues quickly.
Healthcare Example
If users frequently correct AI summaries for missing medications, the team should review retrieval, prompt instructions, and evaluation data. This feedback loop improves quality over time.
Metrics to Monitor
Latency
User experience and batch throughput
Token usage
Cost control
Retrieval score
RAG quality
JSON validity
Integration reliability
Human review rate
Confidence and risk routing
Correction rate
Quality feedback
Safety blocks
Security and policy enforcement
Code: Response Timing
import time
def timed_call(func, *args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed_ms = round((time.time() - start) * 1000, 2)
return result, elapsed_msCommon Mistakes
- Monitoring only server uptime.
- No model quality metrics.
- No cost dashboards.
- No user feedback loop.
- No alerting on abnormal usage.
Interview Q&A
Q: What do you monitor in a GenAI system?
A: Latency, cost, token usage, model errors, retrieval quality, output validity, safety events, confidence score, human review rate, and user corrections.
Q: Why is monitoring different for GenAI?
A: In addition to technical health, we must monitor answer quality, grounding, safety, and cost.
Architect Takeaway
Production GenAI monitoring must track both system performance and answer quality.