Chapter 16
Error Handling, Logging, and Evaluation
Learning Objective
Learn how to make GenAI behavior measurable and supportable in production.
What it means
Production systems must handle failures gracefully. GenAI services can fail due to model timeouts, invalid outputs, retrieval issues, token limits, network errors, or unsafe content detection. Evaluation measures whether outputs are accurate, grounded, safe, and useful.
Why it matters
Without logging and evaluation, teams cannot debug hallucinations, cost spikes, latency issues, or bad outputs. Every important request should have a trace ID and enough metadata to investigate later without exposing sensitive data.
Healthcare Example
A case reviewer reports that a summary missed an important medication. Logs should show document version, prompt version, retrieved chunks, model version, confidence score, and route decision.
Code: Structured Logging with Trace ID
import logging
import uuid
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("genai-service")
def process_case(case_id, prompt_version):
trace_id = str(uuid.uuid4())
try:
logger.info({"trace_id": trace_id, "case_id": case_id, "event": "start", "prompt_version": prompt_version})
result = {"confidence": 0.91, "route": "review"}
logger.info({"trace_id": trace_id, "case_id": case_id, "event": "complete", **result})
return result
except Exception as ex:
logger.exception({"trace_id": trace_id, "case_id": case_id, "event": "failed", "error": str(ex)})
raiseCommon Mistakes
- Logging sensitive raw prompts.
- No trace ID.
- No prompt versioning.
- No evaluation dataset.
- No regression testing after prompt changes.
Interview Q&A
Q: How do you evaluate a GenAI system?
A: I evaluate accuracy, groundedness, citation quality, safety, latency, cost, JSON validity, and human correction rate using test datasets and production monitoring.
Q: What should be logged?
A: Trace ID, timestamp, prompt version, model version, retrieval metadata, confidence score, route, latency, and errors, while avoiding sensitive raw data unless approved.
Architect Takeaway
If you cannot measure it, you cannot trust it. Evaluation and observability are core architecture components, not optional add-ons.