Chapter 16

Error Handling, Logging, and Evaluation

Learning Objective

Learn how to make GenAI behavior measurable and supportable in production.

What it means

Production systems must handle failures gracefully. GenAI services can fail due to model timeouts, invalid outputs, retrieval issues, token limits, network errors, or unsafe content detection. Evaluation measures whether outputs are accurate, grounded, safe, and useful.

Why it matters

Without logging and evaluation, teams cannot debug hallucinations, cost spikes, latency issues, or bad outputs. Every important request should have a trace ID and enough metadata to investigate later without exposing sensitive data.

Healthcare Example

A case reviewer reports that a summary missed an important medication. Logs should show document version, prompt version, retrieved chunks, model version, confidence score, and route decision.

Code: Structured Logging with Trace ID

import logging
import uuid

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("genai-service")

def process_case(case_id, prompt_version):
    trace_id = str(uuid.uuid4())
    try:
        logger.info({"trace_id": trace_id, "case_id": case_id, "event": "start", "prompt_version": prompt_version})
        result = {"confidence": 0.91, "route": "review"}
        logger.info({"trace_id": trace_id, "case_id": case_id, "event": "complete", **result})
        return result
    except Exception as ex:
        logger.exception({"trace_id": trace_id, "case_id": case_id, "event": "failed", "error": str(ex)})
        raise

Common Mistakes

Logging sensitive raw prompts.
No trace ID.
No prompt versioning.
No evaluation dataset.
No regression testing after prompt changes.

Interview Q&A

Q: How do you evaluate a GenAI system?

A: I evaluate accuracy, groundedness, citation quality, safety, latency, cost, JSON validity, and human correction rate using test datasets and production monitoring.

Q: What should be logged?

A: Trace ID, timestamp, prompt version, model version, retrieval metadata, confidence score, route, latency, and errors, while avoiding sensitive raw data unless approved.

Architect Takeaway

If you cannot measure it, you cannot trust it. Evaluation and observability are core architecture components, not optional add-ons.

Ch 15: Python and FastAPI for GenAI Services

Ch 17: GitHub Branching, Pull Requests, and Code Review