Chapter 21

Cost Analysis and Optimization

Learning Objective

Learn how to estimate and control GenAI project costs.

What it means

GenAI cost includes model input/output tokens, embeddings, vector database, infrastructure, monitoring, storage, networking, development effort, and human review. Cost is influenced by model choice, context size, request volume, and architecture design.

Why it matters

A prototype may be cheap with a few test users but expensive at production volume. Architects must estimate cost per request, cost per document, monthly volume, peak usage, and savings from automation or decision support.

Healthcare Example

A document summarization system processing 100 documents per day has a different cost profile than a policy assistant processing 100,000 queries per month. High-risk cases may also require human review, which should be included in ROI.

Cost Drivers

Model selection

Use smaller models for simple classification and larger models for reasoning

Tokens

Use chunking, retrieval, summaries, and output limits

Embeddings

Embed only approved and useful content

Vector DB

Use metadata filters and lifecycle management

Human review

Route only uncertain/high-risk outputs

Latency

Cache repeated results and avoid unnecessary model calls

Code: Monthly Cost Estimator

def estimate_monthly_cost(requests_per_month, avg_input_tokens, avg_output_tokens, input_cost_per_1k, output_cost_per_1k):
    input_cost = (requests_per_month * avg_input_tokens / 1000) * input_cost_per_1k
    output_cost = (requests_per_month * avg_output_tokens / 1000) * output_cost_per_1k
    return round(input_cost + output_cost, 2)

monthly = estimate_monthly_cost(50000, 2500, 500, 0.005, 0.015)
print("Estimated monthly model cost: $", monthly)

Common Mistakes

Ignoring token growth from retrieved context.
Using the largest model for every task.
No caching strategy.
No usage monitoring.
No budget alerts.

Interview Q&A

Q: Does reducing hallucination increase cost?

A: Often yes, because better models, RAG, validation, monitoring, and human review add cost. The architect balances accuracy, cost, and latency based on business risk.

Q: How do you optimize GenAI cost?

A: Token budgeting, model routing, caching, prompt compression, retrieval quality, output limits, and using smaller models for simple tasks.

Architect Takeaway

Cost optimization is an architecture decision, not a finance afterthought.

Ch 20: Kubernetes Fundamentals for GenAI Deployment

Ch 22: End-to-End Project Build Plan