Chapter 21
Cost Analysis and Optimization
Learning Objective
Learn how to estimate and control GenAI project costs.
What it means
GenAI cost includes model input/output tokens, embeddings, vector database, infrastructure, monitoring, storage, networking, development effort, and human review. Cost is influenced by model choice, context size, request volume, and architecture design.
Why it matters
A prototype may be cheap with a few test users but expensive at production volume. Architects must estimate cost per request, cost per document, monthly volume, peak usage, and savings from automation or decision support.
Healthcare Example
A document summarization system processing 100 documents per day has a different cost profile than a policy assistant processing 100,000 queries per month. High-risk cases may also require human review, which should be included in ROI.
Cost Drivers
Model selection
Use smaller models for simple classification and larger models for reasoning
Tokens
Use chunking, retrieval, summaries, and output limits
Embeddings
Embed only approved and useful content
Vector DB
Use metadata filters and lifecycle management
Human review
Route only uncertain/high-risk outputs
Latency
Cache repeated results and avoid unnecessary model calls
Code: Monthly Cost Estimator
def estimate_monthly_cost(requests_per_month, avg_input_tokens, avg_output_tokens, input_cost_per_1k, output_cost_per_1k):
input_cost = (requests_per_month * avg_input_tokens / 1000) * input_cost_per_1k
output_cost = (requests_per_month * avg_output_tokens / 1000) * output_cost_per_1k
return round(input_cost + output_cost, 2)
monthly = estimate_monthly_cost(50000, 2500, 500, 0.005, 0.015)
print("Estimated monthly model cost: $", monthly)Common Mistakes
- Ignoring token growth from retrieved context.
- Using the largest model for every task.
- No caching strategy.
- No usage monitoring.
- No budget alerts.
Interview Q&A
Q: Does reducing hallucination increase cost?
A: Often yes, because better models, RAG, validation, monitoring, and human review add cost. The architect balances accuracy, cost, and latency based on business risk.
Q: How do you optimize GenAI cost?
A: Token budgeting, model routing, caching, prompt compression, retrieval quality, output limits, and using smaller models for simple tasks.
Architect Takeaway
Cost optimization is an architecture decision, not a finance afterthought.