Chapter 4
LLMs, Tokens, Context Window, and Model Parameters
Learning Objective
Understand what an LLM is and how tokens, context window, temperature, latency, and cost affect production design.
What it means
An LLM is a model trained to process and generate language. It does not read text exactly like a human; it processes tokens. The context window is the maximum amount of information the model can consider in one request. Model parameters such as temperature influence consistency and creativity.
Why it matters
Tokens drive cost and performance. Large prompts, long documents, chat history, and retrieved context all consume tokens. A poor design sends too much text to the model, increasing latency and cost while sometimes reducing answer quality.
Healthcare Example
A 150-page medical record should not be sent directly to the model. The system should chunk the document, retrieve only relevant sections, summarize where needed, and send focused context to the model.
Key Terms
| Term | Meaning | Architect Impact |
|---|---|---|
| Token | Unit of text processed by the model | Affects cost and context usage |
| Context Window | Maximum tokens in a request | Limits how much data can be sent |
| Temperature | Controls randomness | Low for healthcare and compliance use cases |
| Latency | Time taken to respond | Impacts user experience and batch throughput |
| Max Output Tokens | Limit for response length | Controls cost and prevents verbose output |
Code: Estimate Simple Token Budget
def estimate_tokens(text: str) -> int:
# Rough estimate: 1 token is about 4 characters in English text.
return max(1, len(text) // 4)
system_prompt = "You are a healthcare document assistant."
user_question = "Summarize missing clinical information."
retrieved_context = "Patient note text..." * 500
total_tokens = sum(estimate_tokens(x) for x in [system_prompt, user_question, retrieved_context])
print("Estimated input tokens:", total_tokens)Common Mistakes
- Sending full documents instead of retrieved chunks.
- Using high temperature for regulated workflows.
- Ignoring output token limits.
- Keeping unnecessary chat history in every prompt.
Interview Q&A
Q: What is a context window?
A: It is the maximum amount of information an LLM can process in a single request, including system prompt, user prompt, history, tool outputs, and retrieved documents.
Q: What temperature would you use for healthcare extraction?
A: A low temperature such as 0 to 0.2 because consistency is more important than creativity.
Architect Takeaway
Context is expensive real estate. A good GenAI architecture sends the minimum relevant information needed for a high-quality answer.