Chapter 4

LLMs, Tokens, Context Window, and Model Parameters

Learning Objective

Understand what an LLM is and how tokens, context window, temperature, latency, and cost affect production design.

What it means

An LLM is a model trained to process and generate language. It does not read text exactly like a human; it processes tokens. The context window is the maximum amount of information the model can consider in one request. Model parameters such as temperature influence consistency and creativity.

Why it matters

Tokens drive cost and performance. Large prompts, long documents, chat history, and retrieved context all consume tokens. A poor design sends too much text to the model, increasing latency and cost while sometimes reducing answer quality.

Healthcare Example

A 150-page medical record should not be sent directly to the model. The system should chunk the document, retrieve only relevant sections, summarize where needed, and send focused context to the model.

Key Terms

Term	Meaning	Architect Impact
Token	Unit of text processed by the model	Affects cost and context usage
Context Window	Maximum tokens in a request	Limits how much data can be sent
Temperature	Controls randomness	Low for healthcare and compliance use cases
Latency	Time taken to respond	Impacts user experience and batch throughput
Max Output Tokens	Limit for response length	Controls cost and prevents verbose output

Code: Estimate Simple Token Budget

def estimate_tokens(text: str) -> int:
    # Rough estimate: 1 token is about 4 characters in English text.
    return max(1, len(text) // 4)

system_prompt = "You are a healthcare document assistant."
user_question = "Summarize missing clinical information."
retrieved_context = "Patient note text..." * 500

total_tokens = sum(estimate_tokens(x) for x in [system_prompt, user_question, retrieved_context])
print("Estimated input tokens:", total_tokens)

Common Mistakes

Sending full documents instead of retrieved chunks.
Using high temperature for regulated workflows.
Ignoring output token limits.
Keeping unnecessary chat history in every prompt.

Interview Q&A

Q: What is a context window?

A: It is the maximum amount of information an LLM can process in a single request, including system prompt, user prompt, history, tool outputs, and retrieved documents.

Q: What temperature would you use for healthcare extraction?

A: A low temperature such as 0 to 0.2 because consistency is more important than creativity.

Architect Takeaway

Context is expensive real estate. A good GenAI architecture sends the minimum relevant information needed for a high-quality answer.

Ch 3: Tools Required for a GenAI Project

Ch 5: Hallucinations, Confidence, and Validation