GoofyCubes
4

Chapter 4

LLMs, Tokens, Context Window, and Model Parameters

Learning Objective

Understand what an LLM is and how tokens, context window, temperature, latency, and cost affect production design.

What it means

An LLM is a model trained to process and generate language. It does not read text exactly like a human; it processes tokens. The context window is the maximum amount of information the model can consider in one request. Model parameters such as temperature influence consistency and creativity.

Why it matters

Tokens drive cost and performance. Large prompts, long documents, chat history, and retrieved context all consume tokens. A poor design sends too much text to the model, increasing latency and cost while sometimes reducing answer quality.

Healthcare Example

A 150-page medical record should not be sent directly to the model. The system should chunk the document, retrieve only relevant sections, summarize where needed, and send focused context to the model.

Key Terms

TermMeaningArchitect Impact
TokenUnit of text processed by the modelAffects cost and context usage
Context WindowMaximum tokens in a requestLimits how much data can be sent
TemperatureControls randomnessLow for healthcare and compliance use cases
LatencyTime taken to respondImpacts user experience and batch throughput
Max Output TokensLimit for response lengthControls cost and prevents verbose output

Code: Estimate Simple Token Budget

def estimate_tokens(text: str) -> int:
    # Rough estimate: 1 token is about 4 characters in English text.
    return max(1, len(text) // 4)

system_prompt = "You are a healthcare document assistant."
user_question = "Summarize missing clinical information."
retrieved_context = "Patient note text..." * 500

total_tokens = sum(estimate_tokens(x) for x in [system_prompt, user_question, retrieved_context])
print("Estimated input tokens:", total_tokens)

Common Mistakes

  • Sending full documents instead of retrieved chunks.
  • Using high temperature for regulated workflows.
  • Ignoring output token limits.
  • Keeping unnecessary chat history in every prompt.

Interview Q&A

Q: What is a context window?

A: It is the maximum amount of information an LLM can process in a single request, including system prompt, user prompt, history, tool outputs, and retrieved documents.

Q: What temperature would you use for healthcare extraction?

A: A low temperature such as 0 to 0.2 because consistency is more important than creativity.

Architect Takeaway

Context is expensive real estate. A good GenAI architecture sends the minimum relevant information needed for a high-quality answer.