Phi-4

Microsoft

Visit Microsoft Find tools using this model

Open weightstext

Small open model focused on reasoning quality relative to its size.

Developer: Microsoft
Release date: Dec 11, 2024
Parameters: 14B
Corpus size: Synthetic + filtered web (Microsoft research)
License: MIT
Context window: 16K tokens
Modalities: text

Learn this model

Tutorial tailored to Phi-4—cost, capabilities, API setup, and production patterns based on this model's specs (not generic copy for every LLM).

Cost & access

Phi-4 weights are available under MIT. Direct API cost may be $0 if you self-host; budget for GPUs, storage, and engineering instead. Hosted endpoints (Together, Fireworks, Groq, etc.) charge per token—shop providers for phi-4 latency and region. With a 16K tokens context window, long PDFs or chat histories increase input tokens quickly—trim history or summarize older turns in production.

Functional understanding

Small open model focused on reasoning quality relative to its size.
Modalities: text · License: MIT · Released 2024-12-12.
Best-fit workflows for this model:
• Drafting, summarization, and structured extraction from long documents.
• On-prem or VPC deployment when data cannot leave your network.

Technical foundation

Microsoft reports 14B parameters; training data: Synthetic + filtered web (Microsoft research).
Context: 16K tokens. Open weights: yes.
Phi-4 is positioned as a general-purpose model in the Microsoft lineup.

First API call

Run Phi-4 locally with Ollama or Hugging Face transformers (weights under MIT).

# Ollama (if model is published there)
# ollama run phi-4

# Or Hugging Face transformers:
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/phi-4", device_map="auto")
print(pipe("Hello from Phi-4", max_new_tokens=80)[0]["generated_text"])

Important technical topics

Prompting Phi-4: be explicit about output format. Weak: "Analyze this." Better: "Return JSON with fields id, total, date for Microsoft billing data."
Temperature: use 0–0.3 for extraction and compliance on Phi-4; 0.7–1.0 for brainstorming.
Tokens: Phi-4 bills by tokens (~¾ word each). 14B parameters affect capability; your bill is driven by context length and call volume.
Context window (16K tokens): everything in one request—system prompt, tools, RAG chunks, and history—must fit. Truncate or summarize when approaching the limit for Phi-4.

Real enterprise patterns

RAG with Phi-4: retrieve from your vector DB, cite sources in the prompt.
Tool calling: define JSON schemas; let Phi-4 request functions, not free-form SQL.
Eval suite: regression prompts before each model or prompt change.
Cost routing: default to Phi-4 for hard tasks; smaller sibling model for triage.

Production & security

Secrets: never commit keys for Phi-4; use vault + per-environment rotation.
PII: mask before inference; log redacted prompts only.
Observability: trace id per request; log model=phi-4, tokens in/out, latency.
GPU monitoring: VRAM, batch queue depth, and model revision hash on each deploy.
Guardrails: schema-validate JSON; block disallowed topics; cross-check numbers against source docs.

Mini projects with this model

Support copilot: Phi-4 drafts replies from KB snippets.
Contract clause extractor with human approval.
Weekly metrics narrative from SQL + CSV exports.
Agent that files expenses from receipt photos (if multimodal).

Suggested stack

Language: Python 3.11+
Model: Phi-4 via Ollama, vLLM, or Hugging Face
Hardware: NVIDIA GPU with enough VRAM for quantization level
API wrapper: FastAPI or LiteLLM proxy
UI: Streamlit or Next.js for internal tools
APIs: FastAPI
Vector DB (RAG): Pinecone / Chroma / pgvector

Learning path

Python basics
HTTP/REST and environment variables
Microsoft authentication and Phi-4 model id (phi-4)
First successful call to Phi-4
Prompt design and JSON / structured outputs
RAG
Tool use / function calling
Evals and regression sets
Production deploy + monitoring