Chapter 14
Healthcare Data Protection, PHI, and Responsible AI
Learning Objective
Understand how to protect sensitive healthcare data in GenAI systems.
What it means
Healthcare GenAI systems may process protected health information such as names, dates of birth, member IDs, diagnoses, medications, clinical notes, and provider details. Data protection requires technical controls, governance, access control, monitoring, and careful vendor selection.
Why it matters
AI systems can unintentionally expose sensitive information through prompts, logs, outputs, vector stores, or debugging tools. A responsible architecture minimizes data exposure and preserves auditability.
Healthcare Example
Before sending a clinical note to an LLM, a system may mask direct identifiers and keep only the minimum information needed for the task. Full identifiers remain in a secure internal system.
Architecture Flow
Code: Simple PHI Masking
import re
def mask_member_id(text):
return re.sub(r"\b[A-Z]{1,3}\d{6,12}\b", "[MEMBER_ID]", text)
def mask_dob(text):
return re.sub(r"\b\d{1,2}/\d{1,2}/\d{4}\b", "[DATE]", text)
note = "Patient DOB 05/12/1978. Member ID ABC123456789 has diabetes."
print(mask_member_id(mask_dob(note)))Common Mistakes
- Logging raw prompts with PHI.
- Storing sensitive text in vector databases without encryption.
- Not defining data retention rules.
- No role-based access control.
- No human review for high-risk outputs.
Interview Q&A
Q: How do you secure healthcare data in a GenAI system?
A: I apply minimum necessary data sharing, masking, encryption, role-based access, private endpoints, audit logging, retention controls, and output validation.
Q: Should PHI be sent to an external LLM?
A: Only if approved by legal/security policies, covered by proper agreements, and protected by enterprise controls. Otherwise use de-identification or private deployment options.
Architect Takeaway
Healthcare AI architecture must treat prompts, embeddings, logs, and outputs as sensitive data surfaces.