Day 1d · Security & Data Safety
You will break things.
Possibly badly.
Not a scare talk. A map.
The patterns that keep client data safe — and keep you out of the news.
30 min
Katherine
Prompt injection · Data hygiene · Cowork risks
What came up in pre-session research
You already knew
something felt off.
- ·Data security and client confidentiality was the most-raised concern
- ·Multiple people asked: what's our policy on what we share vs. keep private?
- ·Concern about AI tools training on company and client data
- ·No formal company-wide AI policy yet — people are navigating it solo
- ·Worry about accuracy and blind trust in AI outputs
These instincts are right. 30 minutes to build the map. You leave with three things to never do, one attack pattern to recognize, and three habits that actually help.
The attack surface
Everything in context
is fair game
📁
Files you share
Docs, spreadsheets, briefs, research — whatever you give it, Claude reads completely. Every word.
e.g. You paste a campaign brief — Claude reads the client name, budget, and strategy. All of it is now in context.
🌐
Web pages it visits
In Cowork with Chrome: every page it loads is read in full — including anything hidden in the page source.
e.g. You ask Claude to summarise a competitor's site. That full page — including anything hidden in the source — is in context.
💬
The conversation
Everything you've said, every tool result — all visible, all influential on what Claude does next.
e.g. You pasted a vendor contract 10 turns ago. Claude still has it. Nothing leaves context until the session ends.
The key insight: Claude doesn't distinguish your instructions from instructions planted inside a document someone else created. It reads everything with equal attention.
Data sanitization
Some things never
go in a prompt. Ever.
✕
Client data — names, strategies, briefs
✕
Social security numbers, financial data, personal info
✕
NDA-protected content, proprietary strategies
✕
Passwords, API keys, access credentials — of any kind
Check your client agreements first
Some clients have signed AI usage addendums that permit specific tools. Others have NDAs that restrict them. If a client has a signed agreement AND you're on an enterprise account: it's not never — it's proceed carefully. Without both? Treat it as confidential by default.
The fix: synthetic data
Describe your dataset structure. Have Claude generate fake rows with realistic shapes. Build and test your workflow with synthetic data — only use real data when proven, on an enterprise account.
Alt fix: Have Claude write the formula, not read the data
Describe what you want to calculate. Claude writes the Excel or Google Sheets formula using column letters and cell references. You paste it into your own file — your actual data never enters the conversation. Cowork's spreadsheet integration makes this especially smooth.
Simon Willison · Security researcher · June 2025
Three conditions.
All three = catastrophe.
🔐
Leg 1
Access to
private data
Files, emails, client records, financials — anything sensitive in context
+
☣️
Leg 2
Untrusted
content exposure
Web pages, documents from external sources, third-party content Claude reads
+
📡
Leg 3
Ability to
communicate out
API calls, file uploads, link clicks, email sends — any outbound channel
The mitigation: break one leg. Don't mount sensitive folders when browsing untrusted sites. Don't load untrusted pages when sensitive data is in context. You don't need to avoid the tools — you need to be intentional about what you combine.
The trifecta in practice
The AI reads
what you can't see.
What prompt injection is
Malicious instructions hidden inside content Claude reads — a web page, a document, an email — that hijack what Claude does next.
Claude can't tell the difference between your instructions and instructions planted by someone else.
PromptArmor · Jan 15, 2026
A Word document with 1-point white text — invisible to you — told Claude Cowork to upload accessible files to an attacker-controlled account via Anthropic's own file-sharing service.
Financial information and PII exfiltrated. Anthropic had known about the underlying vulnerability for ~3 months before launch.
Concealment: 1pt font · white text on white background · 0.1 line spacing. Unreadable to humans. Perfectly legible to an AI reading raw document content.
About the Cowork sandbox
The attack worked because Claude had legitimate access to those files. Cowork's sandboxed environment protected your host machine — but not against data leaving through a door Claude was already allowed to open.
USENIX Security 2024 · Peer-reviewed research
The risk is real
and measurable.
10 defences were tested — here's what they actually look like
Keep rules separate
Put your instructions in a labelled zone, external content (docs, web pages) in another — so the AI knows which voice to trust
Filter known attack phrases
Block content that looks like a hijacking attempt — like "ignore previous instructions" — before it reaches the AI
Check the output before it runs
Before the AI takes an action — sending an email, uploading a file — a safety check scans what it's about to do
Repeat safety rules constantly
Remind the AI of your instructions at the end of every message, so they can't be pushed aside mid-conversation
Use a second AI as a watchdog
A separate model trained to spot injection attacks screens every input before it reaches the main AI
1 in 3
attacks still get through the best of these — across all 10 models tested
Turning on a safety setting doesn't mean you're covered. Every published defence has a failure rate. The next slide shows why.
Anthropic · OpenAI · Google DeepMind · October 2025
The defences break
when attackers try hard.
The Label Trick
99% bypass
Wraps external content in special markers so Claude knows which voice is yours.
The attack: put the closing marker inside the page itself — Claude reads the attacker's instructions as if they're your system rules.
The Repeat Rule
95% bypass
Repeats your safety instructions at the end of every message, so they can't be overridden.
The attack: "ignore everything after this line." Claude follows the attacker.
The Spam Filter
94% bypass
Blocks known injection phrases like "ignore previous instructions."
The attack: rephrase it. "Before we continue, let me clarify my role…" — no blocked phrase, same result.
Checks if Claude's output looks dangerous before allowing it to run.
The attack: craft output that looks like innocent documentation fetching. Tripwire doesn't fire. Your data leaves anyway.
500 researchers. $20,000 in prizes. Every single defence: broken.
The "1% attack rate" you've seen quoted? Measured against automated bots, not humans.
The takeaway: No single defence holds. Layer defences, reduce your attack surface, and don't rely on any one protection to save you.
What to actually do
Three habits.
Real protection.
1
Test with synthetic data, alwaysDescribe your dataset structure. Have Claude generate fake rows with realistic shapes. Build and test your workflow with synthetic data — only use real client data when the process is proven, on an enterprise account.
2
Pre-sort your file access — mount only what you needWhen setting up Cowork, only connect the specific folder for the task at hand. Narrow scope = smaller blast radius. Don't give it your whole drive by default, and don't use Chrome access for sensitive accounts while files are mounted.
3
Enterprise accounts only for client workConsumer and enterprise tiers have different legal data agreements — not just different features. Enterprise means Anthropic is contractually bound on data use and retention. With a signed client AI agreement + an enterprise account, you can use real client data — carefully. Without both, it's a hard no. If your company doesn't have a business account, that's a conversation to start today.
The point of all this
Not a reason to stop.
A reason to go carefully.
You're going to use these tools with client data.
The question isn't whether — it's how.
You now know
- ✓What Claude actually sees — the attack surface
- ✓What to never paste, the legal caveat, and the formula fix
- ✓Willison's lethal trifecta — and how to break it
- ✓How the PromptArmor attack actually worked
- ✓The research: defences reduce risk — none eliminate it
- ✓Three habits to reduce your attack surface starting today
The mindset shift
Before: "I hope nothing goes wrong"
After: "I know the risks, I've structured my workflow to minimize them, and I know what to do if something looks off."
Report suspicious behavior: usersafety@anthropic.com · Questions?
If you remember nothing else from the last 30 minutes
The short version.
DO ✓
DON'T ✗
✓Test your campaign workflow with a fake brief — have Claude generate dummy client names, goals, and budgets first
✗Paste the real brief into Claude.ai on a personal or free account
✓Mount only the specific project folder you need before starting a Cowork session
✗Give Cowork access to your whole Documents folder by default
✓Check there's a signed client AI agreement AND an enterprise account — then proceed carefully
✗Assume the same rules apply to every client
✓Ask Claude to write the formula using column letters; apply it to your own file
✗Upload the real client spreadsheet to get a formula written