Security & Data Safety

Day 1d · Security & Data Safety

You will break things.
Possibly badly.

Not a scare talk. A map.
The patterns that keep client data safe — and keep you out of the news.

30 min Katherine Prompt injection · Data hygiene · Cowork risks

What came up in pre-session research

You already knew
something felt off.

·Data security and client confidentiality was the most-raised concern
·Multiple people asked: what's our policy on what we share vs. keep private?
·Concern about AI tools training on company and client data
·No formal company-wide AI policy yet — people are navigating it solo
·Worry about accuracy and blind trust in AI outputs

These instincts are right. 30 minutes to build the map. You leave with three things to never do, one attack pattern to recognize, and three habits that actually help.

The attack surface

Everything in context
is fair game

📁

Files you share

Docs, spreadsheets, briefs, research — whatever you give it, Claude reads completely. Every word.

e.g. You paste a campaign brief — Claude reads the client name, budget, and strategy. All of it is now in context.

🌐

Web pages it visits

In Cowork with Chrome: every page it loads is read in full — including anything hidden in the page source.

e.g. You ask Claude to summarise a competitor's site. That full page — including anything hidden in the source — is in context.

💬

The conversation

Everything you've said, every tool result — all visible, all influential on what Claude does next.

e.g. You pasted a vendor contract 10 turns ago. Claude still has it. Nothing leaves context until the session ends.

The key insight: Claude doesn't distinguish your instructions from instructions planted inside a document someone else created. It reads everything with equal attention.

Data sanitization

Some things never
go in a prompt. Ever.

✕

Client data — names, strategies, briefs

✕

Social security numbers, financial data, personal info

✕

NDA-protected content, proprietary strategies

✕

Passwords, API keys, access credentials — of any kind

Check your client agreements first

Some clients have signed AI usage addendums that permit specific tools. Others have NDAs that restrict them. If a client has a signed agreement AND you're on an enterprise account: it's not never — it's proceed carefully. Without both? Treat it as confidential by default.

The fix: synthetic data

Describe your dataset structure. Have Claude generate fake rows with realistic shapes. Build and test your workflow with synthetic data — only use real data when proven, on an enterprise account.

Alt fix: Have Claude write the formula, not read the data

Describe what you want to calculate. Claude writes the Excel or Google Sheets formula using column letters and cell references. You paste it into your own file — your actual data never enters the conversation. Cowork's spreadsheet integration makes this especially smooth.

Simon Willison · Security researcher · June 2025

Three conditions.
All three = catastrophe.

🔐

Leg 1

Access to
private data

Files, emails, client records, financials — anything sensitive in context

+

☣️

Leg 2

Untrusted
content exposure

Web pages, documents from external sources, third-party content Claude reads

+

📡

Leg 3

Ability to
communicate out

API calls, file uploads, link clicks, email sends — any outbound channel

The mitigation: break one leg. Don't mount sensitive folders when browsing untrusted sites. Don't load untrusted pages when sensitive data is in context. You don't need to avoid the tools — you need to be intentional about what you combine.

The trifecta in practice

The AI reads
what you can't see.

What prompt injection is

Malicious instructions hidden inside content Claude reads — a web page, a document, an email — that hijack what Claude does next.

Claude can't tell the difference between your instructions and instructions planted by someone else.

PromptArmor · Jan 15, 2026

A Word document with 1-point white text — invisible to you — told Claude Cowork to upload accessible files to an attacker-controlled account via Anthropic's own file-sharing service.

Financial information and PII exfiltrated. Anthropic had known about the underlying vulnerability for ~3 months before launch.

Concealment: 1pt font · white text on white background · 0.1 line spacing. Unreadable to humans. Perfectly legible to an AI reading raw document content.

About the Cowork sandbox

The attack worked because Claude had legitimate access to those files. Cowork's sandboxed environment protected your host machine — but not against data leaving through a door Claude was already allowed to open.

Read the full story → The Register

USENIX Security 2024 · Peer-reviewed research

The risk is real
and measurable.

10 defences were tested — here's what they actually look like

Keep rules separate

Put your instructions in a labelled zone, external content (docs, web pages) in another — so the AI knows which voice to trust

Filter known attack phrases

Block content that looks like a hijacking attempt — like "ignore previous instructions" — before it reaches the AI

Check the output before it runs

Before the AI takes an action — sending an email, uploading a file — a safety check scans what it's about to do

Repeat safety rules constantly

Remind the AI of your instructions at the end of every message, so they can't be pushed aside mid-conversation

Use a second AI as a watchdog

A separate model trained to spot injection attacks screens every input before it reaches the main AI

1 in 3

attacks still get through the best of these — across all 10 models tested

Turning on a safety setting doesn't mean you're covered. Every published defence has a failure rate. The next slide shows why.

Read the paper → arXiv 2310.12815

Anthropic · OpenAI · Google DeepMind · October 2025

The defences break
when attackers try hard.

The Label Trick

99% bypass

Wraps external content in special markers so Claude knows which voice is yours.

The attack: put the closing marker inside the page itself — Claude reads the attacker's instructions as if they're your system rules.

The Repeat Rule

95% bypass

Repeats your safety instructions at the end of every message, so they can't be overridden.

The attack: "ignore everything after this line." Claude follows the attacker.

The Spam Filter

94% bypass

Blocks known injection phrases like "ignore previous instructions."

The attack: rephrase it. "Before we continue, let me clarify my role…" — no blocked phrase, same result.

The Tripwire

100% bypass

Checks if Claude's output looks dangerous before allowing it to run.

The attack: craft output that looks like innocent documentation fetching. Tripwire doesn't fire. Your data leaves anyway.

500 researchers. $20,000 in prizes. Every single defence: broken.

The "1% attack rate" you've seen quoted? Measured against automated bots, not humans.

The takeaway: No single defence holds. Layer defences, reduce your attack surface, and don't rely on any one protection to save you.

Simon Willison's coverage

What to actually do

Three habits.
Real protection.

1

Test with synthetic data, alwaysDescribe your dataset structure. Have Claude generate fake rows with realistic shapes. Build and test your workflow with synthetic data — only use real client data when the process is proven, on an enterprise account.

2

Pre-sort your file access — mount only what you needWhen setting up Cowork, only connect the specific folder for the task at hand. Narrow scope = smaller blast radius. Don't give it your whole drive by default, and don't use Chrome access for sensitive accounts while files are mounted.

3

Enterprise accounts only for client workConsumer and enterprise tiers have different legal data agreements — not just different features. Enterprise means Anthropic is contractually bound on data use and retention. With a signed client AI agreement + an enterprise account, you can use real client data — carefully. Without both, it's a hard no. If your company doesn't have a business account, that's a conversation to start today.

The point of all this

Not a reason to stop.
A reason to go carefully.

You're going to use these tools with client data.
The question isn't whether — it's how.

You now know

✓What Claude actually sees — the attack surface
✓What to never paste, the legal caveat, and the formula fix
✓Willison's lethal trifecta — and how to break it
✓How the PromptArmor attack actually worked
✓The research: defences reduce risk — none eliminate it
✓Three habits to reduce your attack surface starting today

The mindset shift

Before: "I hope nothing goes wrong"

After: "I know the risks, I've structured my workflow to minimize them, and I know what to do if something looks off."

Report suspicious behavior: usersafety@anthropic.com · Questions?

If you remember nothing else from the last 30 minutes

The short version.

DO ✓

DON'T ✗

✓Test your campaign workflow with a fake brief — have Claude generate dummy client names, goals, and budgets first

✗Paste the real brief into Claude.ai on a personal or free account

✓Mount only the specific project folder you need before starting a Cowork session

✗Give Cowork access to your whole Documents folder by default

✓Check there's a signed client AI agreement AND an enterprise account — then proceed carefully

✗Assume the same rules apply to every client

✓Ask Claude to write the formula using column letters; apply it to your own file

✗Upload the real client spreadsheet to get a formula written