1 / 26
30:00
Click to start
Speaker Notes
Day 2a · Data Exploration · 30 min
AI-Assisted Data
Exploration & Analysis
How do I actually use AI to work with data without making mistakes?
30 min + demo Katherine verification · decision-making · data safety
Putting Day 1 together
60%
of companies using AI generate
no material value
Only 5% create substantial value at scale
BCG, "The Widening AI Value Gap," Sep 2025
The difference isn't tool choice
It's verification and process. Same principle as 1b — except numbers feel more trustworthy than words. That makes data hallucinations more dangerous, not less.
For data work, context = your dataset + your question + your business knowledge. Missing any one → bad output.
Beyond client analysis
Three uses you
might not expect

💰 Make the business case

  • You KNOW a process is wasteful, but you don't have time to build the argument
  • Claude does the math, shows assumptions, you adjust inputs
  • Data-backed pitch in 2 min instead of "trust me"

🎯 Prioritize your work

  • "Which of my 5 projects has the highest impact/effort ratio?"
  • Upload a simple spreadsheet with your estimates
  • AI makes your thinking visible and challengeable

🔍 Understand unfamiliar data

  • Client sends a dataset you've never seen
  • "Describe this dataset. What are the quality issues?"
  • "What questions could I answer — and what needs more data?"
The rule: Always show your assumptions. If you can't state your three biggest assumptions, don't present the estimate.
Data in / Data out · Silent breakage
Where things break
under the hood
These don't throw errors. You get clean-looking results that are just... wrong.
📅
Dates read as text
What happens: Claude reads "1/2/2025" — is that Jan 2 or Feb 1? Sorts alphabetically, so "12/1" comes before "2/1".
Your time series is scrambled.
Trend line is nonsense.
Blanks → zero or missing?
What happens: Blank cell = "no conversions" or "not tracked that day"? Claude guesses. Wrong guess → averages off 20-30%.
avg_conv: 4.2 (real: 3.1)
Dropped 18 "empty" rows.
👯
Duplicate rows from exports
What happens: HubSpot, Salesforce, GA all produce duplicate rows in exports. Claude counts them all.
"Total leads: 1,247"
Actual: 940. Inflated 33%.
Summing percentages
What happens: Claude sees a CTR column and sums it instead of weighted average. Presents "total CTR: 247%" with confidence.
Should be: 3.8% weighted avg.
You'd catch this — would a PM?
🔗
Silent join drops
What happens: Two datasets spell the client name differently ("Acme Corp" vs "ACME"). Claude drops unmatched rows silently.
Lost 15% of your data.
Totals just... don't add up.
🪟
Wrong time window
What happens: You say "Q4 performance." Claude uses calendar Q4. Your client's fiscal Q4 is Feb–Apr. Entirely different data.
Right analysis,
completely wrong quarter.
None of these throw an error. That's what makes them dangerous. The output looks clean, professional, and definitive — while the underlying data has been silently mangled.
Hallucination rates · The trend
Getting better fast —
not yet trustworthy alone
Best model · Vectara Hallucination Leaderboard 
25%
20%
15%
10%
5%
0%
21.8%
~2021 (est.)
3.0%
Nov 2023 · GPT-4
0.8%
Feb 2025 · o3-mini
0.7%
Feb 2025 · Gemini 2.0 Flash
2021
2023
early 2025
Task: summarize articles without inventing information · Vectara HHEM · Note: leaderboard refreshed Nov 2025 with harder benchmark
But the average model?
~11%
across 60+ LLMs
Even at 0.7%, that's ~1 in 140 claims might be wrong. Simple math is nearly perfect — but multi-step analysis is where errors compound. Each step multiplies the risk.
Capability benchmarks · 2024–2026
It's not just hallucination —
everything is improving
Three independent benchmarks. Same timeline. Same trajectory.
SimpleQA — Factual Accuracy SWE-Bench — Real-World Coding GDPVal — Professional Knowledge
30% 50% 70% 90% Mid-2024 Late 2024 Early 2025 Mid-2025 Late 2025 Early 2026 GPT-4o · 38% o1 · 47% GPT-4.5 · 63% 3.5 Sonnet · 49% Opus 4.5 · 74% Opus 4.6 · 81% Pre-5.2 best · 39% GPT-5.2 · 71%
Independent, third-party evaluations — not marketing claims. Sources: OpenAI Research ↗ · Princeton NLP SWE-Bench ↗ · Vals.ai ↗
Data analysis accuracy · What the tests show
Benchmarks show the ceiling.
Your results depend on you.
Think of benchmarks like standardized tests for AI — the SAT, but for language models.

📊 How AI scores on standardized tests

  • >85% accuracy on graduate-level science questions
    (PhD-level physics, biology, chemistry)
  • >90% accuracy on general knowledge across 57 subjects
    (accounting to world religions — everything)
  • 53% → 23% error rate with prompt-based mitigation
    (adding skepticism instructions to the prompt)

⚠️ The "leaked test" problem

Imagine a student gets the exact exam questions leaked before the test. They ace it — but give them a new test on the same material and their score drops.
That's what's happening with AI. Models may have "seen" benchmark questions during training. When researchers create brand new questions the models haven't seen:
Scores drop up to 13%
Published scores are the best case. Worst-case drop seen in smaller models; frontier models less affected.
Live Demo · 10 minutes
Decision-Making
With AI
Watch how asking better questions changes the answer.
Setup: 3 months of campaign performance data across 4 clients.
Client X is asking which channel deserves more budget next quarter.
Live demo · Steps 1–2
The Naive Ask → The Challenge

1️⃣ The Naive Ask

"What was the ROI of each channel for Client X?"

Claude returns clean numbers. Looks definitive — numbers, percentages, a clear winner.

⚠ But is it the right answer?

2️⃣ The Challenge

"Paid social ran heavier in December. Email ran heavier in January. Control for seasonal baselines."

Claude re-analyzes, adjusts for seasonality. Different result.

✓ Caught a confound in 30 seconds
Teaching moment: A data analyst would catch this in a 2-day review. We caught it in 30 seconds because we asked.
Live demo · Steps 3–4
Go Deeper → Verify

3️⃣ Going Deeper

"Paid social had higher ROI but we spent 3x more. What if equal budgets?"

Claude normalizes for budget → reveals true relative performance.

Same data, different conclusion.

4️⃣ The Verification Ask

"What are the 3 biggest assumptions? What would change the conclusion?"

Claude surfaces: seasonality assumption, budget scaling linearity, audience overlap between channels.

✓ Claude didn't volunteer these — I asked.
Live demo · Step 5
The Adversarial Pass
Prompt: "Pretend you're a skeptical data analyst reviewing this for the first time. What are the biggest weaknesses?"
🔬 Two modes of Claude
1️⃣
AnalyzeGenerate the findings
2️⃣
ChallengeCritique the findings
✓ The takeaway

This is adversarial verification in practice. You don't need to be technical. You just need to ask the right follow-up.

Three questions
on a sticky note
Before trusting ANY AI-generated finding:
  1. 1. Is this real or is this noise?
    How many data points? Pattern hold if I remove one month? Effect size big enough to matter?
  2. 2. What else could explain this?
    One alternative explanation? What data would tell the difference? List 3, not just the one you like.
  3. 3. Would I change my decision based on this?
    If completely wrong, does decision change? If "no, same thing anyway" → why analyze?
Common traps
Five ways data
analysis goes wrong
⚖️
Authority Bias
AI output looks professional → you assume it's correct. A polished chart of garbage data is still garbage.
Anchoring
AI returns 5 insights. The first one becomes "the answer" even though the order is arbitrary.
📊
Big Data Fallacy
1M data points doesn't mean the pattern is real. Large datasets make noise look like signal.
🔗
Correlation ≠ Causation
"High engagement correlates with high spend" — could be either direction, a third variable, or survivorship bias.
🔬
Exploratory ≠ Confirmatory
Found a pattern? That's a hypothesis to test — not a conclusion to present. Label your work.
How to trust AI data output · 1 of 3
Dashboard Mirroring
"Trust the graph, not the chat"
📊
Have AI design a dashboard in your actual BI toolTableau, Looker, Google Data Studio, even Excel pivot charts. AI walks you through the setup: fields, aggregation, visualization type.
If the same graph appears in the non-AI tool → trust the dataThe dashboard pulls from your real data source. Same result = verified result.
👥
The dashboard lives past your Claude sessionAnyone on the team can reference it. This is Phase 3: codify the methodology into something shared.
How to trust AI data output · 2 of 3
Code, Not Prose +
Show Your Work

💻 Programmatic enforcement

  • Have Claude write code (Python, R, Excel formulas) instead of "telling you" the answer
  • Code is inspectable, reproducible, and testable with known values
  • "I can read the formula. I can't read Claude's thinking."

📓 METHODOLOGY.md

  • Document every query, transformation, join, and assumption
  • Like a lab notebook — if you can't reproduce it, the result isn't science
  • Advanced: Claude Code hook auto-logs methodology (enforced, not hoped for)
Good topic for Kyle's office hours
How to trust AI data output · 3 of 3
Multi-Agent, Spot-Check
& Observability

🤖 Multi-agent verification

  • One session to analyze, another to critique
  • Ask Claude to attack the work, not confirm it
  • The demo's adversarial pattern — formalized

🧪 Spot-check with known values

  • Insert records with known outcomes before analysis
  • If Claude finds those correctly → confidence in the rest
  • Like a control group in an experiment

📡 Continuous observability

  • 89% of orgs with AI agents have observability
  • Track drift, anomalies, refusal patterns
  • Different answers on different days = something changed
Data safety · Working without agreements
Four practical approaches
1. Synthetic data first ⭐

Describe your dataset → Claude generates realistic fake rows. Build and test your workflow on synthetic data. Swap real data when agreements are in place.

75% of data in AI projects will be synthetic by 2026 — Gartner
2. Code-first / Formula-first

Claude writes analysis code using column references. You run the code on real data locally — Claude never sees the data. "Claude builds the microscope; you look through it."

3. Aggregation + anonymization

Use aggregated metrics instead of individual records. Strip identifying info, keep the patterns.

4. Get the agreement signed

Pitch the ROI. Enterprise accounts only — data NOT used for training, contractual protections, ZDR available.

Rule: Client data = enterprise account only. No exceptions. Consumer plans: data may be used for training.
Cowork sandbox · Deep dive
Four layers of protection
Think of it like a building with four security checkpoints.

🖥️ Layer 1 · Separate computer

  • Claude doesn't run on your Mac — it boots a mini virtual computer inside your computer
  • Even if something goes wrong, it can't touch your real machine
  • Same technology that powers Docker
VM isolation · Apple VZVirtualMachine · hardware-level sandboxing

🔒 Layer 2 · Locked room

  • Inside that virtual computer, Claude is in a locked room — it can only do pre-approved actions
  • Tries to do something unauthorized? Blocked before it even starts
  • Every conversation gets its own separate room
Process sandbox · bubblewrap + seccomp · kernel-level restrictions

📁 Layer 3 · You choose the filing cabinet

  • By default Claude can see nothing on your machine
  • You explicitly hand it specific folders — it can only open those
  • Can't browse around, can't go up to parent folders
Filesystem whitelist · default-deny · mounted directories only

🌐 Layer 4 · Approved websites only

  • Claude can only visit a pre-approved list of websites
  • A security guard checks every outbound request — blocks anything not on the list
  • Login credentials are controlled too — only services you've allowed
Network allowlist · SOCKS5 proxy · domain-level filtering
Cowork sandbox · Transparency
What's NOT solved yet
⚠️ Indirect exfiltration via prompt injection

Researchers discovered that attackers can construct prompts that use the Anthropic API endpoint (on the allowlist) to exfiltrate files to attacker-controlled accounts.

Prompt Armor, Jan 14 2026  Simon Willison  Anthropic

✅ What the sandbox does

  • Files you don't share are invisible
  • Destructive actions require permission
  • Network traffic is restricted

⚠️ Practical guidance

  • Don't grant access to credential folders
  • Monitor for suspicious actions
  • Strong safety net — not a magic shield
Skills callback · From Session 1f
/performance-report
Raw data → structured executive report in ~28 seconds
Marketing Plugin v1.1.0 8-section output Reusable skill View on GitHub ↗
Instead of re-explaining your analysis process every time, the skill encodes best practices.
⏭ Skippable · Reference slide
/performance-report
8-Section Output
1Executive Summary
2Key Metrics Dashboard
3Trend Analysis
4What Worked
5What Needs Improvement
6Insights & Observations
7Recommendations
8Next Period Focus
Channels covered: Email · Social · Paid (Search + Social) · SEO · Content · Pipeline
Underlying methodology: Attribution modeling (last-touch, first-touch, linear, U-shaped, data-driven), forecasting, optimization cadence (daily → quarterly). Consider building your own /data-analyst skill — good project for Kyle's office hours.
Your process · Phases 1–2
Start specific. Iterate.

1️⃣ Pick a specific direction

  • "What patterns are in our Q4 data?" → too broad
  • Start with a hypothesis, go depth-first
  • Write question + hypothesis BEFORE starting
"I think email open rates are declining because subject lines are stale, not audience fatigue. If open rates are steady for new subs but dropping for 6+ month subs, that proves fatigue."

2️⃣ Iterate to a conclusion

  • First pass: rough results, observe what Claude gets wrong
  • Second pass: clarify your question based on what you learned
  • Third pass: consistent results — if not, question needs more work
⚠ When Claude gets something wrong, don't correct mid-session. Restart with a better question.
Your process · Phases 3–4 + Cross-domain
Verify, Share &
Borrow from Experts

3️⃣ Verify

  • Run the Three Questions
  • Adversarial pass
  • Check methodology
  • Colleague reviews methodology (not just output)

4️⃣ Share

  • Static HTML page with data + JS viz
  • Reusable skill (1f callback)
  • Dashboard in your BI tool
  • Overlay events onto timelines

🧠 Cross-domain safeguards

  • Intel: List 3 competing explanations (ACH)
  • Science: Label exploratory vs. confirmatory
  • Actuarial: Confidence intervals + sensitivity analysis
  • Math: Where does the pattern break?
Know your limits
When to call an analyst
Red flags that mean "stop and bring in an expert":
🚩
You can't state your assumptions clearlyIf you can't articulate what you're assuming, you can't evaluate the output.
🔍
Effect size is tiny but "feels" importantMay be noise amplified by a large dataset.
Finding contradicts expectations — and you can't explain whyThere's a confound you're not seeing.
📉
Dataset has >5% missing values or quality issuesGarbage in, garbage out — no amount of AI fixes bad data.
🏢
Need to defend in C-suite — can't explain methodology in 2 minIf you can't explain it simply, you don't understand it well enough to present it.
If 2+ of these apply: the Three Questions framework is telling you to get help, not push through.
Session 2a · Closing
Your judgment is the bottleneck.
That's the point.
AI handles the grunt work. You handle the thinking.
📋 Three Questions
  1. 1. Real or noise?
  2. 2. What else could explain this?
  3. 3. Would I change my decision?
🔄 The Process
  1. 1. Start with a decision, not a dataset
  2. 2. Depth-first on 1-2 hypotheses
  3. 3. Challenge the first answer
  4. 4. Adversarial pass
  5. 5. Label: exploratory vs. confirmatory
🔒 Data Safety
  1. 1. Synthetic data first
  2. 2. Code-first / formula-first
  3. 3. Aggregate + anonymize
  4. 4. Enterprise accounts only
🚩 Call an Analyst
  1. · Can't state assumptions
  2. · Tiny effect "feels" important
  3. · Contradicts expectations
  4. · Need C-suite defense
Day 2a · Complete
What's next
Kyle's role-specific hands-on office hours — build automation for a high-impact use case using skills.
1a
Level Setting
1b
Critical Thinking
1c–1f
Plan · Security · Context · Skills
NOW
2a
Data Exploration
Office Hours
Kyle · Hands-on
📎
2b+
Remaining Sessions