AI-Assisted Data Exploration & Analysis · Day 2a

Day 2a · Data Exploration · 30 min

AI-Assisted Data
Exploration & Analysis

How do I actually use AI to work with data without making mistakes?

30 min + demo Katherine verification · decision-making · data safety

Putting Day 1 together

60%

of companies using AI generate
no material value

Only 5% create substantial value at scale

BCG, "The Widening AI Value Gap," Sep 2025↗

The difference isn't tool choice

It's verification and process. Same principle as 1b — except numbers feel more trustworthy than words. That makes data hallucinations more dangerous, not less.

For data work, context = your dataset + your question + your business knowledge. Missing any one → bad output.

Beyond client analysis

Three uses you
might not expect

💰 Make the business case

You KNOW a process is wasteful, but you don't have time to build the argument
Claude does the math, shows assumptions, you adjust inputs
Data-backed pitch in 2 min instead of "trust me"

🎯 Prioritize your work

"Which of my 5 projects has the highest impact/effort ratio?"
Upload a simple spreadsheet with your estimates
AI makes your thinking visible and challengeable

🔍 Understand unfamiliar data

Client sends a dataset you've never seen
"Describe this dataset. What are the quality issues?"
"What questions could I answer — and what needs more data?"

The rule: Always show your assumptions. If you can't state your three biggest assumptions, don't present the estimate.

Data in / Data out · Silent breakage

Where things break
under the hood

These don't throw errors. You get clean-looking results that are just... wrong.

📅

Dates read as text

What happens: Claude reads "1/2/2025" — is that Jan 2 or Feb 1? Sorts alphabetically, so "12/1" comes before "2/1".

Your time series is scrambled.
Trend line is nonsense.

⬜

Blanks → zero or missing?

What happens: Blank cell = "no conversions" or "not tracked that day"? Claude guesses. Wrong guess → averages off 20-30%.

avg_conv: 4.2 (real: 3.1)
Dropped 18 "empty" rows.

👯

Duplicate rows from exports

What happens: HubSpot, Salesforce, GA all produce duplicate rows in exports. Claude counts them all.

"Total leads: 1,247"
Actual: 940. Inflated 33%.

➕

Summing percentages

What happens: Claude sees a CTR column and sums it instead of weighted average. Presents "total CTR: 247%" with confidence.

Should be: 3.8% weighted avg.
You'd catch this — would a PM?

🔗

Silent join drops

What happens: Two datasets spell the client name differently ("Acme Corp" vs "ACME"). Claude drops unmatched rows silently.

Lost 15% of your data.
Totals just... don't add up.

🪟

Wrong time window

What happens: You say "Q4 performance." Claude uses calendar Q4. Your client's fiscal Q4 is Feb–Apr. Entirely different data.

Right analysis,
completely wrong quarter.

None of these throw an error. That's what makes them dangerous. The output looks clean, professional, and definitive — while the underlying data has been silently mangled.

Hallucination rates · The trend

Getting better fast —
not yet trustworthy alone

Best model · Vectara Hallucination Leaderboard↗ ↗

25%

20%

15%

10%

5%

0%

21.8%

~2021 (est.)

3.0%

Nov 2023 · GPT-4

0.8%

Feb 2025 · o3-mini

0.7%

Feb 2025 · Gemini 2.0 Flash

2021

2023

early 2025

Task: summarize articles without inventing information · Vectara HHEM · Note: leaderboard refreshed Nov 2025 with harder benchmark↗

But the average model?

~11%

across 60+ LLMs↗

Even at 0.7%, that's ~1 in 140 claims might be wrong. Simple math is nearly perfect — but multi-step analysis is where errors compound. Each step multiplies the risk.

Capability benchmarks · 2024–2026

It's not just hallucination —
everything is improving

Three independent benchmarks. Same timeline. Same trajectory.

    SimpleQA — Factual Accuracy
    SWE-Bench — Real-World Coding
    GDPVal — Professional Knowledge
  

Independent, third-party evaluations — not marketing claims. Sources: OpenAI Research ↗ · Princeton NLP SWE-Bench ↗ · Vals.ai ↗

Data analysis accuracy · What the tests show

Benchmarks show the ceiling.
Your results depend on you.

Think of benchmarks like standardized tests for AI — the SAT, but for language models.

📊 How AI scores on standardized tests

>85% accuracy on graduate-level science questions
(PhD-level physics, biology, chemistry)↗
>90% accuracy on general knowledge across 57 subjects
(accounting to world religions — everything)
53% → 23% error rate with prompt-based mitigation
(adding skepticism instructions to the prompt)↗

⚠️ The "leaked test" problem

Imagine a student gets the exact exam questions leaked before the test. They ace it — but give them a new test on the same material and their score drops.

That's what's happening with AI. Models may have "seen" benchmark questions during training. When researchers create brand new questions the models haven't seen:

Scores drop up to 13%

Published scores are the best case. Worst-case drop seen in smaller models; frontier models less affected.↗

Live Demo · 10 minutes

Decision-Making
With AI

Watch how asking better questions changes the answer.

Setup: 3 months of campaign performance data across 4 clients.
Client X is asking which channel deserves more budget next quarter.

Live demo · Steps 1–2

The Naive Ask → The Challenge

1️⃣ The Naive Ask

        "What was the ROI of each channel for Client X?"
      

Claude returns clean numbers. Looks definitive — numbers, percentages, a clear winner.

⚠ But is it the right answer?

2️⃣ The Challenge

        "Paid social ran heavier in December. Email ran heavier in January. Control for seasonal baselines."
      

Claude re-analyzes, adjusts for seasonality. Different result.

✓ Caught a confound in 30 seconds

Teaching moment: A data analyst would catch this in a 2-day review. We caught it in 30 seconds because we asked.

Live demo · Steps 3–4

Go Deeper → Verify

3️⃣ Going Deeper

        "Paid social had higher ROI but we spent 3x more. What if equal budgets?"
      

Claude normalizes for budget → reveals true relative performance.

Same data, different conclusion.

4️⃣ The Verification Ask

        "What are the 3 biggest assumptions? What would change the conclusion?"
      

Claude surfaces: seasonality assumption, budget scaling linearity, audience overlap between channels.

✓ Claude didn't volunteer these — I asked.

Live demo · Step 5

The Adversarial Pass

    Prompt: "Pretend you're a skeptical data analyst reviewing this for the first time. What are the biggest weaknesses?"
  

🔬 Two modes of Claude

1️⃣

AnalyzeGenerate the findings

2️⃣

ChallengeCritique the findings

✓ The takeaway

This is adversarial verification in practice. You don't need to be technical. You just need to ask the right follow-up.

Three questions
on a sticky note

Before trusting ANY AI-generated finding:

1. Is this real or is this noise?
How many data points? Pattern hold if I remove one month? Effect size big enough to matter?
2. What else could explain this?
One alternative explanation? What data would tell the difference? List 3, not just the one you like.
3. Would I change my decision based on this?
If completely wrong, does decision change? If "no, same thing anyway" → why analyze?

Common traps

Five ways data
analysis goes wrong

⚖️

Authority Bias

AI output looks professional → you assume it's correct. A polished chart of garbage data is still garbage.

⚓

Anchoring

AI returns 5 insights. The first one becomes "the answer" even though the order is arbitrary.

📊

Big Data Fallacy

1M data points doesn't mean the pattern is real. Large datasets make noise look like signal.

🔗

Correlation ≠ Causation

"High engagement correlates with high spend" — could be either direction, a third variable, or survivorship bias.

🔬

Exploratory ≠ Confirmatory

Found a pattern? That's a hypothesis to test — not a conclusion to present. Label your work.

How to trust AI data output · 1 of 3

Dashboard Mirroring

"Trust the graph, not the chat"

📊

Have AI design a dashboard in your actual BI toolTableau, Looker, Google Data Studio, even Excel pivot charts. AI walks you through the setup: fields, aggregation, visualization type.

✅

If the same graph appears in the non-AI tool → trust the dataThe dashboard pulls from your real data source. Same result = verified result.

👥

The dashboard lives past your Claude sessionAnyone on the team can reference it. This is Phase 3: codify the methodology into something shared.

How to trust AI data output · 2 of 3

Code, Not Prose +
Show Your Work

💻 Programmatic enforcement

Have Claude write code (Python, R, Excel formulas) instead of "telling you" the answer
Code is inspectable, reproducible, and testable with known values
"I can read the formula. I can't read Claude's thinking."

📓 METHODOLOGY.md

Document every query, transformation, join, and assumption
Like a lab notebook — if you can't reproduce it, the result isn't science
Advanced: Claude Code hook auto-logs methodology (enforced, not hoped for)

Good topic for Kyle's office hours

How to trust AI data output · 3 of 3

Multi-Agent, Spot-Check
& Observability

🤖 Multi-agent verification

One session to analyze, another to critique
Ask Claude to attack the work, not confirm it
The demo's adversarial pattern — formalized

🧪 Spot-check with known values

Insert records with known outcomes before analysis
If Claude finds those correctly → confidence in the rest
Like a control group in an experiment

📡 Continuous observability

89% of orgs with AI agents have observability↗
Track drift, anomalies, refusal patterns
Different answers on different days = something changed

Data safety · Working without agreements

Four practical approaches

1. Synthetic data first ⭐

Describe your dataset → Claude generates realistic fake rows. Build and test your workflow on synthetic data. Swap real data when agreements are in place.

75% of data in AI projects will be synthetic by 2026 — Gartner

2. Code-first / Formula-first

Claude writes analysis code using column references. You run the code on real data locally — Claude never sees the data. "Claude builds the microscope; you look through it."

3. Aggregation + anonymization

Use aggregated metrics instead of individual records. Strip identifying info, keep the patterns.

4. Get the agreement signed

Pitch the ROI. Enterprise accounts only — data NOT used for training, contractual protections, ZDR available.

Rule: Client data = enterprise account only. No exceptions. Consumer plans: data may be used for training.

Cowork sandbox · Deep dive

Four layers of protection

Think of it like a building with four security checkpoints.

🖥️ Layer 1 · Separate computer

Claude doesn't run on your Mac — it boots a mini virtual computer inside your computer
Even if something goes wrong, it can't touch your real machine
Same technology that powers Docker

VM isolation · Apple VZVirtualMachine · hardware-level sandboxing

↗

🔒 Layer 2 · Locked room

Inside that virtual computer, Claude is in a locked room — it can only do pre-approved actions
Tries to do something unauthorized? Blocked before it even starts
Every conversation gets its own separate room

Process sandbox · bubblewrap + seccomp · kernel-level restrictions

↗

📁 Layer 3 · You choose the filing cabinet

By default Claude can see nothing on your machine
You explicitly hand it specific folders — it can only open those
Can't browse around, can't go up to parent folders

Filesystem whitelist · default-deny · mounted directories only

↗

🌐 Layer 4 · Approved websites only

Claude can only visit a pre-approved list of websites
A security guard checks every outbound request — blocks anything not on the list
Login credentials are controlled too — only services you've allowed

Network allowlist · SOCKS5 proxy · domain-level filtering

Cowork sandbox · Transparency

What's NOT solved yet

⚠️ Indirect exfiltration via prompt injection

Researchers discovered that attackers can construct prompts that use the Anthropic API endpoint (on the allowlist) to exfiltrate files to attacker-controlled accounts.

Prompt Armor, Jan 14 2026↗ Simon Willison↗ Anthropic↗

✅ What the sandbox does

Files you don't share are invisible
Destructive actions require permission
Network traffic is restricted

⚠️ Practical guidance

Don't grant access to credential folders
Monitor for suspicious actions
Strong safety net — not a magic shield

Skills callback · From Session 1f

/performance-report

Raw data → structured executive report in ~28 seconds

Marketing Plugin v1.1.0 8-section output Reusable skill View on GitHub ↗

Instead of re-explaining your analysis process every time, the skill encodes best practices.

⏭ Skippable · Reference slide

/performance-report
8-Section Output

1Executive Summary

2Key Metrics Dashboard

3Trend Analysis

4What Worked

5What Needs Improvement

6Insights & Observations

7Recommendations

8Next Period Focus

Channels covered: Email · Social · Paid (Search + Social) · SEO · Content · Pipeline

Underlying methodology: Attribution modeling (last-touch, first-touch, linear, U-shaped, data-driven), forecasting, optimization cadence (daily → quarterly). Consider building your own /data-analyst skill — good project for Kyle's office hours.

Your process · Phases 1–2

Start specific. Iterate.

1️⃣ Pick a specific direction

"What patterns are in our Q4 data?" → too broad
Start with a hypothesis, go depth-first
Write question + hypothesis BEFORE starting

"I think email open rates are declining because subject lines are stale, not audience fatigue. If open rates are steady for new subs but dropping for 6+ month subs, that proves fatigue."

2️⃣ Iterate to a conclusion

First pass: rough results, observe what Claude gets wrong
Second pass: clarify your question based on what you learned
Third pass: consistent results — if not, question needs more work

⚠ When Claude gets something wrong, don't correct mid-session. Restart with a better question.

Your process · Phases 3–4 + Cross-domain

Verify, Share &
Borrow from Experts

3️⃣ Verify

Run the Three Questions
Adversarial pass
Check methodology
Colleague reviews methodology (not just output)

4️⃣ Share

Static HTML page with data + JS viz
Reusable skill (1f callback)
Dashboard in your BI tool
Overlay events onto timelines

🧠 Cross-domain safeguards

Intel: List 3 competing explanations (ACH)
Science: Label exploratory vs. confirmatory
Actuarial: Confidence intervals + sensitivity analysis
Math: Where does the pattern break?

Know your limits

When to call an analyst

Red flags that mean "stop and bring in an expert":

🚩

You can't state your assumptions clearlyIf you can't articulate what you're assuming, you can't evaluate the output.

🔍

Effect size is tiny but "feels" importantMay be noise amplified by a large dataset.

❓

Finding contradicts expectations — and you can't explain whyThere's a confound you're not seeing.

📉

Dataset has >5% missing values or quality issuesGarbage in, garbage out — no amount of AI fixes bad data.

🏢

Need to defend in C-suite — can't explain methodology in 2 minIf you can't explain it simply, you don't understand it well enough to present it.

If 2+ of these apply: the Three Questions framework is telling you to get help, not push through.

Session 2a · Closing

Your judgment is the bottleneck.
That's the point.

AI handles the grunt work. You handle the thinking.

📋 Three Questions

1. Real or noise?
2. What else could explain this?
3. Would I change my decision?

🔄 The Process

1. Start with a decision, not a dataset
2. Depth-first on 1-2 hypotheses
3. Challenge the first answer
4. Adversarial pass
5. Label: exploratory vs. confirmatory

🔒 Data Safety

1. Synthetic data first
2. Code-first / formula-first
3. Aggregate + anonymize
4. Enterprise accounts only

🚩 Call an Analyst

· Can't state assumptions
· Tiny effect "feels" important
· Contradicts expectations
· Need C-suite defense

Day 2a · Complete

What's next

Kyle's role-specific hands-on office hours — build automation for a high-impact use case using skills.

✓

1a

Level Setting

✓

1b

Critical Thinking

✓

1c–1f

Plan · Security · Context · Skills

NOW

2a

Data Exploration

🛠️

Office Hours

Kyle · Hands-on

📎

2b+

Remaining Sessions