It's verification and process. Same principle as 1b — except numbers feel more trustworthy than words. That makes data hallucinations more dangerous, not less.
For data work, context = your dataset + your question + your business knowledge. Missing any one → bad output.
Beyond client analysis
Three uses you might not expect
💰 Make the business case
You KNOW a process is wasteful, but you don't have time to build the argument
Claude does the math, shows assumptions, you adjust inputs
Data-backed pitch in 2 min instead of "trust me"
🎯 Prioritize your work
"Which of my 5 projects has the highest impact/effort ratio?"
Upload a simple spreadsheet with your estimates
AI makes your thinking visible and challengeable
🔍 Understand unfamiliar data
Client sends a dataset you've never seen
"Describe this dataset. What are the quality issues?"
"What questions could I answer — and what needs more data?"
The rule: Always show your assumptions. If you can't state your three biggest assumptions, don't present the estimate.
Data in / Data out · Silent breakage
Where things break under the hood
These don't throw errors. You get clean-looking results that are just... wrong.
📅
Dates read as text
What happens: Claude reads "1/2/2025" — is that Jan 2 or Feb 1? Sorts alphabetically, so "12/1" comes before "2/1".
Your time series is scrambled. Trend line is nonsense.
⬜
Blanks → zero or missing?
What happens: Blank cell = "no conversions" or "not tracked that day"? Claude guesses. Wrong guess → averages off 20-30%.
What happens: HubSpot, Salesforce, GA all produce duplicate rows in exports. Claude counts them all.
"Total leads: 1,247" Actual: 940. Inflated 33%.
➕
Summing percentages
What happens: Claude sees a CTR column and sums it instead of weighted average. Presents "total CTR: 247%" with confidence.
Should be: 3.8% weighted avg. You'd catch this — would a PM?
🔗
Silent join drops
What happens: Two datasets spell the client name differently ("Acme Corp" vs "ACME"). Claude drops unmatched rows silently.
Lost 15% of your data. Totals just... don't add up.
🪟
Wrong time window
What happens: You say "Q4 performance." Claude uses calendar Q4. Your client's fiscal Q4 is Feb–Apr. Entirely different data.
Right analysis, completely wrong quarter.
None of these throw an error. That's what makes them dangerous. The output looks clean, professional, and definitive — while the underlying data has been silently mangled.
Even at 0.7%, that's ~1 in 140 claims might be wrong. Simple math is nearly perfect — but multi-step analysis is where errors compound. Each step multiplies the risk.
Capability benchmarks · 2024–2026
It's not just hallucination — everything is improving
Three independent benchmarks. Same timeline. Same trajectory.
SimpleQA — Factual AccuracySWE-Bench — Real-World CodingGDPVal — Professional Knowledge
Benchmarks show the ceiling. Your results depend on you.
Think of benchmarks like standardized tests for AI — the SAT, but for language models.
📊 How AI scores on standardized tests
>85% accuracy on graduate-level science questions (PhD-level physics, biology, chemistry)↗
>90% accuracy on general knowledge across 57 subjects (accounting to world religions — everything)
53% → 23% error rate with prompt-based mitigation (adding skepticism instructions to the prompt)↗
⚠️ The "leaked test" problem
Imagine a student gets the exact exam questions leaked before the test. They ace it — but give them a new test on the same material and their score drops.
That's what's happening with AI. Models may have "seen" benchmark questions during training. When researchers create brand new questions the models haven't seen:
Scores drop up to 13%
Published scores are the best case. Worst-case drop seen in smaller models; frontier models less affected.↗
Live Demo · 10 minutes
Decision-Making With AI
Watch how asking better questions changes the answer.
Setup: 3 months of campaign performance data across 4 clients. Client X is asking which channel deserves more budget next quarter.
Live demo · Steps 1–2
The Naive Ask → The Challenge
1️⃣ The Naive Ask
"What was the ROI of each channel for Client X?"
Claude returns clean numbers. Looks definitive — numbers, percentages, a clear winner.
⚠ But is it the right answer?
2️⃣ The Challenge
"Paid social ran heavier in December. Email ran heavier in January. Control for seasonal baselines."
Claude re-analyzes, adjusts for seasonality. Different result.
✓ Caught a confound in 30 seconds
Teaching moment: A data analyst would catch this in a 2-day review. We caught it in 30 seconds because we asked.
Live demo · Steps 3–4
Go Deeper → Verify
3️⃣ Going Deeper
"Paid social had higher ROI but we spent 3x more. What if equal budgets?"
Claude normalizes for budget → reveals true relative performance.
Same data, different conclusion.
4️⃣ The Verification Ask
"What are the 3 biggest assumptions? What would change the conclusion?"
Claude surfaces: seasonality assumption, budget scaling linearity, audience overlap between channels.
✓ Claude didn't volunteer these — I asked.
Live demo · Step 5
The Adversarial Pass
Prompt: "Pretend you're a skeptical data analyst reviewing this for the first time. What are the biggest weaknesses?"
🔬 Two modes of Claude
1️⃣
AnalyzeGenerate the findings
2️⃣
ChallengeCritique the findings
✓ The takeaway
This is adversarial verification in practice. You don't need to be technical. You just need to ask the right follow-up.
Three questions on a sticky note
Before trusting ANY AI-generated finding:
1.
Is this real or is this noise?
How many data points? Pattern hold if I remove one month? Effect size big enough to matter?
2.
What else could explain this?
One alternative explanation? What data would tell the difference? List 3, not just the one you like.
3.
Would I change my decision based on this?
If completely wrong, does decision change? If "no, same thing anyway" → why analyze?
Common traps
Five ways data analysis goes wrong
⚖️
Authority Bias
AI output looks professional → you assume it's correct. A polished chart of garbage data is still garbage.
⚓
Anchoring
AI returns 5 insights. The first one becomes "the answer" even though the order is arbitrary.
📊
Big Data Fallacy
1M data points doesn't mean the pattern is real. Large datasets make noise look like signal.
🔗
Correlation ≠ Causation
"High engagement correlates with high spend" — could be either direction, a third variable, or survivorship bias.
🔬
Exploratory ≠ Confirmatory
Found a pattern? That's a hypothesis to test — not a conclusion to present. Label your work.
How to trust AI data output · 1 of 3
Dashboard Mirroring
"Trust the graph, not the chat"
📊
Have AI design a dashboard in your actual BI toolTableau, Looker, Google Data Studio, even Excel pivot charts. AI walks you through the setup: fields, aggregation, visualization type.
✅
If the same graph appears in the non-AI tool → trust the dataThe dashboard pulls from your real data source. Same result = verified result.
👥
The dashboard lives past your Claude sessionAnyone on the team can reference it. This is Phase 3: codify the methodology into something shared.
How to trust AI data output · 2 of 3
Code, Not Prose + Show Your Work
💻 Programmatic enforcement
Have Claude write code (Python, R, Excel formulas) instead of "telling you" the answer
Code is inspectable, reproducible, and testable with known values
"I can read the formula. I can't read Claude's thinking."
📓 METHODOLOGY.md
Document every query, transformation, join, and assumption
Like a lab notebook — if you can't reproduce it, the result isn't science
Advanced: Claude Code hook auto-logs methodology (enforced, not hoped for)
Good topic for Kyle's office hours
How to trust AI data output · 3 of 3
Multi-Agent, Spot-Check & Observability
🤖 Multi-agent verification
One session to analyze, another to critique
Ask Claude to attack the work, not confirm it
The demo's adversarial pattern — formalized
🧪 Spot-check with known values
Insert records with known outcomes before analysis
If Claude finds those correctly → confidence in the rest
Different answers on different days = something changed
Data safety · Working without agreements
Four practical approaches
1. Synthetic data first ⭐
Describe your dataset → Claude generates realistic fake rows. Build and test your workflow on synthetic data. Swap real data when agreements are in place.
75% of data in AI projects will be synthetic by 2026 — Gartner
2. Code-first / Formula-first
Claude writes analysis code using column references. You run the code on real data locally — Claude never sees the data. "Claude builds the microscope; you look through it."
3. Aggregation + anonymization
Use aggregated metrics instead of individual records. Strip identifying info, keep the patterns.
4. Get the agreement signed
Pitch the ROI. Enterprise accounts only — data NOT used for training, contractual protections, ZDR available.
Rule: Client data = enterprise account only. No exceptions. Consumer plans: data may be used for training.
Cowork sandbox · Deep dive
Four layers of protection
Think of it like a building with four security checkpoints.
🖥️ Layer 1 · Separate computer
Claude doesn't run on your Mac — it boots a mini virtual computer inside your computer
Even if something goes wrong, it can't touch your real machine
Same technology that powers Docker
VM isolation · Apple VZVirtualMachine · hardware-level sandboxing
Researchers discovered that attackers can construct prompts that use the Anthropic API endpoint (on the allowlist) to exfiltrate files to attacker-controlled accounts.
Prompt Armor, Jan 14 2026↗ Simon Willison↗ Anthropic↗
✅ What the sandbox does
Files you don't share are invisible
Destructive actions require permission
Network traffic is restricted
⚠️ Practical guidance
Don't grant access to credential folders
Monitor for suspicious actions
Strong safety net — not a magic shield
Skills callback · From Session 1f
/performance-report
Raw data → structured executive report in ~28 seconds
Marketing Plugin v1.1.08-section outputReusable skillView on GitHub ↗
Instead of re-explaining your analysis process every time, the skill encodes best practices.
⏭ Skippable · Reference slide
/performance-report 8-Section Output
1Executive Summary
2Key Metrics Dashboard
3Trend Analysis
4What Worked
5What Needs Improvement
6Insights & Observations
7Recommendations
8Next Period Focus
Channels covered: Email · Social · Paid (Search + Social) · SEO · Content · Pipeline
Underlying methodology: Attribution modeling (last-touch, first-touch, linear, U-shaped, data-driven), forecasting, optimization cadence (daily → quarterly). Consider building your own /data-analyst skill — good project for Kyle's office hours.
Your process · Phases 1–2
Start specific. Iterate.
1️⃣ Pick a specific direction
"What patterns are in our Q4 data?" → too broad
Start with a hypothesis, go depth-first
Write question + hypothesis BEFORE starting
"I think email open rates are declining because subject lines are stale, not audience fatigue. If open rates are steady for new subs but dropping for 6+ month subs, that proves fatigue."
2️⃣ Iterate to a conclusion
First pass: rough results, observe what Claude gets wrong
Second pass: clarify your question based on what you learned
Third pass: consistent results — if not, question needs more work
⚠ When Claude gets something wrong, don't correct mid-session. Restart with a better question.