Most AI SOC demos are scripted against scripted data

THTheo H. · Security Researcher & Systems Thinker

AI in Security OperationsJune 15, 2026·11 min read

A detection engineer's take on why the AI SOC demo always looks clean, and what to do about it. Theo Hartley breaks down the six incentives that make curated demos the rational default, why POC numbers don't survive contact with production data, and how to run an evaluation the vendor can't script against.

I've spent years looking at how detection and response systems behave when you pull them out of controlled environments and put them in front of real data. The gap between designed performance and live performance is a systems problem, and it shows up in every layer of the stack: in SIEM ingestion pipelines, in EDR rulesets, in orchestration logic.

The AI SOC demo format produces a version of that same gap, and it does so through incentives that are entirely rational rather than dishonest.

That's what I kept hearing when I spent the first half of this year talking to SOC teams evaluating these tools. The demo looked great in the room. The vendor's investigation ran clean, the alert queue emptied in minutes, and the coverage claims mapped tidily to MITRE ATT&CK. When I asked what happened after the team connected its own telemetry, the story usually shifted.

That gap between the demo room and the production console is the throughline of this piece. The core claims are worth surfacing up front.

In brief:

AI SOC demos run on curated telemetry and pre-scripted attack scenarios, so every metric reflects best-case conditions the tool was built to handle.
The demo format rewards curated data when vendors fund pre-sale engineering at their own expense with no revenue guarantee.
Scripted scenarios test known tactics, techniques, and procedures (TTPs) while excluding the novel, ambiguous attacks that cause the most damage.
Most AI SOC pilots produce no measurable improvement unless they're structured around clear baselines and success criteria from the start.

From those claims, the next few sections work outward in order: first the demo's controlled conditions, then the incentives that produce them, then the scenarios layered on top, and finally what a buyer can do about it.

An AI SOC demo shows the tool's best day on data it already knows

The capability featured most prominently in every AI SOC demo I've observed — the autonomous investigation workflow — is also the capability least reliable under live conditions. That's the structural tension worth naming at the start, because the demo format is designed to hide it.

The standard demo runs on a vendor-controlled environment with pre-loaded telemetry and a set of canned attack scenarios, where the alerts are clean and the attack path is linear.

Once real client data hits the pipeline, three engineering problems tend to emerge that this clean environment conceals: inconsistent terminology and ambiguous field names that need normalization, the requirement for dedicated quality-assurance infrastructure, and the need for specialized rather than general-purpose agents.

The demo environment was built so that none of these surface, which is why none of them do.

Curated data is what the demo format rewards

If none of those problems surface in demos, the next question is why vendors so consistently engineer the conditions that keep them hidden. Having watched how these tools are built and integrated, my read is that curated demos become the rational default through six incentives that require no malice to explain — and each one rewards curated data independently of the others.

The products genuinely perform better on clean data: AI SOC tools are architecturally tuned for complete telemetry and linear attack paths. Showing them on messy production data is closer to test-driving a car on a potholed road, and vendors reasonably show the road where the car drives well.
Vendors fund POC engineering at their own expense, with no revenue guarantee: A reusable demo environment with scripted scenarios can help spread proof-of-concept costs across multiple prospects, while bespoke live-environment integration for each prospect doesn't amortize.
Practitioners consistently prioritize fewer false positives over catching more true positives: Pre-tuning demo scenarios against known attacks eliminates the false positive failure mode entirely, because the attack was designed to be detectable by that specific product.
The practice of rebranding existing automation as agentic AI, which Gartner describes as agent washing, affects vendor claims across the category. A scripted demo against known attack paths is the format in which a rebranded product can most easily appear to fulfill the agentic claim.
When a tool falls short in production, vendors have a direct incentive to frame the problem as buyer readiness rather than product capability. The demo remains the authoritative capability reference regardless of what happens next.
Canned demos eliminate the failure mode around handling production data volumes: A vendor whose product degrades under production data volumes has a financial incentive to keep evaluations on controlled datasets.

Stack those incentives against each other and they all point the same way. A vendor acting rationally within them will keep evaluations on curated data, and the demo format will keep producing the same shape of result.

A scripted scenario tests known TTPs while novel attacks stay outside the test

Curated data is one half of what the demo format rewards; scripted attack scenarios are the other. Vendors build demo scenarios around historically common techniques, which lets them claim broad real-world relevance against MITRE ATT&CK methodology while leaving most of the technique space untested.

The shape of that untested space looks a lot like the gaps already present in production tooling. Production endpoint detection and response (EDR) coverage is itself partial against ATT&CK: EDR rulesets cover only 48 to 55% of ATT&CK techniques, with 27.7% of techniques having no corresponding detection rule in any of the three major rulesets studied, according to a USENIX Security 2024 analysis.

Production SIEM coverage shows the same pattern, and detection engineering rather than collection is the limiting factor — enterprises typically ingest enough telemetry to cover most of ATT&CK but write detections against only a fraction of it.

Inside even the techniques that are covered, the same selectivity recurs. A vendor can claim T1003 (OS Credential Dumping) coverage while its detection targets only one procedure variant, which leaves many real-world execution paths undetected while the box is checked. MITRE's own design documentation makes clear that ATT&CK should not be treated as a completeness checklist.

The demo's numbers come from POCs

The same selectivity that shapes scenario coverage also shapes the numbers a buyer carries out of the demo. POC-derived numbers, collected under controlled conditions, often get carried forward into purchase decisions as if they were production benchmarks. That's the underlying reason the piloting-to-value gap persists, and it's the part of the evaluation process I'd push hardest on redesigning.

Gartner makes that distinction explicit. Their October 2025 report Validate the Promises of AI SOC Agents With These Key Questions (Craig Lawson and Andrew Davies, via BleepingComputer) instructs buyers to ask whether benchmarks were collected during a proof of concept or in sustained production use. The report also warns that most large SOCs will pilot AI agents in the coming years without seeing measurable improvements unless the pilots are structured around clear baselines.

That demand for baselines is also where practitioner voices have converged. Anton Chuvakin's RSA 2026 commentary was direct about what buyers should demand from vendors: specific, falsifiable performance claims tied to conditions and baselines. That same critique surfaces repeatedly in conversations with practitioners who have gone through the evaluation cycle — the demo looked good, the POC passed, and the production numbers told a different story.

How to make an AI SOC vendor demo unscriptable

If demo-derived numbers are unreliable, the practical question is how a buyer changes the conditions they were collected under. Two constraints carry most of the weight: the data the tool sees, and the way the results are scored. Each constraint a buyer adds removes a variable the vendor controls and replaces it with one from the buyer's environment.

Bring your own telemetry and a window the vendor hasn't seen

The first constraint takes the data away from the vendor. Use your real alert sources in the evaluation and leave the vendor's demo environment out of scope. Before any vendor touches your infrastructure, identify three to five cases from the last 90 days where activity was detected late or wasn't surfaced at all.

Those cases become your unscripted test scenarios, and because they come from your own incident response record, the vendor cannot have pre-tuned for them. Layer on your own red-team exercises using tools like MITRE CALDERA or Red Canary's Atomic Red Team library. If the vendor proposes the test cases, they have pre-tuned for them. If you need to verify what this looks like after deployment, the audit-trail question is covered in our piece on AI SOC auditability.

Score the misses and unknowns first

Changing the data is only half the work; the scoring rubric matters just as much. The most informative SOC metrics in an AI evaluation are true positives, false negatives, and incorrect case closures. Weight your scoring toward the miss rate by asking what the tool failed to detect from your injected scenarios, and what determinations changed after human review.

Once miss rate is the primary lens, the false positive numbers vendors lead with start to look different. False positives often originate from vendor-provided rules, so a vendor demonstrating low false positive rates against their own content library is showing a controlled result that is separate from how the tool will behave against your custom detection logic in production.

A scripted demo only proves a vendor can pass its own test

That separation between vendor-controlled and buyer-controlled scoring is the heart of the argument. A scripted demo establishes only that the vendor can pass the test it wrote for itself. The more useful question for the buyer is whether the same tool can pass the buyer's test — on the buyer's data, against attacks it hasn't rehearsed, with the buyer controlling the telemetry, scenarios, and scoring criteria before treating the exercise as a product evaluation.

The practitioner signal I've tracked lines up with that conclusion. Pilots can validate a narrative without translating into broader production use, and Chuvakin's RSA 2026 peer discussion surfaced what he described as very limited enthusiasm about AI SOC startups from practitioners who had specifically shown up to discuss the topic. The strongest positive reaction was a preference to wait for an existing SIEM, SOAR, or Managed Detection and Response (MDR) vendor to build the capability natively.

That skepticism tracks with the structural observation underneath this piece: when curated conditions around known attack paths produce the metrics, those conditions become a purchasing risk the moment the buyer treats demo performance as a production forecast.

Frequently asked questions about AI SOC demos

Are AI SOC demos realistic in production SOCs?

AI SOC demos run on curated telemetry with clean alerts and linear attack paths. In controlled environments, the products perform well, but live SOC environments include incomplete or inconsistently formatted data and nonlinear attack paths — production conditions that typically do not manifest during demonstrations.

What telemetry do AI SOC demos usually run on?

Most AI SOC demos run on vendor-controlled environments with pre-loaded, well-formatted sample data and pre-scripted attack scenarios. The telemetry is complete, the alerts are clean, and the detection outcomes are predetermined. Production data, with its ambiguous field names and missing context, is structurally absent from the standard demo format.

How do I run an AI SOC proof of concept on my own data?

Identify three to five incidents from your last 90 days where detection was late or absent. Connect the candidate tool to your actual alert sources, and run adversary emulation using MITRE CALDERA or Atomic Red Team with scenarios the vendor hasn't seen. Define success criteria before the POC begins, and weight scoring toward miss rate rather than false positive reduction.

What should I ask for in an AI SOC demo?

Ask whether benchmarks were collected during a proof of concept or sustained production use. Ask for the production false positive rate and the full evidence chain for any verdict. If the vendor can't produce specific, falsifiable metrics tied to conditions and baselines, the numbers aren't meaningful.

About the author

THTheo H. focuses on how security operations are evolving as data, automation, and AI reshape the way teams detect and respond to threats. With a background spanning security engineering and platform design, Theo has worked on building and integrating systems that connect telemetry, detection logic, and response workflows across modern security stacks. His work has centered on improving how security teams use data — not just collecting it, but turning it into actionable context for investigations and decisions. He writes about the structural challenges in today’s security operations models, including the limits of traditional SOC architectures, the gap between automation and real-world execution, and the emerging role of AI in augmenting human analysts. His perspective focuses on what is changing — and what isn’t — as organizations attempt to move from tool-driven operations to more adaptive, system-level approaches to security.

Stay sharp on security operations

Practitioner takes on SOC modernization, detection engineering, threat hunting, and more. No fluff. No product pitches.