What the Data on AI in the SOC Actually Holds Up, and What Doesn't (Part 1 of 5)

“But ChatGPT says it’s fine.” “Our vendor promises 70 percent fewer headcount.” “Autonomous SOCs are right around the corner.”

Three sentences. I’ve been hearing them for the past eighteen months in nearly every conversation about AI in the SOC. Here is my position. Unvarnished. If you see it differently and have data, I’ll listen. Until then, mine stands.

TL;DR: The solid research holds about 22 percent productivity gain on tightly defined SOC tasks (Microsoft RCTs, methodologically transparent, vendor-funded, not replicated). The 70-percent promises and the “autonomous SOC” from marketing decks don’t hold up. No sample, no methodology, no baseline. Invest on the documented base and you have an argument. Build on marketing numbers and you take on the vendor’s replication risk.

A clarification up front. When I say “AI in the SOC” in this series, I mean large language models and ML classifiers embedded in triage, correlation, and detection-engineering workflows. Not intelligence in the classical sense. Not the agentic-AGI visions from vendor roadmaps. Pattern matching on large corpora. Sometimes useful. Often overrated.

What follows is my position. Sourced, unvarnished, against the dominant marketing tone.

What I make of the three sentences

The first one comes from the detection engineer: “ChatGPT wrote me this Sigma rule, looks fine to me.” Sure, it looks fine. Until it breaks in the goodware test. Until the condition never matches in real traffic. By now, whenever I hear “looks fine,” I reflexively ask about the goodware test. Saves time.

The second I hear from the CFO corner: “The vendor promises us 70 percent fewer headcount.” On what data? What sample? What methodology? What baseline? Doesn’t exist anywhere. A slide is a slide, not a study. Whoever sells one as the other has passed a marketing statement off as evidence.

The third sits in the CISO briefing: “We’re hearing that autonomous SOCs are around the corner.” Microsoft Threat Intelligence itself admits in March 2026 that agentic AI on the threat-actor side is “not yet observed at scale”. Symmetrically true for the defensive side. When the biggest player in the market phrases it that cautiously, it should give you pause. It does for me. “Around the corner” is marketing language. Not a product status.

These sentences I challenge every time. The next sections show with what data.

What the research actually holds up

The only solid numbers right now come from Microsoft. That annoys me. I’d prefer more players publishing their methodology. As of today: three studies, all Microsoft.

In the RCT with experienced security analysts: 22 percent faster, 7 percent more accurate than the control group, averaged across four defined tasks. Per task the effect ranges from 14 to 49 percent (Edelman et al. 2024).

In the live-operations evaluation over 180 days: 30.13 percent MTTR reduction in Microsoft Defender XDR. Observational, not causal. Bono himself writes that “unobserved confounders inhibit causal identification” (Bono/Grana/Xu 2024). One of the most honest lines in the entire AI-in-SOC literature. It sits in Microsoft’s own paper.

On the phishing triage agent: up to 6.5× more true positives per analyst minute, with 53 percent more attention on malicious mail (Bono 2025). Reallocation of human attention. Not rubber-stamping.

What these numbers hold up is limited. All on Microsoft stack with Microsoft subjects, methodologically transparent and well-documented, vendor-funded, not independently replicated. Whoever cites one in a CXO briefing without the caveat sells the result without the method. That has happened to me often enough that I now add the caveat reflexively. Otherwise the result goes on the table without the methodology behind it. And that’s no longer an honest number.

A second number that makes its way into CXO briefings: IBM’s Cost of a Data Breach Report 2024. USD 1.88 million lower breach costs and 98 days shorter lifecycle for organizations with extensive AI use (IBM 2024). Correlation, not causation, and the caveat sits in the original report. Whoever resells it as “AI saves money” simplifies a multivariate reality into a causal path the data doesn’t support.

Where the vendor pitch leaves the ground

Stellar Cyber advertises 70 percent faster threat detection without extra headcount. Eightfold MTTD. Twentyfold MTTR. Dropzone AI advertises an MSSP case with alert reduction from 144,000 to 200, i.e. 99.8 percent. In none of the publications is there methodology. I haven’t asked them; if you’ve seen the sample or baseline, send it my way.

Honestly, what frustrates me most is exactly this asymmetry. I spend a meaningful part of many client conversations explaining to a SOC lead why the 70 percent isn’t worth anything, because there’s no sample behind it. If the methodology is included, the SOC lead can transfer the study to her own stack or reject it. If only the headline number is included, all that remains is belief or disbelief. That’s a lottery, not an investment basis.

Microsoft itself, in its own “autonomous SOC” blog posts, writes notably cautiously. “Can”. “Will”. “Moves toward”. Future-tense language. In its March 2026 update, Microsoft Threat Intel explicitly admits that agentic AI on the threat-actor side is “not yet observed at scale and limited by reliability and operational risk” (Microsoft 2026). The most methodologically honest spot in the marketing world. Inside the same conglomerate that sells the “autonomous SOC” on the conference slide. Keep that contradiction in mind when someone tries to sell you the “autonomous SOC”.

What works in between

In hunt engagements and detection-rule crafting for client environments, one pattern works reproducibly. AI as a refinement layer. Not as a generator.

The engineer writes the skeleton: YARA, Sigma, KQL, Snort. The AI refines on concrete hints, i.e. expand strings, adjust format, propose a first-draft correlation. Research, context, precision stay with the human. yarGen with --ai flag is an example that shows this logic in code: human builds the scaffold, AI files the detail. Works.

What works less well in the same pattern: AI as an end-to-end rule generator for anything beyond IOC matching. An LLM-generated detection rule that doesn’t rest on a known hash, a known C2 IP, or a known filename mostly looks plausible. It breaks in the goodware test, with invented values, conditions that can never match, fabricated fields.

I don’t find that surprising. The models were trained on a corpus that consists, in large part, of mediocre detection-engineering output from the past decade. Garbage in. Garbage out. At the rule level. Whoever still sells this as a Tier-1 replacement sells the false-positive spike along with it. My position. If anyone has data showing the opposite, I’ll look at it. But please with a sample.

The DACH anchor

The BSI assessed AI in April 2024 as a tool that lowers entry barriers and increases speed and volume of offensive operations. But not one that leads to full attack automation. “Not even in the near future” (BSI 30.04.2024). BACS in Switzerland and ENISA at the EU level sit alongside with similar assessments.

From a US vendor perspective that sounds cautious. From an authority perspective it’s methodologically consistent. Authority threat landscapes work with reporting data, not with sales claims. Not catching up, rather the result of a different data source. I use the DACH situation reports in almost every conversation where a US vendor pitch is on the table.

What sticks

One number that should stick in every conversation: 22 percent. Speed gain, slight accuracy uplift, reallocation of human attention to relevant cases. Verifiable, methodologically documented, vendor-funded. Not 70 percent, and definitely not an “autonomous SOC”.

Whoever builds on the documented base has an argument that holds up in an audit. Whoever builds on marketing numbers finances a claim the vendor itself hasn’t made testable. And takes on the replication risk the vendor left open. My gut, methodologically supported: don’t buy what can’t be verified.

In Part 2 we look at where augmentation stops. Adversarial ML, out-of-distribution detection, and what happens when an LLM-based triage agent ingests a log file the attacker has written into.

Part 1 of 5 in this series on AI in defensive cyber, augmentation, not replacement:

Part 1, What the data holds up (current)
Part 2, Where augmentation stops
Part 3, What it means for SOC teams
Part 4, AI vs AI
Part 5, How it could actually work