AI vs AI: What Attackers Actually Do, and What They Don't (Part 4 of 5)

So far this series has been about the defensive side. What augmentation delivers, where it stops, what it means organizationally. Sitting next to that is a question rarely answered honestly in CISO briefings: what do attackers actually do with AI? The answer is more sober than the tabloid coverage and more nuanced than a one-sentence BSI position. If you have a different answer, I’ll listen. But please with data, not headlines.

TL;DR: Attackers use AI in real terms for phishing scaling (hybrid human+LLM is the effective middle ground, pure LLM mail is increasingly filterable) and for variant analysis in vuln discovery (Big Sleep, with caveats). Fully autonomous attack AI remains mostly hype. BSI, ENISA, and Microsoft Threat Intel rate it through May 2026 as not observable at operational scale.

Phishing: real, but with granularity

The most honest data on LLM phishing impact comes from two academic studies. Heiding et al. measure click rates: control templates 19 to 28 percent, GPT-generated phishing mails 30 to 44 percent, human-with-psychology-model (V-Triad) 69 to 79 percent, hybrid human-plus-LLM 43 to 81 percent (n=112). Bethany et al. documented in a university study over eleven months with around 9,000 recipients a credential-entry rate of around 10 percent for LLM-generated phishing mails (arXiv:2401.09727). LLM mail matches the effectiveness of human-crafted spear-phishing. And an ML detector with F1 of 98.96 percent remains feasible.

TU Berlin confirms this in a 2026 paper (mlsec.tu-berlin.de). LLM phishing above 30 percent click rate, in smaller firms partly exceeding human baselines. Real, measurable, and filterable.

CrowdStrike published a different number in the Global Threat Report 2025. A 442 percent vishing increase in H2/2024 vs H1, 54 percent click rate on LLM phishing vs 12 percent on human-written mail. The methodology isn’t in the publication. I accept CrowdStrike’s numbers as a trend indicator, because they’re consistent with the academic evidence. In an investment case or a customer briefing I wouldn’t cite them as primary evidence. Whoever does, buys the marketing volume, not the methodological statement.

The more important observation hides in the Heiding data. Human-crafted spear-phishing beats pure LLM mail by a factor of two. The interesting middle ground is hybrid: human plus LLM. That matches what I see in hunt engagements. The most effective spear-phishing campaigns combine LLM scaling with human psycho-crafting of the first one or two mails per target. Pure LLM volume-phishing is increasingly filterable. Hybrid phishing is what really challenges detection pipelines. That’s where we’re building the pipelines right now.

Vuln discovery: first real successes, with two caveats

The headline candidates are real. Big Sleep, the LLM agent project from Google Project Zero and DeepMind, identified the first “previously unknown exploitable memory-safety issue” in widely-used software in October 2024: a stack-buffer underflow in SQLite, pre-release (Project Zero Blog). In August 2025 the team reported 20 additional bugs in FFmpeg, ImageMagick, and other open-source projects. The DARPA AI Cyber Challenge in the same month documents: autonomous cyber-reasoning systems find 86 percent of synthetic vulnerabilities, patch 68 percent of them, at an average of USD 152 compute per task (DARPA).

Both examples are solid. Both have caveats that get lost in the marketing reading. Big Sleep operates in variant-analysis mode. The agent gets a known vulnerability pattern and searches code bases for variants. The Project Zero team itself phrases it drily: “target-specific fuzzer would be at least as effective at present”. Not open-ended vulnerability hunting. Pattern extension. The DARPA contest operates on synthetic challenge projects, i.e. open-source code bases with embedded vulnerabilities. A defined sandbox, not an adversarial end-to-end setting.

The Mouzopoulos paper from 2025 summarizes this systematically. LLM cyber evaluations underestimate the real-world risk components (maintenance, scaling, detection avoidance) compared to lab settings (arXiv:2502.00072). What makes “LLM finds zero-days” a headline is in reality more bounded. Variant analysis with a clear starting point works productively. Open-ended vulnerability discovery without hypothesis doesn’t work. Whoever sells one as the other sells a different claim than the one the data supports.

Fully autonomous attack AI: mostly hype

The DACH authority position is conservative and consistent. The BSI assessed fully autonomous attack AI in April 2024 as “not available and unlikely in the near future” (BSI 30.04.2024). ENISA Threat Landscape 2024 and 2025 document AI-augmented operations (phishing campaigns, simple malware mutation) on “limited, evolving scale”. Microsoft Threat Intel in March 2026: agentic AI on the threat-actor side “not yet observed at scale and limited by reliability and operational risk”.

The counter-data point is GTG-1002. Anthropic published in November 2025 a report on an espionage operation attributed to a presumed Chinese state actor, in which Claude Code via MCP tooling, according to Anthropic telemetry, executed 80 to 90 percent of tactical operations autonomously (Anthropic). The most aggressive autonomy claim on the market. It comes from an AI vendor about its own product. And Anthropic documents in the same publication AI hallucinations and result validation as friction factors.

Even in the case communicated by an AI vendor offensively as “AI-orchestrated”, the vendor itself documents hallucinations as the bottleneck. Not modesty. The structural property of the architecture that constrains us equally on the defensive side. It doesn’t disappear on the offensive side either through a model release.

Whoever overweights the GTG-1002 number in a risk assessment without citing the hallucination caveat buys the marketing of the vendor self-disclosure. Conversely: ignoring the BSI and ENISA assessments because they sound less sensational gives up a data source that is methodologically cleaner than any vendor threat report. My clear position on this: BSI, BACS, and ENISA aren’t cautious authority politeness, but rather the only available data source that isn’t based on a commercial selection bias. Whoever wants to refute this for me should show me the vendor threat report with disclosed methodology and independent replication. I’m waiting.

Wabi-sabi: why the detection reflex shifts the equation

A last observation that doesn’t come from cyber research, but from human-AI interaction. AI-generated content develops recognizable style markers. An overly smooth rhythmic structure, fake-contrast constructions, exaggerated certainty. People who see many AI outputs develop a “sounds-like-AI” reflex and automatically discount the content.

Two detection-engineering implications follow. Pure-AI phishing content becomes more detectable over time, not less. The Heiding data already shows this: hybrid (human+LLM) stays effective, pure LLM loses ground. This also applies to our own editorial discipline. For defensive content pipelines (threat-intel briefings, customer reports, detection documentation), a “sounds-like-AI” output is a credibility risk, even if the content is correct. Whoever delivers CTI reports in the AI standard style teaches their clients to stop reading the reports.

I test my own reports against the “sounds-like-AI” reflex. If a client briefing sounds too smooth, I rewrite it. It’s credibility maintenance, not marketing. Whoever doesn’t actively do this lets their reports be written by the same model family their clients are learning to filter out.

What this means for threat modeling

LLM-augmented phishing is real and requires updated content-pattern detection on the defensive side. Hybrid spear-phish recognition becomes the more demanding subtask. LLM vuln discovery is real for variant analysis and should be replicated as a defensive pattern. Your own variant-analysis pipelines on known vulnerability classes draw pre-disclosure findings into your own stack. Fully autonomous attack AI remains mostly hype and doesn’t belong in the top tier of the risk assessment. The DACH authority assessment is the anchor here. Not the spectacular vendor reports.

In Part 5, the closing part of this series, we look at the constructive question. After four parts of methodological skepticism: what do I actually recommend? Which patterns work, which prerequisites need to be in place, and which decision framework do I apply in engagements?

Part 4 of 5 in this series on AI in defensive cyber, augmentation, not replacement:

Part 1, What the data holds up
Part 2, Where augmentation stops
Part 3, What it means for SOC teams
Part 4, AI vs AI (current)
Part 5, How it could actually work