Ground Truth Is a Security Control: Why Self-Driving Security Needs Crash Tests

Illustration of a SOC lab styled as a crash-test facility: an armored vehicle hits a barrier surrounded by analysts, dashboards, and a neural-network module feeding into a shield.

The industry wants a self-driving SOC. But nobody certifies a self-driving car from accident reports and insurance paperwork alone. It gets test tracks, controlled failures, and crash tests. In security, we are trying to build autonomy from something much weaker: alert verdicts, analyst notes, and retrospective incident data. Those labels often describe how work ended, not what was true.

A "false positive" may mean benign. It may also mean "known admin activity," "duplicate," "not enough context," or simply "we ran out of time." Train on that history and you do not learn attacker truth. You learn queue behavior.

This post is not about AI that can pentest faster. Pentesting answers an offensive search question: can an agent find a path to compromise? Defenders need something else. We need a way to generate trustworthy evidence about what our systems should have seen, what they actually saw, and what failed silently. Put plainly: if you cannot generate that evidence on demand, you do not really know whether your defenses work.

Past SOC tickets are a bad training set

Most cyber AI today learns from text. Threat reports, ATT&CK pages, blog posts, and rules. Defenders, on the other hand, make decisions from telemetry. That gap matters, and the easy fix, training on historical alerts and analyst dispositions, has a quiet trap inside it.

SOC labels are noisy, delayed, and shaped by local incentives. A ticket closed as benign tells you the analyst stopped looking. It does not tell you the activity was safe. A ticket marked true positive may reflect a real incident, or it may reflect a careful analyst writing things up the way the metrics demand. Multiply that across years of queue management and you get a dataset that encodes how a particular team handles work, not what attackers actually did inside the environment.

Autonomy trained on that history will be very good at one thing: behaving like the queue.

Pentesting, purple teaming, and BAS are not the same thing

These three get blended together in marketing decks, but they answer different questions and produce different outputs.

Pentesting asks an offensive search question. Given a goal, can an attacker reach it? The output is a path and a report. It is valuable, but it is not designed to validate whether a specific detection works against a specific behavior in a specific environment.

Breach and attack simulation runs a library of canned actions and reports a coverage score. That is useful as a baseline, but the canned nature is also the limitation. Coverage against a fixed catalog is not the same as evidence that your telemetry pipeline, detection logic, and response workflow actually fired when they should have.

Purple teaming is different in intent. It is defender-first. The point is not to win or lose. The point is to execute a specific attacker behavior, in scope, with clear success criteria, and check what the defenses did. Done well, it produces a test record, not a score.

Crash tests for the SOC

Borrow the framing from safety engineering. A crash test is not a real accident. It is a controlled, repeatable scenario with instruments everywhere, designed to produce evidence about how the system behaves at the edges. The point is not to prove the car is invincible. The point is to know exactly where it bends, where it breaks, and what the sensors and restraints did during the event.

Autonomous purple teaming is the equivalent for the SOC. Controlled, repeatable, defender-first execution of specific attacker behaviors, with the environment instrumented so we can compare expected outcomes to observed outcomes. The output is a test record. The behavior we executed. The logs we expected. The logs we actually got. Which detections should have fired. Which did fire. Which benign lookalikes confused the system.

That record is the artifact a self-driving SOC actually needs. It is also the artifact a copilot or agent can be evaluated against, regression-tested with, and trained on safely.

Two example failure modes

The argument is easier to feel with concrete examples.

First case. A PowerShell alert fires. The analyst checks, recognizes the source as IT automation, and closes it as a false positive. Months later, a model trained on those dispositions learns to suppress the same pattern. The route the analyst dismissed is now the route an attacker can borrow with confidence, because the automation will quietly clear the way. The label was operationally reasonable and analytically wrong, and the wrongness compounded once it became training data.

Second case. A detection written with AI assistance looks great on historical incident data. Precision is high, recall is high, the eval table is clean. It ships. Within a week it is exploding on software deployment, backup jobs, scheduled maintenance, and a long tail of normal enterprise activity it was never tested against. The model was not stupid. The benchmark was. There was no controlled, environment-specific source of benign truth to test against, so benign reality became the production incident.

In both cases the gap was not model intelligence. It was trustworthy feedback.

From validation to supervision

Once you have an engine that can execute scoped attacker behaviors on demand and capture what the environment did in response, you get more than a validation tool. You get a supervision signal.

You can generate environment-specific traces of real techniques. You can pair them with environment-specific benign lookalikes that come from the same hosts, the same tools, the same identities. You can label the pair with what was actually executed, not what an analyst wrote down at 2 a.m. That is the kind of data that makes a security model aware of the logs, tools, and controls of a real environment, not just security-flavored in the abstract.

Those traces can also become the regression tests that future copilots and agents are measured against before anyone trusts them with a response action.

Done with discipline, this is closer to a controlled experiment than to a red team engagement. That is the point.

Trustworthy autonomous defense needs test infrastructure, not just better prompts, better agents, or bigger models. Autonomous purple teaming turns attacker behavior into repeatable experiments that validate defenses and generate the environment-specific evidence security models need to learn.

Ground truth is a security control. Treat it like one.