This article presents a comparative empirical evaluation of large language models (LLMs) in the context of penetration tests conducted autonomously and in a hybrid manner. The study is carried out using the YAGA agent, an artificial intelligence agent for penetration testing developed by HackerSec, which operates autonomously, conducting the complete cycle of reconnaissance, exploitation, and post-exploitation independently. After the autonomous execution is completed, the results undergo a human validation step that ensures the accuracy of the findings, eliminates false positives, and contextualizes the impact of identified vulnerabilities in the client's environment.
The benchmark covers 124 scenarios distributed across black-box, gray-box, and white-box configurations, of which 40 require multi-stage exploitation chains to achieve objectives such as RCE, chained SSRF, and privilege escalation. Five models are evaluated as the cognitive engine of the agent: Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.5. The multi-agent coordination employs a stigmergic model based on a shared blackboard with pheromone semantics, where attack chains emerge from the indirect interaction between specialized agents, without the prescription of a central orchestrator.
An LLM classifier coupled with retrieval augmented generation (RAG) enables the dynamic selection of playbooks, while intrinsic curiosity via PPO with Random Network Distillation (RND) guides the exploration of novel adversarial states. Results indicate that Claude Opus 4.8 achieves a 91.2% success rate in complex chains, followed by GPT-5.5 with 87.8%.
Keywords: autonomous penetration testing, offensive AI agent, attack chains, stigmergy, RAG, reinforcement learning, intrinsic curiosity, attack graph
I. Introduction
Automating penetration testing through agents based on large language models (LLMs) represents an emerging paradigm in offensive security. While traditional tools like Metasploit and Burp Suite automate individual exploits, they lack the strategic reasoning capability necessary to chain vulnerabilities into multi-stage attack paths — a skill that distinguishes experienced pentesters from automated scanners.
YAGA is an artificial intelligence agent for penetration testing developed by HackerSec. Unlike monolithic frameworks or pipelines, YAGA operates as an autonomous agent with adversarial reasoning, strategic planning, and tactical execution capabilities. The agent conducts the complete pentest cycle autonomously: reconnaissance, enumeration, exploitation, post-exploitation, and report generation, making strategic decisions based on the state of the environment and accumulated knowledge.
After the autonomous execution is completed, a human validation step is conducted to verify the accuracy of the reported findings, discard false positives, assess the real impact of vulnerabilities in the specific context of the client, and ensure that remediation recommendations are applicable and appropriately prioritized. This model preserves the efficiency and scalability of total automation while ensuring the reliability of the delivered results.
Recent works demonstrate that LLMs can perform offensive security tasks when equipped with appropriate tools [1, 2]. However, most existing evaluations focus on isolated vulnerabilities, neglecting scenarios where achieving an objective — such as RCE on an internal server — requires the sequential composition of multiple individual low-severity vulnerabilities, which we term emerging attack chains.
In this work, we present three main contributions: (1) an extensive benchmark of LLMs across 124 pentest scenarios with graduated complexity, including 40 scenarios requiring exploitation chains; (2) the multi-agent coordination architecture of YAGA, based on stigmergy, which allows the emergence of attack chains without central prescription; and (3) formal stopping criteria that combine structural, epistemic, and reinforcement learning metrics to determine when exploration has reached saturation.
II. Related Work
PentestGPT [1] demonstrated that GPT-4 can interactively guide a human tester through penetration scenarios. ReaperAI [2] proposed a framework for an autonomous agent with hierarchical planning. HackTheBox Benchmark [3] evaluated models in isolated CTFs. None of these works explicitly address emerging attack chains or multi-agent coordination without a central orchestrator.
The concept of computational stigmergy, originating from swarm intelligence [4], has been applied in multi-robot robotics but remains unexplored in the domain of offensive security. AutoAttacker [5] introduced an LLM-driven pipeline for automated pentesting, but it operates with a monolithic planner that does not scale to environments with multiple simultaneous attack vectors. Our approach fundamentally diverges by distributing intelligence among specialized agents that coordinate via shared artifacts, not through direct communication.
III. Multi-Agent Coordination Architecture
A. LLM Classifier with Retrieval Augmentation (RAG)
The first component of the architecture is a strategist agent that receives the output from the reconnaissance phase and determines which attack categories are applicable to the target. This agent uses a fine-tuned LLM classifier to map reconnaissance artifacts (open ports, detected technologies, response headers, software versions) to a set of relevant MITRE ATT&CK tactics.
The plan generated by the strategist is cross-referenced with a RAG system that indexes a database of attack playbooks. The playbooks are stored as vector embeddings in an HNSW index, where each playbook contains preconditions, action sequences, success indicators, and fallback criteria. Retrieval utilizes cosine similarity between the embedding of the reconnaissance context and the embeddings of the playbooks, with an adaptive threshold based on the classifier's confidence:
sim(q, pi) = (Eq · Epi) / (||Eq|| · ||Epi||) > τ · σ(c)
Where Eq is the embedding of the reconnaissance query, Epi is the embedding of playbook i, τ is the base threshold, and σ(c) is the sigmoid confidence of the classifier. With high confidence in the classification, the system retrieves more specific playbooks; with low confidence, it accepts broader matches.
The strategist also prioritizes the execution order of the retrieved playbooks using a utility function that considers the estimated probability of success, potential impact (based on CVSS when available), and the operational cost of the attempt:
U(pi) = w1 · P(success|context) + w2 · impact(pi) − w3 · cost(pi)
B. Scatter-Gather with Cross-Deduplication
In the implemented scatter-gather pattern, exploration tasks are distributed to multiple specialized agents in parallel: a SQL injection agent, an XSS agent, an SSRF agent, etc. Each agent operates independently on the same set of discovered endpoints, producing findings with evidence and severity.
In the gather phase, an orchestrator deduplicates the findings using semantic similarity between the evidence and cross-references results from different agents. An SSRF finding identified by one agent may, when cross-referenced with an open redirect finding from another agent, reveal an attack chain that no individual agent would have prescribed.
Deduplication operates on two layers: (1) exact deduplication by hashing technical evidence (endpoint + payload + response signature), and (2) semantic deduplication using embedding distance to group findings that describe the same underlying vulnerability with superficial variations.
C. Stigmergy and Shared Blackboard
The most significant architectural contribution is the stigmergic coordination model. Instead of a central orchestrator prescribing sequential actions, each agent has a trigger predicate — a condition on the state of the blackboard that activates it. Agents coordinate by reading and writing findings to a shared blackboard with vector search support, where each finding carries a pheromone weight.
The pheromone is a scalar in the range [0, 1] that represents the temporal relevance and quality of the finding. The weight decays exponentially over time:
φ(t) = φ0 · e−λ(t − t0)
Where φ0 is the initial pheromone (proportional to the severity of the finding), λ is the decay rate, and t0 is the timestamp of creation. This mechanism naturally eliminates obsolete paths without manual intervention.
Attack chains emerge organically: a reconnaissance finding awakens the classifier; a high-severity classification awakens the exploit agent; exploit results return to the board and awaken the reporting agent. The order is not prescribed — it emerges from the state of the blackboard. Examples of trigger predicates:
- SQLi Agent: activates when the blackboard contains endpoints with untested query parameters
- PrivEsc Agent: activates when there are shells obtained without root privileges
- Lateral Movement Agent: activates when credentials or tokens have been extracted from a compromised host
- Report Agent: activates when no exploitation agent is active and the blackboard has stabilized
IV. Exploration Driven by Intrinsic Curiosity
To overcome the problem of sparse rewards inherent in autonomous pentesting — where an agent may execute hundreds of actions before obtaining a shell — we incorporated intrinsic curiosity into the reinforcement learning framework. The agent is trained with Proximal Policy Optimization (PPO) combined with Random Network Distillation (RND).
The RND operates through two neural networks: a fixed f (target) network that generates deterministic embeddings for states of the environment, and a trainable f̂ (predictor) network that attempts to predict the embeddings from the fixed network. The prediction error constitutes the curiosity bonus:
rcuriosity(st) = ||f̂(st; θ) − f(st)||2
New states, which the agent has never visited, produce high prediction error (high curiosity), while already explored states produce low error. The total combined reward is:
rtotal(st, at) = α · rext(st, at) + β · rcuriosity(st)
Where rext is the extrinsic reward (successful exploits, information collected) and α, β are balancing coefficients. The coefficient β is annealed over the training to ensure that curiosity dominates in the initial exploration phase and extrinsic reward dominates in the mature phase.
V. Emerging Attack Chains
We define an attack chain as an ordered sequence of actions a1, a2, ..., an where each action ai depends on the result of the previous action to be viable, and the composition achieves an objective that no individual action would reach. Formally:
Chain(G) = {(a1, ..., an) | ∀i > 1: pre(ai) ⊆ post(ai−1) ∧ post(an) ⊇ G}
In the evaluated benchmark, the most frequent attack chains included:
- SSRF → Internal Service Discovery → RCE: external SSRF used to map internal services, followed by exploit on an unpatched service accessible only internally
- SQL Injection → Credential Extraction → Lateral Movement → PrivEsc: SQL injection to extract hashes, offline cracking, reuse in SSH, escalation via sudo misconfiguration
- Open Redirect → OAuth Token Theft → Account Takeover → Admin RCE: redirect to capture OAuth token, access to admin panel, RCE via unrestricted upload
- XXE → SSRF → Cloud Metadata → IAM PrivEsc: XXE to internal SSRF, access to metadata service, escalation of IAM role
The emergence of these chains in the stigmergic model occurs without prescription: the reconnaissance agent deposits an SSRF finding on the blackboard; the internal exploitation agent — whose trigger predicate checks for confirmed SSRFs — awakens and uses the SSRF as a transport primitive to scan the internal network. Internal findings are deposited back on the blackboard, activating new specialized agents.
VI. Formal Stopping Criteria
A. Attack Graph Coverage
The agent maintains a real-time attack graph G = (V, E) where nodes represent hosts, services, and credentials, and edges represent possible actions. The agent tracks three sets: discovered nodes Vd, attempted edges Et, and pending edges Ep. The structural coverage metric is:
Cstruct = |Et| / (|Et| + |Ep|)
When Cstruct exceeds a configurable threshold (typically 0.90), the structural criterion is satisfied. However, pure coverage has limitations — an agent may have 100% coverage on obvious paths and miss non-trivial lateral movement.
B. Information Gain and Decreasing Entropy
To capture the epistemic dimension of exploration, we measure the rate of discovery of new information per action. For each window of N actions, the agent calculates the information gain:
IG(wk) = H(Mk−1) − H(Mk)
Where H(Mk) is the entropy of the environment model in window k. When the normalized information gain falls below a threshold for consecutive windows:
IG(wk) / H(M0) < ε, ∀k ∈ [K−δ, K]
The epistemic criterion is satisfied, indicating that exploration is no longer significantly reducing the agent's uncertainty about the environment.
C. Diminishing Returns in RL
The most elegant criterion connects directly with the RL reward design. We monitor the rate of change of accumulated reward:
ΔR(t) = R(t) − R(t − Δt) → dR/dt → 0
When dR/dt remains below εR for a sustained period, the agent's policy has saturated — additional actions do not generate significant marginal value.
D. Hard and Soft Goals
YAGA operates with two types of goals. Hard goals are explicit objectives of the scope (e.g., "gain domain admin", "exfiltrate data X", "reach subnet Y"). If all hard goals have been achieved or declared infeasible with evidence, this criterion is satisfied. Soft goals represent coverage of vulnerability categories (OWASP Top 10, MITRE ATT&CK tactics). When the coverage of applicable tactics exceeds a configurable threshold (default 85%), the soft goal is satisfied.
E. Budget Constraints
Every real pentest operation has operational limits. The framework implements configurable hard limits: maximum total time (wall-clock), maximum number of actions/requests (to prevent detection and abuse), maximum exploration depth (hops from the entry point), and adaptive rate limiting — if the target starts to throttle responses, the agent automatically retreats following an exponential backoff with jitter.
VII. Experimental Evaluation
A. Benchmark Setup
The benchmark comprises 124 scenarios distributed across three access categories: Black-Box