A simplified version of transfer matrix is generated on the basis of attack tree, and then penetration test strategy is given automatically based on deep reinforcement learning. In the future, penetration test tool can be automatically called according to the strategy to complete the attack.
Automated Penetration Testing Using Deep Reinforcement Learning
Notes by: Z3r0Yu Paper by: Z.Hu, R.Bulan, Y.Tan @ Japan – Advanced Institute of Science and Technology JAIST Project Link: github.com/crond-jaist… IEEE European Symposium on Security and Privacy Workshops 2020
0x01 Main Content
A simplified version of transfer matrix is generated on the basis of attack tree, and then penetration test strategy is given automatically based on deep reinforcement learning. In the future, penetration test tool can be automatically called according to the strategy to complete the attack.
PS: Deep reinforcement learning (DRL) is the combination of deep learning and reinforcement learning. It integrates the strong understanding ability of deep learning on visual and other perceptual problems as well as the decision-making ability of reinforcement learning to realize end-to-end learning. The emergence of deep reinforcement learning makes reinforcement learning technology become practical and can solve complex problems in real scenes.
Main methods:
-
Use the Shodan search engine to gather relevant server data to build a realistic network topology
-
Multi-host multi-stage vulnerability analysis (MulVAL) was used to generate the attack tree of the topology structure
-
The traditional search algorithm (depth-first search) is used to find all possible attack paths in the tree, and a matrix representation is built according to the needs of the deep reinforcement learning algorithm
-
Deep reinforcement learning network (DQN) was used to find the easiest attack path from the possible candidates with an accuracy of 0.86
Personal thought: Why automated penetration testing?
-
The flow of penetration test is basically fixed: information collection -> vulnerability exploitation -> penetration system
-
The process is laborious, time-consuming and complex, requires a lot of tacit knowledge, is not easy to be formalized, and is prone to human error
-
Therefore, on the one hand, automated penetration testing can make attacks more quickly, and on the other hand, it can standardize the behavior of operators and reduce the occurrence of mistakes in actual situations
Current work:
- Core IMPACT commercial tools, a framework for attack generation based on the target system generation model, generate attack schemes based on variations of the metric-FF system.
PS: (1) Metric-FF is a domain-independent intelligent planning system. The system is an extension of the FF planner (in conjunction with the action Description language ADL), adding support for numerical State variables, and more specifically, adding support for a subset of layer 2 features of PDDL 2.1. Metric-ff was developed in C language. It participated in the 3rd International Intelligent Planning Competition IPC in 2002 and has very competitive performance. (2) Blackbox is an intelligent planning system for satisfying products developed by Henry Kautz. The system first converts the Sensitive Issues into Boolean satisfiability problems. Then various advanced Satisfiability engines are used to solve the problem. The front end of Blackbox is a modified version of the system GraphPlan. Blackbox is very flexible and can specify multiple PLN engines to solve problems, so Blackbox has the ability to run effectively on a very wide range of planning problems. (3) LPG (Local search for Planning Graphs) is a planner based on Local search and Planning Graphs, which is compatible with PDDL2.1. The planner can handle both Plan generation and plan modification issues. The search space of LPG consists of Action Graphs and specific subgraphs in planning graphs that represent local planning. Human-machine hybrid intelligent planning platform
PS: PDDL2.1 is a language that supports temporal planning and metric planning. Reference: [temporal planning review and the research status quo] ml-data.org/GDGYDXXB/html/1615527673997-884046199.htm (http://www.x)
-
In 1999 Schneier proposed a method to model the security threats of a particular system and represent the attacks against it in the form of a tree structure. By analyzing the attack tree, we can better understand the relationship between each attack method.
-
In 2018 Yousefi s applied reinforcement learning (RL) to analyze attack trees, which uses Q-learning to discover attack paths, but it still suffers from the problem of too small action space and sample space.
-
Compared with reinforcement learning, this paper adopts DRL, which is more suitable for analyzing attack tree, because it combines deep learning and reinforcement learning, and adopts trial-and-error method to find the optimal solution.
0x02 Background
A. Shodan
Internet connected devices online search engine, this study Shodan to gather the information needed to build a real network topology. This network topology information will then be used as input to the MulVAL algorithm described below.
B. attack tree
1) Attack tree model: a model that expresses the interdependence between attack behavior and attack steps and has the function of normalizing and documenting the attack process. Each node of the tree represents an attack or a sub-target, and the root node represents the final target of the attack. Each child node represents an attack or a child target that should be completed before the parent node is attacked.
The nodes of the attack tree are divided into “AND” nodes AND “OR” nodes. The “AND” tag means that a node cannot be attacked until all of its children have been attacked. The tag “OR” indicates that a parent node can be attacked as long as one of its children is attacked. In Figure 1, graphical AND literal representations of the two possible node types “AND” AND “OR” are shown.
2) MulVAL: MulVAL is an open source tool for generating an actual attack tree corresponding to a given network topology. According to the characteristics of the network topology, it has been proved that the complexity of the attack tree generation algorithm is between O(n^2) and O(n^3), so it is reasonable for a typical sme network with dozens of hosts. In this framework, MulVAL is used to find all possible paths for a given input network topology, build a matrix based on all possible paths found, and then simplify it using a depth-first search (DFS) algorithm to make it more suitable for DRL algorithms.
PS: MulVAL uses Datalog as the model language (including vulnerability description, rule description, configuration description, permission system, etc.). It takes Nessus/OVAL scanner reports, network topology information provided by firewall management tools, network management policies provided by network administrators and other facts into Datalog language as input and turns them over to the internal inference engine to deduce the attack process. The inference engine consists of Datalog rules that capture the behavior of the operating system and the interactions of various components in the network. Finally, the attack tree obtained by the inference engine will be visualized by the visualization tool to form the attack map.
In addition, there is topological Vulnerability Analysis (TVA), an attack graph generation tool with polynomial time complexity, which can be used for automatic analysis of network penetration. The output result is a state attack graph composed of attack steps and attack conditions. The tool requires manual input to create a rule library. At the same time, the tool fails to solve the inherent problem of state explosion in the state attack graph, and the attack graph generated in the complex network is very large, which is not conducive to analysis. It’s order n^2.
NetSPA (Network Security Planning Architecture), proposed by Lippmann, is a tool for generating attack graphs based on graph theory. The tool is now commercially available with the support of the US government. The tool uses firewall rules and vulnerability scanning results to build a network model and calculates network reachability and attack paths. Because of the lack of attack mode learning function, the establishment of its rule base depends on manual input. The generated attack graph contains loops, which is not easy for users to understand. It’s O(nlogn). Because the attack map displayed is not intuitive, on the basis of NetSPA launched NAVIGATOR and GARNET two enhanced image display effect tools.
Attack Graph Technique (mekakuactor.cn)
C. Reinforcement Learning
Reinforcement learning (RL) is a learning method that maps environmental states to actions so that agents gain maximum cumulative rewards from interacting with the environment. As an interactive learning method, RL is characterized by trial and error and delayed returns. RL agents attempt to identify strategies that can provide actions corresponding to any given state of the environment, with the ultimate goal of discovering an optimal strategy in order to achieve maximum cumulative rewards.
Figure 2 shows the interaction between the agent and the environment. At each time step T, the agent observes the environment to obtain state St, and then performs the action At. The environment generates new states St+1 and rewards Rt, depending on At. Such a process can be described by markov Decision Process (MDP). An MDP is divided into four parts, also known as quadrilateral, represented by (S, A, P, R), where S represents the state set. A stands for action set; P (s | ‘s, a) representative in the state s action after a transition to a’ s probability; R(s, a) represents the reward a gets for taking action in state S.
The goal of this strategy is to maximize future cumulative rewards, that is, the “quality” of the current state can be measured by the future cumulative rewards of that state. Reinforcement learning introduces a reward function to calculate the reward at any given time t.
Where gamma is the discount coefficient. Since the uncertainty of the reward value increases the further away you are from the current state, gamma is often used to illustrate this uncertainty.
In addition, the concept of a value function is used to represent the “value” of a state, that is, the expectation of a future cumulative reward for that state.
In addition, the action (a) -state (s) value function is used to express future cumulative rewards conditioned on state and action.
D. Deep reinforcement learning
Deep reinforcement learning (DRL) refers to a class of techniques that combine deep learning and reinforcement learning. DRL is a general learning method in which “agents” obtain state information from the environment, choose appropriate actions according to their own strategies, change the state of the environment, and then obtain rewards or incentives according to the new state of the environment to express the effectiveness of the action for the agent. When DRL is applied to penetration testing, the agent plays the role of Pentester and selects the most efficient path for maximum reward.
DRL algorithms are divided into three main categories: value-based functions, policy-based search and model-based methods. Deep Q-learning network (DQN) is a representative value-based method proposed by Mnih et al. It combines convolutional neural network (CNN) with q-learning algorithm in traditional reinforcement Learning to establish a new DQN model. The input of DQN model is a simplified matrix, through nonlinear transformation of three convolution layers and two fully connected layers, finally a Q value is generated for each action in the output layer. Figure 3 shows the structure of the DQN model.
In Figure 4, the training process of DQN is shown, which improves the traditional Q-learning algorithm to alleviate the instability of the representation function of nonlinear network. For example, DQN uses empirical playback to process transferred samples. At each time step T, the transfer sample obtained by the agent’s interaction with the environment is stored in the replay buffer cell. During the training, a small number of transfer samples were randomly selected and the network parameter θ was updated using the stochastic gradient descent (SGD) algorithm.
DQN also modifies the way Q is computed. Therefore, in the DQN, Q (s, a | theta I) represent the current value of the output of the network, it is used to evaluate the value of the current state action function. Q (s, a | theta I) represents the output of the target network, and Yi = r + Max gamma * Q (s, ‘a’ | theta I) is generally Q value as a target.
The current value network parameter θ is updated in real time. After every N iterations, the parameters of the current value network are copied to the target value network. Then network parameters are updated by minimizing the mean square error between the current Q value and the target Q value. The error function is:
The gradient can be calculated as follows:
0x03 Frame Structure
The framework has three main parts:
-
Training data generation: The training data required to establish the DQN algorithm is used as input
-
DQN module: use DQN algorithm for training, and then use the trained model to automatically give penetration test suggestions
-
Penetration tools: Encapsulates external tools for penetration testing operations on real systems
A. Training Data
The key to using deep learning is to have training data. The method used in this article to create training data to feed into the DQN model consists of three steps.
-
Use Shodan to gather network information and model a real-world network environment
-
Use MulVAL to create an attack tree corresponding to the network environment
-
The data is preprocessed to make it suitable for DQN models
1) Host data set. To build training data, Shodan is first used here to collect information about real network devices (the Shodan API will return data about real network servers, including actual IP addresses, ports and protocols used, known vulnerabilities, and so on). . Figure 6 shows an example of raw data collected by Shodan (sensitive information, such as IP addresses, etc.).
For each different service, a separate service dataset file is created that contains all the relevant information about the particular network host running the service. To protect privacy, only open ports or service names are kept when displaying collected data. Table 1 shows some examples of network server configuration files.
- Vulnerability data set. In addition to the host dataset, a vulnerability dataset file is created here. For a set of known vulnerabilities, the vulnerability data set includes CVE and Microsoft vulnerability identification numbers, as well as the type, base, and exploitability score sections of the CVSS score. To include all the relevant information we needed, we combined information from the National Vulnerability Database (NVD) and Microsoft database (MS) to create a new dataset. Table 2 shows some of the items in the built vulnerability dataset.
3) DQN data set. DQN data set is the training data set of deep Q-learning network, including host data set and vulnerability data set collected by Shodan. To generate the data set required by the DQN algorithm, you first need to create an attack tree for a given set of network topologies.
To do this, a series of network topology templates are created here, populated with actual data from the host dataset. Figure 7 shows an example template for a network server vulnerability configuration.
From this example, we CAN see that vulExists comes from Shodan, and it shows that the target is a web server with can-2002-0392 vulnerability. The vulProperty from the Vulnerability Dataset shows that can-2002-0392 is of type remote utilization, which CAN cause a permission upgrade. This detailed network topology information is used by MulVAL to generate an attack tree corresponding to the topology. Figures 8 and 9 later in this article show an example of a network topology and the attack tree generated for that topology, respectively.
Next, the attack tree must be transformed into a transition matrix. A method of converting attack trees into transport matrices was first proposed by Yousefi nanet al. but it was not suitable for penetration testing where other steps such as file access, command execution etc. are important besides vulnerabilities. Therefore, this paper proposes an improved method to transform attack tree into a simplified transition matrix. In the first step, we map all the nodes in the attack tree to a matrix form, which includes CVSS score information for vulnerabilities, as well as some predefined scores for other actions, such as accessing files. In the next step, a depth-first search (DFS) algorithm is used to simplify the matrix, rather than feeding it directly into the DQN algorithm. The idea is that the full transition matrix shows all possible actions, but it can be simplified if we select only those actions that can be used to reach the target of the attack.
Q: The attack chain can only be established if the attack behaviors are related to each other. The author uses this to simplify the matrix
Therefore, we use the DFS algorithm to find all possible paths to the destination, and then create a simplified matrix that includes (I) the score of the starting node in the first column; (ii) The total score of intermediate steps in the intermediate column; (iii) Score of the target node in the last column. These scores will be further used as bonus points in the DQN algorithm.
B. DQN
In our framework, DQN is used to determine the most feasible attack path through continuous training of DQN model. The input to the model is the simplified matrix described above, and for the activation function, we use the Softmax function. The output of the DQN model is the best attack path. In the learning process, an agent in DQN model represents an attacker, and the target environment is simulated by a simplified attack matrix. Attackers can move from node to node in the attack matrix until they reach the final target-the target server.
The reward for exploiting each vulnerability used in the DQN model is calculated from vulnerability scores determined by us based on components of the Common Vulnerability Scoring System (CVSS).
In CVSS, the base score reflects the harm of the vulnerability itself, while the exploitability score reflects the feasibility of using the specific vulnerability. Therefore, we use a maximum exploitability score of 10 to weigh the importance of the base score and consider the feasibility of using a vulnerability.
C. Penetration Tools
In order for an automated penetration testing framework to be used to attack real systems, it needs to interact with the real network environment, such as the ability to run commands, exploit vulnerabilities, and so on. At this stage, we decided to leverage existing penetration testing tools such as Metasploit rather than build our own, similar to the approach taken by CyPROM here.
While the implementation of this feature is still a work in progress, we will release it along with other parts of the framework in the near future. Our approach is to create a wrapper for penetration testing tools so that the results of the DQN model trained as described above can be used to send commands to these penetration tools that will operate on the real target system. The results of these actions are received as feedback and used to determine how to proceed with a particular attack path.
IV. EXPERIMENTAL RESULTS
A. Experiment Scenario
Figure 8 shows the network topology model used in the experiment. The model represents a small corporate network consisting of a network server, a file server, and a workstation. File servers and workstations are on the same subnet, connected to a firewall through a router. The network server is in a separate subnet, connected to the Internet through a firewall.
From an attack point of view, they made the following assumptions about attack methods.
-
The attacker starts from the Internet and can access the Web server over HTTP and HTTPS.
-
The file server and network server can be connected through file protocols such as NFS and FTP.
-
File servers and workstations can be connected through file protocols such as NFS and Samba.
-
File servers and workstations can access the Internet through HTTP and HTTPS.
To determine the actual protocol details, they used data collected by Shodan to initialize information about vulnerabilities, open ports, products, and protocols of the network server and file server. As for the workstation, it is assumed that it is not running any services, but just using various transport protocols. Table 3 shows examples of network information configuration for these hosts.
The vulnerability data set is used to attach details to each vulnerability, including its type and impact (such as the level of permissions that can be achieved with the vulnerability). For each different vulnerability type, the attack tree algorithm will produce different attack paths. In Table 4, some examples of the types of vulnerabilities encountered in this scenario are given.
Q: In fact, the author has simplified the problem. The analysis of the problem is based on the assumption that Shodan has collected all the information, focusing on the selection of attack path.
B. DQN Dataset Generation
MulVAL is used to generate an attack tree based on the network topology and configuration information discussed earlier. Figure 9 shows the main portion of the attack tree generated for the sample scenario, with each node marked with a short description of how to exploit a vulnerability. In this example, the attacker starts on the Internet, represented by node 26, which is in the top center of the figure. The attacker’s goal is to execute code on the workstation represented by node 1 in the upper left of the figure. We assume that the attacker can move in the direction of the arrow until the target is reached.
To build the transfer matrix required by the DQN algorithm, each node needs to be assigned a bonus score, as shown below.
1) The start node (node 26 in this example) has a bonus value of 0.01, and the target node (node 1 in this example) has a bonus score of 100. 2) For each node that exploits the vulnerability (in this case, node 16 exploits the CVE-2012-0053 vulnerability of node 28), we use the Score VUL value defined in Formula 6 as the reward Score. 3) For each node that executes code or accesses files (denoted by execCode and accessFile in this example), a bonus score of 1.5 points is set here, because this behavior is important during penetration testing (target node 1 is not covered by this rule). 4) For any other node in the tree, the bonus score here is 0, or -1 if there is no path between the two nodes.
C. DQN Training Results
The above simplified matrix becomes the input to the DQN model, which is then trained to determine the total return of all possible paths. For the network topology proposed in this paper, they used Shodan data with different vulnerabilities in the training template, thus creating a total of 2000 different attack trees as training data and 1000 different attack trees as verification data. For verification, they determined that the best path in the simplified matrix was the minimum number of steps to choose the highest reward. Table 5 provides the model accuracy and average number of steps for the optimal attack path. It can be seen from these results that the model has a good performance in finding the most feasible path for this network topology sample.
Figure 10 shows the DQN model’s average reward change for an attack path in the experimental scenario over a total of 100 training iterations. Notice that the rewards are small at first and gradually increase after about 30 iterations. After about 60 iterations, the reward becomes stable, meaning that the DQN model has found the optimal path. In this case, the path is given by a sequence of steps “26→24→17→16→15→13→10→9→8→6→5→ 3→2→1” (see Figure 9). This path not only takes full advantage of network server and file server vulnerabilities, but also adopts a simple attack method, which is easy to use by penetration testing tools.
Figure 10. Reward variation for an attack path in the experimental scenario.
D. Discussion
Although the DQN algorithm appears to converge relatively quickly when using several sample network topologies, we believe that more work is needed to determine the correct parameter selection of the DQN model in a large number of scenarios with different attributes.
Although the framework designed in this paper is intended for automated penetration testing of real systems, so far they have only trained on generic network topology models. As a next step, they consider feeding the DQN algorithm the results of scanning a specific actual network topology as a training model, so that it can evaluate the particularity of that topology and propose the best attack path for that topology.
Each node on the most feasible attack path determined by the DQN algorithm contains all the information needed to carry out this step attack effectively. Therefore, path and node information can be used to drive the execution of penetration testing tools, such as Metasploit, to carry out this attack. For example, knowing that node 28 represents a CVE-2012-0053 vulnerability, the framework can send a command to Metasploit and exploit the network server vulnerability using the internal module MSF exploit. We plan to take advantage of this in the penetration tools module mentioned in Section III-C.
Meaning of this framework:
-
For attack training activities, such as suggesting attack paths, trainees can then conduct their own experiments in a network-wide guided learning approach
-
Perform realistic attacks in an automated manner (after completing the penetration tool module). This will reduce the cost of organizing a defense training campaign, as the support of white hat hackers will no longer be needed to carry out such attacks. Repeated training using realistic attack methods will significantly improve participants’ defensive skills.
V. CONCLUSION
This paper proposes an automatic penetration testing framework based on deep reinforcement Learning techniques, especially deep Q-learning network (DQN). Their approach innovatively combines the Shodan search engine with various vulnerability datasets to gather real host and vulnerability data for building realistic training and verification scenarios. Then, the attack tree method is used to generate attack information for each training scenario, which is then processed and used to train the DQN model. The training uses the reward score assigned to each node mainly based on CVSS scoring information to determine the most feasible attack path in a given network scenario.
Their experiment populates a given network topology with real host and vulnerability data, constructing 2,000 training scenarios and 1,000 validation scenarios. Even with a small amount of training data, DQN achieved 0.86 accuracy in determining the best attack path, and in other cases provided viable attack paths that met our expectations.
Future work:
-
It is planned to expand the training data set with more network topologies to improve the versatility and stability of the DQN model.
-
Consider incorporating network service scanning capabilities into the framework so that real target environment information can be automatically provided to the DQN model to make the results of the actual network topology more accurate.