Runner, 2016/03/07 himself
0 x00 preface
Some time ago in the cloud knowledge library to see a more interesting article using machine learning malicious code classification. This article introduces the method adopted by the champion team in a malicious code classification contest on Kaggle, demonstrating the application and potential of machine learning in the field of security. However, the theme of this competition is the classification of malicious code, there is no further implementation of malicious code detection; Secondly, the competition code is only for Windows PLATFORM PE format, lack of research on mobile applications. Inspired by this, we try to use machine learning method to detect malicious code on Android platform, and finally get a certain detection effect.
0x01 Background
Android malicious code detection method
At present, malware detection methods mainly include signature-based detection method and behavior-based detection method. Signature code-based detection detects whether a file has the signature codes (such as a special code or a string) of known malicious software. Its advantages are fast, high accuracy and low false positive rate, but it cannot detect unknown malicious code. The behavior-based detection method matches the behavior of the monitoring program with the known malicious behavior pattern, so as to judge whether the target file has malicious characteristics. It has the advantage of detecting unknown malicious code variants, but has the disadvantage of high false positive rate.
Behavior – based analysis methods are divided into dynamic analysis method and static analysis method. Dynamic analysis method refers to the use of “sandbox or simulator” to simulate running programs, through monitoring or interception to analyze the behavior of running programs, but it consumes resources and time. The static analysis method is to extract the features of the program by reverse means and analyze the instruction sequence. This paper uses static analysis method to detect malicious line code.
Weka and machine learning classification algorithms
Weka (Waikato Environment for Knowledge Analysis) is a free, non-commercial open source machine learning and data minining software based on JAVA Environment. Weka stores data in the attribute-relation File Format (ARFF) File, which is an ASCII text File. In this paper, feature data are generated into ARFF format files, and Weka’s own classification algorithm is used for data training and model testing.
Machine learning is divided into supervised learning and unsupervised learning. Supervised learning is to use learning algorithms to learn a model based on training sets, and then use test sets to evaluate the accuracy and performance of the model. Classification algorithm belongs to supervised learning, which needs to establish a model first. Common classification algorithms include Random Forest and support vector machine (SVM).
Basic format of APK
APK (Android Application Package) is available on Wikipedia.
The APK file format is a ZIP-based format that is constructed in a similar way to JAR files. It is the Internet media type application/VND. Android. Package – archive;
An APK file usually contains the following files:
- Classes. dex: Dalvik bytecode, which can be executed by the Dalvik virtual machine.
- Androidmanifest.xml: An Android manifest file that describes the application’s name, version number, required permissions, registered services, and linked other applications. This file uses XML file format.
- Meta-inf folder: There are three files under it
- Manifest.mf: MANIFEST information
- Cert. RSA: saves the certificate and authorization information of the application program
- Cert. SF: Saves the list of SHA-1 information resources
- Res: resource folders required by APK
- Assets: directory of original resource files that do not need to be compiled
- Resources.arsc: compiled binary resource file
- Lib: library file directory
Of all the files to watch out for is classes.dex, where android’s executable code is compiled and packaged.
Dalvik Virtual machine with disassembly
Unlike JAVA Virtual Machines (JVMS), Android virtual machines are called Dalvik Virtual Machines (DVMS). The Java virtual machine runs Java bytecode, and the Dalvik virtual machine runs Dalvik bytecode. The Java VIRTUAL machine is based on the stack architecture, and the Dalvik Virtual machine is based on the register architecture.
DVM has its own DEX executable file format and instruction set code. Smali and Baksmali are assembler and disassembler for DEX execution file format, and DEX file will be generated after disassembly. Smali code has a specific format and syntax, and smali language is an interpretation of Dalvik VM bytecode.
Apktool is based on the smali tool for encapsulation and improvement, in addition to the DEX file assembly and disassembly functions, but also APK has been compiled into binary resource files for decompilation and recompilation. Instead of using smali and baksmali tools, this article uses apkTool directly to disassemble APK files.
#! bash java -jar apktool.jar d D:\drebin\The_Drebin_Dataset\set\apk\DroidKungFu\xyz.apkCopy the code
Successful command execution results in the following level 1 directory structure in the out directory:
- Androidmanifest.xml configuration file
- Yml Decomposes the generated file for use by apktool
- Assets/Directory of assets files that do not need decompilation
- Lib/Directory for library files that do not need decompilation
- Res/Decompiled resource file directory
- Smali/Decompile generated smali source file directory
The smali directory structure corresponds to the original Java source SRC directory.
0x02 Feature Engineering
Classification and description of Dalvik instruction
Smali is an interpretation of DVM bytecode, and while it is not an official standard language, all statements follow a set of syntactic specifications. Dalvik opcodes pallergabor detail can refer to this article. The uw. Hu/androidblog… , which lists the meaning, usage and examples of Dalvik Opcode in detail.
Since there are more than 200 Dalvik instructions, classification and simplification are needed to remove irrelevant instructions, leaving only the core instruction set of M, R, G, I, T, P and V, and only the opcode field is retained and parameters are removed. Seven types of instruction set M, R, G, I, T, P and V respectively represent seven types of instruction: move, return, jump, judge, get data, save data and call method. The instruction is classified and described once. See the following figure for details:
OpCode N-gram
N-gram is a concept in the field of natural language processing, but it is also often used to handle the analysis of malicious code. OpCode n-gram is to extract n-gram features from the field of the instruction OpCode. N can be 2,3,4, etc. OpCode N-gram for an smALI format assembler file is shown below:
0x03 System Design and Implementation
The whole system is divided into two parts: establishing malicious code detection model and testing malicious code detection model.
The malicious code detection model is established as follows:
Several programs were written in C++ to process the data in the process of model building:
- Total.exe: Used to summarize all smali files in the project directory generated after a single APK disassembly into a file
- Simplication. Exe: Used to extract instructions, classify and describe them
- Ngramgen. exe: Used to generate n-gram sequences of specified N
- Arff.exe counts the number of each feature and generates Arff files suitable for Weka
The test malicious code detection model is as follows:
The machine learning tool Weka was used to test the model and test the accuracy of the model. The model with high accuracy can be used to predict whether unknown Android code is malicious code.
0x04 Experimental evaluation
Experimental data source
Experimental data are divided into malicious code samples and normal code samples. Normal code samples are downloaded from the Android Market; The data of malicious code samples came from the Drebin project, which collected 5,560 APK sample files of 178 kinds from August 2010 to October 2012. The data volume distribution of 178 malicious code families is shown in the figure below:
The experimental results
540 malicious samples and 560 benign samples were used in this paper, with a total of 1100 samples in 2 categories. The classification algorithm adopts random forest, 150 decision trees, n is 3, and ten fold cross verification is carried out.
The accuracy rate is shown in the figure below. 1045 samples were correctly classified, while 5 samples failed to be classified. For malware, where TPR (true positive rate)= 0.981, FPR (false positive rate)=0.08, Precision(Precision)= 0.922
The ROC curve effect is as follows: The Receiver Operating Characteristic (ROC) curve and AUC are often used to evaluate a binary classifier The pros and cons of classifier), specific knowledge about the AUC please see en.wikipedia.org/wiki/Area_u…
0x05 Summary and Outlook
- In general, the experimental results show that the detection of malicious code has a high accuracy
- If other features can be combined, accuracy should be further improved
- In addition to using the random forest algorithm, you can also try the effect of other classification algorithms, such as support vector machines
0x06 References
- Using machine learning to classify malicious code
- Kaggle’s Malware classification
- Weka software download
- Description of Dalvik Opcodes
- Random forest algorithm
- Apktool tools
- Smali study notes
- Drebin project introduction and download