Malware Detection Visulization

by Bo Zhang, Xuanyu Wu and Shuliang Mai

Introduction

Nowadays, Android is dominating the smartphone market as an open source and customizable operating system. Many hackers targeted Android applications by disseminating malwares, posing serious threats to users. Historically, mobile security products such as Norton and Lookout, are heavily relied upon as major defense against such threats. Recently, many machine learning based methods have been invented for malware detection. A successful one of them is creating features from a Heterogeneous Information Network (HinDroid). However, it is confined in such a way that it ignores more comprehensive information which can be extracted from graph representation. In this project, we will explore different meta-paths and incorporate various graph embedding methods in the task of malware prediction. We propose to build upon our previous work in HinDroid replication, more specifically we will attempt to use deep learning graph embedding techniques. The success of this project will not only be determined by the performance of our final model but more importantly the evaluation of different graph embedding methods and further reasoning on the result.

Code Repository Our Presentation Android Logo

Research on Hindroid Paper

Jump to Hindroid Paper

We have replicated the major parts of the Hindroid paper and trained the model with our own benign apps fetched from Apkpure[2]

Our replication results showed Below.

The malware smali codes were obtained from the dataset provided by Prof. Aaron Fraenkel. Other than aforementioned data, the process to build the A, B, P matrix can be referred to Hindorid Paper[2]. Our plan is to develop based on the features from the HinDroid[1] method which uses a SVM with different customized kernels for classification, we try to extract the latent embeddings of APPs and APIs before building a classification model. The latent embeddings are generated from the metapath graphs using the same matrices (A, B, P) with node2vec and metapath2vec. Based on the EDA and plots with reduced dimensions below, we can see that the embedding method is on the right track and the source data is of high quality in classification. Before we jump into the latent embedding generated by node2vec, we first analyze how the HinDroid[1] method performs differently on different types of malwares.

Our Test Accuracy Rate on Different Malware Types Showed as Below

With the verification of the different accuracy on different malware types. We proceed to research on the performances of graph embedding methods.

Node2vec

Jump to Node2vec Paper

Node2vec is an algorithmic framework for learning continuous feature representations for nodes. In the algorithm, biased random walks are generated from the network, after which word2vec algorithm is applied to get embeddings for each node. Specifically, two parameters p and q are used while generating the random walks. If a random walk just traversed edge (t,v), the probability of transiting to the current node v’s neighbor x is 1/p if x is the same as t; 1 if there is an edge between t and x; 1/q if there is no direct relationship between t and x.

With the HinDroid graph, we can try applying node2vec to get node embeddings. However, in order to incorporate the existing metapath information, we decided to generate random walks in a way that slightly differs from the original node2vec definition. In particular, at each node, depending on the current metapath and our position in the metapath, we only look at edges of a certain type and sample from certain neighbors. For example, if the metapath is ABAT, we should start from an APP, then from neighbors that have an edge type A with the current APP, randomly select an neighbor, in this case an API, based on the probability distribution defined in node2vec paper. After that, randomly select another API from the current API’s neighbors that connect with the current API with edge type B. Lastly, we select another APP from the API’s neighbors with edge type A. We can continue generating nodes and add them to the random walk.

To understand the results better, we further did dimension reduction of APP embeddings and plot the 2-dimensional vectors to see how clustered malware and benign application embeddings are. As shown below, malware and benign APP embeddings are well-separated. This means node2vec algorithm has successfully captured the difference between malware and benign applications from the HinDroid graph and the random walks we generated.

Node2vec Embeddings Visualization with Different Meta-paths and Parameters

Moreover, from the figure, we can see small clusters of malwares, and our initial guess is that each cluster corresponds to a certain type of malware.

Metapath2Vec

Traditionally, conventional network embedding models, such as node2vec, work with homogeneous networks, which means that they treat all different types of nodes and relations the same. This will make heterogeneous nodes, such as App and API in our case, indistinguishable. metapath2vec is a network embedding model which aims to generate latent embeddings based on heterogeneous networks. Specifically, “metapath2vec maximizes the network probability in consideration of multiple types of nodes and edges”.[3] Similar to node2vec, the first step is to generate random walks in the heterogeneous network. Then, skip-gram model[4] will be applied to model the semantically and geographically close nodes. Similar to the logic of random walk for node2vec, meta-path-based random walks will be generated for the application of skip-gram models.

In order to make comparison with node2vec, we generated random walks for the same three meta-paths: AAT, ABAT, and ABPBTAT under the same dataset. Also, to make sure every APP appears at least once, we start from each APP and generate 200 random walks of max length 100. Then, metapath2vec is applied to these random walks to generate all types of node embeddings. By using these embeddings as feature representations and applying SVM linear kernel with “scale” gamma, the classification results are summarized in the following table.

By using the same t-SNE dimension reduction as stated last part, the 2-dimensional visualization of benign and malwares Apps are shown below. We can see from the figure that malwares and benign wares are mostly well separated and some clusters of malwares were formed.

Conclusion

Our metapath2vec can beat HinDroid in ABA metapath.

Graph-based method generate embeddings with customized dimension.

Advanced machine learning method can apply to embeddings.

More EDA is available for embeddings. (e.g. Cluster Analysis)

Back to Top

Guided By Aaron and Shivam