APIGraph

A frameowrk to collect and utilize API semantics to enhance Android malware classifiers.

About

We introduce APIGraph that uses a new concept named API semantics, to tackle model aging, a long-standing problem for machine learning-based malware detection systems, from the perspective of enhancing feature space abstractions. Our paper, Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware is published in ACM CCS 2020. Here's the trailer of of our talk:

Features

We show that although Android malware evolves over time, many semantics can be preserved across different variations, leaving us an opportunity to detect them after evolution.
We propose APIGraph, a framework to collect and utilize API semantics by extracting API relations from documents and building and vectorizing an API relation graph.
We build and open-sourced a large-scale (> 320K apps) and evolutionary (7 years from 2012 to 2018) dataset, and evluated APIGraph with 4 SOTA Android malware classifiers, strictly following spatial and temporal consistency.
We released our code and dataset at https://github.com/seclab-fudan/APIGraph

The Model Aging Problem

Model aging, or model degradation, describes the phenomenon that the performance of trained classifiers drops significantly over time, which is a long-standing problem in the machine learning literature. However, when applying ML techniques to security areas, for example malware detection, things become even worse as the problem space may be much more complicated than traditional ML tasks such as image classification. Recently both academic and insdutry experiences demonstrate the severe problem of model aging:

A Kaspersky 2019 white paper[1] shows the detection rate of an ML-based classifier drops from 100% to below 80%, or even 60% under another configuration, in only three months.
A USENIX Security 2019 paper[2] tests 3 state-of-the-art ML-based Android malware classifiers and they all suffer from model aging.

[1] https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf, Kaspersky 2019

[2] TESSERACT: Eliminating experimental bias in malware classification across space and time, USENIX Security 2019

Existing solution to model aging is to add newly labeled samples to retrain and update the aged models, i.e. data-perspective solution. However, this kind of methods has the following shortcoming: 1) it comes at a high cost, as we need to label many new samples, and there is a time window during retraining, which may leave a chance for malware to infect lots of users. 2) More importantly, retraining with new data is still constrained by the training data itself, so in nature the retrained models are still blind, unaware of malware evolution.

Idea of APIGraph

Observation: Many semantics are preserved during evolution while implementation may be different across variations

A real-world example: XLoader is a family of spyware and banking trojan that steals personally identifiable information (PII) and financial data. It was reported by TrendMicro in April, 2018 and kept evolving since its born and had generated several variations until late 2019. During the evolution, the implementations of different versions have changed a lot. However, we observe that one of the core logic are preserved across versions. As shown in the following figure, an early version V1 collects IMEI and sends this information to its server throught HTTP. In a later version V2, it further collects IMSI and ICCID, and sends to its server throught socket. Although different APIs are used, the overall target are the same, i.e. collecting PII and sending to its server.

Therefore, our idea is to let machine learning models to capture such preserved semantics during malware evolution. Our framework, APIGraph mainly solves two main challenges:

Where and how to extract such semantics?
How to let ML models utilize such knowledge?

The short answer is that we extract API relations from Android API documents using NLP techniques and pre-defined relation templates, and the we build an relation graph and then use graph embedding algorithm to vectorize each API while preserving their semantics and relations. After that, we cluster APIs into semnatically-close groups and then use these groups to stabilize the feature vectors of evolutionary malware. For more detailed explaination, please refer to our paper or join our talk at CCS 2020.

Workflow of APIGraph

Dataset

We build a large-scale & evolutionary dataset, which contains more than 320K Android apps across 7 years. The scale of this dataset is, as far as we know, the largest one to study and evaluate Android malware classifiers. Furthermore, to make fair evaluations, we strictly follow recently proposed best practice (Tesseract-Security2019) to satisfy both temporal and spatial consistency:

Temporal consistency requires that all training samples must be strictly precedent to the testing ones, and during each test, malware and goodware should come from the same time period.
Spatial consistency requires that the malware ratio should be close to that in the real-world, i.e. 10% for Android malware.

The dataset details are available at APIGraph-code-database.

Baselines

We tested four state-of-the-art Android malware classifiers as the baselines, as listed below.

Classifiers	Publication	API feature format	Algorithms	Reproduction
MamaDroid	NDSS 2017	Markov Chain of API Calls	Random Forest	source code
DroidEvolver	Euro S&P 2019	API Occurrence	Model Pool	source code
Drebin	NDSS 2014	Selected API Occurrence	SVM	re-implemented
Drebin-DL	ESORICS 2017	Selected API Occurrence	DNN	re-implemented

These four classifiers are published in top venues and their source code are publicly available or we can re-implement them, sometimes with the help of their authors.
Specially, we thank the authors of DroidEvolver for their help.
We strictly follow their configuration to make sure our reproductions can achieve the results as stated in their paper.

Experiment 1: Slowing down model aging

The first experiment shows how APIGraph helps in slowing down model aging.
We use the AUT metric, proposed by a recent paper[1], which is the area under the curve within a time period.
As shown in this table, after enhanced by APIGraph, the four classifiers can achieve from 8.7% to 19.6% improvements.
We also draw the detailed figures for models trained on 2012 and tested on 2013.
The red curves show the f1-score of the original 4 classifiers, while the blue ones are f1-score of the enhanced classifiers.
We can clearly see in these figures that the trend of performance decreasing is slowed down after using APIGraph.
[1] TESSERACT: Eliminating experimental bias in malware classification across space and time, USENIX Security 2019

Experiment 2: Reducing retraining cost

In the second experiment, we show how APIGraph can help reducing retraining cost in two metrics:

retraining frequency
number of new samples to label

The experiment is done as follows:

A classifier was trained on 2012 samples, and tested month by month from 2013 to 2018.
When f1 below 𝑇_𝑙(e.g. 0.8), retrain with active learning until f1 reaches 𝑇_ℎ (e.g. 0.9).

As shown in this table, we can see that APIGraph can help reduce retrain frequency from 22% to 76%, and decrease the number of labeled samples from 33% to 96%.

We also draw the detailed figures here.

Publications

Xiaohan Zhang, Yuan Zhang, Ming Zhong, Daizong Ding, Yinzhi Cao, Yukun Zhang, Mi Zhang, Min Yang. "Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware." 27th ACM Conference on Computer and Communications Security (ACM CCS 2020).

If you find our paper interesting, you can reference our paper using the following Bibtex:

@inproceedings{zhang2020enhancing,
        title={Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware},
        author={Zhang, Xiaohan and Zhang, Yuan and Zhong, Ming and Ding, Daizong and Cao, Yinzhi and Zhang, Yukun and Zhang, Mi and Yang, Min},
        booktitle={Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security},
        pages={757--770},
        year={2020}
      }