AnDetect
Content:
- Introduction
- What’s AnDetect?
- Machine Learning
- AnDetect Workflow
- Conclusion
Introduction
Android has become much more popular since its first release in 2007. It revolutionized the way how smartphones and mobile devices work. Therefore, the user reliance on such devices and technologies has been doubled since 2015. On the other hand, the number of attack surface is also increasing each time users show more interest in Android and make it their primary devices for storing their private data. However, researchers and security analysts are always trying their best to prevent those who wanna steal or disclose the user’s data from their devices. Till that moment, there’re almost 1 billion malware discovered and 560,000 detected every day. Also, around 10.5 million Android malware apps found in 2019. So, the question here is how to prevent those malicious apps from flying all around the world?
My focus on this blog post is to give you a high level overview of what I have done in my graduation project and explain the point without too much details of the actual paper or superficial words so any one who has or even doen’t have privious knowledge (but at least has a CS mindset :D) can get what I’m gonna write here.
What’s AnDetect
The first thing that came to my mind when I was researching is to find a name for the project and give it an identity. So in a nutshell, AnDetect is a composite of 2 words (Android and Detect) a system that can detect an Android malware apps. Actually, It’s more than just a system, it’s a next-gen detection system and that means combining AI or Machine Learning with information security to prevent or detect malicious activity or attack. That’s why it’s called Next-Gen cause all the old generations of such systems are based on old-fashioned techniques that the attackers already knew and can bypass those old methods.
AnDetect has a new way of detecting malware applications in Android OS. It uses Machine Learning and Jimple code with large amount of data in order to detect malicious apps. Let’s first define what is Jimple? Unlike Java byte code, Jimple code is a 3 address byte code used to optimize Java applications and later on researchers used Jimple not only for optimization purposes, but also for analysis and creating CFGs (Control Flow Graphs) with framework called Soot. Before we dive deep into my approach, I’ll show you the most common methods for detecting Android malwares.
The first one is permission-based detection, this method analyzes the malicious pattern of Android application permissions in order to decide whether it’s malware or not. It has some pros such as simplicity also a lightweight technique, but it takes lesser info and this usually is not enough. The second method is code-based and it’s all about decompiling the source code of the application (Java/Assembly), then instead of analyzing it manually, the decompiled code is passed to a machine learning algorithm to classify whether it’s malware or not (teach a machine to analyze code as a human analyst). This method also has some pros it takes a lot of data to analyze then decide, but this is too much info and the decompilation process sometimes is time-consuming.
So, the approach I used is mixed of the previous ones above.
It’s Jimple based technique which means the application is
decompiled to its Jimple format (instead of Java/Assembly).
First, we extract the permissions used in the application along
with the component names (Activity, Service, …etc.) from the
manifest.xml
file. After that, I use
FlowDroid
tool to extract the Jimple code of the app to a new directory.
Finally, we extract the APIs corresponding to the extracted
permissions used in each of the app components. In other words,
we look for the function calls related to each permission
extracted from the app in each of the components.
The following table is an example of some requested permissions and its corresponding API calls that can be used in Android.
Permission | API calls |
---|---|
android.permission.ACCESS_COARSE_LOCATION |
retrieveLocation() computeMostGranularCommonLocation() startReceivingLocationUpdates() |
android.permission.RECEIVE_SMS |
startRecording() addMessage() clearConversation() |
android.permission.RECORD_AUDIO |
getAllThreads() startListening() setRecognitionListener() |
After the extraction of those information from the application,
the result is an array of all the permissions and APIs used in
each component (i.e.
permission::component::API
).
Machine Learning
Machine Learning has become the most demanded skill in the tech industry the last decade. In simple words, it combines data and mathematics with CS to provide an elegant way for predicting the future, but within a range of error. The lower error the better it performs to predict the future. So, the goal here is to develop ant train a machine learning model on a large amount of data (APKs) to predict a new one whether it’s malware or not. Our model is trained on ~400 APK files divided into 50-50 malware and good-ware. We evaluated 3 different classifiers (machine learning algorithms) in order to select the best model for our detection system. The following tables shows the result of the 3 models with their accuracy.
Model | Accuracy | Error |
---|---|---|
SVM | 96.2% | 3.8% |
RF | 98.7% | 1.3% |
MLP | 97.5% | 2.5% |
As we can see, the obtained results has a pretty good accuracy with a little error rates. The heighst model in accuracy is the one we select in our system to detect malicious applications in Android. Next, I’m going to show how putting all these parts together to build the AnDetect system and starting detecting new applications.
AnDetect Workflow
AnDetect is built with various tools and technologies such as (Python, JavaScript, Flask, Scikit-Learn, and React-Native). I use Python as a main programming language for the detection process. As shown in the figure below, the system has three main components:
- End-user UI application.
- Backend with RESTful API.
- The AnDetect core system.
In the user space React-Native is used to build the UI. It’s only a single page application. On the other side, the backend server is built in Python using Flask framework for the RESTful API. Then, a user can select an APK file from the device storage and uploads it to the server. After the user uploads the APK file, the core system starts the analysis process first with FlowDroid in order to extract the Jimple code. Afterwards, the feature extraction process comes into place to extract the features, then it passes the feature set to the classification model. Finally, the server obtains the result of the provided APK file that detects whether the application is a malware or not and sends the final result to the user.
Conclusion
In this blog, I illustrated how AnDetect works starting from the analysis process using Jimple code and extracting the important features from the applications. As well as, how I effectivley used machine learning to build the detection process and finaly the archetichture of the whole system. Along with that, I avoided using very technical words in order to facilitate to you what I’ve done so far. I’ve also added the actual paper for those who wanna know all the details under the hood that are not mentioned here. Eventually, this isn’t the end of the story, maybe the method I used not effecient enough and there’s still so much work to do. As we know, the infosec world is like cat and mouse game that never ends. Every time hackers found new ways to bypass the security processes, the researchers also discover a lot of ways to prevent those bad people from compromising the user’s data.
Download the full paper PDF.