The Project focused on analyzing Android applications with the intention of developing a machine learning model to identify malicious applications. By using Machine Learning, we focused on the permission of each application requested. Finding the sensitive permission was the key to identify the malicious one.
Malware or Malicious Software is defined as software designed to distort and interrupt the mobile or computer applications, collect important information and hence perform malicious operations. These malicious operations include gaining access over private information, covertly steal this valuable information over the system, display undesirable advertisement, and spy on the activities of the users.
Google Play store is fast evolving to be the largest app store in the World. With Android commanding over 88% smartphone market share and more than 60% share of daily device activations, it is important to ensure security of the operating system. Android Malwares are on a rise, with one report stating that more than 8000 samples are detected every day (Lueg, 2017). Malwares on mobile devices have capabilities to monitor intercept private communication. A new breed of malware focus on mining cryptocurrency without explicit permission resulting in a battery drain.
Computer security has always posed serious issue in the today’s scenario but with an onset of mobile terminals becoming predominant, in addition to the computer security mobile security is equally noteworthy. Mobile Phones are replaced by smartphones continually with most of the smart phones having android applications running on them. Smartphones are used by the users for two types of functions. Firstly, the android applications on the smartphones are incorporated for personal use i.e. to obtain pictures, contacts, emails and other agendas. Secondly, the very same device is also exploited to retrieve the IT infrastructure of an organization. The latter is implemented for policies like Bring Your Own Device. In the same scenario, there is an increase in the number of malwares attacking these devices. Researchers have established two methods of malware detection. The first type is the static analysis where we study malicious applications but without executing them. Some of the techniques used in static analysis include de-compilation, pattern matching, and decryption and so on. The unknown applications can have distinct signatures by means of obfuscation and encryption therefore these methods are generally not able to identify unknown malware. Therefore, the second type of method, dynamic analysis, is proposed. These approaches can monitor application's behaviors such as network access, phone calling and message sending at run time. This type of analysis is done in a sandbox environment to prevent the malware from infecting actual production systems; many such kinds of sandboxes are virtual systems that can be easily made to roll back to a clean state after the analysis is complete. Due to the increasing openness and popularity, android phones have been an attraction to most of the malicious applications and an attacker can easily embed its own code into the code of a benign application. Therefore, malwares attacking the android application are growing at an alarming rate and under these circumstances, security of the devices and the assets these devices allow access to, be at stake. In addition, android itself has some distinct characteristics and limitations due to its mobile nature. Therefore, detection of malware with conventional methods becomes cumbersome which poses the need to develop a novel and efficient approach for detecting malicious applications. The efficacious way of ensuring security on those devices is still a matter of research and investigation and there is a lot of room for improvement.
Android Malware has been a hot research topic and has presented challenges for users and companies which face constant threat of data loss, privacy violation, intrusive ads, and even bitcoin mining. Research in this field includes using Deep learning (Yuan, et al., 2014), Convolutional Neural Network (Zhang, et al., 2018), Cloud based Android malware analysis service (Zhang, et al., 2016) and many more.
According to all of work they have done, we know that to detect a malware in Android, we may need to focus on three aspects:
- Declared permissions found in permissions.xml file of an APK.
- Sensitive API calls made by an APK.
- Dynamic behavior.
In our analysis, we focused on required permissions, which is the most general way that malware implement their strategy. We used permission related data, i.e. static analysis to collate information about a package’s permission requirements and additionally, we have also used Google Play Store metadata to obtain information about a particular application. The dataset used has used VirusTotal to detect the presence of malware in an Android Package (APK) file.
As detailed in the earlier section, we collected data from a GitHub repository which can be found here. The dataset contains information about 17231 APK files, exhaustive Android permissions and other Google Play metadata of applications as of June 2014. The author of the dataset processed some Google Play features, and continuous attributes, using the Naive Bayes Algorithm. The details about how the data were collected and preprocessed can be found here.
The dataset was collected by the author by using a custom web scraping script using iMacros to gather Google Play metadata and to classify files using VirusTotal. Additionally, APK files were downloaded from a third-party website and static analysis was performed to generate the dataset by understanding all the permissions needed by an application.
Our analysis happened in distinct stages which we have explained below. The idea is to explore the data lifecycle from collection to practical usage in insight generation
-
Data Collection As detailed in the earlier section, we collected data from a GitHub repo which can be found here. The collection methodology has been explained in the thesis document here. It primarily involves fetching APK files from third-party trusted websites and market metadata from the Google Play Store.
-
Data Cleaning Data Cleaning is performed in order to remove inaccuracies and make the data consistent in format. Some of the steps we undertook are:
- Standardization Involved deduplicating information encoded in string format. This requires a thorough evaluation of all columns to check for duplicate information and follow a standardized naming convention
- Normalization Columns with numeric values are highly susceptible to a flaw in machine learning algorithms which give higher priority to columns containing larger numbers. This automatically increases their importance and leads to a fundamental flaw during the modelling phase. We use standard scaling to center the data around the mean and transform it to have unit standard deviation. This allows our algorithm, during the modelling phase, to better understand feature importance and interactions with the target variable
- Missing Values The goal was to ensure that the dataset contains as much information as possible. We checked every column for missing values and try to use imputation methods to enrich each APK record with missing values. For numeric values, we used the mean or the median while for categorical values, we used the mode to impute missing values
- Data Modelling For data modelling, we explored multiple machine learning algorithms, we could not utilize deep learning algorithms as it was not computationally feasible. While this was one of the most important phases of the project, it is important to note that the model will be only as good as the data. For this reason, we carried out feature engineering in order to try and capture interaction effects or translate a high dimensional feature on a lower dimension. We used the following models to classify whether the Android Package is malicious or not:
-
Random Forest The core idea is to generate multiple decision trees on subsets of features and based on the target variable, follow a regressor or a voting structure to generate outputs. It is a highly robust and fast algorithm and will be our primary choice for the modelling phase
-
Extreme Gradient Boosting Gradient Boosting or Gradient Boosted Trees involves generating a primary decision tree and creating multiple decision trees, each to account for the error of the previous tree. A new trick in generating such trees is known as Extreme Gradient Boosting which aims to take advantage of parallelization by dispatching tree generation to multiple threads at a time. It is slower than Random Forest but with the right tuning, it may outperform other tree-based algorithms
-
Support Vector Machines Support Vector Machines are also called as SVMs. They are used to generate decision boundaries that are highly flexible to better adapt to the data. They are one of the best machine learning algorithms to classify data. This accuracy comes at the cost of computation time which can go up in polynomial order. This makes using SVM infeasible in cases of large datasets
-
Decision Tree A simple model that enables generating decision rules for classification, achieves a fair amount of accuracy in most situations. We prefer to use it for its simplicity and its expressive power. By carefully pruning and randomizing column selection, we try and eliminate overfitting and bias respectively. We sacrifice the accuracy in favor of understandability in this case since the trade-off was inexpensive in terms of accuracy gains.
- Model Validation and Tuning Towards the end of the model development phase, we used the F1 score to generate metrics which helped us evaluate the model performance. We used scores generated by 10-fold cross validation to ensure that the model did not overfit the training dataset and gave a fair estimate of the performance. Additionally, we used randomized search to better tune certain parameters and improve the performance further.
- Data Cleaning
- Detecting columns with no information The exhaustive analysis of Android permissions resulted in columns containing no information. This meant that certain columns in our dataset had the same value for all the APK files analyzed. This unnecessarily took up space without adding any information. Below is the code to achieve the above listed portion of data cleaning:
#Finding columns with no information
useless = []
for i in data.columns:
l = len(data[i].unique())
if l<2:
useless.append(i)
print("Column Name: ",i,"Uniques: ",l)
Additionally, columns which contained nearly unique information do not add any value enough for us to draw generalizations. Thus, detecting and removing such columns was important to make the data manageable and insights generalizable.
#Removing columns with the following names due to their uniqueness and low information quality
tmp = tmp.drop(columns = ['APK_NAME',' DETECTION_DATE','DEVELOPER_NAME_BINARY'])
We also dropped some columns which had negative values, some columns contained variables the machine learning algorithm cannot recognize, such as some date value and non-Latin characters, we dropped them as well.
- Removing NULL values
Certain columns contained missing values. We analyzed the columns to ensure the absence of missing values and when detected, we checked if they were imputable. Records with no provision for imputation were eliminated using:
tmp = tmp.dropna(how='any',inplace=False)
- Feature Engineering It is important to incorporate real-world logic in the data to allow better generalization. In order to do so, we began by incorporating the duration of the APK since being uploaded on Google Play Store.
Our dataset contained information about the APK publish date on Google Play Store and also the date of malware detection. We used the information contained in these columns to compute a variable capturing the ‘age’ of the APK on the Google Play Store.
- One-Hot Encoding The dataset contained a column indicating the application’s category on the Google Play Store. Data of categorical nature cannot be handled well by machine learning algorithms; hence we converted them to numeric values for easier management. We created dummy variables, each representing whether the application falls in a particular category.
dummies=pd.get_dummies(tmp['CATEGORY'])
cleaned = pd.concat([tmp,dummies],axis=1)
cleaned=cleaned.drop(columns = ['CATEGORY'])
cleaned.columns = cleaned.columns.str.strip()
- Removing highly correlated features
To make sure our data did not contain any redundant information, we removed highly correlated features.
corr_matrix = cleaned.corr().abs()
#Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
#Find index of feature columns with correlation greater than 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
- Model development We Used Machine Learning models to classify Android applications. We partitioned our dataset into training and test sets.
feature_train, feature_test, label_train, label_test = train_test_split(final.drop('MALWARE',axis=1),final['MALWARE'],test_size=0.3,random_state=3462)
We used multiple classification algorithms to generate accurate results, they are, RandomForest, SVC, Decision tree, XGB and KNN.
- Model performance We compared model performances using a combined metrics called the F1 score. It is the harmonic mean of precision and recall. Based on the best F1 score, we selected the Decision Tree. It allowed us to understand the decision modeling process without sacrificing the accuracy.
The confusion matrix generated by our decision tree model
- Distribution of malicious and benign files
Distribution of malicious and benign files in the dataset
The dataset contains a balanced sample of malicious and benign APKs, achieved through random sampling
- Android applications by categories
Distribution of APKs by category in the dataset Android applications have been representatively sourced from all Google Play Store categories 3. Applications by Android versions
Distribution of APKs, benign and malicious, by Android API version We see a dominant majority of application representation from the API version 10 i.e. Android Gingerbread (2.3.3). We also see Android API version in the dataset which was codenamed KitKat. It would be safe to assume that the data collection occurred between June 2014 and October 2014.
- Frequently used permissions
The most frequently used Android permissions in the dataset A dominant majority of applications have invoked the internal handler which enables them to send messages to threads and runnable objects using the Android system interface. This is followed by applications seeking to access Norton Security and reading network buffers
- Decision Tree Representation
The first split on a permission ‘ACCESS_NORTON_SECURITY’
The following splits in the tree based on the application’s size and whether it communicates to the internet using Android OS Network APIs The entire tree can be found here
- Confusion Matrix for Decision Tree
Confusion matrix generated for the Decision Tree model The model can be evaluated by metrics such as precision and recall. Furthermore, we use a metrics called as the F1 score to generate a combined score from precision and recall. Our model’s precision is 0.5993, the recall is 0.7517 and the F1 score is 0.6647. Recall can be interpreted as the ability of our model to accurately detect malicious applications i.e. the number of malicious applications detected out of all malicious applications
- Most Important Features
The importance scores generated by the Decision Tree model for every variable in the dataset It is evident from the decision tree that ‘ACCESS_NORTON_SECURITY’ is important for the model to begin its predictions. This is followed by ‘DAYS_FROM_PUBLISH’ which is the age of the Android application in the Google Play Store. Implications and suggestions We can conclude our project to the following set of implications
- Not all permissions are necessary to understand the behavior of an Android application.
- Google Play Store’s application metadata may be a good source of information, but effort must be taken to sift relevant information and preprocess data in a consistent format.
- Feature engineering is an important step in enriching the quality of the data. It enables us to impart real-world logic which in turn may allow for better generalizations by the model.
- Static analysis of Android package files may not be entirely accurate as the current state-of-the-art method using static analysis allows accurate classification 90% of the times. It is important to include dynamic analysis as it may enable a better representation of an application’s true behavior.
- Static analysis, though imperfect, enables classifiers to detect many malicious applications across different application categories. Attributes are stable and reproducible.
- Certain permissions are more likely to be called by malicious applications than by benign applications.
- Feature engineering can be extensively used to generate synthetic features that better capture the application behavior.
- Dynamic analysis can be combined with processing of application review data on Google Play store to get a better understanding of the user experience and the application’s behavior. This may help improve the classification accuracy by magnitudes.
We have a few suggestions which may assist others in future research in this domain
- Feature engineering and feature selection is a must to ensure that the data is representative of the application and this also allows us to train the model in a computationally efficient way
- Collating data from multiple sources may result in better data quality. By using datasets from market intelligence firms, we may get deeper insights about an application and about the journey of an application through its lifecycle.
- Models trained of such datasets must be constantly updated to account for evolutionary behavior of malware and also, to detect newer strains.
This project was aimed at raising awareness about Android malwares and the staggering rate at which malwares are distributed. Lack of security patches and Android updates for many devices has made them vulnerable to exploits and while the latest updates may be unavailable, researchers and companies can develop tools to protect devices from malwares.
The ever-evolving security landscape enables malicious applications to enter the Android ecosystem through the Google Play Store. It is important to protect devices against such malicious applications and we propose a machine learning solution to this problem.
Features we used in the research are numerous, however, it is unreasonable to say it is enough. Attackers are improving their strategy as well. Retraining the model and incorporating new features is the best way to continuously detect malicious applications.
It is important to observe the application behavior better in order to understand its true intentions. Using state-of-the-art algorithms and capturing more data points may go a long way in protect our devices. Future Work The dataset that we used contains static features of the Android Package files and while this approach is not adequate, we can extend the data quality by using dynamic file analyzers to better understand malware behavior by capturing finer data points. Additionally, researchers can use deep learning on a much larger dataset scraped from multiple websites and use the latest malware repositories to develop highly accurate classifiers.
This will greatly assist in reducing the spread of malwares and move the current detection methods from static on-device rule-based to a more dynamic cloud-based approach. Not only will it enable companies to continuously update the engine, it will also allow them to support a much wider variety of low powered devices that do not possess enough computing power to scan files on their own.
Lueg, C., 2017. 8,400 new Android malware samples every day. [Online] Available at: https://www.gdatasoftware.com/blog/2017/04/29712-8-400-new-android-malware-samples-every-day [Accessed November 2018].
Yuan, Z., Lu, Y., Wang, Z. & Xue, Y., 2014. Droid-Sec: deep learning in android malware detection. [Online] Available at: https://dl.acm.org/citation.cfm?id=2631434 [Accessed November 2018].
Zhang, H. et al., 2016. ScanMe mobile: a cloud-based Android malware analysis service. ACM SIGAPP Applied Computing Review, 16(1), pp. 36-49.
Zhang, Y., Yang, Y. & Wang, X., 2018. A Novel Android Malware Detection Approach Based on Convolutional Neural Network. ACM New York, NY, USA, pp. 144-149.