This repository contains the winning solution (2nd place) of the Macrosoft Maleware Prediction Challenge on Kaggle. For details on our approach, see the overview of our soultion.
Link to the competition homepage: https://www.kaggle.com/c/microsoft-malware-prediction
- Stephan Michaels
- Florian Imorde
The competition "Microsoft Maleware Prediction" was based on the questions whether or not a computer is infected by maleware. Based on different properties and features povided by Microsofts Windwos Defender, an algorithm had to be created, which predicts the probability of such an infection.
For solving the maleware prediction problem, two models were trained:
- Model 1 (later called M1): The data set was cleaned and string values encoded. Afterwards a LightGBM was trained.
- Model 2 (later called M2): The preprocessed data from model 1 was extended with new features. Next, important features were selected and a LightGBM was trained.
Finally an average of the predictions of both models was calculated.
- code
- 1_Data_Cleaning_train_set.ipynb
Cleaning the train set. - 2_Data_Cleaning_test_set.ipynb
Cleaning the test set. - 3_Data_Encoding_M1.ipynb
Encode the train - and test data for model 1. - 4_Submission_M1.ipynb
Building model 1. - 5_Feature_Engineering_M2.ipynb
Creating new features for model 2. - 6_Data_Encoding_M2.ipynb
Encode the train - and test data for model 2. - 7_Submission_M2.ipynb
Building model 2. - 8_Submission_Solution.ipynb
Building the final solution by averaging the solutions from model 1 and model 2. - Optional_Feature_Selection_M2.ipynb
Selection the most importend features for building up model 2. - Optional_Submission_Simple_Model
Building a simplified model.
- 1_Data_Cleaning_train_set.ipynb
- data
- Data_Description.xlsx
Feature informations: Relevant for the future, type, description. - encoding_dictionary.p
Dictionary, which contains the encoder for relabeling values of features.
- Data_Description.xlsx
- feature_importance
- Featureimportance_M1.csv
List of all features used in model 1 with corresponding importance. - Featureimportance_Feature_Selection_M2.csv
List of all features in model 2 after Feature Engineering with corresponding importance. - Featureimportance_M2.csv
List of all features used in model 2 with corresponding importance. - FeatureImportance_Simple.csv
List of all features used in the simplified model.
- Featureimportance_M1.csv
- models
- Placeholder
Models are saved here.
- Placeholder
- submissions
- Placeholder
Final submissions are stored here.
- Placeholder
The original train- and testdata can be downloaded form the competition homepage.
Link to the data:https://www.kaggle.com/c/microsoft-malware-prediction/data
The datasets have to be stored in the data folder.
Notebook with:
- Intel(R) Core(TM) i7-8850H
- 16GB RAM
- Windows 10 Pro, 64 Bit (Version: 1809)
- Anaconda 1.9.6
- Python 3.7.1
- Jupyter Notebook 5.7.4
The following libraries are required:
- numpy (Version 1.15.4)
- pandas (Version 0.23.4)
- dask (Version 1.0.0)
- scikit-learn (Version 0.20.1 )
- tqdm (Version 4.28.1)
- lightgbm (Version 2.2.1)
- pickle (Version 4.0)
Out code is submitted under MIT license.