Skip to content

This repository contains the winning solution (2nd place) of the Macrosoft Maleware Prediction Challenge on Kaggle.

License

Notifications You must be signed in to change notification settings

imor-de/microsoft_malware_prediction_kaggle_2nd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Microsoft Maleware Prediction on Kaggle

This repository contains the winning solution (2nd place) of the Macrosoft Maleware Prediction Challenge on Kaggle. For details on our approach, see the overview of our soultion.

Link to the competition homepage: https://www.kaggle.com/c/microsoft-malware-prediction

Team:

  • Stephan Michaels
  • Florian Imorde

Competition Description:

The competition "Microsoft Maleware Prediction" was based on the questions whether or not a computer is infected by maleware. Based on different properties and features povided by Microsofts Windwos Defender, an algorithm had to be created, which predicts the probability of such an infection.

Soution Description:

For solving the maleware prediction problem, two models were trained:

  • Model 1 (later called M1): The data set was cleaned and string values encoded. Afterwards a LightGBM was trained.
  • Model 2 (later called M2): The preprocessed data from model 1 was extended with new features. Next, important features were selected and a LightGBM was trained.

Finally an average of the predictions of both models was calculated.

Archive Contents

  • code
    • 1_Data_Cleaning_train_set.ipynb
      Cleaning the train set.
    • 2_Data_Cleaning_test_set.ipynb
      Cleaning the test set.
    • 3_Data_Encoding_M1.ipynb
      Encode the train - and test data for model 1.
    • 4_Submission_M1.ipynb
      Building model 1.
    • 5_Feature_Engineering_M2.ipynb
      Creating new features for model 2.
    • 6_Data_Encoding_M2.ipynb
      Encode the train - and test data for model 2.
    • 7_Submission_M2.ipynb
      Building model 2.
    • 8_Submission_Solution.ipynb
      Building the final solution by averaging the solutions from model 1 and model 2.
    • Optional_Feature_Selection_M2.ipynb
      Selection the most importend features for building up model 2.
    • Optional_Submission_Simple_Model
      Building a simplified model.
  • data
    • Data_Description.xlsx
      Feature informations: Relevant for the future, type, description.
    • encoding_dictionary.p
      Dictionary, which contains the encoder for relabeling values of features.
  • feature_importance
    • Featureimportance_M1.csv
      List of all features used in model 1 with corresponding importance.
    • Featureimportance_Feature_Selection_M2.csv
      List of all features in model 2 after Feature Engineering with corresponding importance.
    • Featureimportance_M2.csv
      List of all features used in model 2 with corresponding importance.
    • FeatureImportance_Simple.csv
      List of all features used in the simplified model.
  • models
    • Placeholder
      Models are saved here.
  • submissions
    • Placeholder
      Final submissions are stored here.

Data

The original train- and testdata can be downloaded form the competition homepage.

Link to the data:https://www.kaggle.com/c/microsoft-malware-prediction/data

The datasets have to be stored in the data folder.

Hardware:

Notebook with:

  • Intel(R) Core(TM) i7-8850H
  • 16GB RAM

Software

  • Windows 10 Pro, 64 Bit (Version: 1809)
  • Anaconda 1.9.6
  • Python 3.7.1
  • Jupyter Notebook 5.7.4

Libraries

The following libraries are required:

  • numpy (Version 1.15.4)
  • pandas (Version 0.23.4)
  • dask (Version 1.0.0)
  • scikit-learn (Version 0.20.1 )
  • tqdm (Version 4.28.1)
  • lightgbm (Version 2.2.1)
  • pickle (Version 4.0)

Licence

Out code is submitted under MIT license.

About

This repository contains the winning solution (2nd place) of the Macrosoft Maleware Prediction Challenge on Kaggle.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published