Microsoft Malware Prediction

W207 Final Project

  • Kevin Hartman
  • Gunnar Mein
  • Andrew Morris

The inspiration for this project came from a recent competition from Kaggle:

The motivation and prompt:

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.

As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.

Can you help protect more than one billion machines from damage BEFORE it happens?

Contents

  1. Definitions and Setup
  2. Loading the Data
  3. Exploratory Data Analysis
  4. Data Wrangling and Cleaning
  5. Encoding and Transformations
  6. Feature Selection
  7. Model Fitting and Evaluation
  8. Neural Network
  9. Review and Conclusion

We used the following classifiers that were covered in the course:

  • k Nearest Neighbors (week 2)
  • Decision Trees, as well as Random Forests, Extra Trees, AdaBoost, and Gradient Boosting (week 4)
  • Logistic Regression (week 5)
  • Neural Networks (week 7)
  • Support Vector Machines (week 8)*
  • PCA and Gaussian Mixture Models (weeks 9 and 10)

We also investigated three experimental libraries in HistGradientBoosting and took a deeper dive into PyTorch - two topics that were not covered in the course.

We did not pursue Naive Bayes (week 3) or Stochastic Gradient Descent (week 6) due to time constraints, and because our EDA did not provide enough support to investigate these approaches.

*Of note, we performed limited LinearSVM() modeling but the results showed the model just basically guessed all positives or all negatives. This yielded low accuracy, so we discarded this model quickly.

Setup

Below we set up environment variables and library definitions. The environment variables indicate how the notebook will be run.

[Back to top]

In [8]:
# Import required libraries for subsequent operations

import re
import time
import gc
import torch
import math


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn as nn

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.mixture import GaussianMixture
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble._hist_gradient_boosting.utils import (get_equivalent_estimator)
from sklearn.metrics import zero_one_loss
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix

from torch.utils import data

%matplotlib notebook
%matplotlib inline

import warnings
warnings.filterwarnings('ignore') #hide warnings that arise from missing glyphs, deprecations, etc.

# set required global flags  

load_from_encoded_files = True # skip EDA, cleaning and encoding, and load from files
use_mini_files = True # load from mini sets when loading from encoded files
do_EDA = False    # EDA portion can be skipped if working further downstream
debug = False      # use small files to check basic functionality
save_data = False # not saving encoded files can save a lot of time
generate_ids = False # if True, saves test_ids.csv
run_all_models = True # If True, run all models

# here is where we decide what to load, and trigger the process

if load_from_encoded_files:
    if use_mini_files:
        filename_encoded_train = "data/mini_train_encoded.csv"
        filename_encoded_dev = "data/mini_dev_encoded.csv"
        filename_encoded_validate = "data/mini_validate_encoded.csv"
    else:
        filename_encoded_train = "data/train_encoded.csv"
        filename_encoded_dev = "data/dev_encoded.csv"
        filename_encoded_validate = "data/validate_encoded.csv"
else:
    if debug:
        filename_train = "data/debug/mini_initial_train.csv"
        filename_test = "data/debug/mini_initial_test.csv"
    else:
        filename_train = "data/train.csv"
        filename_test = "data/test.csv"

Loading

Below is a utility function for loading raw files and encoded versions of those files. The actual loading process decides between small debug versions (~60MB) and the 4GB real files as indicated in our environment variables.

[Back to top]

In [2]:
# load with pre-informed data types for faster loading

def load_from_raw_data(filename):
   
    dtypes = {
        'MachineIdentifier':                                    'str',
        'ProductName':                                          'str',
        'EngineVersion':                                        'str',
        'AppVersion':                                           'str',
        'AvSigVersion':                                         'str',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float64',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int16',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'str',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float64', 
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float32', 
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float32', 
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float64', 
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float64', 
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'str',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32', 
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32', 
        'Census_InternalPrimaryDisplayResolutionVertical':      'float32', 
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'str',
        'Census_InternalBatteryNumberOfCharges':                'float64', 
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'str',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }

    df = pd.read_csv(filename, dtype=dtypes, engine='c')
    return df

def load_from_encoded_data(filename):
   
    dtypes = {
        'MachineIdentifier':                                    'int64',
        'ProductName':                                          'int64',
        'EngineVersion':                                        'int64',
        'AppVersion':                                           'int64',
        'AvSigVersion':                                         'int64',
        'RtpStateBitfield':                                     'int64',
        'Platform':                                             'int64',
        'Processor':                                            'int64',
        'OsVer':                                                'int64',
        'OsPlatformSubRelease':                                 'int64',
        'OsBuildLab':                                           'int64',
        'SkuEdition':                                           'int64',
        'SmartScreen':                                          'int64',
        'Census_MDC2FormFactor':                                'int64',
        'Census_DeviceFamily':                                  'int64',
        'Census_PrimaryDiskTypeName':                           'int64',
        'Census_ChassisTypeName':                               'int64',
        'Census_PowerPlatformRoleName':                         'int64',
        'Census_InternalBatteryType':                           'int64',
        'Census_OSVersion':                                     'int64',
        'Census_OSArchitecture':                                'int64',
        'Census_OSBranch':                                      'int64',
        'Census_OSEdition':                                     'int64',
        'Census_OSSkuName':                                     'int64',
        'Census_OSInstallTypeName':                             'int64',
        'Census_OSWUAutoUpdateOptionsName':                     'int64',
        'Census_GenuineStateName':                              'int64',
        'Census_ActivationChannel':                             'int64',
        'Census_FlightRing':                                    'int64',
        'RtpStateBitfield_wasna':                               'int64',
        'DefaultBrowsersIdentifier_wasna':                      'int64',
        'AVProductStatesIdentifier_wasna':                      'int64',
        'AVProductsInstalled_wasna':                            'int64',
        'AVProductsEnabled_wasna':                              'int64',
        'CityIdentifier_wasna':                                 'int64',
        'OrganizationIdentifier_wasna':                         'int64',
        'GeoNameIdentifier_wasna':                              'int64',
        'IsProtected_wasna':                                    'int64',
        'SMode_wasna':                                          'int64',
        'IeVerIdentifier_wasna':                                'int64',
        'Firewall_wasna':                                       'int64',
        'UacLuaenable_wasna':                                   'int64',
        'Census_OEMNameIdentifier_wasna':                       'int64',
        'Census_OEMModelIdentifier_wasna':                      'int64',
        'Census_ProcessorCoreCount_wasna':                      'int64',
        'Census_ProcessorManufacturerIdentifier_wasna':         'int64',
        'Census_ProcessorModelIdentifier_wasna':                'int64',
        'Census_PrimaryDiskTotalCapacity_wasna':                'int64',
        'Census_SystemVolumeTotalCapacity_wasna':               'int64',
        'Census_TotalPhysicalRAM_wasna':                        'int64',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches_wasna': 'int64',
        'Census_InternalPrimaryDisplayResolutionHorizontal_wasna': 'int64',
        'Census_InternalPrimaryDisplayResolutionVertical_wasna': 'int64',
        'Census_InternalBatteryNumberOfCharges_wasna':          'int64',
        'Census_OSInstallLanguageIdentifier_wasna':             'int64',
        'Census_IsFlightingInternal_wasna':                     'int64',
        'Census_IsFlightsDisabled_wasna':                       'int64',
        'Census_ThresholdOptIn_wasna':                          'int64',
        'Census_FirmwareManufacturerIdentifier_wasna':          'int64',
        'Census_IsWIMBootEnabled_wasna':                        'int64',
        'Census_IsVirtualDevice_wasna':                         'int64',
        'Census_IsAlwaysOnAlwaysConnectedCapable_wasna':        'int64',
        'Wdft_IsGamer_wasna':                                   'int64',
        'Wdft_RegionIdentifier_wasna':                          'int64',
        'Census_FirmwareVersionIdentifier_wasna':               'int64',
        'OsBuildLab_platform':                                  'float64',
        'OsBuildLab_release':                                   'float64',
        'IsBeta':                                               'int8',
        'IsSxsPassiveMode':                                     'int8',
        'HasTpm':                                               'int8',
        'AutoSampleOptIn':                                      'int8',
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'HasDetections':                                        'int8',
        'CountryIdentifier':                                    'float64',
        'LocaleEnglishNameIdentifier':                          'float64',
        'OsBuild':                                              'float64',
        'OsSuite':                                              'int16',
        'Census_OSBuildNumber':                                 'float64',
        'Census_OSUILocaleIdentifier':                          'float64',
        'EngineVersion_major':                                  'int16',
        'EngineVersion_minor':                                  'int16',
        'EngineVersion_build1':                                 'int16',
        'EngineVersion_build2':                                 'int16',
        'AppVersion_major':                                     'int16',
        'AppVersion_minor':                                     'int16',
        'AppVersion_build1':                                    'int16',
        'AppVersion_build2':                                    'int16',
        'AvSigVersion_major':                                   'int16',
        'AvSigVersion_minor':                                   'int16',
        'AvSigVersion_build1':                                  'int16',
        'AvSigVersion_build2':                                  'int16',
        'Census_OSVersion_major':                               'int16',
        'Census_OSVersion_minor':                               'int16',
        'Census_OSVersion_build1':                              'int16',
        'Census_OSVersion_build2':                              'int16',
        'OsVer_major':                                          'int16',
        'OsVer_minor':                                          'int16',
        'OsVer_build1':                                         'int16',
        'OsVer_build2':                                         'int16',
        'OsBuildLab_major':                                     'float64',
        'OsBuildLab_minor':                                     'float64',
        'Census_OSBuildRevision':                               'int32',
        'OsBuildLab_build1':                                    'int32',
        'OsBuildLab_build2':                                    'float64',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float64',
        'IsProtected':                                          'float16',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float64',
        'Firewall':                                             'float16',
        'Census_ProcessorCoreCount':                            'float64',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_OSInstallLanguageIdentifier':                   'float64',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float64',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float64',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float64',
        'CityIdentifier':                                       'float64',
        'Census_OEMNameIdentifier':                             'float64',
        'Census_OEMModelIdentifier':                            'float64',
        'Census_ProcessorModelIdentifier':                      'float64',
        'Census_TotalPhysicalRAM':                              'float64',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float64',
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float64',
        'Census_InternalPrimaryDisplayResolutionVertical':      'float64',
        'Census_FirmwareVersionIdentifier':                     'float64',
        'UacLuaenable':                                         'float64',
        'Census_PrimaryDiskTotalCapacity':                      'float64',
        'Census_SystemVolumeTotalCapacity':                     'float64',
        'Census_InternalBatteryNumberOfCharges':                'float64',
        'EngineVersion_combined':                               'float64',
        'AppVersion_combined':                                  'float64',
        'AvSigVersion_combined':                                'float64',
        'Census_OSVersion_combined':                            'float64',
        'OsVer_combined':                                       'float64',
        'OsBuildLab_combined':                                  'float64'
        }

    df = pd.read_csv(filename, dtype=dtypes, engine='c', header=0)
    return df
In [3]:
if load_from_encoded_files:
    df_train = load_from_encoded_data(filename_encoded_train)
    train_labels = df_train['HasDetections']
    df_dev = load_from_encoded_data(filename_encoded_dev)
    dev_labels = df_dev['HasDetections']
    df_validate = load_from_encoded_data(filename_encoded_validate)
    validate_labels = df_validate['HasDetections']
else:
    train_df = load_from_raw_data(filename_train) 
    test_df = load_from_raw_data(filename_test)
    test_df['HasDetections'] = np.int8(2)
    full_df = pd.concat([train_df, test_df]) # make full, big dataframe for analysis  

EDA

[Back to top]

Below we have an Analysis class that generates useful categorical information about each variable. It also displays frequencies of values in different hues as the values relate to the target. We would use this information to gain insights for cleaning and ideas for feature selection / engineering.

In short, we do this to "get to know" our data a little better.

In [7]:
# Custom-made class to assist with EDA on this dataset
# The code is generalizable. However, specific decisions on plot types were made because
# all our features are categorical
class Analyze:
    def __init__(self, df):
        self.df = df.copy()
    
    def remove_df():
        self.df = None
        gc.collect()
        
    def print_eda_summary(self):
        #sns.set(rc={'figure.figsize':(10*2,16*8)})
        sns.set()
        i=0
        fig, ax = plt.subplots(nrows=round(len(self.df.columns)), ncols=2, figsize=(16,5*round(len(self.df.columns))))
        all_cols=[]
        for col in self.df.columns:
            #if col == 'MachineIdentifier': continue
            if self.df[col].dtype.name == 'object'  or self.df[col].dtype.name == 'category': 
                self.df[col] = self.df[col].astype('str')
            all_cols.append(col)
            max_len = self.df[col].nunique()
            if max_len > 10:
                max_len = 10
            g=sns.countplot(y=self.df[col].fillna(-1), hue=self.df['HasDetections'], order=self.df[col].fillna(-1).value_counts(dropna=False).iloc[:max_len].index, ax=ax[i][0])
            g.set_xlim(0,self.df.shape[0])
            plt.tight_layout()
            ax[i][0].title.set_text(col)
            ax[i][0].xaxis.label.set_visible(False)
            xlabels = ['{:,.0f}'.format(x) + 'K' for x in g.get_xticks()/1000]
            g.set_xticklabels(xlabels)
            ax[i][1].axis("off")
            # Basic info
            desc = self.df[col].describe()
            summary = "DESCRIPTION\n   Name: {:}\n   Type: {:}\n  Count: {:}\n Unique: {:}\nMissing: {:}\nPercent: {:2.3f}".format(
                desc.name.ljust(50), str(desc.dtype).ljust(10), self.df[col].count(), self.df[col].nunique(),
                ('yes' if self.df[col].hasnans else 'no'), (1-self.df[col].count()/self.df.shape[0])*100)
            ax[i][1].text(0, 1, summary, verticalalignment="top", family='monospace', fontsize=12)
            analysis=[]
            if self.df[col].dtype.name == 'object': 
                # additional analysis for categorical variables
                if len(self.df[col].str.lower().unique()) != len(self.df[col].unique()):
                    analysis.append("- duplicates from case\n")
                # look for HTML escape characters (&#x..;)
                # and unicode characters (searching for: anything not printable)
                self.df_bad = self.df[col][self.df[col].str.contains(r'[\x00-\x1f]|&#x\d\d;', regex=True, na=True)]
                if len(self.df_bad) - self.df.shape[0] - self.df[col].count()>0:
                    analysis.append("- illegal chars: {:}\n".format(len(self.df_bad) - self.df.shape[0] - self.df[col].count()))
                # find different capitalizations of "unknown"
                # if more than one present, need to read as string, turn to lowercase, then make categorical
                self.df_unknown = self.df[col].str.lower() == 'unknown'
                unknowns = self.df[col][self.df_unknown].unique()
                if len(unknowns) > 1:
                    analysis.append("- unknowns\n  {:}\n".format(unknowns))
                if len(''.join(analysis)) > 0:
                    ax[i][1].text(.5, .85, 'FINDINGS\n'+''.join(analysis), verticalalignment="top", family='monospace', fontsize=12)
            else:
                # Stats for numeric variables
                statistics = "STATS\n   Mean: {:5.4g}\n    Std: {:5.4g}\n    Min: {:5.4g}\n    25%: {:5.4g}\n    50%: {:5.4g}\n    75%: {:5.4g}\n    Max: {:5.4g}".format(
                    desc.mean(), desc.std(), desc.min(), desc.quantile(.25), desc.quantile(.5), desc.quantile(.75), desc.max())
                ax[i][1].text(.5, .85, statistics, verticalalignment="top", family='monospace', fontsize=12)

            # Top 5 and bottom 5 unique values or all unique values if < 10
            if self.df[col].nunique() <= 10:
                values = pd.DataFrame(list(zip(self.df[col].value_counts(dropna=False).keys().tolist(),
                                         self.df[col].value_counts(dropna=False).tolist())),
                                columns=['VALUES', 'COUNTS'])
                values = values.to_string(index=False)
                ax[i][1].text(0, .6, values, verticalalignment="top", family='monospace', fontsize=12)
            else:
                values = pd.DataFrame(list(zip(self.df[col].value_counts(dropna=False).iloc[:5].keys().tolist(),
                                         self.df[col].value_counts(dropna=False).iloc[:5].tolist())),
                                columns=['VALUES', 'COUNTS'])
                mid_row = pd.DataFrame({'VALUES':[":"],
                                        'COUNTS':[":"]})
                bot_values = pd.DataFrame(list(zip(self.df[col].value_counts(dropna=False).iloc[-5:].keys().tolist(),
                                         self.df[col].value_counts(dropna=False).iloc[-5:].tolist())),
                                columns=['VALUES', 'COUNTS'])
                values = values.append(mid_row)
                values = values.append(bot_values)
                values = values.to_string(index=False)
                ax[i][1].text(0, .6, values, verticalalignment="top", family='monospace', fontsize=12)
            i=i+1
        fig.show()

Analyze the datasets

Below are the results of our inline EDA plots and summary information. We show information for both train and test sets so we can observe directionality of variable frequencies. Had we more time we would have liked to have investigated this directionality a little more.

In [8]:
if do_EDA:
    analyzer = Analyze(full_df)
    analyzer.print_eda_summary()