Microsoft Malware Prediction¶

W207 Final Project¶

• Kevin Hartman
• Gunnar Mein
• Andrew Morris

The inspiration for this project came from a recent competition from Kaggle:

The motivation and prompt:

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.

As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.

Can you help protect more than one billion machines from damage BEFORE it happens?

Contents¶

We used the following classifiers that were covered in the course:

• k Nearest Neighbors (week 2)
• Decision Trees, as well as Random Forests, Extra Trees, AdaBoost, and Gradient Boosting (week 4)
• Logistic Regression (week 5)
• Neural Networks (week 7)
• Support Vector Machines (week 8)*
• PCA and Gaussian Mixture Models (weeks 9 and 10)

We also investigated three experimental libraries in HistGradientBoosting and took a deeper dive into PyTorch - two topics that were not covered in the course.

We did not pursue Naive Bayes (week 3) or Stochastic Gradient Descent (week 6) due to time constraints, and because our EDA did not provide enough support to investigate these approaches.

*Of note, we performed limited LinearSVM() modeling but the results showed the model just basically guessed all positives or all negatives. This yielded low accuracy, so we discarded this model quickly.

Setup¶

Below we set up environment variables and library definitions. The environment variables indicate how the notebook will be run.

In [8]:
# Import required libraries for subsequent operations

import re
import time
import gc
import torch
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn as nn

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.mixture import GaussianMixture
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import zero_one_loss
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix

from torch.utils import data

%matplotlib notebook
%matplotlib inline

import warnings
warnings.filterwarnings('ignore') #hide warnings that arise from missing glyphs, deprecations, etc.

# set required global flags

load_from_encoded_files = True # skip EDA, cleaning and encoding, and load from files
do_EDA = False    # EDA portion can be skipped if working further downstream
debug = False      # use small files to check basic functionality
save_data = False # not saving encoded files can save a lot of time
generate_ids = False # if True, saves test_ids.csv
run_all_models = True # If True, run all models

# here is where we decide what to load, and trigger the process

if use_mini_files:
filename_encoded_train = "data/mini_train_encoded.csv"
filename_encoded_dev = "data/mini_dev_encoded.csv"
filename_encoded_validate = "data/mini_validate_encoded.csv"
else:
filename_encoded_train = "data/train_encoded.csv"
filename_encoded_dev = "data/dev_encoded.csv"
filename_encoded_validate = "data/validate_encoded.csv"
else:
if debug:
filename_train = "data/debug/mini_initial_train.csv"
filename_test = "data/debug/mini_initial_test.csv"
else:
filename_train = "data/train.csv"
filename_test = "data/test.csv"


Below is a utility function for loading raw files and encoded versions of those files. The actual loading process decides between small debug versions (~60MB) and the 4GB real files as indicated in our environment variables.

In [2]:
# load with pre-informed data types for faster loading

dtypes = {
'MachineIdentifier':                                    'str',
'ProductName':                                          'str',
'EngineVersion':                                        'str',
'AppVersion':                                           'str',
'AvSigVersion':                                         'str',
'IsBeta':                                               'int8',
'RtpStateBitfield':                                     'float64',
'IsSxsPassiveMode':                                     'int8',
'DefaultBrowsersIdentifier':                            'float32',
'AVProductStatesIdentifier':                            'float32',
'AVProductsInstalled':                                  'float16',
'AVProductsEnabled':                                    'float16',
'HasTpm':                                               'int8',
'CountryIdentifier':                                    'int16',
'CityIdentifier':                                       'float32',
'OrganizationIdentifier':                               'float16',
'GeoNameIdentifier':                                    'float16',
'LocaleEnglishNameIdentifier':                          'int16',
'Platform':                                             'category',
'Processor':                                            'category',
'OsVer':                                                'category',
'OsBuild':                                              'int16',
'OsSuite':                                              'int16',
'OsPlatformSubRelease':                                 'category',
'OsBuildLab':                                           'category',
'SkuEdition':                                           'category',
'IsProtected':                                          'float16',
'AutoSampleOptIn':                                      'int8',
'PuaMode':                                              'category',
'SMode':                                                'float16',
'IeVerIdentifier':                                      'float16',
'SmartScreen':                                          'str',
'Firewall':                                             'float16',
'UacLuaenable':                                         'float64',
'Census_MDC2FormFactor':                                'category',
'Census_DeviceFamily':                                  'category',
'Census_OEMNameIdentifier':                             'float32',
'Census_OEMModelIdentifier':                            'float32',
'Census_ProcessorCoreCount':                            'float16',
'Census_ProcessorManufacturerIdentifier':               'float16',
'Census_ProcessorModelIdentifier':                      'float32',
'Census_ProcessorClass':                                'category',
'Census_PrimaryDiskTotalCapacity':                      'float64',
'Census_PrimaryDiskTypeName':                           'category',
'Census_SystemVolumeTotalCapacity':                     'float64',
'Census_HasOpticalDiskDrive':                           'int8',
'Census_TotalPhysicalRAM':                              'float32',
'Census_ChassisTypeName':                               'str',
'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32',
'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32',
'Census_InternalPrimaryDisplayResolutionVertical':      'float32',
'Census_PowerPlatformRoleName':                         'category',
'Census_InternalBatteryType':                           'str',
'Census_InternalBatteryNumberOfCharges':                'float64',
'Census_OSVersion':                                     'category',
'Census_OSArchitecture':                                'category',
'Census_OSBranch':                                      'category',
'Census_OSBuildNumber':                                 'int16',
'Census_OSBuildRevision':                               'int32',
'Census_OSEdition':                                     'str',
'Census_OSSkuName':                                     'category',
'Census_OSInstallTypeName':                             'category',
'Census_OSInstallLanguageIdentifier':                   'float16',
'Census_OSUILocaleIdentifier':                          'int16',
'Census_OSWUAutoUpdateOptionsName':                     'category',
'Census_IsPortableOperatingSystem':                     'int8',
'Census_GenuineStateName':                              'category',
'Census_ActivationChannel':                             'category',
'Census_IsFlightingInternal':                           'float16',
'Census_IsFlightsDisabled':                             'float16',
'Census_FlightRing':                                    'category',
'Census_ThresholdOptIn':                                'float16',
'Census_FirmwareManufacturerIdentifier':                'float16',
'Census_FirmwareVersionIdentifier':                     'float32',
'Census_IsSecureBootEnabled':                           'int8',
'Census_IsWIMBootEnabled':                              'float16',
'Census_IsVirtualDevice':                               'float16',
'Census_IsTouchEnabled':                                'int8',
'Census_IsPenCapable':                                  'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
'Wdft_IsGamer':                                         'float16',
'Wdft_RegionIdentifier':                                'float16',
'HasDetections':                                        'int8'
}

return df

dtypes = {
'MachineIdentifier':                                    'int64',
'ProductName':                                          'int64',
'EngineVersion':                                        'int64',
'AppVersion':                                           'int64',
'AvSigVersion':                                         'int64',
'RtpStateBitfield':                                     'int64',
'Platform':                                             'int64',
'Processor':                                            'int64',
'OsVer':                                                'int64',
'OsPlatformSubRelease':                                 'int64',
'OsBuildLab':                                           'int64',
'SkuEdition':                                           'int64',
'SmartScreen':                                          'int64',
'Census_MDC2FormFactor':                                'int64',
'Census_DeviceFamily':                                  'int64',
'Census_PrimaryDiskTypeName':                           'int64',
'Census_ChassisTypeName':                               'int64',
'Census_PowerPlatformRoleName':                         'int64',
'Census_InternalBatteryType':                           'int64',
'Census_OSVersion':                                     'int64',
'Census_OSArchitecture':                                'int64',
'Census_OSBranch':                                      'int64',
'Census_OSEdition':                                     'int64',
'Census_OSSkuName':                                     'int64',
'Census_OSInstallTypeName':                             'int64',
'Census_OSWUAutoUpdateOptionsName':                     'int64',
'Census_GenuineStateName':                              'int64',
'Census_ActivationChannel':                             'int64',
'Census_FlightRing':                                    'int64',
'RtpStateBitfield_wasna':                               'int64',
'DefaultBrowsersIdentifier_wasna':                      'int64',
'AVProductStatesIdentifier_wasna':                      'int64',
'AVProductsInstalled_wasna':                            'int64',
'AVProductsEnabled_wasna':                              'int64',
'CityIdentifier_wasna':                                 'int64',
'OrganizationIdentifier_wasna':                         'int64',
'GeoNameIdentifier_wasna':                              'int64',
'IsProtected_wasna':                                    'int64',
'SMode_wasna':                                          'int64',
'IeVerIdentifier_wasna':                                'int64',
'Firewall_wasna':                                       'int64',
'UacLuaenable_wasna':                                   'int64',
'Census_OEMNameIdentifier_wasna':                       'int64',
'Census_OEMModelIdentifier_wasna':                      'int64',
'Census_ProcessorCoreCount_wasna':                      'int64',
'Census_ProcessorManufacturerIdentifier_wasna':         'int64',
'Census_ProcessorModelIdentifier_wasna':                'int64',
'Census_PrimaryDiskTotalCapacity_wasna':                'int64',
'Census_SystemVolumeTotalCapacity_wasna':               'int64',
'Census_TotalPhysicalRAM_wasna':                        'int64',
'Census_InternalPrimaryDiagonalDisplaySizeInInches_wasna': 'int64',
'Census_InternalPrimaryDisplayResolutionHorizontal_wasna': 'int64',
'Census_InternalPrimaryDisplayResolutionVertical_wasna': 'int64',
'Census_InternalBatteryNumberOfCharges_wasna':          'int64',
'Census_OSInstallLanguageIdentifier_wasna':             'int64',
'Census_IsFlightingInternal_wasna':                     'int64',
'Census_IsFlightsDisabled_wasna':                       'int64',
'Census_ThresholdOptIn_wasna':                          'int64',
'Census_FirmwareManufacturerIdentifier_wasna':          'int64',
'Census_IsWIMBootEnabled_wasna':                        'int64',
'Census_IsVirtualDevice_wasna':                         'int64',
'Census_IsAlwaysOnAlwaysConnectedCapable_wasna':        'int64',
'Wdft_IsGamer_wasna':                                   'int64',
'Wdft_RegionIdentifier_wasna':                          'int64',
'Census_FirmwareVersionIdentifier_wasna':               'int64',
'OsBuildLab_platform':                                  'float64',
'OsBuildLab_release':                                   'float64',
'IsBeta':                                               'int8',
'IsSxsPassiveMode':                                     'int8',
'HasTpm':                                               'int8',
'AutoSampleOptIn':                                      'int8',
'Census_HasOpticalDiskDrive':                           'int8',
'Census_IsPortableOperatingSystem':                     'int8',
'Census_IsSecureBootEnabled':                           'int8',
'Census_IsTouchEnabled':                                'int8',
'Census_IsPenCapable':                                  'int8',
'HasDetections':                                        'int8',
'CountryIdentifier':                                    'float64',
'LocaleEnglishNameIdentifier':                          'float64',
'OsBuild':                                              'float64',
'OsSuite':                                              'int16',
'Census_OSBuildNumber':                                 'float64',
'Census_OSUILocaleIdentifier':                          'float64',
'EngineVersion_major':                                  'int16',
'EngineVersion_minor':                                  'int16',
'EngineVersion_build1':                                 'int16',
'EngineVersion_build2':                                 'int16',
'AppVersion_major':                                     'int16',
'AppVersion_minor':                                     'int16',
'AppVersion_build1':                                    'int16',
'AppVersion_build2':                                    'int16',
'AvSigVersion_major':                                   'int16',
'AvSigVersion_minor':                                   'int16',
'AvSigVersion_build1':                                  'int16',
'AvSigVersion_build2':                                  'int16',
'Census_OSVersion_major':                               'int16',
'Census_OSVersion_minor':                               'int16',
'Census_OSVersion_build1':                              'int16',
'Census_OSVersion_build2':                              'int16',
'OsVer_major':                                          'int16',
'OsVer_minor':                                          'int16',
'OsVer_build1':                                         'int16',
'OsVer_build2':                                         'int16',
'OsBuildLab_major':                                     'float64',
'OsBuildLab_minor':                                     'float64',
'Census_OSBuildRevision':                               'int32',
'OsBuildLab_build1':                                    'int32',
'OsBuildLab_build2':                                    'float64',
'AVProductsInstalled':                                  'float16',
'AVProductsEnabled':                                    'float16',
'OrganizationIdentifier':                               'float16',
'GeoNameIdentifier':                                    'float64',
'IsProtected':                                          'float16',
'SMode':                                                'float16',
'IeVerIdentifier':                                      'float64',
'Firewall':                                             'float16',
'Census_ProcessorCoreCount':                            'float64',
'Census_ProcessorManufacturerIdentifier':               'float16',
'Census_OSInstallLanguageIdentifier':                   'float64',
'Census_IsFlightingInternal':                           'float16',
'Census_IsFlightsDisabled':                             'float16',
'Census_ThresholdOptIn':                                'float16',
'Census_FirmwareManufacturerIdentifier':                'float64',
'Census_IsWIMBootEnabled':                              'float16',
'Census_IsVirtualDevice':                               'float16',
'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
'Wdft_IsGamer':                                         'float16',
'Wdft_RegionIdentifier':                                'float64',
'DefaultBrowsersIdentifier':                            'float32',
'AVProductStatesIdentifier':                            'float64',
'CityIdentifier':                                       'float64',
'Census_OEMNameIdentifier':                             'float64',
'Census_OEMModelIdentifier':                            'float64',
'Census_ProcessorModelIdentifier':                      'float64',
'Census_TotalPhysicalRAM':                              'float64',
'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float64',
'Census_InternalPrimaryDisplayResolutionHorizontal':    'float64',
'Census_InternalPrimaryDisplayResolutionVertical':      'float64',
'Census_FirmwareVersionIdentifier':                     'float64',
'UacLuaenable':                                         'float64',
'Census_PrimaryDiskTotalCapacity':                      'float64',
'Census_SystemVolumeTotalCapacity':                     'float64',
'Census_InternalBatteryNumberOfCharges':                'float64',
'EngineVersion_combined':                               'float64',
'AppVersion_combined':                                  'float64',
'AvSigVersion_combined':                                'float64',
'Census_OSVersion_combined':                            'float64',
'OsVer_combined':                                       'float64',
'OsBuildLab_combined':                                  'float64'
}

return df

In [3]:
if load_from_encoded_files:
train_labels = df_train['HasDetections']
dev_labels = df_dev['HasDetections']
validate_labels = df_validate['HasDetections']
else:
test_df['HasDetections'] = np.int8(2)
full_df = pd.concat([train_df, test_df]) # make full, big dataframe for analysis


EDA¶

Below we have an Analysis class that generates useful categorical information about each variable. It also displays frequencies of values in different hues as the values relate to the target. We would use this information to gain insights for cleaning and ideas for feature selection / engineering.

In short, we do this to "get to know" our data a little better.

In [7]:
# Custom-made class to assist with EDA on this dataset
# The code is generalizable. However, specific decisions on plot types were made because
# all our features are categorical
class Analyze:
def __init__(self, df):
self.df = df.copy()

def remove_df():
self.df = None
gc.collect()

def print_eda_summary(self):
#sns.set(rc={'figure.figsize':(10*2,16*8)})
sns.set()
i=0
fig, ax = plt.subplots(nrows=round(len(self.df.columns)), ncols=2, figsize=(16,5*round(len(self.df.columns))))
all_cols=[]
for col in self.df.columns:
#if col == 'MachineIdentifier': continue
if self.df[col].dtype.name == 'object'  or self.df[col].dtype.name == 'category':
self.df[col] = self.df[col].astype('str')
all_cols.append(col)
max_len = self.df[col].nunique()
if max_len > 10:
max_len = 10
g=sns.countplot(y=self.df[col].fillna(-1), hue=self.df['HasDetections'], order=self.df[col].fillna(-1).value_counts(dropna=False).iloc[:max_len].index, ax=ax[i][0])
g.set_xlim(0,self.df.shape[0])
plt.tight_layout()
ax[i][0].title.set_text(col)
ax[i][0].xaxis.label.set_visible(False)
xlabels = ['{:,.0f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
ax[i][1].axis("off")
# Basic info
desc = self.df[col].describe()
summary = "DESCRIPTION\n   Name: {:}\n   Type: {:}\n  Count: {:}\n Unique: {:}\nMissing: {:}\nPercent: {:2.3f}".format(
desc.name.ljust(50), str(desc.dtype).ljust(10), self.df[col].count(), self.df[col].nunique(),
('yes' if self.df[col].hasnans else 'no'), (1-self.df[col].count()/self.df.shape[0])*100)
ax[i][1].text(0, 1, summary, verticalalignment="top", family='monospace', fontsize=12)
analysis=[]
if self.df[col].dtype.name == 'object':
# additional analysis for categorical variables
if len(self.df[col].str.lower().unique()) != len(self.df[col].unique()):
analysis.append("- duplicates from case\n")
# look for HTML escape characters (&#x..;)
# and unicode characters (searching for: anything not printable)
if len(self.df_bad) - self.df.shape[0] - self.df[col].count()>0:
analysis.append("- illegal chars: {:}\n".format(len(self.df_bad) - self.df.shape[0] - self.df[col].count()))
# find different capitalizations of "unknown"
# if more than one present, need to read as string, turn to lowercase, then make categorical
self.df_unknown = self.df[col].str.lower() == 'unknown'
unknowns = self.df[col][self.df_unknown].unique()
if len(unknowns) > 1:
analysis.append("- unknowns\n  {:}\n".format(unknowns))
if len(''.join(analysis)) > 0:
ax[i][1].text(.5, .85, 'FINDINGS\n'+''.join(analysis), verticalalignment="top", family='monospace', fontsize=12)
else:
# Stats for numeric variables
statistics = "STATS\n   Mean: {:5.4g}\n    Std: {:5.4g}\n    Min: {:5.4g}\n    25%: {:5.4g}\n    50%: {:5.4g}\n    75%: {:5.4g}\n    Max: {:5.4g}".format(
desc.mean(), desc.std(), desc.min(), desc.quantile(.25), desc.quantile(.5), desc.quantile(.75), desc.max())
ax[i][1].text(.5, .85, statistics, verticalalignment="top", family='monospace', fontsize=12)

# Top 5 and bottom 5 unique values or all unique values if < 10
if self.df[col].nunique() <= 10:
values = pd.DataFrame(list(zip(self.df[col].value_counts(dropna=False).keys().tolist(),
self.df[col].value_counts(dropna=False).tolist())),
columns=['VALUES', 'COUNTS'])
values = values.to_string(index=False)
ax[i][1].text(0, .6, values, verticalalignment="top", family='monospace', fontsize=12)
else:
values = pd.DataFrame(list(zip(self.df[col].value_counts(dropna=False).iloc[:5].keys().tolist(),
self.df[col].value_counts(dropna=False).iloc[:5].tolist())),
columns=['VALUES', 'COUNTS'])
mid_row = pd.DataFrame({'VALUES':[":"],
'COUNTS':[":"]})
bot_values = pd.DataFrame(list(zip(self.df[col].value_counts(dropna=False).iloc[-5:].keys().tolist(),
self.df[col].value_counts(dropna=False).iloc[-5:].tolist())),
columns=['VALUES', 'COUNTS'])
values = values.append(mid_row)
values = values.append(bot_values)
values = values.to_string(index=False)
ax[i][1].text(0, .6, values, verticalalignment="top", family='monospace', fontsize=12)
i=i+1
fig.show()


Analyze the datasets¶

Below are the results of our inline EDA plots and summary information. We show information for both train and test sets so we can observe directionality of variable frequencies. Had we more time we would have liked to have investigated this directionality a little more.

In [8]:
if do_EDA:
analyzer = Analyze(full_df)
analyzer.print_eda_summary()