avatarShad Griffin

Summary

The provided content outlines a comprehensive approach to creating mean encoded features for predictive modeling using Backblaze's hard drive failure data.

Abstract

The content details the second part of a series on solving the Backblaze machine failure problem on Kaggle, focusing specifically on Step Two: the creation of mean encoded features. It guides the reader through setting up the environment, reading in data from previous years, and creating meaningful features that can be used to predict hard drive failures. The process involves establishing a global mean failure rate, aggregating failure rates by manufacturer and model, and handling irrelevant aggregations by replacing them with the global average when the sample size is too small. The article emphasizes the importance of using IBM Watson Studio and cloud object storage for the analysis and demonstrates how to export and upload the resulting data frames to IBM Cloud Object Storage. The author, Josh Griffin, shares his insights and code snippets to help data scientists understand and apply these techniques to real-world equipment failure prediction problems.

Opinions

  • The author, Josh Griffin, expresses gratitude to Backblaze for providing a valuable dataset for teaching machine failure prediction techniques.
  • Griffin suggests that the goal is not to achieve the highest AUC but to build realistic and transitionable solutions from the lab to the real world.
  • The author recommends working through a related article for further details on the techniques used, indicating its relevance and usefulness.
  • Griffin emphasizes the need to ensure a large number of observations for meaningful aggregations, using a threshold of 10,000 in the context of the Backblaze data.
  • The author provides a rationale for the choice of runtime configuration on IBM Cloud, noting that some notebooks may take hours to run and should be scheduled as jobs.
  • Griffin advises on the importance of avoiding irrelevant aggregations when building mean encoded features, which is crucial for accurate predictive modeling.
  • The author highlights the usefulness of mean encoded features by example, such as when a disk fails consistently under certain conditions.
  • Griffin acknowledges that the data used in the notebook is the property of Backblaze and directs readers to Backblaze's and Kaggle's websites for questions regarding data use.

A Complete Solution to the Backblaze Machine Failure Kaggle Problem, II

Step Two, Creating Mean Encoded Features

1.0 Introduction

Backblaze, you are the “GOAT.” You are the “cat’s meow.” You “Rock the House.” In case you don’t know why Backblaze is so totally “kick-ass,” they open-sourced a vast set of hard drive information a few years ago and continue updating it each quarter. What a treasure trove of superb data. Backblaze, thank you from the bottom of my heart.

The Backblaze data includes operational metrics from hard drives with an indicator of a hard-drive failure. It is an excellent source for teaching techniques related to machine failure. Again, thank you for making this available to the open-source community.

Here is a link to the data.

My goal in this series of articles is not to give the best solution with the highest AUC. My goal is to show you how to approach equipment failure problems and build solutions that reflect realistic accuracy, and provide an easy transition from the lab to the real world.

I will use a Spark/Python Jupyter notebook inside IBM’s Watson Studio on the cloud as a tool in this discussion.

I will also be using cloud object storage on the IBM cloud.

The second article is about designing features for a predictive model. Specifically, using data from 2017, 2018 and 2019 to build features for our model based on 2020. For more information on mean encoded features, please see the following article.

https://readmedium.com/leveraging-value-from-postal-codes-naics-codes-area-codes-and-other-funky-arse-categorical-be9ce75b6d5a

I created these notebooks with a runtime useing 1 driver with 1 vCPU and 4 GB RAM, and 2 executors each with 1 vCPU and 4 GB RAM. This is available for free on the IBM Cloud. Some of the notebooks take a few hours to run. You’ll need to schedule your notebooks to run as jobs.

I recommend working through this article if you have not previously done so. It will provide further details on the techniques used to understand the data from Backblaze.

2.0 Establish environment and parameters

from functools import reduce
from pyspark.sql import DataFrame

import pyspark.sql.functions as F

from pyspark.sql.functions import when

from pyspark.sql.functions import rand
from pyspark.sql.functions import lit

#Define cloud object storages parameters
import ibmos2spark, os
# @hidden_cell

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_ae0ee98cbce04bcbb3163be1d0955096 = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_ae0ee98cbce04bcbb3163be1d0955096 = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

credentials = {
    'endpoint': XXXXXX,
    'service_id': 'XXXXXX',
    'iam_service_endpoint': 'XXXXX',
    'api_key': 'XXXXXX'
}

configuration_name = 'os_ae0ee98cbce04bcbb3163be1d0955096_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

3.0 Read in data created in step 1

Read in the data from 2017–2019 we created in step 1. There are 107,909,839 observations and 16 fields.

df = spark.read.parquet(cos.url('data2019_2017.parquet', 'backblazedata-donotdelete-pr-cij57grgkoctem'))

#df=df.show(200)
#print((df.count(), len(df.columns)))

4.0 Create mean encoded features.

In the next few steps we will created mean encoded features and export to a format to be used in our predictive model.

One thing you want to avoid when building mean encoded features is using irrelevant aggregations. This can occur when you don’t have a big enough sample for the aggregation to be meaningful. The failure rate is tiny in this exercise. This means we must ensure a large number of observations are available for the aggregations to be meaningful. In the code below, I use 10,000 as a threshold. There is no magic number, but I picked 10,000 based on the fact that overall average failure rate across all disk drives.

4.1 Establish a global mean

Having a global mean in the data frame allows us to easily replace irrelevant values with an average.

Create a Dummy field to use in aggregations.

df = df.withColumn("wookie", lit(1))

Create a global failure rate accross all three years of data. We will use this when we create the features. We now have a spark dataframe that expresses the average failure rate across all disks for the last three years.

#aggregate the data
total = df.groupBy('wookie').agg(F.mean("failure").alias('avg_failure')).collect()
#convert output to rdd
rdd = spark.sparkContext.parallelize(total)
#convert output to spark
zz=rdd.toDF()
#rename the column
zz=zz.withColumnRenamed("avg_failure","GLOBAL_AVG_FAILURE")
#multiply by 10,000, for formatting purposes.
zz = zz.withColumn("GLOBAL_AVG_FAILURE", zz.GLOBAL_AVG_FAILURE*10000)
#zz.show(200)

Join the global average to the original data frame.

df=df.join(zz,(df.wookie) == (zz.wookie),”inner”)

4.2 Aggregate the failure rate by Manufacturer.

#Calculate the summaries.
total = df.groupBy('MANUFACTURER').agg(F.mean("failure").alias('avg_failure'),F.count("failure").alias('count_failure'),\
                                       F.sum("failure").alias('sum_failure'),F.mean("GLOBAL_AVG_FAILURE").alias('GLOBAL_AVG_FAILURE')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert to spark
zz=rdd.toDF()
#rename columns
zz=zz.withColumnRenamed("avg_failure","MANU_FAIL_RATE")
zz=zz.withColumnRenamed("sum_failure","MANU_FAIL_TOTAL")
zz=zz.withColumnRenamed("count_failure","MANU_FAIL_CNT")
#multiply by 10,000 to make them easier to read and deal with
zz = zz.withColumn("MANU_FAIL_RATE", zz.MANU_FAIL_RATE*10000)

#zz.show(200)

We want to avoid situations where aggregations are based on a small number of records. In the next step we replace values with the global failure average if the total number of records used to calcluate the value is less than 10,000. Again, 10,000 is reasonable based on the overall failure rate.

df_manu = zz.withColumn("MANU_FAIL_RATE", when(zz.MANU_FAIL_CNT<100000,zz.GLOBAL_AVG_FAILURE).otherwise(zz.MANU_FAIL_RATE))

#df_manu.show(200)

Convert the aggregated data frame to pandas.

df_manup = df_manu.toPandas()

Define credentials for object storage

import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials = {
    'IAM_SERVICE_ID': 'XXXXXX',
    'IBM_API_KEY_ID': 'XXXXXX',
    'ENDPOINT': 'XXXXX',
    'IBM_AUTH_ENDPOINT': 'XXXXXX',
    'BUCKET': 'XXXXXX',
    'FILE': 'XXXXXX'
}
from ibm_botocore.client import Config
import ibm_boto3
cos = ibm_boto3.client(service_name=’s3',
 ibm_api_key_id=credentials[‘IBM_API_KEY_ID’],
 ibm_service_instance_id=credentials[‘IAM_SERVICE_ID’],
 ibm_auth_endpoint=credentials[‘IBM_AUTH_ENDPOINT’],
 config=Config(signature_version=’oauth’),
 endpoint_url=credentials[‘ENDPOINT’])

Export the pandas dataframe to csv and upload to cloud object storage.

df_manup=df_manup.to_csv(‘manufacturer.csv’,index=False)
cos.upload_file(Filename=’manufacturer.csv’,Bucket=credentials[‘BUCKET’],Key=’manufacturer.csv’)

4.3 Aggregate the failure rate by Model.

#Calculate the summaries.
total = df.groupBy(‘MODEL’).agg(F.mean(“failure”).alias(‘avg_failure’),F.count(“failure”).alias(‘count_failure’),\
 F.sum(“failure”).alias(‘sum_failure’),F.mean(“GLOBAL_AVG_FAILURE”).alias(‘GLOBAL_AVG_FAILURE’)).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert output to spark
zz=rdd.toDF()
#rename columns
zz=zz.withColumnRenamed(“avg_failure”,”MODEL_FAIL_RATE”)
zz=zz.withColumnRenamed(“sum_failure”,”MODEL_FAIL_TOTAL”)
zz=zz.withColumnRenamed(“count_failure”,”MODEL_FAIL_CNT”)
#multiply by 10,000 to make them easier to read and deal with
zz = zz.withColumn(“MODEL_FAIL_RATE”, zz.MODEL_FAIL_RATE*10000)
#replace values when total for a summary is less than 10,000
df_model = zz.withColumn(“MODEL_FAIL_RATE”, when(zz.MODEL_FAIL_CNT<100000,zz.GLOBAL_AVG_FAILURE).otherwise(zz.MODEL_FAIL_RATE))
#convert to Pandas
df_modelp = df_model.toPandas()
#export to csv
df_modelp=df_modelp.to_csv(‘model.csv’,index=False)
#upload to object storage
cos.upload_file(Filename=’model.csv’,Bucket=credentials[‘BUCKET’],Key=’model.csv’)
#zz.show(200)

4.4 Calculate the values of predictors when a disk fails by model.

This aggregation could be a useful predictor when compared to other non-failure values. For example, if a disk fails everytime a field is equal to 76.4, you should probably take note.

Select disks that failed

df_failure=df.filter(df.FAILURE == 1)

Aggregate the fields by model

#Calculate the summaries.
total = df_failure.groupBy('MODEL').agg(F.mean("REAllOCATED_SECTOR_COUNT_N").alias('REALLOCATED_SECTOR_COUNT_N_MOD'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_N").alias('REPORTED_UNCORRECTABLE_ERRORS_N_MOD'),\
                                F.mean("COMMAND_TIMEOUT_N").alias('COMMAND_TIMEOUT_N_MOD'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_N").alias('CURRENT_PENDING_SECTOR_COUNT_N_MOD'),\
                                F.mean("POWER_ON_HOURS_N").alias('POWER_ON_HOURS_N_MOD'),\
                                F.mean("REAllOCATED_SECTOR_COUNT_R").alias('REALLOCATED_SECTOR_COUNT_R_MOD'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_R").alias('REPORTED_UNCORRECTABLE_ERRORS_R_MOD'),\
                                F.mean("COMMAND_TIMEOUT_R").alias('COMMAND_TIMEOUT_R_MOD'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_R").alias('CURRENT_PENDING_SECTOR_COUNT_R_MOD'),\
                                F.mean("POWER_ON_HOURS_R").alias('POWER_ON_HOURS_R_MOD')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)

#convert to spark
df_avg_by_model=rdd.toDF()
#convert to pandas
df_avg_by_model = df_avg_by_model.toPandas()
#export to csv
df_avg_by_model=df_avg_by_model.to_csv('df_avg_by_model.csv',index=False)
#upload to cloud object storage
cos.upload_file(Filename='df_avg_by_model.csv',Bucket=credentials['BUCKET'],Key='df_avg_by_model.csv')

4.5 Calculate the values of predictors when a disk fails by manufacturer.

#Calculate the summaries.
total = df_failure.groupBy(‘MANUFACTURER’).agg(F.mean(“REAllOCATED_SECTOR_COUNT_N”).alias(‘REALLOCATED_SECTOR_COUNT_N_MAN’),\
 F.mean(“REPORTED_UNCORRECTABLE_ERRORS_N”).alias(‘REPORTED_UNCORRECTABLE_ERRORS_N_MAN’),\
 F.mean(“COMMAND_TIMEOUT_N”).alias(‘COMMAND_TIMEOUT_N_MAN’),\
 F.mean(“CURRENT_PENDING_SECTOR_COUNT_N”).alias(‘CURRENT_PENDING_SECTOR_COUNT_N_MAN’),\
 F.mean(“POWER_ON_HOURS_N”).alias(‘POWER_ON_HOURS_N_MAN’),\
 F.mean(“REAllOCATED_SECTOR_COUNT_R”).alias(‘REALLOCATED_SECTOR_COUNT_R_MAN’),\
 F.mean(“REPORTED_UNCORRECTABLE_ERRORS_R”).alias(‘REPORTED_UNCORRECTABLE_ERRORS_R_MAN’),\
 F.mean(“COMMAND_TIMEOUT_R”).alias(‘COMMAND_TIMEOUT_R_MAN’),\
 F.mean(“CURRENT_PENDING_SECTOR_COUNT_R”).alias(‘CURRENT_PENDING_SECTOR_COUNT_R_MAN’),\
 F.mean(“POWER_ON_HOURS_R”).alias(‘POWER_ON_HOURS_R_MAN’)).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert to spark
df_avg_by_manu=rdd.toDF()
#convert to pandas
df_avg_by_manu = df_avg_by_manu.toPandas()
#export to csv
df_avg_by_manu=df_avg_by_manu.to_csv(‘df_avg_by_manu.csv’,index=False)
#upload to cloud object storage
cos.upload_file(Filename=’df_avg_by_manu.csv’,Bucket=credentials[‘BUCKET’],Key=’df_avg_by_manu.csv’)
#df_avg_by_manu.show(10)

All data used in this notebook is the property of Backblaze.

For questions regarding use of data please see the following website.

Mean Encoding
Equipment Failure
Predictive Mainetance
Kaggle
Spark
Recommended from ReadMedium