A Complete Solution to the Backblaze Machine Failure Kaggle Problem, II

Step Two, Creating Mean Encoded Features

backblazedata/Backblaze.com -- Step Two -- Mean Encoded Features.ipynb at master ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

A Complete Solution to the BackBaze.com Kaggle Problem

Step Five, Modeling the Data

jshadgriffin.medium.com

A Complete Solution to the BackBaze.com Kaggle Problem

Part Six, Evaluating the Model

jshadgriffin.medium.com

A Complete Solution to the BackBaze.com Kaggle Problem

Step Four, Adding Features

jshadgriffin.medium.com

A Complete Solution to the BackBaze.com Kaggle Problem

Step Three, Shaping and cleaning the data

jshadgriffin.medium.com

A Complete Solution to the BackBaze.com Kaggle Problem

Step One, File Processing

jshadgriffin.medium.com

1.0 Introduction

Backblaze, you are the “GOAT.” You are the “cat’s meow.” You “Rock the House.” In case you don’t know why Backblaze is so totally “kick-ass,” they open-sourced a vast set of hard drive information a few years ago and continue updating it each quarter. What a treasure trove of superb data. Backblaze, thank you from the bottom of my heart.

The Backblaze data includes operational metrics from hard drives with an indicator of a hard-drive failure. It is an excellent source for teaching techniques related to machine failure. Again, thank you for making this available to the open-source community.

Here is a link to the data.

Backblaze Hard Drive Stats

Each day in the Backblaze data center, we take a snapshot of each operational hard drive. This snapshot includes basic…

www.backblaze.com

My goal in this series of articles is not to give the best solution with the highest AUC. My goal is to show you how to approach equipment failure problems and build solutions that reflect realistic accuracy, and provide an easy transition from the lab to the real world.

I will use a Spark/Python Jupyter notebook inside IBM’s Watson Studio on the cloud as a tool in this discussion.

IBM Watson Studio - Overview

Build, run and manage AI models. Prepare data and build models on any cloud using open source code or visual modeling…

www.ibm.com

I will also be using cloud object storage on the IBM cloud.

IBM Cloud Docs

Find documentation, API & SDK references, tutorials, FAQs, and more resources for IBM Cloud products and services.

cloud.ibm.com

The second article is about designing features for a predictive model. Specifically, using data from 2017, 2018 and 2019 to build features for our model based on 2020. For more information on mean encoded features, please see the following article.

https://readmedium.com/leveraging-value-from-postal-codes-naics-codes-area-codes-and-other-funky-arse-categorical-be9ce75b6d5a

I created these notebooks with a runtime useing 1 driver with 1 vCPU and 4 GB RAM, and 2 executors each with 1 vCPU and 4 GB RAM. This is available for free on the IBM Cloud. Some of the notebooks take a few hours to run. You’ll need to schedule your notebooks to run as jobs.

Scheduling a notebook

You can create a job to run your notebook at periodic intervals.

dataplatform.cloud.ibm.com

I recommend working through this article if you have not previously done so. It will provide further details on the techniques used to understand the data from Backblaze.

Machine Learning for Equipment Failure Prediction and Predictive Maintenance (PM)

I spent roughly four years of my life studying equipment failure problems as a Data Scientist. This article includes…

medium.com

2.0 Establish environment and parameters

from functools import reduce
from pyspark.sql import DataFrame

import pyspark.sql.functions as F

from pyspark.sql.functions import when

from pyspark.sql.functions import rand
from pyspark.sql.functions import lit

#Define cloud object storages parameters
import ibmos2spark, os
# @hidden_cell

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_ae0ee98cbce04bcbb3163be1d0955096 = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_ae0ee98cbce04bcbb3163be1d0955096 = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

credentials = {
    'endpoint': XXXXXX,
    'service_id': 'XXXXXX',
    'iam_service_endpoint': 'XXXXX',
    'api_key': 'XXXXXX'
}

configuration_name = 'os_ae0ee98cbce04bcbb3163be1d0955096_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

3.0 Read in data created in step 1

Read in the data from 2017–2019 we created in step 1. There are 107,909,839 observations and 16 fields.

df = spark.read.parquet(cos.url('data2019_2017.parquet', 'backblazedata-donotdelete-pr-cij57grgkoctem'))

#df=df.show(200)
#print((df.count(), len(df.columns)))

4.0 Create mean encoded features.

In the next few steps we will created mean encoded features and export to a format to be used in our predictive model.

One thing you want to avoid when building mean encoded features is using irrelevant aggregations. This can occur when you don’t have a big enough sample for the aggregation to be meaningful. The failure rate is tiny in this exercise. This means we must ensure a large number of observations are available for the aggregations to be meaningful. In the code below, I use 10,000 as a threshold. There is no magic number, but I picked 10,000 based on the fact that overall average failure rate across all disk drives.

4.1 Establish a global mean

Having a global mean in the data frame allows us to easily replace irrelevant values with an average.

Create a Dummy field to use in aggregations.

df = df.withColumn("wookie", lit(1))

Create a global failure rate accross all three years of data. We will use this when we create the features. We now have a spark dataframe that expresses the average failure rate across all disks for the last three years.

#aggregate the data
total = df.groupBy('wookie').agg(F.mean("failure").alias('avg_failure')).collect()
#convert output to rdd
rdd = spark.sparkContext.parallelize(total)
#convert output to spark
zz=rdd.toDF()
#rename the column
zz=zz.withColumnRenamed("avg_failure","GLOBAL_AVG_FAILURE")
#multiply by 10,000, for formatting purposes.
zz = zz.withColumn("GLOBAL_AVG_FAILURE", zz.GLOBAL_AVG_FAILURE*10000)
#zz.show(200)

Join the global average to the original data frame.

df=df.join(zz,(df.wookie) == (zz.wookie),”inner”)

4.2 Aggregate the failure rate by Manufacturer.

#Calculate the summaries.
total = df.groupBy('MANUFACTURER').agg(F.mean("failure").alias('avg_failure'),F.count("failure").alias('count_failure'),\
                                       F.sum("failure").alias('sum_failure'),F.mean("GLOBAL_AVG_FAILURE").alias('GLOBAL_AVG_FAILURE')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert to spark
zz=rdd.toDF()
#rename columns
zz=zz.withColumnRenamed("avg_failure","MANU_FAIL_RATE")
zz=zz.withColumnRenamed("sum_failure","MANU_FAIL_TOTAL")
zz=zz.withColumnRenamed("count_failure","MANU_FAIL_CNT")
#multiply by 10,000 to make them easier to read and deal with
zz = zz.withColumn("MANU_FAIL_RATE", zz.MANU_FAIL_RATE*10000)

#zz.show(200)

We want to avoid situations where aggregations are based on a small number of records. In the next step we replace values with the global failure average if the total number of records used to calcluate the value is less than 10,000. Again, 10,000 is reasonable based on the overall failure rate.

df_manu = zz.withColumn("MANU_FAIL_RATE", when(zz.MANU_FAIL_CNT<100000,zz.GLOBAL_AVG_FAILURE).otherwise(zz.MANU_FAIL_RATE))

#df_manu.show(200)

Convert the aggregated data frame to pandas.

df_manup = df_manu.toPandas()

Define credentials for object storage

import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials = {
    'IAM_SERVICE_ID': 'XXXXXX',
    'IBM_API_KEY_ID': 'XXXXXX',
    'ENDPOINT': 'XXXXX',
    'IBM_AUTH_ENDPOINT': 'XXXXXX',
    'BUCKET': 'XXXXXX',
    'FILE': 'XXXXXX'
}

from ibm_botocore.client import Config
import ibm_boto3
cos = ibm_boto3.client(service_name=’s3',
 ibm_api_key_id=credentials[‘IBM_API_KEY_ID’],
 ibm_service_instance_id=credentials[‘IAM_SERVICE_ID’],
 ibm_auth_endpoint=credentials[‘IBM_AUTH_ENDPOINT’],
 config=Config(signature_version=’oauth’),
 endpoint_url=credentials[‘ENDPOINT’])

Export the pandas dataframe to csv and upload to cloud object storage.

df_manup=df_manup.to_csv(‘manufacturer.csv’,index=False)
cos.upload_file(Filename=’manufacturer.csv’,Bucket=credentials[‘BUCKET’],Key=’manufacturer.csv’)

4.3 Aggregate the failure rate by Model.

#Calculate the summaries.
total = df.groupBy(‘MODEL’).agg(F.mean(“failure”).alias(‘avg_failure’),F.count(“failure”).alias(‘count_failure’),\
 F.sum(“failure”).alias(‘sum_failure’),F.mean(“GLOBAL_AVG_FAILURE”).alias(‘GLOBAL_AVG_FAILURE’)).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)
#convert output to spark

zz=rdd.toDF()
#rename columns
zz=zz.withColumnRenamed(“avg_failure”,”MODEL_FAIL_RATE”)
zz=zz.withColumnRenamed(“sum_failure”,”MODEL_FAIL_TOTAL”)
zz=zz.withColumnRenamed(“count_failure”,”MODEL_FAIL_CNT”)
#multiply by 10,000 to make them easier to read and deal with
zz = zz.withColumn(“MODEL_FAIL_RATE”, zz.MODEL_FAIL_RATE*10000)

#replace values when total for a summary is less than 10,000
df_model = zz.withColumn(“MODEL_FAIL_RATE”, when(zz.MODEL_FAIL_CNT<100000,zz.GLOBAL_AVG_FAILURE).otherwise(zz.MODEL_FAIL_RATE))

#convert to Pandas
df_modelp = df_model.toPandas()
#export to csv
df_modelp=df_modelp.to_csv(‘model.csv’,index=False)
#upload to object storage
cos.upload_file(Filename=’model.csv’,Bucket=credentials[‘BUCKET’],Key=’model.csv’)
#zz.show(200)

4.4 Calculate the values of predictors when a disk fails by model.

This aggregation could be a useful predictor when compared to other non-failure values. For example, if a disk fails everytime a field is equal to 76.4, you should probably take note.

Select disks that failed

df_failure=df.filter(df.FAILURE == 1)

Aggregate the fields by model

#Calculate the summaries.
total = df_failure.groupBy('MODEL').agg(F.mean("REAllOCATED_SECTOR_COUNT_N").alias('REALLOCATED_SECTOR_COUNT_N_MOD'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_N").alias('REPORTED_UNCORRECTABLE_ERRORS_N_MOD'),\
                                F.mean("COMMAND_TIMEOUT_N").alias('COMMAND_TIMEOUT_N_MOD'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_N").alias('CURRENT_PENDING_SECTOR_COUNT_N_MOD'),\
                                F.mean("POWER_ON_HOURS_N").alias('POWER_ON_HOURS_N_MOD'),\
                                F.mean("REAllOCATED_SECTOR_COUNT_R").alias('REALLOCATED_SECTOR_COUNT_R_MOD'),\
                                F.mean("REPORTED_UNCORRECTABLE_ERRORS_R").alias('REPORTED_UNCORRECTABLE_ERRORS_R_MOD'),\
                                F.mean("COMMAND_TIMEOUT_R").alias('COMMAND_TIMEOUT_R_MOD'),\
                                F.mean("CURRENT_PENDING_SECTOR_COUNT_R").alias('CURRENT_PENDING_SECTOR_COUNT_R_MOD'),\
                                F.mean("POWER_ON_HOURS_R").alias('POWER_ON_HOURS_R_MOD')).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)

#convert to spark
df_avg_by_model=rdd.toDF()
#convert to pandas
df_avg_by_model = df_avg_by_model.toPandas()
#export to csv
df_avg_by_model=df_avg_by_model.to_csv('df_avg_by_model.csv',index=False)
#upload to cloud object storage
cos.upload_file(Filename='df_avg_by_model.csv',Bucket=credentials['BUCKET'],Key='df_avg_by_model.csv')

4.5 Calculate the values of predictors when a disk fails by manufacturer.

#Calculate the summaries.
total = df_failure.groupBy(‘MANUFACTURER’).agg(F.mean(“REAllOCATED_SECTOR_COUNT_N”).alias(‘REALLOCATED_SECTOR_COUNT_N_MAN’),\
 F.mean(“REPORTED_UNCORRECTABLE_ERRORS_N”).alias(‘REPORTED_UNCORRECTABLE_ERRORS_N_MAN’),\
 F.mean(“COMMAND_TIMEOUT_N”).alias(‘COMMAND_TIMEOUT_N_MAN’),\
 F.mean(“CURRENT_PENDING_SECTOR_COUNT_N”).alias(‘CURRENT_PENDING_SECTOR_COUNT_N_MAN’),\
 F.mean(“POWER_ON_HOURS_N”).alias(‘POWER_ON_HOURS_N_MAN’),\
 F.mean(“REAllOCATED_SECTOR_COUNT_R”).alias(‘REALLOCATED_SECTOR_COUNT_R_MAN’),\
 F.mean(“REPORTED_UNCORRECTABLE_ERRORS_R”).alias(‘REPORTED_UNCORRECTABLE_ERRORS_R_MAN’),\
 F.mean(“COMMAND_TIMEOUT_R”).alias(‘COMMAND_TIMEOUT_R_MAN’),\
 F.mean(“CURRENT_PENDING_SECTOR_COUNT_R”).alias(‘CURRENT_PENDING_SECTOR_COUNT_R_MAN’),\
 F.mean(“POWER_ON_HOURS_R”).alias(‘POWER_ON_HOURS_R_MAN’)).collect()
#Convert to RDD
rdd = spark.sparkContext.parallelize(total)

#convert to spark
df_avg_by_manu=rdd.toDF()

#convert to pandas
df_avg_by_manu = df_avg_by_manu.toPandas()
#export to csv
df_avg_by_manu=df_avg_by_manu.to_csv(‘df_avg_by_manu.csv’,index=False)
#upload to cloud object storage
cos.upload_file(Filename=’df_avg_by_manu.csv’,Bucket=credentials[‘BUCKET’],Key=’df_avg_by_manu.csv’)

#df_avg_by_manu.show(10)

All data used in this notebook is the property of Backblaze.

For questions regarding use of data please see the following website.

Backblaze Hard Drive Stats

Each day in the Backblaze data center, we take a snapshot of each operational hard drive. This snapshot includes basic…

www.backblaze.com

A Complete Solution to the Backblaze Machine Failure Kaggle Problem, II

Step Two, Creating Mean Encoded Features

backblazedata/Backblaze.com -- Step Two -- Mean Encoded Features.ipynb at master ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

A Complete Solution to the BackBaze.com Kaggle Problem

Step Five, Modeling the Data

A Complete Solution to the BackBaze.com Kaggle Problem

Part Six, Evaluating the Model

A Complete Solution to the BackBaze.com Kaggle Problem

Step Four, Adding Features

A Complete Solution to the BackBaze.com Kaggle Problem

Step Three, Shaping and cleaning the data

A Complete Solution to the BackBaze.com Kaggle Problem

Step One, File Processing

1.0 Introduction

Backblaze Hard Drive Stats

Each day in the Backblaze data center, we take a snapshot of each operational hard drive. This snapshot includes basic…

IBM Watson Studio - Overview

Build, run and manage AI models. Prepare data and build models on any cloud using open source code or visual modeling…

IBM Cloud Docs

Find documentation, API & SDK references, tutorials, FAQs, and more resources for IBM Cloud products and services.

Scheduling a notebook

You can create a job to run your notebook at periodic intervals.

Machine Learning for Equipment Failure Prediction and Predictive Maintenance (PM)

I spent roughly four years of my life studying equipment failure problems as a Data Scientist. This article includes…

2.0 Establish environment and parameters

3.0 Read in data created in step 1

4.0 Create mean encoded features.

4.1 Establish a global mean

4.2 Aggregate the failure rate by Manufacturer.

4.3 Aggregate the failure rate by Model.

4.4 Calculate the values of predictors when a disk fails by model.

4.5 Calculate the values of predictors when a disk fails by manufacturer.

Backblaze Hard Drive Stats

Each day in the Backblaze data center, we take a snapshot of each operational hard drive. This snapshot includes basic…

Hard Drive Test Data

Daily Snapshot of Each Operational Hard Drive in 2016