Summary

The website content describes a tutorial on using deep learning for root cause analysis in a data center environment, focusing on predicting the root causes of service failures based on telemetry and error data.

Abstract

The tutorial outlines a method for applying artificial neural networks to the problem of root cause analysis within a data center. It details the process of building a deep learning model that uses binary features indicating CPU load, memory leaks, network delays, and specific error codes to predict the root cause of service disruptions. The model aims to assist the data center team in quickly identifying the underlying issues among three possible classes: memory leaks, network delays, or database issues. The tutorial includes steps for preprocessing the data, such as normalizing features and encoding the target variable, as well as constructing and training a neural network with two hidden layers using TensorFlow and Keras. The model's performance is evaluated on test data, and it achieves an accuracy of 76% after 20 epochs of training. Additionally, the tutorial provides guidance on making predictions for single and multiple incidents, demonstrating how to interpret the model's output to determine the likely root cause.

Opinions

The tutorial expresses confidence in the ability of deep learning models to effectively predict root causes in data center service failures.
It suggests that the preprocessing steps are straightforward due to the binary nature of the input features.
The authors imply that the model's accuracy could be further improved by adjusting the architecture, such as adding more layers or nodes.
There is an assumption that the provided dataset is sufficient and representative for training a robust predictive model.
The use of one-hot encoding for the target variable indicates a preference for categorical cross-entropy as the loss function for multi-class classification problems.

Root Cause Analysis

using Deep Learning

In this tutorial we will use artificial neural network to solve one of the common problem that we call Root Cause Analysis. The use case to solve relates to root cause analysis of problems found in a data center.

We have a data center that runs a number of software services. Service failures happen sometimes, so the data center team have to quickly troubleshoot and identify the root cause. The team wants to create a model that can predict root causes reported by customers based on the telemetry generated and errors noticed. They already have a system monitoring tool that tracks CPU, memory, and application latency characteristics of their servers. In addition, they also track errors reported by their applications.

Can we use this information to predict root causes of the issues noticed?

Problem statement

Using features about CPU loads, memory load, network delays, and three types of errors observed, we want to build a deep learning model to predict the root cause of the error.

The available data set has available has one record for each incident to indicate if any of the load issues or errors were noticed when the problem happened.

You can download the data set that we use on this tutorial using the following link.

Each record in the file has a unique identifier ID that represents the incident. There are seven feature variables, namely CPU_LOAD, MEMORY_LEAK_LOAD, network DELAY, ERROR_1000, ERROR_1001, ERROR_1002, and ERROR_1003. Each of feature is a either a 1 or 0.

The target variable is ROOT_CAUSE. It has three classes, MEMORY_LEAK, NETWORK_DELAY, or DATABASE_ISSUE.

Snapchat of the root cause analysis file

We want to build a model to predict the root cause based on the other values provided.

Here you can find the list of steps to follow in order to build the model and predict using new data.

Preprocess data

Load the data into a pandas data frame and print the content;

import pandas as pd
import os
import tensorflow as tf

#Load the data file into a Pandas Dataframe
symptom_data = pd.read_csv("root_cause_analysis.csv")

#Explore the data loaded
print(symptom_data.dtypes)
symptom_data.head()

Use the label encoder to transform the target variable into numeric values;

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

label_encoder = preprocessing.LabelEncoder()
symptom_data['ROOT_CAUSE'] = label_encoder.fit_transform(
                                symptom_data['ROOT_CAUSE'])

Given that all the values for the feature variables are either one or zero, it is already normalized so we don’t have to do any further processing.

Convert the data to a Numpy array of float numbers;

#Convert Pandas DataFrame to a numpy vector
np_symptom = symptom_data.to_numpy().astype(float)

Split the features variables from the target variable;

#Extract the feature variables (X)
X_data = np_symptom[:,1:8]

#Extract the target variable (Y), conver to one-hot-encodign
Y_data=np_symptom[:,8]

Use one hot encoding to encode the target variable;

Y_data = tf.keras.utils.to_categorical(Y_data,3)

Split the data into training and test sets;

#Split training and test data
X_train,X_test,Y_train,Y_test = train_test_split( X_data, Y_data, test_size=0.10)

print("Shape of feature variables :", X_train.shape)
print("Shape of target variable :",Y_train.shape)

Now the data ready for deep learning.

Build the model

Now we proceed to build a model for root cause analysis.

Set the EPOCHS, BATCH SIZE, the number of nodes on the hidden layer N_HIDDEN, and the part of data to be used on validation VALIDATION_SPLIT;

from tensorflow import keras
from tensorflow.keras import optimizers
from tensorflow.keras.regularizers import l2

#Setup Training Parameters
EPOCHS=20
BATCH_SIZE=64
VERBOSE=1
OUTPUT_CLASSES=len(label_encoder.classes_)
N_HIDDEN=128
VALIDATION_SPLIT=0.2

Use a model with two hidden layers of and an output layer;

#Create a Keras sequential model
model = tf.keras.models.Sequential()
#Add a Dense Layer
model.add(keras.layers.Dense(N_HIDDEN,
                             input_shape=(7,),
                              name='Dense-Layer-1',
                              activation='relu'))

#Add a second dense layer
model.add(keras.layers.Dense(N_HIDDEN,
                              name='Dense-Layer-2',
                              activation='relu'))

#Add a softmax layer for categorial prediction
model.add(keras.layers.Dense(OUTPUT_CLASSES,
                             name='Final',
                             activation='softmax'))
#Compile the model
model.compile(
              loss='categorical_crossentropy',
              metrics=['accuracy'])


model.summary()

Build the model, then fit the training data;

#Build the model
model.fit(X_train,
          Y_train,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS,
          verbose=VERBOSE,
          validation_split=VALIDATION_SPLIT)

Evaluate the model against the test data;

#Evaluate the model against the test dataset and print results
print("\nEvaluation against Test Dataset :\n------------------------------------")
model.evaluate(X_test,Y_test)

The overall accuracy during training is around 86% and the accuracy against test data is 76%

If you wish to improve the model you can continue to increase the number of layers or the number of nodes.

Make predictions

Now that we built the model we can proceed to predict root causes. We will first predict the root cause for a single input . So we will:

Set the values for the seven feature variables;

#Pass individual flags to Predict the root cause
import numpy as np

CPU_LOAD=1
MEMORY_LOAD=0
DELAY=0
ERROR_1000=0
ERROR_1001=1
ERROR_1002=1
ERROR_1003=0

Use the model to predict the probabilities for the various classes and use the argmax function to extract the position with the highest probability;

prediction=np.argmax(model.predict(
    [[CPU_LOAD,MEMORY_LOAD,DELAY,
      ERROR_1000,ERROR_1001,ERROR_1002,ERROR_1003]]), axis=1 )

Use the label and coder to retrieve the string class value for the root cause;

print(label_encoder.inverse_transform(prediction))

The prediction results shows that the cause is a database issue.

We can also perform a batch prediction in a similar fashion for multiple incidents in one shot, as shown here

#Predicting as a Batch
print(label_encoder.inverse_transform(np.argmax(
        model.predict([[1,0,0,0,1,1,0],
                                [0,1,1,1,0,0,0],
                                [1,1,0,1,1,0,1],
                                [0,0,0,0,0,1,0],
                                [1,0,1,0,1,1,1]]), axis=1 )))