Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3786

Abstract

arison operators, and bitwise operators on this type too.</p><p id="c4da">It’s important to note that there is another type, called <code>bytes</code> which is different from the above in that it is a dynamically sized array, and not a value type but a reference type. It is basically shorthand for <code>byte[]</code>.</p><p id="ad39">When you can limit the length of your data to a predefined amount of bytes, it is always good practice to use some of <code>bytes1</code> to <code>bytes32</code> because it is much cheaper.</p><h2 id="0118">Enums</h2><p id="aea7"><b>Enums</b> in Solidity are a way to create user-defined types. Enums are explicitly convertible to integer types, but not implicitly. Enum values are numbered in the order they are defined, starting from 0.</p><p id="ed00">Enums are not part of the ABI (Application Binary Interface — more on this in a later lesson, but it’s basically how you encode Solidity code for the Ethereum Virtual Machine, and how you get data back). This means that if your function returns an <code>enum</code> for example, it will be automatically converted to a <code>uint8</code> behind the scenes. The integer returned is just large enough to hold all enum values. With more values, the size gets increased too (<code>uint16</code> and up).</p><p id="cdb6">The below code, taken from the <a href="https://docs.soliditylang.org/en/v0.4.24/index.html">Solidity docs</a>, defines an enum with four possible values, creates a variable of that enum named <code>choice</code> and a constant called <code>defaultChoice</code>that will hold a default value.</p><div id="cf29"><pre><span class="hljs-keyword">enum</span> <span class="hljs-title class_">ActionChoices</span> { GoLeft, GoRight, GoStraight, SitStill } ActionChoices choice; ActionChoices <span class="hljs-type">constant</span> <span class="hljs-variable">defaultChoice</span> <span class="hljs-operator">=</span> ActionChoices.GoStraight;</pre></div><p id="66a9">Now we can define some functions to interact with our <code>enum</code>.</p><div id="c0bb"><pre><span class="hljs-title function_"><span class="hljs-keyword">function</span> <span class="hljs-title">setGoStraight</span></span>() <span class="hljs-keyword">public</span> { choice = ActionChoices.GoStraight; }

<span class="hljs-title function_"><span class="hljs-keyword">function</span> <span class="hljs-title">setChoice</span></span>(ActionChoices <span class="hljs-keyword">new</span><span class="hljs-type">Choice</span>) <span class="hljs-keyword">public</span> { choice = <span class="hljs-keyword">new</span><span class="hljs-type">Choice</span>; }</pre></div><p id="6bc2">The first one simply sets the <code>choice</code> to <code>GoStraight</code> while the second one sets it to the choice that the caller passes into the function. As we can see after deployment, the <code>setChoice</code> function expects a <code>uint8</code> value, which corresponds to the <code>enum</code> value declared at that number.</p><figure id="e997"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1pKNPVy4UUBCSLi2-SIckg.png"><figcaption>Testing enums in Remix</figcaption></figure><p id="7917">If we want to get the value of <code>choice</code> and <code>defaultChoice</code>, we can define the following functions:</p><div id="1f02"><pre><span class="hljs-keyword">function</span> <span class="hljs-title">getChoice</span>() public view returns (ActionChoices) { <span class="hljs-keyword">return</span> <span class="hljs-type">choice</span>; }</pre></div><div id="43e7"><pre><span class="hljs-function">function <span class="hljs-title">getDefaultChoice</span>() <span class="hljs-keyword">public</span> pure <span class="hljs-title">returns</span> (<span class="hljs-params"><span class=

Options

"hljs-built_in">uint</span></span>)</span> { <span class="hljs-keyword">return</span> <span class="hljs-built_in">uint</span>(defaultChoice); }</pre></div><p id="c2f2">As we can see if we try this out in Remix, the first function returns a <code>uint8</code> while the second returns a <code>uint256</code>.</p><figure id="e514"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jmaOFb9GhXz7FWC4ONMa_A.png"><figcaption>Testing enums in Remix</figcaption></figure><h2 id="3c7c">Fixed point numbers</h2><p id="2ecc"><b>Fixed point numbers </b>represent fractional numbers by storing a fixed number of digits of their fractional part. No matter how large or small the fractional part is, it will always use the same number of bits.</p><p id="cdcd" type="7">Fixed point numbers are not fully supported by Solidity yet. They can be declared, but cannot be assigned to or from.</p><p id="f872">We can differentiate between signed fixed point numbers, declared with the <code>fixed</code> keyword, and unsigned fixed point numbers, declared with the <code>ufixed</code> keyword.</p><p id="3c1c">It can also be declared as <code>fixedMxN</code> or <code>ufixedMxN</code> where <code>M</code> represents the number of bits the type takes, and <code>N</code> represents the number of decimal points. <code>M</code> has to be divisible by 8 and a number between 8 and 256. <code>N</code> has to be a number between 0 and 80.</p><p id="96e1">They function with the following operators:</p><ul><li>Comparisons: <code><=</code>, <code><</code>, <code>==</code>, <code>!=</code>, <code>>=</code>, <code>></code> (evaluate to <code>bool</code>)</li><li>Arithmetic operators: <code>+</code>, <code>-</code>, unary <code>-</code>, unary <code>+</code>, <code>*</code>, <code>/</code>, <code>%</code> (remainder)</li></ul><h2 id="09b7">Conclusion</h2><p id="bd3a">In this lesson, we looked at what value types are available in Solidity and how each one works.</p><p id="28de">Thank you for staying with us till the end. If you enjoyed reading this piece please keep in touch and follow Solidify to keep up with our lessons on Solidity. In the upcoming articles, we will deep dive into the intricacies of the language, progressing from beginner to advanced level.</p><p id="067c">If you are new to Solidity, check out the previous lessons about setting up a local development environment and writing your first smart contract.</p><div id="6b76" class="link-block"> <a href="https://readmedium.com/how-to-setup-your-local-solidity-development-environment-c4c8195810f3"> <div> <div> <h2>How to Setup Your Local Solidity Development Environment</h2> <div><h3>Get started with smart contract development</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*HHko-o9m1sVngmTeRVYgKA.jpeg)"></div> </div> </div> </a> </div><div id="3ad1" class="link-block"> <a href="https://readmedium.com/lesson-1-your-first-solidity-smart-contract-1ba7e641f9a3"> <div> <div> <h2>Lesson 1: Your First Solidity Smart Contract</h2> <div><h3>In the previous lesson, we looked at how to set up your local Solidity development environment. Here we will continue…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*7r7HSYkbn73NrmR_skvh5w.jpeg)"></div> </div> </div> </a> </div></article></body>

A Complete Anomaly Detection Algorithm From Scratch in Python: Step by Step Guide

Anomaly Detection Algorithm Using the Probabilities

Anomaly detection can be treated as a statistical task as an outlier analysis. But if we develop a machine learning model, it can be automated and as usual, can save a lot of time. There are so many use cases of anomaly detection. Credit card fraud detection, detection of faulty machines, or hardware systems detection based on their anomalous features, disease detection based on medical records are some good examples. There are many more use cases. And the use of anomaly detection will only grow.

In this article, I will explain the process of developing an anomaly detection algorithm from scratch in Python.

The Formulas and Process

This will be much simpler compared to other machine learning algorithms I explained before. This algorithm will use the mean and variance to calculate the probability for each training data.

If the probability is high for a training example, it is normal. If the probability is low for a certain training example it is an anomalous example. The definition of high and low probability will be different for the different training sets. We will talk about how to determine that later.

If I have to explain the working process of anomaly detection, that’s very simple.

Calculate the mean using this formula:

Here m is the length of the dataset or the number of training data and xi is a single training example. If you have several training features, most of the time you will have, the mean needs to be calculated for each feature.

2. Calculate the variance using this formula:

Here, mu is the calculated mean from the previous step.

3. Now, calculate the probability for each training example with this probability formula.

Don’t be confused by the summation sign in this formula! This is actually the variance in a diagonal shape.

You will see how it looks later when we will implement the algorithm.

4. We need to find the threshold of the probability now. As I mentioned before if the probability is low for a training example, that is an anomalous example.

How much probability is low probability?

There is no universal limit for that. We need to find that out for our training dataset.

We take a range of probability values from the output we got in step 3. For each probability, find the label if the data is anomalous or normal.

Then calculate precision, recall, and f1 score for a range of probabilities.

Precision can be calculated using the following formula

Recall can be calculated by the following formula:

Here, True positives are the number of cases where the algorithm detects an example as an anomaly and in reality, it is an anomaly.

False Positives occur when the algorithm detects an example as anomalous but in the ground truth, it is not.

False Negative means the algorithm detects an example as not anomalous but in reality, it is an anomalous example.

From the formulas above you can see that higher precision and higher recall are always good because that means we have more true positives. But at the same time, false positives and false negatives play a vital role as you can see in the formulas as well. There needs to be a balance there. Based on your industry you need to decide which one is tolerable for you.

A good way is to take an average. There is a unique formula for taking an average. That’s called the f1 score. The formula for f1 score is:

Here, P and R are precision and recall respectively.

I am not going into details on why the formula is that unique. Because this article is about anomaly detection. If you are interested in learning more about precision, recall, and f1 score, I have a detailed article on that topic here:

A Complete Understanding of Precision, Recall, and F Score Concepts

How to Deal with a Skewed Dataset in Machine Learning

towardsdatascience.com

Based on the f1 score, you need to choose your threshold probability.

1 is the perfect f score and 0 is the worst probability score

Anomaly Detection Algorithm

I will use a dataset from Andrew Ng’s machine learning course which has two training features. I am not using a real-world dataset for this article because this dataset is perfect for learning. It has only two features. In any real-world dataset, it is unlikely to have only two features.

The good thing about having two features is you can visualize the data which is great for learners. Feel free to download the dataset from this link and follow along:

Machine-Learning-With-Python/ex8data1.xlsx at master · rashida048/Machine-Learning-With-Python

Contribute to rashida048/Machine-Learning-With-Python development by creating an account on GitHub.

github.com

Let’s start the mission!

First, import the necessary packages

import pandas as pd 
import numpy as np

Import the dataset. This is an excel dataset. Here training data and cross-validation data are stored in separate sheets. So, let’s bring the training data.

df = pd.read_excel('ex8data1.xlsx', sheet_name='X', header=None)
df.head()

Let’s plot column 0 against column 1.

plt.figure()
plt.scatter(df[0], df[1])
plt.show()

You probably know by looking at this graph which data are anomalous.

Check how many training examples are in this dataset:

m = len(df)

Calculate the mean for each feature. Here we have only two features: 0 and 1.

s = np.sum(df, axis=0)
mu = s/m
mu

Output:

0    14.112226
1    14.997711
dtype: float64

From the formula described in the ‘Formulas and Process’ section above, let’s calculate the variance:

vr = np.sum((df - mu)**2, axis=0)
variance = vr/m
variance

Output:

0    1.832631
1    1.709745
dtype: float64

Now make it diagonal shaped. As I explained in the ‘Formulas and Process’ section after the probability formula, that summation sign was actually the diagonals of the variance.

var_dia = np.diag(variance)
var_dia

Output:

array([[1.83263141, 0.        ],
       [0.        , 1.70974533]])

Calculate the probability:

k = len(mu)
X = df - mu
p = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1))
p

The training part is done.

Let’s put all these calculations for probability into a function for future use.

def probability(df):
    s = np.sum(df, axis=0)
    m = len(df)
    mu = s/m
    vr = np.sum((df - mu)**2, axis=0)
    variance = vr/m
    var_dia = np.diag(variance)
    k = len(mu)
    X = df - mu
    p = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1))
    return p

The next step is to find out the threshold probability. If the probability is lower than the threshold probability, the example data is anomalous data. But we need to find out that threshold for our particular case.

For this step, we use cross-validation data and also the labels. In this dataset, we have the cross-validation data and also the labels in separate sheets.

For your case, you can simply keep a portion of your original data for cross-validation.

Now import the cross-validation data and the labels:

cvx = pd.read_excel('ex8data1.xlsx', sheet_name='Xval', header=None)
cvx.head()

Here are the labels:

cvy = pd.read_excel('ex8data1.xlsx', sheet_name='y', header=None)
cvy.head()

The purpose of cross-validation data is to calculate the threshold probability. And we will use that threshold probability to find the anomalous data of df.

Now call the probability function we defined before to find the probability for our cross-validation data ‘cvx’:

p1 = probability(cvx)

I will convert ‘cvy’ to a NumPy array just because I like working with arrays. DataFrames are also fine though.

y = np.array(cvy)

Output:

#Part of the array
array([[0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],

Here, the value of ‘y’ 0 suggests that that’s a normal example, and the ‘y’ value of 1indicates that, it is an anomalous example.

Now, how to select a threshold?

I do not want to just check for all the probability from our list of probability. That may be unnecessary. Let’s examine the probability values some more.

p.describe()

Output:

count    3.070000e+02
mean     5.378568e-02
std      1.928081e-02
min      1.800521e-30
25%      4.212979e-02
50%      5.935014e-02
75%      6.924909e-02
max      7.864731e-02
dtype: float64

As you can see in the picture, we do not have too many anomalous data. So, if we just start from the 75% value, that should be good. But just to be extra safe I will start the range from the mean.

So, we will take a range of probabilities from the mean value and lower. We will check the f1 score for each probability of this range.

First, define a function to calculate the true positives, false positives, and false negatives:

def tpfpfn(ep, p):
    tp, fp, fn = 0, 0, 0
    for i in range(len(y)):
        if p[i] <= ep and y[i][0] == 1:
            tp += 1
        elif p[i] <= ep and y[i][0] == 0:
            fp += 1
        elif p[i] > ep and y[i][0] == 1:
            fn += 1
    return tp, fp, fn

Make a list of the probabilities that are lower than or equal to the mean probability.

eps = [i for i in p1 if i <= p1.mean()]

Check, the length of the list,

len(eps)

Output:

Define a function to calculate the ‘f1’ score as per the formula we discussed before:

def f1(ep, p):
    tp, fp, fn = tpfpfn(ep)
    prec = tp/(tp + fp)
    rec = tp/(tp + fn)
    f1 = 2*prec*rec/(prec + rec)
    return f1

All the functions are ready!

Now calculate the f1 score for all the epsilon or the range of probability values we selected before.

f = []
for i in eps:
    f.append(f1(i, p1))
f

Output:

[0.16470588235294117,
 0.208955223880597,
 0.15384615384615385,
 0.3181818181818182,
 0.15555555555555556,
 0.125,
 0.56,
 0.13333333333333333,
 0.16867469879518074,
 0.12612612612612614,
 0.14583333333333331,
 0.22950819672131148,
 0.15053763440860213,
 0.16666666666666666,
 0.3888888888888889,
 0.12389380530973451,

This is a part of the f score list. The length should be 128. The f scores are usually ranged between 0 and 1 where 1 is the perfect f score. The higher the f1 score the better. So, we need to take the highest f score from the list of ‘f’ scores we just calculated.

Now, use the ‘argmax’ function to determine the index of the maximum f score value.

np.array(f).argmax()

Output:

And now use this index to get the threshold probability.

e = eps[127]
e

Output:

0.00014529639061630078

Find out the Anomalous Examples

We have the threshold probability. We can find out the labels of our training data from it.

If the probability value is lower than or equal to this threshold value, the data is anomalous and otherwise, normal. We will denote the normal and anomalous data as 0and 1 respectively,

label = []
for i in range(len(df)):
    if p[i] <= e:
        label.append(1)
    else:
        label.append(0)
label

Output:

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,

This is part of the label list.

I will add these calculated labels in the training dataset above:

df['label'] = np.array(label)
df.head()

I plotted the data where the label is 1 in red color and where the label is zero in black. Here is the plot.

Does it make sense?

It does, right? The data in red are clearly anomalous.

Conclusion

I tried to explain the process to develop an anomaly detection algorithm step by step. I did not leave any steps hidden here. I hope it is understandable. If you are having trouble understanding just by reading it, I suggest run every piece of code by yourself in a notebook. That will make it very clear.

Please do not hesitate to share, if you are doing some cool projects using this algorithm.

Feel free to follow me on Twitter and like my Facebook page.