avatarKe Gui

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4854

Abstract

/pre></div><p id="1783">If we use mean and one std as the boundary, the results will look like these:</p><div id="d99c"><pre>fig, ax = plt.subplots(figsize=(10,6)) d1[<span class="hljs-string">'simple_rtn'</span>].plot(<span class="hljs-attribute">label</span>=<span class="hljs-string">'simple_rtn'</span>, <span class="hljs-attribute">legend</span>=<span class="hljs-literal">True</span>, ax = ax) plt.axhline(<span class="hljs-attribute">y</span>=d1_mean.loc[<span class="hljs-string">'mean'</span>], <span class="hljs-attribute">c</span>=<span class="hljs-string">'r'</span>, <span class="hljs-attribute">label</span>=<span class="hljs-string">'mean'</span>) plt.axhline(<span class="hljs-attribute">y</span>=d1_mean.loc[<span class="hljs-string">'std'</span>], <span class="hljs-attribute">c</span>=<span class="hljs-string">'c'</span>, <span class="hljs-attribute">linestyle</span>=<span class="hljs-string">'-.'</span>,label='std') plt.axhline(<span class="hljs-attribute">y</span>=-d1_mean.loc[<span class="hljs-string">'std'</span>], <span class="hljs-attribute">c</span>=<span class="hljs-string">'c'</span>, <span class="hljs-attribute">linestyle</span>=<span class="hljs-string">'-.'</span>,label='std') plt.legend(<span class="hljs-attribute">loc</span>=<span class="hljs-string">'lower right'</span>)</pre></div><figure id="6b95"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*FLzs76QTf1j1DobAAMhFjw.png"><figcaption></figcaption></figure><p id="9d44">What happens if I use 3 times std instead?</p><figure id="4f8f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4Gh-QeSzWY4RwJMrTcdktQ.png"><figcaption></figcaption></figure><p id="b808">Looks good! Now is the time to look for those outliers:</p><div id="388d"><pre><span class="hljs-attr">mu</span> = d1_mean.loc[<span class="hljs-string">'mean'</span>] <span class="hljs-attr">sigma</span> = d1_mean.loc[<span class="hljs-string">'std'</span>]</pre></div><div id="832b"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_outliers</span>(<span class="hljs-params">df, mu=mu, sigma=sigma, n_sigmas=<span class="hljs-number">3</span></span>): <span class="hljs-string">''' df: the DataFrame mu: mean sigmas: std n_sigmas: number of std as boundary '''</span> x = df[<span class="hljs-string">'simple_rtn'</span>] mu = mu sigma = sigma

<span class="hljs-keyword">if</span> (x &gt; mu+n_sigmas*sigma) | (x&lt;mu-n_sigmas*sigma):
    <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>
<span class="hljs-keyword">else</span>:
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span></pre></div><p id="0f21">After applied the rule <code>get_outliers</code> to the stock price return, a new column is created:</p><div id="b52a"><pre>d1<span class="hljs-selector-attr">[<span class="hljs-string">'outlier'</span>]</span> = d1<span class="hljs-selector-class">.apply</span>(get_outliers, axis=<span class="hljs-number">1</span>)

d1<span class="hljs-selector-class">.head</span>()</pre></div><figure id="7bc1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jyeOjdeTDjxnjSrz9shYMA.png"><figcaption></figcaption></figure><h2 id="cd87">✍Tip!</h2><div id="1fc0"><pre>#The above <span class="hljs-keyword">code</span> snippet can be refracted <span class="hljs-keyword">as</span> follow:</pre></div><div id="1fe5"><pre>cond = (d1[<span class="hljs-string">'simple_rtn'</span>] > mu + sigma * <span class="hljs-number">2</span>) | (d1[<span class="hljs-string">'simple_rtn'</span>] < mu - sigma * <span class="hljs-number">2</span>) d1[<span class="hljs-string">'outliers'</span>] = np.where(cond, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)</pre></div><p id="3602">Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.</p><div id="5c8d"><pre>d1<span class="hljs-selector-class">.outlier</span><span class="hljs-selector-class">.value_counts</span>()</pre></div><figure id="b9f2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sg1___A_jkqJHKHl3D7o1A.png"><figcaption></figcaption></figure><p id="f8b0">We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another <code>DataFrame</code> and show it in the graph:</p><div id="ee96"><pre>outliers = d1.loc[d1[<span class="hljs-string">'outlier'</span>] == <span class="hljs-number">1</span>, [<span class="hljs-string">'simple_rtn'</span>]]</pre></div><div id="bdd0"><pre><span class="hljs-built_in">fig,</span> ax = plt.subplots()</pre></div><div id="a9c2"><pre>ax.plot(d1.index, d1.simple_rtn, <span class="hljs-attribute">color</span>=<span class="hljs-string">'blue'</span>, <span class="hljs-attribu

Options

te">label</span>=<span class="hljs-string">'Normal'</span>) ax.scatter(outliers.index, outliers.simple_rtn, <span class="hljs-attribute">color</span>=<span class="hljs-string">'red'</span>, <span class="hljs-attribute">label</span>=<span class="hljs-string">'Anomaly'</span>) ax.set_title(<span class="hljs-string">"Apple's stock returns"</span>) ax.legend(<span class="hljs-attribute">loc</span>=<span class="hljs-string">'lower right'</span>)</pre></div><div id="e763"><pre>plt.tight_layout<span class="hljs-comment">()</span>

plt.show<span class="hljs-comment">()</span></pre></div><figure id="7603"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BvHr0o6c4GRaYrACvJy55Q.png"><figcaption></figcaption></figure><p id="0762">In the above plot, we can observe outliers marked with a red dot.</p><h2 id="0d1e">3. Winsorization</h2><p id="92b7"><i>Winsorization is </i>the process of replacing a specified number of extreme values with a smaller data value. It is named after the engineer-turned-biostatistician <a href="https://en.wikipedia.org/wiki/Charles_Winsor">Charles P. Winsor</a> (1895–1951). The effect is the same as <a href="https://en.wikipedia.org/wiki/Clipping_(signal_processing)">clipping</a> in signal processing.</p><p id="bf71">A typical strategy is to set all outliers to a specified <a href="https://en.wikipedia.org/wiki/Percentile">percentile</a> of the data; for example, a 95% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile. It can be realized in pandas with clip() function.</p><div id="71e8"><pre>outlier_cutoff = 0.01 d1.pipe(lambda x:x.clip(<span class="hljs-attribute">lower</span>=x.quantile(outlier_cutoff), <span class="hljs-attribute">upper</span>=x.quantile(1-outlier_cutoff), <span class="hljs-attribute">axis</span>=1, <span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>)) d1</pre></div><figure id="65e3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wXMe0HziVSItRGKA8Egu5A.png"><figcaption></figcaption></figure><p id="0f09">Note here, the shape of the dataframe remains the same. Those values below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile. We can visualize the difference in a plot.</p><div id="cce1"><pre>fig, ax = plt.subplots() ax.plot(d.index, d.simple_rtn, <span class="hljs-attribute">color</span>=<span class="hljs-string">'red'</span>, <span class="hljs-attribute">label</span>=<span class="hljs-string">'Normal'</span>) ax.plot(d1.index, d1.simple_rtn, <span class="hljs-attribute">color</span>=<span class="hljs-string">'blue'</span>, <span class="hljs-attribute">label</span>=<span class="hljs-string">'Anomaly_removed'</span>) ax.set_title(<span class="hljs-string">"stock returns outliers_winsorize returns"</span>) ax.legend(<span class="hljs-attribute">loc</span>=<span class="hljs-string">'lower right'</span>);</pre></div><figure id="300f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1SHG8Ey0M0t9AB860qcD_Q.png"><figcaption></figcaption></figure><p id="305d">The reason I prefer Winsorization is that no information was removed by accident when there are more than 1 features for your machine learning model along with the simple return.</p><p id="4d36">In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.</p><p id="1103">Happy learning, happy coding!</p><div id="e94f" class="link-block"> <a href="https://readmedium.com/identifying-outliers-part-two-4c00b2523362"> <div> <div> <h2>Identifying Outliers — Part Two</h2> <div><h3>How to find and visualize outliers in your dataset by Pandas</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*d59T7r5eHM8pr6v03p1n2w.jpeg)"></div> </div> </div> </a> </div><div id="50d1" class="link-block"> <a href="https://readmedium.com/identifying-outliers-part-three-257b09f5940b"> <div> <div> <h2>Identifying Outliers — Part Three</h2> <div><h3>How to find and visualize outliers in your dataset by Pandas</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*d59T7r5eHM8pr6v03p1n2w.jpeg)"></div> </div> </div> </a> </div></article></body>

📈Python for finance series

Identifying Outliers — Part One

How to find and visualize outliers in your dataset by Pandas

Photo by Dave Gandy

Update 10/28/2020 : Winsorization added at the end of this article. It is one of the common ways to limit/remove extreme values in financial data.

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

📈Python For Finance Series

  1. Identifying Outliers
  2. Identifying Outliers — Part Two
  3. Identifying Outliers — Part Three
  4. Stylized Facts
  5. Feature Engineering & Feature Selection
  6. Data Transformation
  7. Fractionally Differentiated Features
  8. Data Labelling
  9. Meta-labeling and Stacking

Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:

The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.

1. Data preparation

Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.

import pandas as pd 
import yfinance as yf
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300
df = yf.download('AAPL',
                 start = '2000-01-01',
                 end= '2010-12-31')

As we only care about the returns, a new DataFrame (d1) is created to hold the adjusted price and returns.

d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
d1['simple_rtn']=d1.adj_close.pct_change()
d1.head()

2. Using mean and standard deviation as the boundary.

Calculate the mean and std of the simple_rtn:

d1_mean = d1['simple_rtn'].agg(['mean', 'std'])

If we use mean and one std as the boundary, the results will look like these:

fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')

What happens if I use 3 times std instead?

Looks good! Now is the time to look for those outliers:

mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']
def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
    '''
    df: the DataFrame
    mu: mean
    sigmas: std
    n_sigmas: number of std as boundary
    '''
    x = df['simple_rtn']
    mu = mu
    sigma = sigma
    
    if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
        return 1
    else:
        return 0

After applied the rule get_outliers to the stock price return, a new column is created:

d1['outlier'] = d1.apply(get_outliers, axis=1)
d1.head()

✍Tip!

#The above code snippet can be refracted as follow:
cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)

Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.

d1.outlier.value_counts()

We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame and show it in the graph:

outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]
fig, ax = plt.subplots()
ax.plot(d1.index, d1.simple_rtn, 
        color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn, 
           color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')
plt.tight_layout()

plt.show()

In the above plot, we can observe outliers marked with a red dot.

3. Winsorization

Winsorization is the process of replacing a specified number of extreme values with a smaller data value. It is named after the engineer-turned-biostatistician Charles P. Winsor (1895–1951). The effect is the same as clipping in signal processing.

A typical strategy is to set all outliers to a specified percentile of the data; for example, a 95% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile. It can be realized in pandas with clip() function.

outlier_cutoff = 0.01
d1.pipe(lambda x:x.clip(lower=x.quantile(outlier_cutoff),
                        upper=x.quantile(1-outlier_cutoff),
                        axis=1,
                        inplace=True))
d1

Note here, the shape of the dataframe remains the same. Those values below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile. We can visualize the difference in a plot.

fig, ax = plt.subplots()
ax.plot(d.index, d.simple_rtn, 
        color='red', label='Normal')
ax.plot(d1.index, d1.simple_rtn, 
        color='blue', label='Anomaly_removed')
ax.set_title("stock returns outliers_winsorize returns")
ax.legend(loc='lower right');

The reason I prefer Winsorization is that no information was removed by accident when there are more than 1 features for your machine learning model along with the simple return.

In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.

Happy learning, happy coding!

Data Science
Python
Data Cleaning
Artificial Intelligence
Stock Market
Recommended from ReadMedium