avatar(Joseph) Cho Hang Ng

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3580

Abstract

in HTML format. All elements are systemically arranged in classes.</p><figure id="0363"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4xFcpZD2rbt6qz6SQYQFCw.png"><figcaption></figcaption></figure><h1 id="4555">Data Manipulation</h1><p id="b16e">But how do know which classes the relevant data are stored in?</p><p id="d845">After a few searches, we know that they are stored in“div”, we can create an empty list and use a for loop to find all elements and append them to the list.</p><div id="69a5"><pre>ls= [] # <span class="hljs-keyword">Create</span> empty list <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> soup_is.find_all(‘div’): #Find <span class="hljs-keyword">all</span> data structure that <span class="hljs-keyword">is</span> ‘div’ ls.append(l.<span class="hljs-keyword">string</span>) # add <span class="hljs-keyword">each</span> <span class="hljs-keyword">element</span> one <span class="hljs-keyword">by</span> one <span class="hljs-keyword">to</span> the list

ls = [<span class="hljs-built_in">e</span> <span class="hljs-keyword">for</span> <span class="hljs-built_in">e</span> <span class="hljs-keyword">in</span> ls <span class="hljs-keyword">if</span> <span class="hljs-built_in">e</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> (‘Operating Expenses’,’Non-recurring Events’)] # <span class="hljs-keyword">Exclude</span> those columns</pre></div><p id="87bc">You will find that there are a lot of “none” elements in ls because not all “div” has an element. We just need to filter those out.</p><blockquote id="63e5"><p><code><i>new_ls = list(filter(None,ls))</i></code></p></blockquote><p id="3725">And now it looks like this.</p><figure id="6c63"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mDVpyF0yN-VL75nabV4KNw.png"><figcaption></figcaption></figure><p id="6595">If we take a step further and start reading the list starting in the 12th position.</p><blockquote id="6815"><p><code>new_ls = new_ls[12:]</code></p></blockquote><p id="b0db">Well, now we have a list. But how do we turn it into a data frame? First, we need to iterate 6 items at a time and store them in tuples. However, we want a list so that the pandas library can read it into a data frame.</p><blockquote id="925b"><p><code>is_data = list(zip(*[iter(new_ls)]*6))</code></p></blockquote><figure id="ba2b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*_q3gV3uNyc8wFUveCEhSyA.png"><figcaption></figcaption></figure><p id="2d72">Perfect, that is exactly what we want. Now, we just have to read it into a data frame.</p><blockquote id="f986"><p><code>Income_st = pd.DataFrame(is_data[0:])</code></p></blockquote><figure id="27f9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wOpKI5EjO2LRmvs3srF8KQ.png"><figcaption></figcaption></figure><h1 id="399f">Data Cleaning</h1><p id="7315">Sweet. It is almost done. We just need to read the first row as the column and the first column as the row index. Here are some clean-up.</p><div id="0e5b"><pre>Income_st.<span class="hljs-keyword">columns</span> = Income_st.iloc[<span class="hljs-number">0</span>] # <span class="hljs-type">Name</span> <span class="hljs-keyword">columns</span> <span class="hljs-keyword">to</span> first <span class="hljs-keyword">row</span> <span class="hljs-keyword">of</span> dataframe</pre></div><div id="5d48"><pre><span class="hljs-attr">Income_st</span> = Income_st.iloc[<span class="hljs-number">1</span>:,] <span class="hljs-comment"># start to read 1st row</span><

Options

/pre></div><div id="b6b9"><pre><span class="hljs-attr">Income_st</span> = Income_st.T <span class="hljs-comment"># transpose dataframe</span></pre></div><div id="e5a3"><pre>Income_st.<span class="hljs-keyword">columns</span> = Income_st.iloc[<span class="hljs-number">0</span>] #<span class="hljs-type">Name</span> <span class="hljs-keyword">columns</span> <span class="hljs-keyword">to</span> first <span class="hljs-keyword">row</span> <span class="hljs-keyword">of</span> dataframe</pre></div><div id="5e12"><pre>Income_st.<span class="hljs-keyword">drop</span>(Income_st.<span class="hljs-keyword">index</span>[<span class="hljs-number">0</span>],inplace=<span class="hljs-keyword">True</span>) #<span class="hljs-keyword">Drop</span> first <span class="hljs-keyword">index</span> <span class="hljs-keyword">row</span></pre></div><div id="facd"><pre>Income_st.<span class="hljs-keyword">index</span>.<span class="hljs-keyword">name</span> = ‘’ # Remove the <span class="hljs-keyword">index</span> <span class="hljs-keyword">name</span></pre></div><div id="644b"><pre>Income_st.<span class="hljs-keyword">rename</span>(<span class="hljs-keyword">index</span>={‘ttm’: ‘<span class="hljs-number">12</span>/<span class="hljs-number">31</span>/<span class="hljs-number">2019</span>’},inplace=<span class="hljs-keyword">True</span>) #<span class="hljs-keyword">Rename</span> ttm <span class="hljs-keyword">in</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">columns</span> <span class="hljs-keyword">to</span> <span class="hljs-keyword">end</span> <span class="hljs-keyword">of</span> the year</pre></div><div id="aa81"><pre>Income_st = Income_st[Income_st.<span class="hljs-built_in">columns</span>[:-<span class="hljs-number">5</span>]] # <span class="hljs-built_in">remove</span> <span class="hljs-built_in">last</span> <span class="hljs-number">5</span> irrelevant <span class="hljs-built_in">columns</span></pre></div><p id="cc8f">After using the same techniques for the Income statement, balance sheet and cash flow, your Data Frames should look like the following.</p><figure id="c149"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*RPI76f9YR04JMcDr3At3zA.png"><figcaption>Income Statement</figcaption></figure><figure id="e038"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TGyyywxLoEysrThe8bx_sQ.png"><figcaption>Balance Sheet</figcaption></figure><figure id="54f6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ivjlHjcCvaD_lvVTwv22fw.png"><figcaption>Cash flow statement</figcaption></figure><p id="2924">After doing a transpose on the DataFrames, DateTime is turned into a row index and the features become column names.</p><p id="92a5">Here are some afterthought questions:</p><ol><li>How are the features correlated to the stock price of a company? How do you find out if they are correlated?</li><li>If so, which time period of stock price are related to the features from our financial statements?</li><li>What else can you do with the extracted data on developing algorithmic trading model?</li></ol><p id="1f55">Feel free to leave your answers and comments below at the response to check if you can come up with some unique answers.</p><p id="6a81">Thank you for reading this article, feel free to share it if you find it useful. <a href="https://readmedium.com/replace-traditional-accounting-analysis-with-python-advanced-dupont-9b24f7719ee0">Here</a> is another article on utilizing the new DataFrame and do further financial accounting analysis with Python.</p></article></body>

Web scraping for financial statements with Python — 1

I replicated an Excel-based accounting work I used to spend hours on and this task can be done in seconds now.

Background

There are not many data science use cases for accounting. However, accounting jobs are one of the tedious categories that need to be automated. One of the tasks I used to do is to get financial statements for individual companies so that I can analyze their performance against the industry. Basically, I would search their names on Yahoo Finance and gather the key indicators on Excel.

Manipulating web data can be tricky sometimes especially when the website gets updated, but mastering these few steps will save you a tremendous amount of time in the future. Here is an example of getting financial statements from Yahoo Finance using Python.

Import libraries

Urlib.request is an open source library that parse content from a webpage. When you call that, you basically ask the website to get the data from that website. Here is more information if interested.

https://docs.python.org/3/library/urllib.request.html

Another library is called Beautiful Soup that makes reading data stored in XML format so much easier. XML is a format similar to HTML that has store values between tags. It looks kinda messy if you open it. Just like when you get the source code of a webpage.

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

Processing

Here is a simple trick you can flexibly adjust the stock symbol and plug it into the URL link. It will come in handy later if you want to extract hundreds of company’s financial statements.

# Enter a stock symbol
index= ‘MSFT’
# URL link 
url_is = ‘https://finance.yahoo.com/quote/' + index + ‘/financials?p=’ + index
url_bs = ‘https://finance.yahoo.com/quote/' + index +’/balance-sheet?p=’ + index
url_cf = ‘https://finance.yahoo.com/quote/' + index + ‘/cash-flow?p=’+ index

Now we have the URL link saved. If you manually open them on a Web browser, it will look like this.

Read the URL

Next, we just need to open the link and read it into a proper format called lxml. Simple enough.

read_data = ur.urlopen(url_is).read() 
soup_is= BeautifulSoup(read_data,’lxml’)

Well, if you open soup_is, it will look like a mess because the elements were originally in HTML format. All elements are systemically arranged in classes.

Data Manipulation

But how do know which classes the relevant data are stored in?

After a few searches, we know that they are stored in“div”, we can create an empty list and use a for loop to find all elements and append them to the list.

ls= [] # Create empty list
for l in soup_is.find_all(‘div’): 
  #Find all data structure that is ‘div’
  ls.append(l.string) # add each element one by one to the list
 
ls = [e for e in ls if e not in (‘Operating Expenses’,’Non-recurring Events’)] # Exclude those columns

You will find that there are a lot of “none” elements in ls because not all “div” has an element. We just need to filter those out.

new_ls = list(filter(None,ls))

And now it looks like this.

If we take a step further and start reading the list starting in the 12th position.

new_ls = new_ls[12:]

Well, now we have a list. But how do we turn it into a data frame? First, we need to iterate 6 items at a time and store them in tuples. However, we want a list so that the pandas library can read it into a data frame.

is_data = list(zip(*[iter(new_ls)]*6))

Perfect, that is exactly what we want. Now, we just have to read it into a data frame.

Income_st = pd.DataFrame(is_data[0:])

Data Cleaning

Sweet. It is almost done. We just need to read the first row as the column and the first column as the row index. Here are some clean-up.

Income_st.columns = Income_st.iloc[0] # Name columns to first row of dataframe
Income_st = Income_st.iloc[1:,] # start to read 1st row
Income_st = Income_st.T # transpose dataframe
Income_st.columns = Income_st.iloc[0] #Name columns to first row of dataframe
Income_st.drop(Income_st.index[0],inplace=True) #Drop first index row
Income_st.index.name = ‘’ # Remove the index name
Income_st.rename(index={‘ttm’: ‘12/31/2019’},inplace=True) #Rename ttm in index columns to end of the year
Income_st = Income_st[Income_st.columns[:-5]] # remove last 5 irrelevant columns

After using the same techniques for the Income statement, balance sheet and cash flow, your Data Frames should look like the following.

Income Statement
Balance Sheet
Cash flow statement

After doing a transpose on the DataFrames, DateTime is turned into a row index and the features become column names.

Here are some afterthought questions:

  1. How are the features correlated to the stock price of a company? How do you find out if they are correlated?
  2. If so, which time period of stock price are related to the features from our financial statements?
  3. What else can you do with the extracted data on developing algorithmic trading model?

Feel free to leave your answers and comments below at the response to check if you can come up with some unique answers.

Thank you for reading this article, feel free to share it if you find it useful. Here is another article on utilizing the new DataFrame and do further financial accounting analysis with Python.

Data Science
Machine Learning
Accounting
Web Scraping
Financial Analysis
Recommended from ReadMedium