avatarMr. Q

Summary

The provided content outlines an introductory guide to using Python, particularly the Pandas library, for financial analysis, with a focus on data manipulation and visualization.

Abstract

The article is the first in a series aimed at financial analysts learning to apply Python tools in their work. It introduces the Python packages Pandas, Matplotlib, and TA, emphasizing their utility in financial data analysis. The author explains the concept of DataFrames in Pandas, drawing parallels with Excel to facilitate understanding for analysts familiar with spreadsheets. The tutorial covers creating DataFrames, importing data from CSV files, previewing data, handling missing values, and removing duplicates. Practical examples include retrieving financial data from SimFin and using Jupyter Notebooks (specifically Google Colab) for code execution. The article also provides links to additional resources and the author's shared code and files for hands-on learning.

Opinions

  • The author believes that Python, especially with the Pandas package, offers more powerful data manipulation capabilities than traditional Excel.
  • Pandas' DataFrame is presented as a superior alternative to Excel worksheets for financial analysts, with the ability to handle larger datasets and more complex operations.
  • The use of Jupyter Notebooks, specifically Google Colab, is recommended for their ease of use and accessibility for beginners.
  • The author suggests that the skills and tools covered in the series are essential for financial analysts to stay competitive in the industry.
  • The author values hands-on practice and encourages readers to follow along with the provided code and datasets.
  • There is an emphasis on the importance of data visualization in financial analysis, with Matplotlib being a key tool for this purpose.
  • The author acknowledges the challenge of dealing with missing data and suggests practical methods for handling such cases.

Python for Financial Analysis Series — Python Tools Day 1

Highlight:

  • This article gives an overview of popular Python packages such as Pandas, Matplotlib and TA and how to use them.

In our last Python learning phase, Python Core, we have created a simple stock recommendation program and have been improving the scripts while learning the key Python features and concepts. From the second phase, Python Tools, we will start looking into popular Python packages such as Pandas, Matplotlib and TA (a simple technical indicator package).

If you need to refresh a bit knowledge of Python Core, please feel free to check out the following article series.

Python Tools is our last compulsory phase before we can effectively apply Python to our daily works. We will have 6 days sub-series and will cover all essential knowledge about most used Python packages for financial analysts.

Day 1: Pandas I (Introduction to DataFrame)

Day 2: Pandas II (Data Retrieval)

Day 3: Pandas III (Column based computations + filtering)

Day 4: Pandas IV (Aggregation + working with multiple DataFrames)

Day 5: Pandas Series + TA (Technical Indicator Package)

Day 6: Matplotlib (Data visualization)

Pandas is one of the most popular Python packages for financial analysts. It provides Excel in Python (obviously much more than traditional Excel). That is why we will spend most of the days in learning Pandas. We will extend Pandas with TA package to have a peek of technical analysis too. Finally, seeing the data is important, so we will use one of the most used data visualization packages, Matplotlib, for day 6.

IMPORTANT UPDATE

Since Microsoft Azure Notebooks has moved to GitHub driven, which may be challenging for a beginner to start with. I have put all the material under Google Colab. It is a very similar Jupyter Notebook environment and you may check it out by click here.

All my code and files are shared and you may click here to open.

Let’s start

To follow the example codes, let’s create a new notebook under the same project and call it “Python Tools”. Please check Python Core Day 1 if you haven’t seen how to create a new notebook with Microsoft Azure Notebooks yet.

As I said, Pandas essentially provides us with Excel in Python. In Excel, one of the most important components is a worksheet. We usually consider it a data table. In Pandas, the concept is very similar, we have DataFrame and we can consider DataFrame as data table too.

DataFrame Key Parts

Probably, as financial analysts, Excel is our best friend so far. That is why when I explain Pandas’ DataFrame, I usually link the key concepts to Excel, which has been working quite well. We will do the same here.

Have a look the Excel worksheet. What we have here are mainly two things. We have rows and columns. For rows, we reference it from number one and for columns, we reference it from letter “A” (there is also a setting to make it number referencing too). To locate one range, we use column letter and row number to reference, such as B2. We usually have row headers and column headers when we create a data table.

Let’s manually create a Pandas’ DataFrame and have a look. In the first cell of the new notebook “Python Tools”, do the following code and run it.

We first imported Pandas and we gave it a short name “pd”, which is almost the standard short name you may find everywhere online. Second, we created a nested Python list, which is similar to a table. The most outside list contains rows and the inner lists store the values for columns. Third, we used the nested list to create a DataFrame object “df” with the class “pd.DataFrame”. Finally, we print it out. One trick to learn here is that the last line of the cell will be automatically displayed properly by Jupyter and we do not need to use the function “print”. But notice only the last line will be printed out.

As we can see, the DataFrame looks almost the same as Excel tables. Probably the only difference is, by default, DataFrame has a reference from number zero for both rows and columns. In Excel, we usually have row headers and column headers, because easier to read and to use. We can do the same thing here with DataFrame. In DataFrame, we call the row headers “index” and column headers “columns”. Try the following code.

We took the data of three stocks we used in Python Core and we created nested list storing the values of their prices and currencies. The list was then used to create DataFrame. However, this time we specified the parameter “index” (as row headers) and “columns” (as column headers). Not only it is easier to read, but also it is much easier for us to retrieve the data from the table. We will see in the next session.

Read data to DataFrame

In real life, we are more likely to have the data ready to use from data files. They can be csv file (colon-separated file) or Excel sheets. Luckily, Pandas provides functions to create DataFrame by reading those files.

To try it out, let’s prepare a csv file with financial data first. One interesting website I found is “SimFin” (https://simfin.com/). We can download historical financial data for free and it is good enough for learning purpose.

Go to the bulk data download page (https://simfin.com/data/bulk) and download Trailing Twelve Months Income Statement for US companies as a csv file. Trailing Twelve Months numbers are good because we do not need to worry about company reporting frequency and can compare the numbers directly.

Now upload the file to our project in Microsoft Azure Notebooks, so that we can use it in our code. Go back to the project page. On the right, we have an up-arrow icon button. By clicking on it, we can choose uploading “From Computer”. Follow the wizard and we should see the csv file appearing in the project.

Once it’s uploaded, come back to the notebook and create one new cell below to try the following code.

We used the function “read_csv” to load the data from the file into a DataFrame. So far it is read into a variable “df”. We will view it in a minute. One thing to notice is that we used the parameter “sep” to indicate that within the csv file columns are separated by a semi-colon “;” (we can tell by opening the downloaded file in Windows Notepad). By default, it will be comma “,”.

Remember using TAB to explore the functions? If we type “pd.read” then press TAB, we will see many functions starting with the keyword “read”. They are for a different type of files, including “read_excel”.

Preview the DataFrame

The data table is usually large. We could try to display the whole table by putting the variable “df” at the end of the cell. However, usually, we only need to preview the table to understand the data. Try the following code in new cells.

We have functions “head” and “tail” to preview the first and last several rows. By default, it will be five rows, but we can also control it by passing in a number.

See all columns

Notice the “…” between columns when we scroll to right. This is because we have many columns and Pandas will avoid displaying too many columns or rows. Remember we specified the parameter “columns” when we manually creating the DataFrame. We can retrieve column headers by using the same attribute “columns”.

It is good we can see all the columns but we also find the attribute “columns” is not a normal Python list. To convert it into Python list, we can do the following, using the function “tolist”. Almost all parts of the DataFrame can be converted into Python list with the function “tolist”.

In Python, especially with Pandas DataFrame, we tend to use many chained function calls, like what we did above. It is quite intuitive, as we kind of read the operations in order and thinks the operation done step by step from left to right.

One more trick to learn about Python list is that we can slice the list by using the colon “:”. For example, if we want to see from the second element to the fifth element of a list, we do the following. Remember in Python everything starts from zero.

If we want to see the sliced list from the beginning of the list, we can leave the first part blank.

We can do it from the end of the list too. Remember we can use a negative number to specify the element of the list from the end (rather than from the beginning).

Know the data table

Sometimes we need to know the size of the table and the following are the most common way.

The attribute “shape” of DataFrame will return a tuple (very similar to list, except for read-only feature). The first element is number of the rows and the second is number of columns. We can retrieve the values from the tuple in the same way as the list, i.e. by using square brackets “[]”. Finally, we can also apply the function “len” to the DataFrame to get the length of the table, i.e. the number of rows.

If the data is not coming from a well-prepared data source, we may have duplicated rows. Pandas provides us with a good function to handle it.

Very straight forward, we used the function “drop_duplicates” to remove the potential duplicates and generate a new DataFrame with unique rows. We also verified that our data does not have duplicates, because the length of both tables is the same.

How about data availability? Usually, in DataFrame, not available data will be shown as a null value, NaN. When companies do not report certain numbers, data providers may use null value. To know how many rows we have actual values, we can use the function “count”.

As we can see, some columns are coming with null values, because the count is less than the size of the table, 45230. For some of the financial items, we can assume null value means zero, such as Abnormal Gains (Losses). We will use the function “fillna” in our next session to clean those columns. However, some columns we cannot just replace with zero and need to handle in other ways, such as Shares (Basic), which we may need to manually find other data sources to fill the gap or simply removing it from our model.

Good! That is all for the basics. As we can see, DataFrame is a data table, like Excel worksheet that we are familiar with already. We have seen some useful functions to view the table, such as “head”, “tail”, “count” and “drop_duplicates”.

In the next session, we will look into basic data retrievals, i.e. getting the values from the table.

Data Science
Finance
Python
Programming
Recommended from ReadMedium