avatarJason Huynh

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2878

Abstract

efore exporting as a PDF. For us as humans, this gives us the illusion of a table but for a computer this is nothing more than words separated by spaces.</p><p id="5fa6">The first effort was to try Tabula, which is a java-based program with a python API that extracts ‘text’ from a PDF file. <i>This is not an OCR* solution but I will address OCR later.</i> Tabula works great when the financial tables have ‘lines’, but works horrendously when the financial tables have no ‘lines’. For example, when there are no ‘lines’, each word has its own row. <i>*OCR means optical character recognition.</i></p><p id="deaf">The second effort was to use OCR, extract strings and put the strings into a pandas data frame. <i>For OCR, I was using pytesseract. </i>This worked quite well until I realised that sometimes OCR can’t read certain letters correctly regardless of how high the DPI <i>(dots per inch)</i> is. For example, ‘1’ would become ‘i’ and vice versa. Ultimately, this becomes a painful data cleansing exercise.</p><p id="896c">The third effort was to use computer vision. <i>For machine learning nerds, I specifically used the cv2 package.</i> My main weakness here is that I’m not that great at computer vision. In the end, I was able to detect a table but my computer vision skills are not where they need to be to scale this process across many PDFs.</p><p id="820c">The final effort was to use the Google’s Document AI API. <i>This is a combined OCR, text extraction and computer vision solution.</i> Even though it performed the best, it still didn’t always give me the table I wanted. Sometimes it would just group data into single row without any spaces in between words. Data cleaning this would have been very difficult. Ultimately, I am doubtful if could create a ‘super’ script that would perfectly pull out financial data.</p><p id="7ba7"><i>As a side note, I did also try Adobe’s OCR solution and like Google Document AI, it performed quite well but not as well as I wanted it to.</i></p><h1 id="b0f5">Second Issue — Lack of Consistency</h1><p id="ddf8">Among all of the efforts stated earlier, I found that using OCR to extract strings the easiest to do. In general, it worked for all PDFs and produced the least amount of errors, except not all letters could be read properly, but this is something I could live with.</p><p id="22bd">However, a lack of consistency among annual reports was the biggest issue that made automation virtually impossible. This made it difficult because I would have had to hard code all potential formatting errors. For example, prior to 2020, Tesla was reporting its finances in millions but in the 2020 annual report, it was reporting in the billions. Another issue is that descriptions such as operating income, operating profit, EBIT etc. all vary between annual reports. These are small issues but can produce complicated co

Options

de when they each need to be hard coded. Ultimately, this would lead me to create a python package that would need constant maintenance, which is fairly unproductive for my purposes.</p><h1 id="476d">What I ended up doing in the end</h1><p id="2785">In the end, I took the easy route. Instead of automatically grabbing data out of annual reports with Python, I just had a Google form and quickly entered data into the form. I know exactly what values I wanted, could easily discern ambiguous words and know when the financial statements were trying to be ‘tricky’.</p><p id="6af4">What surprised me was that the ‘manual’ method took no longer than for a Python script to run. The reason is that OCR slows down the whole process — it’s resource heavy procedure to say the least. For example, it took my cloud process several minutes just to process a single PDF from finding the financial statements automatically to producing a data frame.</p><p id="7adf">Thereafter, I still used Python pull data from Google Sheets, which was the landing point of my Google Form inputs, to make data manipulation and model for intrinsic value. Ultimately, this is far easier than trying to make program that automatically pulls out data from a PDF.</p><h1 id="8b64">Conclusions</h1><p id="f3c4">At the moment, I don’t have my dream product that automatically pulls data out of annual reports and into a dashboard. But, after a year’s worth of experimentation, I can say for sure that I am definitely more agile in my value investing than I was at the start.</p><p id="a82c"><i>Never miss one of my stories! <a href="https://jdwag123.medium.com/subscribe">Click here to Subscribe</a>. Stuck behind a wall and want to read more stories? <a href="https://jdwag123.medium.com/membership">Join up to Medium Here</a>. If you’re active on Twitter, <a href="https://twitter.com/jdwag12321">follow me on Twitter Here</a>.</i></p><p id="4e61"><b><i>Schedule a DDIChat Session in <a href="https://app.ddichat.com/category/financial-markets-and-analysis">Financial Markets and Analysis</a>:</i></b></p><div id="c402" class="link-block"> <a href="https://app.ddichat.com/category/financial-markets-and-analysis"> <div> <div> <h2>Experts - Financial Markets and Analysis - DDIChat</h2> <div><h3>DDIChat allows individuals and businesses to speak directly with subject matter experts. It makes consultation fast…</h3></div> <div><p>app.ddichat.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Fjnpi9WWr0IjC-ZK)"></div> </div> </div> </a> </div><p id="2179"><b><i>Apply to be a DDIChat Expert <a href="https://app.ddichat.com/expertsignup">here</a>.</i></b></p></article></body>

Photo by Kari Shea on Unsplash

I Tried to Automate Value Investing in Python and Failed. Here’s What I learnt.

I’m a big fan of Warren Buffet. It’s not that I want to be as wealthy as he is. Instead, I really like his lifestyle. He lives a simple life, has a cushiony job and is never worried about being fired — primarily, he’s the largest shareholder of Berkshire Hathaway. Most importantly, he has capacity and is willing to give his money away.

I would like to do something similar with my limited time on Earth. It doesn’t need to be quite as large as Berkshire Hathaway, but I’d like to build something that can support others and benefit society.

Below is a high level overview of what I tried doing. I don’t go into detail about my code in this article but if interested feel free to ask me anyway.

After reading ‘The Warren Buffett Way” by Robert G. Hagstrom, I had the idea that I could automate value investing. As a data analyst with a machine learning background (I’m not quite a data scientist), I felt that it could be possible.

The main issue with value investing is that it can be tedious to read annual reports. By that I mean, all of the information and data are in PDFs. Having to update Excel sheets manually can be a bit of a waste of time, especially if the company turns out to be a dud.

Since companies tend to write summaries of how their business functions, I didn’t really need to text mine paragraphs to better understand how a business works. Just reading the business section would have been sufficient.

However, I did need to PDF mine financial statements since I preferred to look at 10 years worth of financial data once. Personally, I believe you can view the performance of the management team over a 10 year horizon better than over a 2 year block.

But, before I go on, I need to point out that I live in Australia, where we don’t have the luxury of a SEC-filings like API and other APIs such as the Yahoo Finance API isn’t as accurate as I’d like it to be. Furthermore, annual report financial statements are backed by professional auditors, whereas APIs usually are not.

First Issue — PDF Tables

The first issue about PDF mining is that the financial data tables provided aren’t really tables. Usually, they do start as actual tables in Excel but once they are copied and pasted onto a Word document the lines are usually whited out before exporting as a PDF. For us as humans, this gives us the illusion of a table but for a computer this is nothing more than words separated by spaces.

The first effort was to try Tabula, which is a java-based program with a python API that extracts ‘text’ from a PDF file. This is not an OCR* solution but I will address OCR later. Tabula works great when the financial tables have ‘lines’, but works horrendously when the financial tables have no ‘lines’. For example, when there are no ‘lines’, each word has its own row. *OCR means optical character recognition.

The second effort was to use OCR, extract strings and put the strings into a pandas data frame. For OCR, I was using pytesseract. This worked quite well until I realised that sometimes OCR can’t read certain letters correctly regardless of how high the DPI (dots per inch) is. For example, ‘1’ would become ‘i’ and vice versa. Ultimately, this becomes a painful data cleansing exercise.

The third effort was to use computer vision. For machine learning nerds, I specifically used the cv2 package. My main weakness here is that I’m not that great at computer vision. In the end, I was able to detect a table but my computer vision skills are not where they need to be to scale this process across many PDFs.

The final effort was to use the Google’s Document AI API. This is a combined OCR, text extraction and computer vision solution. Even though it performed the best, it still didn’t always give me the table I wanted. Sometimes it would just group data into single row without any spaces in between words. Data cleaning this would have been very difficult. Ultimately, I am doubtful if could create a ‘super’ script that would perfectly pull out financial data.

As a side note, I did also try Adobe’s OCR solution and like Google Document AI, it performed quite well but not as well as I wanted it to.

Second Issue — Lack of Consistency

Among all of the efforts stated earlier, I found that using OCR to extract strings the easiest to do. In general, it worked for all PDFs and produced the least amount of errors, except not all letters could be read properly, but this is something I could live with.

However, a lack of consistency among annual reports was the biggest issue that made automation virtually impossible. This made it difficult because I would have had to hard code all potential formatting errors. For example, prior to 2020, Tesla was reporting its finances in millions but in the 2020 annual report, it was reporting in the billions. Another issue is that descriptions such as operating income, operating profit, EBIT etc. all vary between annual reports. These are small issues but can produce complicated code when they each need to be hard coded. Ultimately, this would lead me to create a python package that would need constant maintenance, which is fairly unproductive for my purposes.

What I ended up doing in the end

In the end, I took the easy route. Instead of automatically grabbing data out of annual reports with Python, I just had a Google form and quickly entered data into the form. I know exactly what values I wanted, could easily discern ambiguous words and know when the financial statements were trying to be ‘tricky’.

What surprised me was that the ‘manual’ method took no longer than for a Python script to run. The reason is that OCR slows down the whole process — it’s resource heavy procedure to say the least. For example, it took my cloud process several minutes just to process a single PDF from finding the financial statements automatically to producing a data frame.

Thereafter, I still used Python pull data from Google Sheets, which was the landing point of my Google Form inputs, to make data manipulation and model for intrinsic value. Ultimately, this is far easier than trying to make program that automatically pulls out data from a PDF.

Conclusions

At the moment, I don’t have my dream product that automatically pulls data out of annual reports and into a dashboard. But, after a year’s worth of experimentation, I can say for sure that I am definitely more agile in my value investing than I was at the start.

Never miss one of my stories! Click here to Subscribe. Stuck behind a wall and want to read more stories? Join up to Medium Here. If you’re active on Twitter, follow me on Twitter Here.

Schedule a DDIChat Session in Financial Markets and Analysis:

Apply to be a DDIChat Expert here.

Python
Investing
Data
Machine Learning
Automation
Recommended from ReadMedium