avatarScollay Petry

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3185

Abstract

ickly. Or, because it’s such a dynamic language, radically different approaches can be taken to solve a problem, and it’s really up to you, dear programmer, to determine which is optimal for you! Only with rigorous, timed testing can one arrive at an “ideal” solution.</p><h1 id="3e3a">Problems and Solutions</h1><p id="b8d6">When I paused and took stock of the situation, I discovered at least four productivity killers that were holding me back. I took each issue and made it a research project, tested various libraries and solutions, and developed what I think are reasonable approaches to each. I’ve written an article for each, and I’ve posted sample scripts in my <a href="https://github.com/scollay/caffeinated-pandas">GitHub repo</a>.</p><blockquote id="d0ad"><p><b><i>1. Running out of memory.</i></b><i> I have a MacBook Pro with 16 GB’s of RAM, but I found that loading just 3 GB’s of stock quotes data from disk sent my computer into a tailspin: mouse movements became jittery, the fan was blowing like mad, and processing slowed down considerably. It turns out that Pandas is a bit of a memory hog in its default state, and a Dataframe can easily consume way more memory than what’s read from disk. Fortunately, there’s an easy fix that can “squeeze” and reduce Pandas memory consumption by 89% (seriously!) in seconds. <a href="https://readmedium.com/squeezing-pandas-89-less-ram-consumption-4d91a0eb9c08">The article is here</a>.</i></p></blockquote><blockquote id="03f7"><p><b><i>2. Disk reading and writing taking way too long.</i></b><i> Stock quotes and other data are typically offered for download in a comma-delimited file, or “.csv file”. Sometimes they are .zip’d to make them smaller. After download, though, these are among the worst formats for data storage and retrieval. There’s a better way to read and write files that will reduce speeds from minutes to just 2–3 seconds and will save space from 65–84%. <a href="https://readmedium.com/storing-pandas-98-faster-disk-reads-and-72-less-space-208e2e2be8bb">A full battery of tests to find the perfect format </a></i><a href="https://readmedium.com/storing-pandas-98-faster-disk-reads-and-72-less-space-208e2e2be8bb">for your needs <i>are here</i></a><i>.</i></p></blockquote><blockquote id="b045"><p><b><i>3. Computer’s full processing power not being used.</i></b><i> A Python program will, by default, only use one processor on a computer. If you’ve got a “Quad Core” like I do, it seems a shame to not be using the full capacity of the computer when enhancing data with technical indicators and machine learning features. I’ve created an easy way to use all of your processors with just a single line of code using standard Python. <a href="https://readmedium.com/multiprocessing-pandas-46-95-faster-dataframe-enhancements-c65ef29f03b1">The method is here</a>.</i></p></blockquote><blockquote id="3fec"><p><b><i>4. Taking too long to develop, iterate, and test code.</i></b><i> Running a model requires an entire dataset as input to get to the final conclusion(s). But if you’re in developer/programming “mode” and constantly testing, tweaking, and re-running using an entire dataset, it’s probably taki

Options

ng you way too long to progress forward. Develop and test instead with <a href="https://readmedium.com/processing-pandas-10x-faster-coding-and-iteration-with-rough-samples-78b75b7d5b0b">this simple sampling trick</a>.</i></p></blockquote><h1 id="ce5f">The Right Tools for the Job</h1><h2 id="0095">Largish Data</h2><p id="1440">My baseline dataset consists of a bit over 20+ years of daily stock quotes from Sharadar via the Nasdaq Data Link (formerly Quandl) service. In all, the full database of listed and delisted stocks and ETFs is about 50 Million rows and the file size is 3.3 GBs. In the year 2021, this doesn’t qualify as “big data” — maybe approaching “largish”? It’s not trivial by any means and presents challenges, but this plus quite a bit more is fully manageable on a decent laptop or cloud server using the solutions posted in this series of articles.</p><h2 id="2f28">Bigger Data</h2><p id="05f3">If you’re dealing with Billions of rows of tick or other data, the solutions I’m presenting could work with some fancy footwork, but there are certainly much better “bigger data” solutions starting with the likes of Dask or Vaex which can read across disk drives and computers, to more Enterprise server-side solutions. These tools are great at dealing with sizable data sets in a familiar Dataframe format, but even though they are based on Pandas, they don’t allow some basic functions like sorting, grouping, or pivoting on more than one column without a bit of preprocessing.</p><h2 id="2d19">Small Data</h2><p id="f251">Alternatively, if you’re downloading daily stock quotes from Yahoo on a few dozen stocks, most of what I’m presenting here is overkill. Please use standard Pandas, save as .csv’s, and don’t add the complexity. Running “small data” on multiple CPUs might take longer than on a single core because of the inherent processing overheads in Python’s <code>multiprocessing</code> module.</p><h2 id="16cf">Beyond Stocks</h2><p id="5bc7">While I’m focused on stock quotes in this series of articles, the principles will certainly work on any columnar, Pandas-based data for all types of models and analysis.</p><h2 id="fbbd">Reasons for Writing and Publishing</h2><p id="9bf3"><i>It’s taken me a long time with many false starts to get to this point where I can confidently process large swaths of data. Going into this I had a notion that I’d like to write about it. Just knowing that others would be reading it, I’ve dramatically improved my code and its performance, sometimes by an order of magnitude as I searched for “better” ways. Also, by sharing, I’m hoping others will provide constructive feedback on what I could have done better!</i></p><figure id="656c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*3r_xT4pwf03aNXWW"><figcaption>Image by <a href="https://www.dreamstime.com/doodkoalex_info">Doodkoalex</a> | <a href="https://www.dreamstime.com/">Dreamstime</a></figcaption></figure><p id="a5c2"><i>Thanks for reading Caffeinated Pandas! Please FOLLOW ME if you’d like to be alerted about new content.</i></p><p id="172f"><i>More content at <a href="http://plainenglish.io/">plainenglish.io</a></i></p></article></body>

Accelerate Your Stock Market Modelling, Reporting & Development with Pandas

Experience 10x faster development with pandas: 89% less memory usage, 98% faster disk reads, and 72% less space.

Image by Kevin Butz | Unsplash

I’ve long been interested in the financial markets, trading, and investing. Over the years I’ve toyed with stock charting and technical analysis, usually with various online charting, stock analysis, and portfolio building services.

While these services provided scripting “languages” to allow me to customize whatever analysis I was doing, they all had limitations that prevented me from doing exactly what I wanted to do. Or, they were more black box in nature, which hindered my understanding of how metrics were calculated and how stocks were selected. Or sadly, in the case of Quantopian, the service was discontinued, and while they generously left behind their open-source engine Zipline, it’s become sorely deprecated.

Last year, with a lot more downtime and home time than expected, and armed with a few years of casual Python programming under my belt, I decided to build a backtesting and trading system that would suit my needs. Then, if I happened on an interesting book or white paper with a trading idea, I’d have no limits to testing because I controlled the code, the market universe, and computing resources.

Frustration and Little Progress

After months of “organic”, ad-hoc data gathering and testing, I realized that I lacked even the most basic data processing capabilities and the infrastructure needed to efficiently test my trading and portfolio building ideas on a broad range of stocks and markets.

Image by Aleksandrs Samuilovs | Dreamstime

So when I wanted to push the envelope with some analysis, my computer pushed back, and in trying to “fix” problems, I just created more of a mess! While it’s a lot of fun imagining algorithms to do this or that, it’s impossible to develop models effectively if the data behind the models isn’t organized, memory-optimized, and can’t be updated, accessed, processed, or tested quickly.

I had also fallen into the rabbit hole that is “the Internet”, which is an amazing tool for learning and exchanging information, but it’s also a challenging one because you do have to make sure you’re getting the best and most accurate information.

For example, Stack Overflow is a programmer’s dream, especially when one is learning. But it’s also littered with years-old answers that no longer work because Python and Pandas evolve quickly. Or, because it’s such a dynamic language, radically different approaches can be taken to solve a problem, and it’s really up to you, dear programmer, to determine which is optimal for you! Only with rigorous, timed testing can one arrive at an “ideal” solution.

Problems and Solutions

When I paused and took stock of the situation, I discovered at least four productivity killers that were holding me back. I took each issue and made it a research project, tested various libraries and solutions, and developed what I think are reasonable approaches to each. I’ve written an article for each, and I’ve posted sample scripts in my GitHub repo.

1. Running out of memory. I have a MacBook Pro with 16 GB’s of RAM, but I found that loading just 3 GB’s of stock quotes data from disk sent my computer into a tailspin: mouse movements became jittery, the fan was blowing like mad, and processing slowed down considerably. It turns out that Pandas is a bit of a memory hog in its default state, and a Dataframe can easily consume way more memory than what’s read from disk. Fortunately, there’s an easy fix that can “squeeze” and reduce Pandas memory consumption by 89% (seriously!) in seconds. The article is here.

2. Disk reading and writing taking way too long. Stock quotes and other data are typically offered for download in a comma-delimited file, or “.csv file”. Sometimes they are .zip’d to make them smaller. After download, though, these are among the worst formats for data storage and retrieval. There’s a better way to read and write files that will reduce speeds from minutes to just 2–3 seconds and will save space from 65–84%. A full battery of tests to find the perfect format for your needs are here.

3. Computer’s full processing power not being used. A Python program will, by default, only use one processor on a computer. If you’ve got a “Quad Core” like I do, it seems a shame to not be using the full capacity of the computer when enhancing data with technical indicators and machine learning features. I’ve created an easy way to use all of your processors with just a single line of code using standard Python. The method is here.

4. Taking too long to develop, iterate, and test code. Running a model requires an entire dataset as input to get to the final conclusion(s). But if you’re in developer/programming “mode” and constantly testing, tweaking, and re-running using an entire dataset, it’s probably taking you way too long to progress forward. Develop and test instead with this simple sampling trick.

The Right Tools for the Job

Largish Data

My baseline dataset consists of a bit over 20+ years of daily stock quotes from Sharadar via the Nasdaq Data Link (formerly Quandl) service. In all, the full database of listed and delisted stocks and ETFs is about 50 Million rows and the file size is 3.3 GBs. In the year 2021, this doesn’t qualify as “big data” — maybe approaching “largish”? It’s not trivial by any means and presents challenges, but this plus quite a bit more is fully manageable on a decent laptop or cloud server using the solutions posted in this series of articles.

Bigger Data

If you’re dealing with Billions of rows of tick or other data, the solutions I’m presenting could work with some fancy footwork, but there are certainly much better “bigger data” solutions starting with the likes of Dask or Vaex which can read across disk drives and computers, to more Enterprise server-side solutions. These tools are great at dealing with sizable data sets in a familiar Dataframe format, but even though they are based on Pandas, they don’t allow some basic functions like sorting, grouping, or pivoting on more than one column without a bit of preprocessing.

Small Data

Alternatively, if you’re downloading daily stock quotes from Yahoo on a few dozen stocks, most of what I’m presenting here is overkill. Please use standard Pandas, save as .csv’s, and don’t add the complexity. Running “small data” on multiple CPUs might take longer than on a single core because of the inherent processing overheads in Python’s multiprocessing module.

Beyond Stocks

While I’m focused on stock quotes in this series of articles, the principles will certainly work on any columnar, Pandas-based data for all types of models and analysis.

Reasons for Writing and Publishing

It’s taken me a long time with many false starts to get to this point where I can confidently process large swaths of data. Going into this I had a notion that I’d like to write about it. Just knowing that others would be reading it, I’ve dramatically improved my code and its performance, sometimes by an order of magnitude as I searched for “better” ways. Also, by sharing, I’m hoping others will provide constructive feedback on what I could have done better!

Image by Doodkoalex | Dreamstime

Thanks for reading Caffeinated Pandas! Please FOLLOW ME if you’d like to be alerted about new content.

More content at plainenglish.io

Python
Data Science
Stock Market
Programming
Pandas
Recommended from ReadMedium