avatarJeremy DiBattista

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5148

Abstract

6 to 2016 for Microsoft and Apple only. Each date contains the open and close prices, as well as a string of all headlines from the New York Times that dealt with said company. The dataset contains sentiment analysis on the combined headline string indicating if a positive or negative sentiment is detected.</p><h2 id="e537">Pros:</h2><p id="cc81">Long Timespan — 10 years of headlines are ample data to train, test, and validate an algorithm, and this can be even further improved by adding additional data in a similar methodology.</p><p id="612b">Supplemental Information — The built-in stock prices and sentiment analysis columns make this a dataset training ready! A lot of additional steps like natural language processing are done for you!</p><p id="6ec8">Reliable Data — Data comes directly from the New York Times, and while this isn’t a diverse source of data, it is a reliable and consistent source.</p><h2 id="7eaa">Cons:</h2><p id="d471">Only 2 tickers — It could be dangerous to learn off of 2 tickers and extrapolate to other stocks. It is a shame this dataset does not contain 20+ tickers from different sectors! Apple and Microsoft are also both successful companies, which could introduce unwanted survivor bias.</p><p id="7cda">Data is getting old — Only having data as recent as 2016 could hurt when wanting to create an algorithm to trade today. This may require a decent amount of backfilling the missing information to be usable.</p><p id="f9a3">Lack of Metadata — The information provided is only strings of headlines. This lacks in-depth metadata and article content that could prove useful.</p><h2 id="76b8">Overall:</h2><p id="d7ba">This dataset is great for learning how to build an algorithmic trader. It provides a good amount of data on 2 tickers and provides extra analysis. If you want to grab a dataset and begin training, there is no better option than this one! I would be cautious to use this as your only data source, however. Especially if are looking to build a comprehensive algorithm. The drawbacks of older data and not very much information hold back what is otherwise a great dataset.</p><h2 id="c9f8">Kaggle Daily News for Stock Market Prediction</h2><p id="0429">This dataset contains the top 25 upvoted world news retrieved each day from Reddit's world news forum spanning from 2008 until 2016. It also contains the Dow Jones Industrial Average data as well as a boolean, 0 if the Dow closed lower that day, and a 1 if it closed higher.</p><h2 id="bf1e">Pros:</h2><p id="81f2">Long Timespan — 8+ years with 25 headlines per day is ample data to train, test, and validate an algorithm, and this can be even further improved by adding additional data in a similar methodology.</p><p id="e7b6">Well made — the dataset is well organized and ready to be utilized for algorithm development. The dataset was produced by a professor for use in a deep learning course, so it is naturally made easy to use.</p><h2 id="1cf0">Cons:</h2><p id="4c80">Data Validity — Pulling headlines based on what users upvoted and downvoted can introduce bias into the algorithm. Reddit is also not vetted for the validity of the upvoted news sources.</p><p id="9f75">Data is getting old — Only having data as recent as 2016 could hurt when wanting to create an algorithm to trade today. This may require a decent amount of backfilling information to be usable.</p><p id="a3d4">Not specific — The data is only from world news, not financial news or individual symbols, so extracting specific financial articles is not possible.</p><h2 id="669f">Overall:</h2><p id="3d50">This is the most well-rounded dataset of the three. It provides ample data, a great timespan, and the opportunity for a user to easily add to it, augment it with techniques like NLP, or use it to get an algorithm developed quickly. This convenience and ample free data, however, comes at the drawback of data reliability.</p><h1 id="bc80">Overall Impression of Datasets</h1><p id="49f6">All of these datasets provide ample free data incredibly quickly. However, none of these datasets are perfect. They all suffer from their own drawbacks that could limit their usefulness and are all from 2018 or older. The benefit to datasets, however, is they provide a great starting point for adding some historical context to your free API or web scraper!</p><h1 id="deb4">Web Scrapers</h1><p id="7055">Web scrapers involve creating a program that will systematically extract data from a source site, and save that data for later use. They are a favorite of many for being free to create, and fully customizable. For this article, I will be covering two methodologies of web scrapers!</p><h2 id="c526">Pros:</h2><p id="96cf">Fully customizable — You are writing the code to extract the data, so you are storing exactly the data you want, exactly how you want it.</p><p id="73bd">Unlimited free access — Your bot will run as long as you allow it, no paywalls, or limits to the data you receive.</p><h2 id="e02a">Cons:</h2><p id="4665">Can be slow — Since you are programmatically accessing a website, gathering mass data is much slower than using an API.</p><p id="16

Options

5c">Limited by the website — You are limited by what is physically visible and accessible on the site you are scraping as well, which may mean going far back in time (like getting tweets from years ago) could be programmatically challenging.</p><p id="7edc">High development cost — Creating a web scraper is more challenging and will take longer than using an API or downloading a dataset.</p><h2 id="2fc3">Selenium</h2><p id="360b">Selenium is a toolkit supported by most programming languages for programmatically controlling a web browser using scripts. It is widely popular, widely supported, easy to use, and currently used by hundreds of companies for web scraping, automation, and testing of systems.</p><h2 id="767f">Pros:</h2><p id="4b79">Wide Support — With development support for Python, Java, Ruby, Javascript, and C#, it is easy to take your favorite language and get started automating a web browser!</p><p id="3585">Fully Customizable — You fully control the web browser. Anything you can do in a browser can be automated, saved, and used to programmatically store data!</p><h2 id="7932">Cons:</h2><p id="f53f">High Development Cost — Creating your perfect scraper will take time, and in addition, there is always a possibility the site changes, requiring maintenance.</p><p id="9b73">Time-Consuming — waiting for the scraper to gather enough data to create any useful algorithm may be prohibitive versus utilizing other sources like APIs.</p><h2 id="a520">Overall:</h2><p id="7cd7">Selenium truly is the “You want it? build it!” option for web scraping. It is an incredibly expansive and useful tool, with the only drawback being you can only accomplish what is accessible via a browser. If you need specific data and have the programming prowess to create your own solution, there is no better platform than selenium.</p><h2 id="37dc">Getdata.io</h2><p id="fb1f">An intriguing and possibly quicker solution to web scraping comes in getdata.io. The platform allows the creation of “recipes” using their query language which will systematically grab data from a webpage as it changes. The great part, however, is that you do not need to be versed in their query language as they have a chrome extension that works as a point-and-click solution to generating a recipe! Unlike selenium, which is built to perform any web-based tasks, this is tailored and built to scrape data!</p><h2 id="acf2">Pros:</h2><p id="bef3">Quick to Learn — It will not take nearly as long to begin scraping a webpage when using getdata.io. The added chrome extension makes scraping data simple!</p><p id="381f">Built for extracting data — the platform is built to detect changes in data and update accordingly. You can then use a get request to simply pull any data into your algorithm!</p><p id="6f37">A community of data — The recipes you create and the data you extract can become public. This means you are given a great community of people who are already scraping for data! The recipes I found for financial news, however, were too weak to report on.</p><h2 id="b199">Cons:</h2><p id="2361">Lack of historic feeds — since the scraper is only looking for changes, it is not good at crawling backward for historic data, meaning you will need to wait for enough data!</p><p id="5112">More limited than Selenium — The recipe creation system, while it simplifies the process, also limits the capabilities of what can be extracted. Complex extractions may be more difficult, which could make Selenium easier in the long run.</p><h2 id="32a5">Overall:</h2><p id="b256">For someone who wants an easy introduction to web scraping and easy access to tailored data, this is an attractive option. What it lacks in depth it more than makes up for in a community of data and a fast start-up time!</p><h1 id="c5ca">Overall Impression of Web Scrapers</h1><p id="8f32">Web scrapers provide the best option for people who need to create their own data. While they may take longer to gather the necessary data, the amount of customizability is unmatched. Web scraping can be used to augment an existing dataset or API or can be used on its own to wonderous success. Overall, the amount of work you put into your web scraper will determine the success it has at creating the perfect dataset!</p><p id="8944">There you have it! If you followed me through both parts, I hope I have provided a great resource as to where to get started in finding news for algorithmic trading. My overall recommendation for creating your free dataset is to</p><ol><li>Don't be afraid to mix sources. Datasets and APIs/Scrapers will work very well together as long as you are trying to maintain consistency.</li><li>Use the sources that fit your use case. None of these sources are perfect, but they are all on here for a reason. Each of these 10 items provides at least one unique aspect not covered by the other 9!</li><li>Be careful when trading with your own money!</li></ol><p id="b0e3">If you have a favorite API, dataset, or web scraper that I missed, please let me know, and if you like this article, please follow me for more content in the field of data science!</p></article></body>

Building a Budget News-Based Algorithmic Trader? Well then You Need Hard-To-Find Data — Part 2

My story of building an algorithmic trader for $0, analyzing free APIs, datasets, and web scrapers. Part 2: Datasets and Web Scrapers

Photo by Stephen Dawson on Unsplash

Finding data for a news based trader is an exceptionally challenging task. Getting historical data going back more than a year is either a daunting challenge or one that will cost hundreds to thousands of dollars to purchase. In part 1, I analyzed the best news APIs for accessing data. If you have not read that article, I recommend first doing so here. To summarize the results of what I discovered about free APIs, they are excellent for providing real-time news for traders but fall exceptionally short in providing users with a backlog of data. In fact, I could not find data that went further back than a year. They are also limited in the number of requests a user can make in any individual month. Luckily, we can take steps to mitigate the drawbacks of APIs. In this article, I will be analyzing free datasets and web scrapers to see how they can provide the necessary data for creating algorithmic traders.

Datasets

Datasets are a favorite for accessing mass data quickly. If the correct data is available, datasets provide an invaluable speedup in algorithm development time due to being able to download and use masses of data quickly. I searched dozens of database archives from google’s dataset service to Kaggle. Surprisingly, the only source that was able to provide truly useful datasets was Kaggle, and they actually had multiple!

Pros:

Lots of information — The more information available, the easier it is to learn and discover trends, it is a reason why the classic dataset has never fallen out of style!

Quick compilation — When the dataset is downloaded, it is incredibly fast to access, train on, and use the dataset, leading to fast development times.

Cons:

Hard to update — Updating a dataset you did not create is a challenging task, and even so, you may suffer from stitching together sources that do not quite match. You may be at the mercy of the dataset creator to release a new version, or may only perpetually have access to old data, both significant drawbacks of the classic dataset.

Hard to find — It is much easier to find a news API than a news dataset. Even if you do find a dataset, finding one that exactly matches the problem you are trying to solve is unlikely. This may make using a dataset an impossible option.

Kaggle US Financial News Articles

This dataset contains articles from Bloomberg, CNBC, Reuters, WSJ, and Fortune from January to May of 2018. The total size of the dataset is over 1 gigabyte, containing thousands upon thousands of articles and metadata.

Pros:

Ample Data — 1 Gigabyte is by far the largest dataset I found. This means whether you only want specific tickers or general news, any user should have no problem extracting the information they need from this dataset. It also has tons of metadata including what entity the article is about and the sentiment towards that entity.

Reliable Data — The dataset contains reputable sources only, providing reliable news coverage to base your algorithm on.

Cons:

Short time span — 5 months of data is a small sample size. The market was stable and doing well over this period of time, which could cause unreliable learning.

Messy data — The data is sorted by article while the date and associated entities are lodged in the metadata. This means there is likely substantial data-wrangling required before this dataset could be usable.

Overall:

This dataset is a great starting point for data collection. It has the significant drawback of lacking a large timespan but if you take it as a starting point, and fill in supplemental data since May 2018, this dataset could prove valuable!

Kaggle Impact of News on Share Closing Value

This dataset is pretty lightweight but is by far the most intriguing dataset on this list. It includes articles spanning 2006 to 2016 for Microsoft and Apple only. Each date contains the open and close prices, as well as a string of all headlines from the New York Times that dealt with said company. The dataset contains sentiment analysis on the combined headline string indicating if a positive or negative sentiment is detected.

Pros:

Long Timespan — 10 years of headlines are ample data to train, test, and validate an algorithm, and this can be even further improved by adding additional data in a similar methodology.

Supplemental Information — The built-in stock prices and sentiment analysis columns make this a dataset training ready! A lot of additional steps like natural language processing are done for you!

Reliable Data — Data comes directly from the New York Times, and while this isn’t a diverse source of data, it is a reliable and consistent source.

Cons:

Only 2 tickers — It could be dangerous to learn off of 2 tickers and extrapolate to other stocks. It is a shame this dataset does not contain 20+ tickers from different sectors! Apple and Microsoft are also both successful companies, which could introduce unwanted survivor bias.

Data is getting old — Only having data as recent as 2016 could hurt when wanting to create an algorithm to trade today. This may require a decent amount of backfilling the missing information to be usable.

Lack of Metadata — The information provided is only strings of headlines. This lacks in-depth metadata and article content that could prove useful.

Overall:

This dataset is great for learning how to build an algorithmic trader. It provides a good amount of data on 2 tickers and provides extra analysis. If you want to grab a dataset and begin training, there is no better option than this one! I would be cautious to use this as your only data source, however. Especially if are looking to build a comprehensive algorithm. The drawbacks of older data and not very much information hold back what is otherwise a great dataset.

Kaggle Daily News for Stock Market Prediction

This dataset contains the top 25 upvoted world news retrieved each day from Reddit's world news forum spanning from 2008 until 2016. It also contains the Dow Jones Industrial Average data as well as a boolean, 0 if the Dow closed lower that day, and a 1 if it closed higher.

Pros:

Long Timespan — 8+ years with 25 headlines per day is ample data to train, test, and validate an algorithm, and this can be even further improved by adding additional data in a similar methodology.

Well made — the dataset is well organized and ready to be utilized for algorithm development. The dataset was produced by a professor for use in a deep learning course, so it is naturally made easy to use.

Cons:

Data Validity — Pulling headlines based on what users upvoted and downvoted can introduce bias into the algorithm. Reddit is also not vetted for the validity of the upvoted news sources.

Data is getting old — Only having data as recent as 2016 could hurt when wanting to create an algorithm to trade today. This may require a decent amount of backfilling information to be usable.

Not specific — The data is only from world news, not financial news or individual symbols, so extracting specific financial articles is not possible.

Overall:

This is the most well-rounded dataset of the three. It provides ample data, a great timespan, and the opportunity for a user to easily add to it, augment it with techniques like NLP, or use it to get an algorithm developed quickly. This convenience and ample free data, however, comes at the drawback of data reliability.

Overall Impression of Datasets

All of these datasets provide ample free data incredibly quickly. However, none of these datasets are perfect. They all suffer from their own drawbacks that could limit their usefulness and are all from 2018 or older. The benefit to datasets, however, is they provide a great starting point for adding some historical context to your free API or web scraper!

Web Scrapers

Web scrapers involve creating a program that will systematically extract data from a source site, and save that data for later use. They are a favorite of many for being free to create, and fully customizable. For this article, I will be covering two methodologies of web scrapers!

Pros:

Fully customizable — You are writing the code to extract the data, so you are storing exactly the data you want, exactly how you want it.

Unlimited free access — Your bot will run as long as you allow it, no paywalls, or limits to the data you receive.

Cons:

Can be slow — Since you are programmatically accessing a website, gathering mass data is much slower than using an API.

Limited by the website — You are limited by what is physically visible and accessible on the site you are scraping as well, which may mean going far back in time (like getting tweets from years ago) could be programmatically challenging.

High development cost — Creating a web scraper is more challenging and will take longer than using an API or downloading a dataset.

Selenium

Selenium is a toolkit supported by most programming languages for programmatically controlling a web browser using scripts. It is widely popular, widely supported, easy to use, and currently used by hundreds of companies for web scraping, automation, and testing of systems.

Pros:

Wide Support — With development support for Python, Java, Ruby, Javascript, and C#, it is easy to take your favorite language and get started automating a web browser!

Fully Customizable — You fully control the web browser. Anything you can do in a browser can be automated, saved, and used to programmatically store data!

Cons:

High Development Cost — Creating your perfect scraper will take time, and in addition, there is always a possibility the site changes, requiring maintenance.

Time-Consuming — waiting for the scraper to gather enough data to create any useful algorithm may be prohibitive versus utilizing other sources like APIs.

Overall:

Selenium truly is the “You want it? build it!” option for web scraping. It is an incredibly expansive and useful tool, with the only drawback being you can only accomplish what is accessible via a browser. If you need specific data and have the programming prowess to create your own solution, there is no better platform than selenium.

Getdata.io

An intriguing and possibly quicker solution to web scraping comes in getdata.io. The platform allows the creation of “recipes” using their query language which will systematically grab data from a webpage as it changes. The great part, however, is that you do not need to be versed in their query language as they have a chrome extension that works as a point-and-click solution to generating a recipe! Unlike selenium, which is built to perform any web-based tasks, this is tailored and built to scrape data!

Pros:

Quick to Learn — It will not take nearly as long to begin scraping a webpage when using getdata.io. The added chrome extension makes scraping data simple!

Built for extracting data — the platform is built to detect changes in data and update accordingly. You can then use a get request to simply pull any data into your algorithm!

A community of data — The recipes you create and the data you extract can become public. This means you are given a great community of people who are already scraping for data! The recipes I found for financial news, however, were too weak to report on.

Cons:

Lack of historic feeds — since the scraper is only looking for changes, it is not good at crawling backward for historic data, meaning you will need to wait for enough data!

More limited than Selenium — The recipe creation system, while it simplifies the process, also limits the capabilities of what can be extracted. Complex extractions may be more difficult, which could make Selenium easier in the long run.

Overall:

For someone who wants an easy introduction to web scraping and easy access to tailored data, this is an attractive option. What it lacks in depth it more than makes up for in a community of data and a fast start-up time!

Overall Impression of Web Scrapers

Web scrapers provide the best option for people who need to create their own data. While they may take longer to gather the necessary data, the amount of customizability is unmatched. Web scraping can be used to augment an existing dataset or API or can be used on its own to wonderous success. Overall, the amount of work you put into your web scraper will determine the success it has at creating the perfect dataset!

There you have it! If you followed me through both parts, I hope I have provided a great resource as to where to get started in finding news for algorithmic trading. My overall recommendation for creating your free dataset is to

  1. Don't be afraid to mix sources. Datasets and APIs/Scrapers will work very well together as long as you are trying to maintain consistency.
  2. Use the sources that fit your use case. None of these sources are perfect, but they are all on here for a reason. Each of these 10 items provides at least one unique aspect not covered by the other 9!
  3. Be careful when trading with your own money!

If you have a favorite API, dataset, or web scraper that I missed, please let me know, and if you like this article, please follow me for more content in the field of data science!

Data Science
Algorithmic Trading
Machine Learning
Stock Market
Artificial Intelligence
Recommended from ReadMedium