avatarYeung WONG

Summary

The web content provides a step-by-step guide on how to scrape 1-minute price data for cryptocurrencies, specifically "The CoinDesk 20," using Python and the CoinDesk API.

Abstract

The article details a method for extracting timely and precise 1-minute interval cryptocurrency price data from CoinDesk, focusing on "The CoinDesk 20" list of cryptocurrencies. It emphasizes the importance of such high-frequency data for day-to-day trading strategies and pattern recognition. The guide includes an understanding of the CoinDesk API structure, handling of missing data, and the eventual creation of a comprehensive cryptocurrency dataframe. The process involves setting up the correct API parameters, including the cryptocurrency symbol, start and end times for data retrieval, and whether to retrieve open-high-low-close (OHLC) data. The Python code provided utilizes libraries such as requests, numpy, and pandas to fetch, process, and store the data in a CSV file for further analysis.

Opinions

  • The author believes that minute-based data is crucial for traders who engage in day-to-day trading and need to analyze trends and recognize patterns quickly.
  • "The CoinDesk 20" is suggested as a reliable starting point for scraping cryptocurrency prices due to its focus on assets with significant market cap and liquidity.
  • The author notes that CoinDesk's API provides both 1-minute and 1-hour based data, with the former being particularly useful for short-term analysis.
  • There is an acknowledgment that CoinDesk's data may have missing timestamps, which necessitates a data imputation method to ensure data continuity.
  • The tutorial advocates for the use of Python and its libraries for efficient data scraping and manipulation, highlighting the language's utility in financial data analysis.

Web Scraping Cryptocurrency 1-Minute Price Data (Python)

Step-by-step guide to scrap different cryptocurrency prices from CoinDesk.

Goal

This article aims at showing how to use Python scraping the cryptocurrency 1-minute prices in CoinDesk. In general, you can only pull the hour-based or day-based data and this is alright if you are working on a long-term investment strategy. However, the timely and short interval data would be a big benefit if someone is working on day-to-day trading so that he/she can better analyze the upcoming trend and recognize the pattern. Therefore, in the following, I will show you how to pull out the 1-minute price data from a famous cryptocurrency information platform — CoinDesk.

CoinDesk — Cryptocurrency Information Platform

Cryptocurrency List

To scrape the cryptocurrency price, a list of cryptocurrencies is needed. In fact, there are nearly 6000 cryptocurrencies as of August 2021 and not all of them are good for trading in terms of market cap and liquidity. “The CoinDesk 20” would be a good starting point. It filters from the larger universe of thousands of cryptocurrencies and digital assets to define a core group of 20. In the following paragraph, I will scrape the prices of “The CoinDesk 20” as the demonstration and the upcoming tutorial will be based on this set of assets as well. (Please note that ‘MATIC’ is not included since CoinDesk only provides ‘MATIC’ price data starting from July 2021 and it is insufficient to do the analysis.) Also, if you are interested to know what else cryptocurrencies CoinDesk supports, you may refer back to its website.

The CoinDesk 20
Example of Other Cryptocurrencies Supported by CoinDesk
# The CoinDesk 20
coindesk20_list = ['BTC', 'ETH', 'XRP', 'ADA', 'USDT', 'DOGE', 'XLM', 'DOT', 'UNI', 'LINK', 'USDC', 'BCH', 'LTC', 'GRT', 'ETC', 'FIL', 'AAVE', 'ALGO', 'EOS']

CoinDesk API

Before we start scraping the price, we first have to understand the CoinDesk API, the easiest way is to observe the price chart plot showing.

Left: 12h (1-minute) Right: 1w (1-hour)

From the charts, we can trivially observe that the data is in 1-minute based for ‘12h’ chart while the data is in 1-hour based for ‘1w’ chart. In fact, I have summarized in the below table.

Since I am interested in 1-minute price data, I would like to use the ‘12h’ as the basis. Take Bitcoin as illustration, below is the API structure for CoinDesk.

https://production.api.coindesk.com/v2/price/values/BTC?start_date=2021-08-20T15:42&end_date=2021-08-21T03:42&ohlc=true

I have made the parameters to be bold and italic for your easy reference. Basically speaking, 4 parameters can be set.

  1. Cryptocurrency Symbol
  2. Price Data Starting Time
  3. Price Data End Time
  4. Open-High-Low-Close

Please be remarked that (1) the time is in UTC+0 format, (2) the discrepancies for start time and end time should not be larger than 12 hours if you are interested in minute-based data and (3) if ohlc is set to be false, only closing price will be returned.

Data Scraping

After having the cryptocurrencies list and truly understanding the API structure, we can now start scraping the price.

# Import Libraries
import requests
import numpy as np
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
# The CoinDesk 20
coindesk20_list = ['BTC', 'ETH', 'XRP', 'ADA', 'USDT', 'DOGE', 'XLM', 'DOT', 'UNI', 'LINK', 'USDC', 'BCH', 'LTC', 'GRT', 'ETC', 'FIL', 'AAVE', 'ALGO', 'EOS']
raw_df = pd.DataFrame()
for coin in coindesk20_list:
    coin_df = pd.DataFrame()
    df = pd.DataFrame(index=[0])
    
    # Define the Start Date and End Date
    end_datetime = datetime(2021, 8, 1, 0, 0)
    datetime_checkpt = datetime(2021, 7, 1, 0, 0)
    
    while len(df) > 0:
        if end_datetime == datetime_checkpt:
            break
        start_datetime = end_datetime - relativedelta(hours = 12)
        url = 'https://production.api.coindesk.com/v2/price/values/' + coin + '?start_date=' + start_datetime.strftime("%Y-%m-%dT%H:%M") + '&end_date=' + end_datetime.strftime("%Y-%m-%dT%H:%M") + '&ohlc=true'
        temp_data_json = requests.get(url)
        temp_data = temp_data_json.json()
        df = pd.DataFrame(temp_data['data']['entries'])
        df.columns = ['Timestamp', 'Open', 'High', 'Low', 'Close']
        
        # Handle the Missing Data
        insert_idx_list = [np.nan]
        while len(insert_idx_list) > 0:
            timestamp_checking_array = np.array(df['Timestamp'][1:]) - np.array(df['Timestamp'][:-1])
            insert_idx_list = np.where(timestamp_checking_array != 60000)[0]
            if len(insert_idx_list) > 0:
                print('There are ' + str(len(insert_idx_list)) + ' timestamp mismatched.')
                insert_idx = insert_idx_list[0]
                temp_df = df.iloc[insert_idx.repeat(int(timestamp_checking_array[insert_idx]/60000)-1)].reset_index(drop=True)
                temp_df['Timestamp'] = [temp_df['Timestamp'][0] + i*60000 for i in range(1, len(temp_df)+1)]
                df = df.loc[:insert_idx].append(temp_df).append(df.loc[insert_idx+1:]).reset_index(drop=True)
                insert_idx_list = insert_idx_list[1:]
        
        df = df.drop(['Timestamp'], axis=1)
        df['Datetime'] = [end_datetime - relativedelta(minutes=len(df)-i) for i in range(0, len(df))]
        coin_df = df.append(coin_df)
        end_datetime = start_datetime
    coin_df['Symbol'] = coin
    raw_df = raw_df.append(coin_df)
raw_df = raw_df[['Datetime', 'Symbol', 'Open', 'High', 'Low', 'Close']].reset_index(drop=True)
raw_df.to_csv('raw_df.csv', index=False)

Simply speaking, we can divide the codes into 4 parts.

1. Get the JSON data from API

temp_data_json = requests.get(url)
temp_data = temp_data_json.json()
df = pd.DataFrame(temp_data['data']['entries'])
df.columns = ['Timestamp', 'Open', 'High', 'Low', 'Close']

Using the requests package allows us easily pull the API JSON data, and after that, we just store it in pandas data frame and change the column names.

2. Handle the missing data

insert_idx_list = [np.nan]
        while len(insert_idx_list) > 0:
            timestamp_checking_array = np.array(df['Timestamp'][1:]) - np.array(df['Timestamp'][:-1])
            insert_idx_list = np.where(timestamp_checking_array != 60000)[0]
            if len(insert_idx_list) > 0:
                print('There are ' + str(len(insert_idx_list)) + ' timestamp mismatched.')
                insert_idx = insert_idx_list[0]
                temp_df = df.iloc[insert_idx.repeat(int(timestamp_checking_array[insert_idx]/60000)-1)].reset_index(drop=True)
                temp_df['Timestamp'] = [temp_df['Timestamp'][0] + i*60000 for i in range(1, len(temp_df)+1)]
                df = df.loc[:insert_idx].append(temp_df).append(df.loc[insert_idx+1:]).reset_index(drop=True)
                insert_idx_list = insert_idx_list[1:]

This part will be the most tricky one. It is because that I found there are some circumstances that CoinDesk does not capture every minute of data. By observation, in the normal situation, the timestamp would have a discrepancy value of 60000 for 1 minute. Therefore, once I notice the row difference for the timestamp is larger than 60000, I can directly tell that gap of time is the missing period. To deal with it, a hot-deck imputation methodology is applied. In the other words, the closest minute data will be used to replace the missing one.

3. Add the Datetime and Symbol to the coin_df

df = df.drop(['Timestamp'], axis=1)
        df['Datetime'] = [end_datetime - relativedelta(minutes=len(df)-i) for i in range(0, len(df))]

Since the Timestamp column is defined by CoinDesk and not easily interpreted, instead of writing a time transformation function, I just deduce the Datetime column so as to indicate the date and time for the cryptocurrency price.

coin_df['Symbol'] = coin

Also, the cryptocurrency symbol is added to the coin_df as well.

4. Merge the coin _df into raw_df

raw_df = raw_df.append(coin_df)

Lastly, a consolidated dataset called raw_df is merged.

Raw Dataframe (raw_df.csv)

Cryptocurrency Dataframe

Finally, we can transform the data into the cryptocurrency data frame.

cryptocurrency_df = pd.DataFrame(raw_df['Close'].values.reshape(len(coindesk20_list), -1).transpose(), index=raw_df['Datetime'][:int(len(raw_df) / len(coindesk20_list))], columns=coindesk20_list)
cryptocurrency_df.to_csv('cryptocurrency_df.csv')
Cryptocurrency Dataframe (cryptocurrency_df.csv)

It comes to the end of the tutorial. Now you can move to the next part to see how to analyze the risk and return for the cryptocurrencies. =)

Data Science
Financial Data
Data Analysis
Scraping
Cryptocurrency
Recommended from ReadMedium