Summary

This article describes how to use Python, Reddit API, and BigQuery to build a batch pipeline that loads data from multiple sources in one simple script.

Abstract

The article explains how to create an app and authenticate on Reddit, extract data using the Reddit API, transform the data into a data frame, and load it into BigQuery. The author provides code snippets and tips for each step, as well as a link to the full script on GitHub. The goal is to reduce scrolling time on Reddit by creating a daily news pipeline that can be exported to a Google sheet or prettified using SQL.

Opinions

The author emphasizes the importance of using a well-documented API, such as Reddit's, to extract data.
The author suggests using a Google Cloud Storage (GCS) bucket to hide authentication keys when displaying code in a public portfolio.
The author notes that the JSON output from the Reddit API is not ideal and recommends converting the data to a data frame.
The author recommends using logging statements to ensure transparency and preempt frustration when checking each table during the loading process.
The author suggests trying to convert the script to a Google Cloud Function and scheduling it for daily updates.
The author recommends using an AI service that provides the same performance and functions as ChatGPT Plus (GPT-4) but

Use the Reddit API and BigQuery to build a Daily News Pipeline

Thanks to a well-documented API, we can use Python’s request library to derive the top news stories as well as associated user data. Despite being known as a thread-based social network, Reddit is one of the leading aggregators of news with communities like r/news garnering over 20 million members.

Below, we’ll walk through how to combine Python with BigQuery to build a batch pipeline that can load data from multiple sources in one simple script.

Create App & Authenticate

Before we can code a single request it’s necessary to log into your Reddit account to create an app.

Tip: Choose ‘script’ and, since this will only be run locally, use a stock URL like ‘reddit.com’ in the URL fields.

After you’ve registered as a developer account you can authenticate by entering your user name and password.

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='<GOOGLE_CRED>'
    auth = requests.auth.HTTPBasicAuth('CLIENT_ID', 'SECRET_TOKEN')
    data = {
    'grant_type': 'client_credentials',
    'username': 'username',
    'password': 'password'
    }

Tip: If you ever display this code in a public portfolio be sure to hide these keys. You can use a Google Cloud Storage (GCS) bucket for this task.

Once authenticated, Reddit will provide you with a client ID and key. We’ll use these in a POST call to get our secret token. All of this information can be saved in the ‘headers’ variable which we’ll plug into our GET request.

headers = {'User-Agent': 'News/0.0.1'}
request = requests.post('https://www.reddit.com/api/v1/access_token', auth=auth, data=data, headers=headers)
    token = request.json()['access_token']
    headers = {**headers, **{'Authorization': f"bearer {token}"}}

FYI, this symbol ‘**’ coerces a variable into string format so that it can be used as a string payload in an API call. In other words, it gets our token into a format the API can easily understand.

Extract

We’ll be making all requests from this endpoint: ‘https://oauth.reddit.com/’.

We can modify this end point to contain the URL for the communities we’d like to examine. For r/news, we’d write ‘r/news’ and determine how to sort by using the suffix ‘/hot’ to aggregate by featured posts. We can also add a parameter to limit the number of results we receive by writing ‘limit’ and then ‘100.’

news_requests = requests.get('https://oauth.reddit.com/r/news/hot', headers=headers, params={'limit': '100'})

In order to access the response, we’ll use the .json() method and ensure we get all the data by passing the ‘data’ key. To access descriptive fields like ‘title’ we can add the key ‘children.’

Transform

Since the JSON output isn’t ideal, we’ll want to convert this data to a data frame. Moreover, we’ll want to ensure we create columns for each field in our final output. We can use a simple for statement and loop through the JSON , data and children keys. We then append the results to a new column.

df = pd.DataFrame()
    for post in news_requests.json()['data']['children']:
        df = df.append({
        'title': post['data']['title'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'score': post['data']['score'],
        'ups': post['data']['ups'],
        'domain': post['data']['domain'],
        'num_comments': post['data']['num_comments']
    }, ignore_index=True)

Tip: Be sure you add the data key to each child field so you don’t end up with an empty output.

This step can be repeated for as many communities as you want to request. In my case, since I’m a news junkie, I requested data from nine communities (see the full script here).

Load

The final step is to load the data into BigQuery. Since news doesn’t change by the second, a batch load job is fine. I created a dataset, ‘reddit_news’ to hold the output and created individual tables for each community.

Verify

While you’re welcome to wait until the last step of the process to see if everything loads, I find it far less frustrating to check each table.

To ensure transparency (and preempt frustration), I suggest writing logging statements . In the development stage, print statements will suffice.

Next Steps

Since my goal was to reduce my scrolling time on Reddit I created a batch load job I can trigger daily to get the top headlines in BigQuery. From there, I can export to a Google sheet or use SQL to prettify the output.

For an additional challenge, I suggest trying to convert this script to a Google Cloud Function and schedule it so that you’re always in the know.

Create a job-worthy data portfolio. Learn how with my free project guide.