Use the Reddit API and BigQuery to build a Daily News Pipeline
Thanks to a well-documented API, we can use Python’s request library to derive the top news stories as well as associated user data. Despite being known as a thread-based social network, Reddit is one of the leading aggregators of news with communities like r/news garnering over 20 million members.
Below, we’ll walk through how to combine Python with BigQuery to build a batch pipeline that can load data from multiple sources in one simple script.
Create App & Authenticate
Before we can code a single request it’s necessary to log into your Reddit account to create an app.
Tip: Choose ‘script’ and, since this will only be run locally, use a stock URL like ‘reddit.com’ in the URL fields.
After you’ve registered as a developer account you can authenticate by entering your user name and password.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='<GOOGLE_CRED>'
auth = requests.auth.HTTPBasicAuth('CLIENT_ID', 'SECRET_TOKEN')
data = {
'grant_type': 'client_credentials',
'username': 'username',
'password': 'password'
}
Tip: If you ever display this code in a public portfolio be sure to hide these keys. You can use a Google Cloud Storage (GCS) bucket for this task.
Once authenticated, Reddit will provide you with a client ID and key. We’ll use these in a POST call to get our secret token. All of this information can be saved in the ‘headers’ variable which we’ll plug into our GET request.
headers = {'User-Agent': 'News/0.0.1'}
request = requests.post('https://www.reddit.com/api/v1/access_token', auth=auth, data=data, headers=headers)
token = request.json()['access_token']
headers = {**headers, **{'Authorization': f"bearer {token}"}}
FYI, this symbol ‘**’ coerces a variable into string format so that it can be used as a string payload in an API call. In other words, it gets our token into a format the API can easily understand.
Extract
We’ll be making all requests from this endpoint: ‘https://oauth.reddit.com/’.
We can modify this end point to contain the URL for the communities we’d like to examine. For r/news, we’d write ‘r/news’ and determine how to sort by using the suffix ‘/hot’ to aggregate by featured posts. We can also add a parameter to limit the number of results we receive by writing ‘limit’ and then ‘100.’
news_requests = requests.get('https://oauth.reddit.com/r/news/hot', headers=headers, params={'limit': '100'})
In order to access the response, we’ll use the .json() method and ensure we get all the data by passing the ‘data’ key. To access descriptive fields like ‘title’ we can add the key ‘children.’
Transform
Since the JSON output isn’t ideal, we’ll want to convert this data to a data frame. Moreover, we’ll want to ensure we create columns for each field in our final output. We can use a simple for statement and loop through the JSON , data and children keys. We then append the results to a new column.
df = pd.DataFrame()
for post in news_requests.json()['data']['children']:
df = df.append({
'title': post['data']['title'],
'upvote_ratio': post['data']['upvote_ratio'],
'score': post['data']['score'],
'ups': post['data']['ups'],
'domain': post['data']['domain'],
'num_comments': post['data']['num_comments']
}, ignore_index=True)
Tip: Be sure you add the data key to each child field so you don’t end up with an empty output.
This step can be repeated for as many communities as you want to request. In my case, since I’m a news junkie, I requested data from nine communities (see the full script here).
Load
The final step is to load the data into BigQuery. Since news doesn’t change by the second, a batch load job is fine. I created a dataset, ‘reddit_news’ to hold the output and created individual tables for each community.
Verify
While you’re welcome to wait until the last step of the process to see if everything loads, I find it far less frustrating to check each table.
To ensure transparency (and preempt frustration), I suggest writing logging statements . In the development stage, print statements will suffice.
Next Steps
Since my goal was to reduce my scrolling time on Reddit I created a batch load job I can trigger daily to get the top headlines in BigQuery. From there, I can export to a Google sheet or use SQL to prettify the output.
For an additional challenge, I suggest trying to convert this script to a Google Cloud Function and schedule it so that you’re always in the know.
Create a job-worthy data portfolio. Learn how with my free project guide.