avatarZoumana Keita

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6889

Abstract

ttributes, we can use the <code>dir()</code> function combined with the helper function <code>print_all_attributes()</code> as illustrated below:</p><div id="e39c"><pre>all_attributes = <span class="hljs-built_in">dir</span>(next_reddit)

<span class="hljs-comment"># Helper function to print all the attributes</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">print_attributes_in_table</span>(<span class="hljs-params">data, columns</span>): <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(data), columns): <span class="hljs-built_in">print</span>(<span class="hljs-string">',\t'</span>.join(data[i:i+columns]))

<span class="hljs-comment"># Run the function</span> print_attributes_in_table(all_attributes, <span class="hljs-number">5</span>)</pre></div><p id="5573">The result below is showing all the attributes in a table of five columns, and some of the attributes are highlighted in an orange rectangle:</p><ul><li><code>created</code> : the date of creation of the post</li><li><code>comments</code> : the list of all the comments related to the post</li><li><code>num_comments</code> : number of comments this post had</li><li><code>title</code> : the title of the post</li><li><code>url</code> : the URL of the post</li></ul><figure id="080e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jKfgwUijoOXq81kb5SMyow.png"><figcaption>The list of all the attributes (Image by Author)</figcaption></figure><p id="3be0">The <code>comments</code>variable is a <code>praw.models.comment_forest.CommentForest </code>type<i>, </i>we will need to iterate through each comment in the <code>CommentForest</code><i> </i>to get the values. This is done with the <code>extract_comment_from_forest</code> function.</p><div id="8e52"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">extract_comments_from_forest</span>(<span class="hljs-params">submission</span>):

all_comments = []

<span class="hljs-comment"># Start iterating through each comment in the forest and get the content</span>
submission.comments.replace_more(limit=<span class="hljs-number">0</span>) <span class="hljs-comment"># Flatten the tree</span>
comments = submission.comments.<span class="hljs-built_in">list</span>() <span class="hljs-comment"># all the comments</span>

<span class="hljs-keyword">for</span> comment <span class="hljs-keyword">in</span> comments:
    all_comments.append(comment.body)

<span class="hljs-keyword">return</span> all_comments</pre></div><p id="7645">Now, we can proceed with the creation of the logic to create the <code>Pandas</code> data frame containing all the posts, as implemented in the <code>extract_top_N_post</code> which returns the first <code>N=100</code> posts by default.</p><div id="c003"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-keyword">import</span> datetime <span class="hljs-keyword">as</span> dt

<span class="hljs-keyword">def</span> <span class="hljs-title function_">extract_top_N_posts</span>(<span class="hljs-params">topic_of_interest, N = <span class="hljs-number">100</span></span>):

topic_of_interest = topic_of_interest.replace(<span class="hljs-string">' '</span>, <span class="hljs-string">''</span>) final_list_of_dict = [] dict_result = {}

submissions = my_Reddit_App.subreddit(topic_of_interest).hot(limit=N)

<span class="hljs-keyword">for</span> submission <span class="hljs-keyword">in</span> submissions: dict_result[<span class="hljs-string">"title"</span>] = submission.title dict_result[<span class="hljs-string">"creation_date"</span>] = dt.datetime.fromtimestamp(submission.created) dict_result[<span class="hljs-string">"id"</span>] = submission.<span class="hljs-built_in">id</span> dict_result[<span class="hljs-string">"url"</span>] = submission.url dict_result[<span class="hljs-string">"comments"</span>] = extract_comments_from_forest(submission)

final_list_of_dict.append(dict_result)
dict_result = {}

<span class="hljs-comment"># Create the dataframe</span> df = pd.DataFrame(final_list_of_dict)

<span class="hljs-keyword">return</span> df</pre></div><p id="e318">From the function:</p><ul><li>We start by defining the relevant variables</li><li>Then in the <code>for</code> loop we define all the attributes to be returned in the final dictionary.</li><li>Finally, we convert the resulting dictionary into a pandas data frame.</li></ul><p id="1ea2">Let’s see the function in action 🚀 with the extraction of the Reddits about <code>DataScience</code> .</p><div id="2c2c"><pre>data_science_reddits_df = extract_top_N_posts(<span class="hljs-string">'DataScience'</span>)</pre></div><p id="250c">By checking the size of the dataset, we can see that there are 100 rows and 5 columns. Also, the first five rows are displayed with the <code>display</code> function.</p><div id="dc56"><pre><span class="hljs-built_in">print</span>(data_science_reddits_df.shape) <span class="hljs-comment"># => (100, 5) #--> 100 rows and 5 columns</span>

display(data_science_reddits_df.head())</pre></div><figure id="1047"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*iVQlSKUaDzB7Arhr4ChZhQ.png"><figcaption>First five rows of the dataframe (Image by Author)</figcaption></figure><p id="eb3e">Wonderful, you did it! But before wrapping up, let’s have a quick data visualization!</p><h1 id="96bd">Quick Data Visualization</h1><p id="8caf">The goal here is to analyze the comments and have a broad overview of which topics are being discussed about <code>DataScience</code> . The real-time nature of the results can make your result different from mine.</p><p id="d057">Some data cleaning is required prior to visualizing the comments, and the overall logic is implemented in the<code>clean_text()</code> function.</p><h2 id="da30">Data cleaning</h2><p id="a574">The function data cleaning function leverages the following libraries:</p><ul><li><code>NLTK</code> : one of the most used packages for text preprocessing</li><li><code>re</code> : the regular expression package for data cleaning and extraction</li><li><code>emoji</code> : used to detect and convert emojis into their textual format</li></ul><div id="89ee"><pre><span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords <span class="hljs-keyword">import</span> nltk nltk.download(<span class="hljs-string">"stopwords"</span>)

<span class="hljs-keyword">import</span> re <span class="hljs-keyword">from</span> emoji <span class="hljs-keyword">import</span> demojize

STOPWORDS = <span class="hljs-built_in">set</span>(stopwords.words(<sp

Options

an class="hljs-string">'english'</span>)) MIN_LEN = <span class="hljs-number">2</span>

<span class="hljs-keyword">def</span> <span class="hljs-title function_">clean_text</span>(<span class="hljs-params">text</span>):

<span class="hljs-comment"># Remove all closing and opening brackets</span> text = re.sub(<span class="hljs-string">r"[([{})]]"</span>, <span class="hljs-string">""</span>, text)

<span class="hljs-comment"># Remove URLs</span> text = re.sub(<span class="hljs-string">r"http\S+"</span>, <span class="hljs-string">""</span>, text)

<span class="hljs-comment"># Remove numeric values</span> text = re.sub(<span class="hljs-string">r"[0-9]"</span>, <span class="hljs-string">""</span>, text)

<span class="hljs-comment"># Remove stopwords</span> text = <span class="hljs-string">" "</span>.join([word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> text.split() <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> STOPWORDS])

<span class="hljs-comment"># Remove words with length < threshold</span> text = <span class="hljs-string">" "</span>.join([word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> text.split() <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(word) > MIN_LEN])

<span class="hljs-comment"># Convert emojis to textual format</span> text = demojize(text)

<span class="hljs-keyword">return</span> text</pre></div><p id="7aac">The main text-cleaning processes performed in the function are: removing all closing, and opening brackets, URLs, numeric values, stopwords, all words less than two characters, and finally converting emojis to text.</p><h2 id="28ef">Visualization function</h2><p id="e668">The visualization function generates the word cloud of the comments and the main modules used are:</p><ul><li><code>wordcloud</code> : the library that provides the <code>WordCloud</code> class for creating an instance of the <code>wordcloud</code></li><li><code>matplotlib</code> : used to generate the result in a graphical format</li></ul><div id="9563"><pre><span class="hljs-keyword">from</span> wordcloud <span class="hljs-keyword">import</span> WordCloud <span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-keyword">def</span> <span class="hljs-title function_">show_wordcloud</span>(<span class="hljs-params">comments</span>):

all_comments = <span class="hljs-string">' '</span>.join(comments)

wordcloud = WordCloud(width=<span class="hljs-number">5000</span>, height=<span class="hljs-number">4000</span>, background_color=<span class="hljs-string">'black'</span>, min_font_size=<span class="hljs-number">10</span>).generate(all_comments)

plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">12</span>), facecolor=<span class="hljs-string">'k'</span>, edgecolor=<span class="hljs-string">'k'</span>) plt.imshow(wordcloud) plt.axis(<span class="hljs-string">"off"</span>) plt.tight_layout(pad=<span class="hljs-number">0</span>) plt.show()</pre></div><p id="747b">The function creates a <code>wordcloud</code> of the comments with a <code>black</code> background, a width of 5000, and a height of 4000. The final result is displayed in a window of 12x12 dimension.</p><h2 id="4587">Show the final result</h2><p id="8ad8">By combining all the above functions we get the following code, which provides all the necessary comments for better understanding the overall process.</p><div id="813c"><pre>import itertools

<span class="hljs-comment"># Get all the comments</span> list_all_comments = data_science_reddits_df['comments'].tolist()

<span class="hljs-comment"># Remove all the empty lists (empty comments)</span> list_all_comments = [list_of_comments for list_of_comments in list_all_comments if list_of_comments != []]

<span class="hljs-comment"># Convert all the comments as a single list</span> all_comments = list(itertools.chain.from_iterable(list_all_comments))

<span class="hljs-comment"># Clean the comments</span> cleaned_comments = [clean_text(comment) for comment in all_comments]

<span class="hljs-comment"># Show the wordcloud</span> show_wordcloud(cleaned_comments)</pre></div><p id="de5e">Output:</p><figure id="13bd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tEie9AppM_wOnoy7wjuTLA.png"><figcaption>Word cloud of the comments (Image by Author)</figcaption></figure><p id="9989">The bigger the word in the cloud, the most representative it is. We can see that data, work, data science, people model, etc are most represented.</p><p id="7b32">Additional preprocessing such as stemming and n-grams can be performed for a more meaningful word cloud.</p><h1 id="656c">Conclusion</h1><p id="d4d6">Congratulations!!!🎉</p><p id="fedb">I hope this article helped you acquire the skills needed to achieve your goal. Check out my <a href="https://towardsdatascience.com/collect-data-from-twitter-a-step-by-step-implementation-using-tweepy-7526fff2cb31">article explaining how to collect data from Twitter using the Tweepy</a> library.</p><p id="7d9b">The <a href="https://github.com/keitazoumana/Medium-Articles-Notebooks/blob/main/Reddit_Data_Scraping.ipynb">source code of the article</a> is available on my GitHub.</p><p id="de67">Also, If you enjoy reading my stories and wish to support my writing, consider becoming a Medium member. It’s $5 a month, giving you unlimited access to thousands of Python guides and Data science articles.</p><p id="d4db">By signing up using<a href="https://zoumanakeita.medium.com/membership"> my link</a>, I will earn a small commission at no extra cost to you.</p><div id="6539" class="link-block"> <a href="https://zoumanakeita.medium.com/membership"> <div> <div> <h2>Join Medium with my referral link - Zoumana Keita</h2> <div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div> <div><p>zoumanakeita.medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*arTyPNYJnwYMxH-8)"></div> </div> </div> </a> </div><p id="1ccb">Feel free to follow me on <a href="https://twitter.com/zoumana_keita_">Twitter</a>, and <a href="https://www.youtube.com/channel/UC9xKdy8cz6ZuJU5FTNtM_pQ">YouTube</a>, or say Hi on <a href="https://www.linkedin.com/in/zoumana-keita/">LinkedIn</a>.</p><p id="0c0d">Before you leave, there are more great resources below you might be interested in reading!</p></article></body>

How to Scrape Data From Reddit Using Python — With Code

This article is a comprehensive overview of the data collection process from Reddit for free.

Photo by Christopher Gower on Unsplash

Introduction

In today’s fast-growing data-driven environment, many institutions often face a lot of difficulties when collecting diverse and relevant data.

The challenges they face include by are not limited to high costs, and a time-consuming process, making the overall task complex and resource-intensive.

At the same time, millions of data of different types (text, image, audio, video) are being created on social media platforms every day, making those platforms stand out as invaluable resources to overcome these challenges.

For instance, Twitter has 330 Million monthly active users , 134 Daily active users , 460k daily new accounts and 140 million daily Tweets

Furthermore,Reddit has 330 million monthly active users , 14 billion views per month , and 25 million daily votes.

Reddit and Twitter statistics reported in 2023 by Dustin Stout (Customized by Author)

These statistics are quite compelling and clear the doubt that these platforms are data-generating machines!

Failing to utilize these golden resources is like being thirsty in the middle of the sea.

In this article, you will learn how to efficiently and easily extract data from these social media platforms with a specific focus on Reddit, using the Python programming language.

Data Collection Process

Below is the overall workflow for collecting data from Reddit.

  • The user will specify the topic he/she would like to collect data about, let’s say Data Science , and will also specify the total number of posts mentioning Data Science which can be 600
  • Then the App Creator the module will create an instance of Python Reddit API Wrapper or PRAW for short which triggers the access to Reddit for data collection.
  • Furthermore, the Data Preprocessor collects the posts with respect to the inputs’ requirements.
  • Finally, the collected data is exported as a pandas data frame along with the Title , the Date , the URL and the Comments of each post.
General workflow from topic definition to data acquisition (Image by Author)

Pre-requisites

There are two main requirements to successfully complete this tutorial: (1) create a Reddit App, and (2) configure the App instance.

#1 Create Reddit App

The first step is to access the Reddit login page, which will allow the creation of the API credentials in five main steps as highlighted below:

Reddit Credentials Generation Steps (Image by Author)

The credentials required to configure the Reddit App is generated after Step 5 , this is personal information and make sure to jealously keep it!

#2 Configure the App Instance

Next, an instance of the app can be created by filling the client_id , client_secret and user_agent sections in the Reddit class below after installing and importing the praw library.

$ pip install praw
# Import the PRAW API
from praw import Reddit

# Configure an instance of your app
my_Reddit_App = Reddit(client_id = 'my_client_ID',
                       client_secret = 'my_secret',
                       user_agent = 'my_user_agent')

Data Collection

The data collection is performed on Subreddit, which is the Reddit forum dedicated to a specific topic such as Artificial Intelligence , Data Science , Deep Learning , and more.

For simplicity’s sake, we will stick to only one topic, and an example is given below for collecting Data Science subreddit, and we set the limit to the first 100 results with the .hot() function.

data_science_subreddit = my_Reddit_App.subreddit('DataScience').hot(limit=100)

print(data_science_subreddit)

Output:

<praw.models.listing.generator.ListingGenerator object at 0x7f4f4a68a410>

During the data collection, not all the 100 posts are immediately downloaded, instead, a ListingGenerator instance is returned, which can then be used to access each post with the next() function.

next_reddit = next(data_science_subreddit)
print(type(next_reddit))

Output:

<class 'praw.models.reddit.submission.Submission'>

We can notice that the previous Reddit’s type aSubmission , which corresponds to a post on Reddit, and each Submission typically includes properties such as the Title , URL , creation date .

To get an exhaustive list of all the attributes, we can use the dir() function combined with the helper function print_all_attributes() as illustrated below:

all_attributes = dir(next_reddit) 

# Helper function to print all the attributes
def print_attributes_in_table(data, columns):
    for i in range(0, len(data), columns):
        print(',\t'.join(data[i:i+columns]))

# Run the function
print_attributes_in_table(all_attributes, 5)

The result below is showing all the attributes in a table of five columns, and some of the attributes are highlighted in an orange rectangle:

  • created : the date of creation of the post
  • comments : the list of all the comments related to the post
  • num_comments : number of comments this post had
  • title : the title of the post
  • url : the URL of the post
The list of all the attributes (Image by Author)

The commentsvariable is a praw.models.comment_forest.CommentForest type, we will need to iterate through each comment in the CommentForest to get the values. This is done with the extract_comment_from_forest function.

def extract_comments_from_forest(submission):

    all_comments = []

    # Start iterating through each comment in the forest and get the content
    submission.comments.replace_more(limit=0) # Flatten the tree
    comments = submission.comments.list() # all the comments

    for comment in comments:
        all_comments.append(comment.body)

    return all_comments

Now, we can proceed with the creation of the logic to create the Pandas data frame containing all the posts, as implemented in the extract_top_N_post which returns the first N=100 posts by default.

import pandas as pd
import datetime as dt

def extract_top_N_posts(topic_of_interest, N = 100):

  topic_of_interest = topic_of_interest.replace(' ', '')
  final_list_of_dict = []
  dict_result = {}

  submissions = my_Reddit_App.subreddit(topic_of_interest).hot(limit=N)

  for submission in submissions:
    dict_result["title"] = submission.title
    dict_result["creation_date"] = dt.datetime.fromtimestamp(submission.created)
    dict_result["id"] = submission.id
    dict_result["url"] = submission.url
    dict_result["comments"] = extract_comments_from_forest(submission)

    final_list_of_dict.append(dict_result)
    dict_result = {}

  # Create the dataframe
  df = pd.DataFrame(final_list_of_dict)

  return df

From the function:

  • We start by defining the relevant variables
  • Then in the for loop we define all the attributes to be returned in the final dictionary.
  • Finally, we convert the resulting dictionary into a pandas data frame.

Let’s see the function in action 🚀 with the extraction of the Reddits about DataScience .

data_science_reddits_df = extract_top_N_posts('DataScience')

By checking the size of the dataset, we can see that there are 100 rows and 5 columns. Also, the first five rows are displayed with the display function.

print(data_science_reddits_df.shape)
# => (100, 5) #--> 100 rows and 5 columns

display(data_science_reddits_df.head())
First five rows of the dataframe (Image by Author)

Wonderful, you did it! But before wrapping up, let’s have a quick data visualization!

Quick Data Visualization

The goal here is to analyze the comments and have a broad overview of which topics are being discussed about DataScience . The real-time nature of the results can make your result different from mine.

Some data cleaning is required prior to visualizing the comments, and the overall logic is implemented in theclean_text() function.

Data cleaning

The function data cleaning function leverages the following libraries:

  • NLTK : one of the most used packages for text preprocessing
  • re : the regular expression package for data cleaning and extraction
  • emoji : used to detect and convert emojis into their textual format
from nltk.corpus import stopwords
import nltk 
nltk.download("stopwords")


import re 
from emoji import demojize

STOPWORDS = set(stopwords.words('english'))
MIN_LEN = 2

def clean_text(text):

  # Remove all closing and opening brackets
  text = re.sub(r"[\([{})\]]", "", text)

  # Remove URLs
  text = re.sub(r"http\S+", "", text)

  # Remove numeric values
  text = re.sub(r"[0-9]", "", text)

  # Remove stopwords
  text = " ".join([word for word in text.split() if word not in STOPWORDS])

  # Remove words with length < threshold
  text = " ".join([word for word in text.split() if len(word) > MIN_LEN])

  # Convert emojis to textual format
  text = demojize(text)

  return text

The main text-cleaning processes performed in the function are: removing all closing, and opening brackets, URLs, numeric values, stopwords, all words less than two characters, and finally converting emojis to text.

Visualization function

The visualization function generates the word cloud of the comments and the main modules used are:

  • wordcloud : the library that provides the WordCloud class for creating an instance of the wordcloud
  • matplotlib : used to generate the result in a graphical format
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def show_wordcloud(comments):

  all_comments = ' '.join(comments)

  wordcloud = WordCloud(width=5000, height=4000, 
                           background_color='black', 
                           min_font_size=10).generate(all_comments)

  plt.figure(figsize=(12, 12), facecolor='k', edgecolor='k')
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad=0)
  plt.show()

The function creates a wordcloud of the comments with a black background, a width of 5000, and a height of 4000. The final result is displayed in a window of 12x12 dimension.

Show the final result

By combining all the above functions we get the following code, which provides all the necessary comments for better understanding the overall process.

import itertools

# Get all the comments
list_all_comments = data_science_reddits_df['comments'].tolist()

# Remove all the empty lists (empty comments)
list_all_comments = [list_of_comments for list_of_comments in list_all_comments if list_of_comments != []]

# Convert all the comments as a single list
all_comments = list(itertools.chain.from_iterable(list_all_comments))

# Clean the comments
cleaned_comments = [clean_text(comment) for comment in all_comments]

# Show the wordcloud
show_wordcloud(cleaned_comments)

Output:

Word cloud of the comments (Image by Author)

The bigger the word in the cloud, the most representative it is. We can see that data, work, data science, people model, etc are most represented.

Additional preprocessing such as stemming and n-grams can be performed for a more meaningful word cloud.

Conclusion

Congratulations!!!🎉

I hope this article helped you acquire the skills needed to achieve your goal. Check out my article explaining how to collect data from Twitter using the Tweepy library.

The source code of the article is available on my GitHub.

Also, If you enjoy reading my stories and wish to support my writing, consider becoming a Medium member. It’s $5 a month, giving you unlimited access to thousands of Python guides and Data science articles.

By signing up using my link, I will earn a small commission at no extra cost to you.

Feel free to follow me on Twitter, and YouTube, or say Hi on LinkedIn.

Before you leave, there are more great resources below you might be interested in reading!

Data Science
Python
Programming
Education
Technology
Recommended from ReadMedium