avatarWilliam Firth

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

8619

Abstract

n class="hljs-meta">fetch</span>() woodworkingvideos.<span class="hljs-meta">fetch</span>() garageporn.<span class="hljs-meta">fetch</span>() somethingimade.<span class="hljs-meta">fetch</span>() toolporn.<span class="hljs-meta">fetch</span>() workbenches._<span class="hljs-meta">fetch</span>()</pre></div><ul><li>Append the ‘active_user_count’ and date onto the corresponding dictionary entry</li></ul><div id="3de4"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"woodworking active users"</span>]</span><span class="hljs-selector-class">.append</span>(woodworking.active_user_count)</pre></div><div id="c6c7"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"beginnerwoodworking active users"</span>]</span><span class="hljs-selector-class">.append</span>(beginnerwoodworking.active_user_count)</pre></div><div id="5f38"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"woodworkingplans active users"</span>]</span><span class="hljs-selector-class">.append</span>(woodworkingplans.active_user_count)</pre></div><div id="3b1c"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"woodworkingvideos active users"</span>]</span><span class="hljs-selector-class">.append</span>(woodworkingvideos.active_user_count)</pre></div><div id="22fe"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"garageporn active users"</span>]</span><span class="hljs-selector-class">.append</span>(garageporn.active_user_count)</pre></div><div id="1bf9"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"somethingimade active users"</span>]</span><span class="hljs-selector-class">.append</span>(somethingimade.active_user_count)</pre></div><div id="458e"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"toolporn active users"</span>]</span><span class="hljs-selector-class">.append</span>(toolporn.active_user_count)</pre></div><div id="865b"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"workbenches active users"</span>]</span><span class="hljs-selector-class">.append</span>(workbenches.active_user_count)</pre></div><div id="c425"><pre>users_dict<span class="hljs-selector-attr">[<span class="hljs-string">"date"</span>]</span><span class="hljs-selector-class">.append</span>(now)</pre></div><ul><li>Export/overwrite the dictionary as a ‘userdata.csv’ file. This code block creates or opens the ‘userdata.csv’ file and writes the information from our dictionary into a standard dataframe format that you will use later in the analysis section.</li></ul><div id="7056"><pre><span class="hljs-built_in">keys</span>=sorted(users_dict.<span class="hljs-built_in">keys</span>()) <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span> (<span class="hljs-string">"userdata.csv"</span> , <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> outfile: writer=csv.writer(outfile, delimiter = <span class="hljs-string">"\t"</span>) writer.writerow(<span class="hljs-built_in">keys</span>) writer.writerows(zip(*[users_dict[key] <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> <span class="hljs-built_in">keys</span>]))</pre></div><ul><li>Print one of the counts and date/time to the terminal</li></ul><div id="79e6"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">"woodworking active user count:"</span>,woodworking.active_user_count)</span></span> <span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(now)</span></span></pre></div><ul><li>Wait 1 hour before repeating. This can be changed to whatever time interval you want.</li></ul><div id="4581"><pre><span class="hljs-selector-tag">time</span><span class="hljs-selector-class">.sleep</span>(<span class="hljs-number">3600</span>)</pre></div><p id="d4e3">Let this script run until completion. I ran it on an extra Raspberry Pi I had laying around so I didn’t have to worry about it on my main machine. Come back to the next section once it's complete.</p><p id="e963">You can find this script as well as the following analysis script on my Github in the links below.</p><h2 id="9686">Analyze Your Data</h2><p id="b72d">Create a new file in your ‘active_user_data’ directory called ‘analysis.py’. This must be in the same folder/directory as your ‘active_users.py’ and ‘userdata.csv’ files.</p><p id="5c29">Import the following libraries:</p><div id="ead6"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd</pre></div><div id="525d"><pre><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt</pre></div><div id="600d"><pre><span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns</pre></div><p id="d3e7">Read in your ‘user_data.csv’ file as a Pandas dataframe:</p><div id="f599"><pre><span class="hljs-attr">df</span> = pd.read_csv(<span class="hljs-string">'userdata.csv'</span>, sep = <span class="hljs-string">'\t'</span>)</pre></div><p id="11f0">If you print ‘df’ it should look something like this:</p><figure id="46e4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e4K5egVhEhV4XG69wmFIFg.png"><figcaption>print(df) Note: this was printed in a Juypter notebook.</figcaption></figure><p id="16ac">The ‘date’ column is currently a data type of string. Change it into a datetime object and create ‘day’ and ‘hour’ columns:</p><div id="b053"><pre>df<span class="hljs-selector-attr">[<span class="hljs-string">'date'</span>]</span> = pd<span class="hljs-selector-class">.to_datetime</span>(df<span class="hljs-selector-attr">[<span class="hljs-string">'date'</span>]</span>)</pre></div><div id="0fa8"><pre>df<span class="hljs-selector-attr">[<span class="hljs-string">'hour'</span>]</span> = df<span class="hljs-selector-attr">[<span class="hljs-string">'date'</span>]</span><span class="hljs-selector-class">.dt</span>.hour</pre></div><div id="608b"><pre>df<span class="hljs-selector-attr">[<span class="hljs-string">'day'</span>]</span> = df<span class="hljs-selector-attr">[<span class="hljs-string">'date'</span>]</span><span class="hljs-selector-class">.dt</span>.dayofweek</pre></div><p id="4761">If you print ‘df’ again, you should see two new columns, ‘day’ and ‘hour’, represented numerically:</p><figure id="ce74"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tJDCzSf5Wd0zOtSGe3BIZg.png"><figcaption>‘hour’ and ‘day’ columns added to the far right-hand side</figcaption></figure><p id="d5d2">Take notice that the user counts between different subreddits can vary drastically. Normalizing these values will allow for direct comparison. If you only collected data from one subreddit, you can skip this step.</p><p id="8f55">Create a list of the dataframe columns excluding those that you do not want to include in the normalization: ‘date’, ‘hour’, and ‘day’.</p><div id="27fa"><pre>my_col = list(df.columns.values) my_col.<span class="hljs-built_in">remove</span>(<span class="hljs-string">'date'</span>) my_col.<span class="hljs-built_in">remove</span>(<span class="hljs-string">'day'</span>) my_col.<span class="hljs-built_in">remove</span>(<span class="hljs-string">'hour'</span>)</pre></div><p id="fc03">Iterate through the columns and replace the user count values with the normalized values.</p><div id="09ed"><pre><span class="hljs-keyword">for</span> column <span class="hljs-keyword">in</span> my_col: maximum = df<span class="hljs-selector-attr">[f<span class="hljs-string">'{column}'</span>]</span><span class="hljs-selector-class">.max</span>() minimum = df<span class="hljs-selector-attr">[f<span class="hljs-string">'{column}'</span>]</span><span class="hljs-selector-class">.min</span>() my_range = maximum - minimum df<span class="hljs-selector-attr">[f<span class="hljs-string">'{column}'</span>]</span> = (df<span class="hljs-selector-attr">[column]</span> - minimum) / my_range</pre></div><p id="7db2">Now get a sum of all the normalized values for each row. Add this as a new column:</p><div id="590b"><pre><span class="hljs-built_in">df</span>[<span class="hljs-string">'normalized total'</span>] = <span class="hljs-built_in">df</span>[my_col].<span class="hljs-built_in">sum</span>(axis = 1)</pre></div><p id="7a25">Printing out df again, you should see all the user values now between 0 and 1 as well as a new ‘normalized total’ column on the far right.</p><figure id="892

Options

f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*60QetcmikFen9OQQ-Hc2Tg.png"><figcaption></figcaption></figure><p id="7b53">First, plot just a single subreddit. I chose three different plots to look at. A box plot showing the user count by the hour of the day, a violin plot showing the user count by day of the week, and a scatter plot that shows the entire time series.</p><div id="86b4"><pre><span class="hljs-attribute">fig</span>, axes = plt.subplots(<span class="hljs-number">3</span>,<span class="hljs-number">1</span>)</pre></div><div id="cc83"><pre><span class="hljs-attribute">fig</span>.set_size_inches(<span class="hljs-number">12</span>, <span class="hljs-number">8</span>)</pre></div><div id="01c9"><pre>fig.suptitle<span class="hljs-comment">("Woodworking Subreddits Active User Data")</span></pre></div><div id="18c9"><pre>axes[<span class="hljs-number">0</span>].set_title('<span class="hljs-keyword">User</span> <span class="hljs-title">Count</span> by Hour')</pre></div><div id="edae"><pre>axes[<span class="hljs-number">1</span>].set_title('<span class="hljs-keyword">User</span> <span class="hljs-title">Count</span> by Day - <span class="hljs-attr">Mon=</span><span class="hljs-number">0</span>')</pre></div><div id="ba07"><pre>axes[<span class="hljs-number">2</span>].set_title('<span class="hljs-keyword">User</span> <span class="hljs-title">Count</span> by Day - raw')</pre></div><div id="34a1"><pre>sns.boxplot<span class="hljs-params">(<span class="hljs-attr">ax</span> = axes[0], <span class="hljs-attr">data</span> = df, <span class="hljs-attr">y</span> ='woodworking active users',<span class="hljs-attr">x</span> = 'hour')</span></pre></div><div id="5c2b"><pre>sns.violinplot<span class="hljs-params">(<span class="hljs-attr">ax</span> = axes[1], <span class="hljs-attr">data</span> = df, <span class="hljs-attr">y</span> ='woodworking active users',<span class="hljs-attr">x</span> = 'day')</span></pre></div><div id="9b88"><pre>sns.scatterplot(ax = axes[2],<span class="hljs-attribute">data</span>=df, <span class="hljs-attribute">x</span>=<span class="hljs-string">'date'</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">'woodworking active users'</span>)</pre></div><div id="912e"><pre>fig.tight_layout<span class="hljs-comment">()</span></pre></div><div id="9b59"><pre>plt.<span class="hljs-keyword">show</span>()</pre></div><div id="c3f2"><pre>fig.savefig(<span class="hljs-symbol">'userdata</span>.png')</pre></div><figure id="63f7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vF-i-kr3IOikL0sHMtb2Wg.png"><figcaption></figcaption></figure><p id="30b5">From the scatter plot you can clearly see there are high points and low points throughout the week. From the Violin plot, you can clearly see the distribution of user count throughout the weekdays. This shows that while Friday (day 4) does not have the highest average users per day (white dot in the center of the violin plot) it has a very tight distribution. Finally, the box plot shows us the times of day that have the highest user counts. Mid to late evening looks to be the most active time. This leads to the conclusion that Friday evening is one of the best times to make a new post on r/woodworking!</p><p id="1c06">This was just one subreddit though. Re-plotting using the normalized totals will give a more holistic view of the woodworking communities.</p><div id="ce28"><pre><span class="hljs-attribute">fig</span>, axes = plt.subplots(<span class="hljs-number">3</span>,<span class="hljs-number">1</span>)</pre></div><div id="be59"><pre><span class="hljs-attribute">fig</span>.set_size_inches(<span class="hljs-number">12</span>, <span class="hljs-number">8</span>)</pre></div><div id="5737"><pre>fig.suptitle<span class="hljs-comment">("Woodworking Subreddits Active User Data")</span></pre></div><div id="b23e"><pre>axes[<span class="hljs-number">0</span>].set_title('<span class="hljs-keyword">User</span> <span class="hljs-title">Count</span> by Hour')</pre></div><div id="f1a2"><pre>axes[<span class="hljs-number">1</span>].set_title('<span class="hljs-keyword">User</span> <span class="hljs-title">Count</span> by Day - <span class="hljs-attr">Mon=</span><span class="hljs-number">0</span>')</pre></div><div id="9b43"><pre>axes[<span class="hljs-number">2</span>].set_title('<span class="hljs-keyword">User</span> <span class="hljs-title">Count</span> by Day - raw')</pre></div><div id="8d9e"><pre>sns.boxplot<span class="hljs-params">(<span class="hljs-attr">ax</span> = axes[0], <span class="hljs-attr">data</span> = df, <span class="hljs-attr">y</span> ='normalized total',<span class="hljs-attr">x</span> = 'hour')</span></pre></div><div id="df0f"><pre>sns.violinplot<span class="hljs-params">(<span class="hljs-attr">ax</span> = axes[1], <span class="hljs-attr">data</span> = df, <span class="hljs-attr">y</span> ='normalized total',<span class="hljs-attr">x</span> = 'day')</span></pre></div><div id="ad97"><pre>sns.scatterplot(ax = axes[2],<span class="hljs-attribute">data</span>=df, <span class="hljs-attribute">x</span>=<span class="hljs-string">'date'</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">'normalized total'</span>)</pre></div><div id="59b4"><pre>fig.tight_layout<span class="hljs-comment">()</span></pre></div><div id="1d03"><pre>plt.<span class="hljs-keyword">show</span>()</pre></div><div id="4fd3"><pre>fig.savefig(<span class="hljs-symbol">'userdatanormalizedtotal</span>.png')</pre></div><figure id="41bf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*d1cKqdIqit0VwO0DI4wfjg.png"><figcaption></figcaption></figure><p id="faaf">Here a similar analysis can be done. Sunday may have the highest average number of users, but they are distributed all throughout the day. Monday has the tightest distribution and around 8 PM looks to be the most active hour. So Monday evening looks to be the best time to make posts in general!</p><h2 id="8474">Conclusion</h2><p id="e0c9">This data was only collected throughout a couple of weeks. A more robust analysis would use months to years of data.</p><p id="f58c">Further, this brief analysis has provided a hypothesis on the best times to post content on social media, but a lot of questionable assumptions were made to get to that hypothesis. Do more active users = better time to post? Is a tighter distribution better than a total number of users? Can results from this Reddit analysis be applied to other social media platforms? In Part II, I’ll take a deeper look at these assumptions and try to test the hypothesis.</p><p id="ebf7">Finally, I hope this tutorial has helped you on your quest for maximal likes, upvotes, and shares and has given you the tools to collect your own data and make your own analysis.</p><h2 id="3793">Links</h2><div id="42b8" class="link-block"> <a href="https://github.com/firthfabrications/active_user_data_public"> <div> <div> <h2>GitHub - firthfabrications/active_user_data_public: Collect active user data using reddit API</h2> <div><h3>You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*6_FGJXalQc88OWMz)"></div> </div> </div> </a> </div><div id="e994" class="link-block"> <a href="https://www.youtube.com/channel/UC13QznG1Kk2HPJuKhStFsOA"> <div> <div> <h2>Firth Fabrications</h2> <div><h3>I am a mechanical engineer by education, an electronics manufacturing engineer by profession, and a woodworker by…</h3></div> <div><p>www.youtube.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*YkLbtUtr-DfokOxD)"></div> </div> </div> </a> </div><p id="6db9"><a href="https://www.instagram.com/firthfabrications/">https://www.instagram.com/firthfabrications/</a></p><p id="7bb5"><i>More content at <a href="http://plainenglish.io/">plainenglish.io</a>. Sign up for our <a href="http://newsletter.plainenglish.io/">free weekly newsletter</a>. Get exclusive access to writing opportunities and advice in our <a href="https://discord.gg/GtDtUAvyhW">community Discord</a>.</i></p></article></body>

Determine The Best Time To Post Content On Social Media Using Python

How to use Python and Reddit’s API to collect hourly active user data and determine the highest traffic times.

Custom Image from Data Set

Reasoning

Not long ago, I started woodworking and wanted to share my projects with the woodworking communities on Reddit and other social media platforms. I noticed that some posts did better than others even though the content and quality were roughly equivalent. This led me to believe that when I make my posts must play a role and that I could use my skills as a data scientist to optimize my posting schedule.

I chose to do my analysis on Reddit instead of Instagram, YouTube, or TikTok for a few reasons. First and foremost, Reddit has subreddits that are communities of people-centered around a specific topic. This makes it easy to specify my analysis, to say, just the woodworking community, instead of the entire platform user base. Other social media platforms do not have this level of compartmentalization for their communities. You can look at posts with certain hashtags, but you can’t easily look at a group of users who follow that hashtag.

The second key reason I chose to do my analysis on Reddit is that Reddit shows current active users for any given subreddit. You can see in real-time how many people are browsing any particular subreddit.

The third and final reason is that Reddit has an easy-to-use API which should make collecting this data simple compared to trying to scrape this type of data from the web.

Assumptions

There are a couple of assumptions I will be making during this analysis.

  • Higher active user numbers correlate to a better time to post. However, a counterargument could be made that the more active users, the more posts will be made and the more competition for attention there will be. either way, you can still use this guide to collect the data and then analyze it however you wish.
  • People browse Reddit at the same time that they browse other social media platforms and thus this analysis applies to other social media. This one is anecdotal, but I find that I make the rounds between all social media in the same sitting. I’ll scroll through Instagram, then Tiktok, Reddit one after the other.

Getting a Reddit API Client ID and Secret

There are many tutorials online for getting a Reddit client ID and Secret. I will quickly go over the steps here but if you run into trouble I suggest you Google ‘Reddit API’

  • You will need a Reddit account. Create one if you don’t already have one
  • Go to Preferences -> Apps -> ‘are you a developer…’
  • Name it whatever you want
  • Select the ‘Script’ Option
  • Give it whatever description you want
  • Leave ‘About URL’ blank
  • enter ‘http://localhost:8080’ in ‘redirect URL’
  • Copy your client ID and Secret for use later

Collecting the Data

It’s finally time to do some coding.

Create a new directory where you will save a few python and CSV files. This will make cross-referencing the files easier down the road. I named mine ‘active_user_data’.

Open up a new .py file in your favorite text editor and save it inside the ‘active_user_data’ directory. I named mine ‘active_users.py’

Import the following libraries.

import praw  # Reddit API
import csv
from datetime import datetime

Use ‘PRAW’ to log into Reddit. You’ll need your client_id and client_secret from setting up your API key earlier as well as your username, and password

reddit = praw.Reddit(
client_id="XXXXXXXXXXXX",
client_secret="XXXXXXXXXXXXXXXXXXXXXXXX",
user_agent="Users",
username="XXXXXXXX",
password="XXXXXXXX",
)

Define the subreddits you want to get data for. This can be just one subreddit or multiple. Later you will normalize all the data. For my purposes, I chose various woodworking and maker subreddits.

woodworking = reddit.subreddit("woodworking")
beginnerwoodworking = reddit.subreddit("beginnerwoodworking")
woodworkingplans = reddit.subreddit("woodworkingplans")
woodworkingvideos = reddit.subreddit("woodworkingvideos")
garageporn = reddit.subreddit("garageporn")
somethingimade = reddit.subreddit("somethingimade")
toolporn = reddit.subreddit("toolporn")
workbenches = reddit.subreddit("workbenches")

The data will be collected on a loop at some user-specified time interval. But before the loop can be set up, you need a way to store the data you are going to fetch. So first, create a list of field names that will act as the column names of your data and then an empty dictionary.

Field names:

field_names= [
'woodworking active users' ,
'beginnerwoodworking active users',
'woodworkingplans active users' ,
'woodworkingvideos active users',
'garageporn active users',
'somethingimade active users',
'toolporn',
'workbenches',
'date']

Empty Dictionary:

users_dict = {
"woodworking active users": [],
"beginnerwoodworking active users": [],
"woodworkingplans active users": [],
"woodworkingvideos active users": [],
"garageporn active users": [],
"somethingimade active users": [],
"toolporn active users": [],
"workbenches active users": [],
"date": [],}

Set the amount of time you want this loop to run. This is set to run for 28 days from when you run the script.

t_end = time.time() + 60 * 43801

Now create the loop.

while time.time() < t_end:

For each iteration of the loop, you will

  • Gather the date and time:
now = datetime.now()
  • Fetch each of the subreddits
woodworking._fetch()
beginnerwoodworking._fetch()
woodworkingplans._fetch()
woodworkingvideos._fetch()
garageporn._fetch()
somethingimade._fetch()
toolporn._fetch()
workbenches._fetch()
  • Append the ‘active_user_count’ and date onto the corresponding dictionary entry
users_dict["woodworking active users"].append(woodworking.active_user_count)
users_dict["beginnerwoodworking active users"].append(beginnerwoodworking.active_user_count)
users_dict["woodworkingplans active users"].append(woodworkingplans.active_user_count)
users_dict["woodworkingvideos active users"].append(woodworkingvideos.active_user_count)
users_dict["garageporn active users"].append(garageporn.active_user_count)
users_dict["somethingimade active users"].append(somethingimade.active_user_count)
users_dict["toolporn active users"].append(toolporn.active_user_count)
users_dict["workbenches active users"].append(workbenches.active_user_count)
users_dict["date"].append(now)
  • Export/overwrite the dictionary as a ‘userdata.csv’ file. This code block creates or opens the ‘userdata.csv’ file and writes the information from our dictionary into a standard dataframe format that you will use later in the analysis section.
keys=sorted(users_dict.keys())
with open ("userdata.csv" , "w") as outfile:
writer=csv.writer(outfile, delimiter = "\t")
writer.writerow(keys)
writer.writerows(zip(*[users_dict[key] for key in keys]))
  • Print one of the counts and date/time to the terminal
print("woodworking active user count:",woodworking.active_user_count)
print(now)
  • Wait 1 hour before repeating. This can be changed to whatever time interval you want.
time.sleep(3600)

Let this script run until completion. I ran it on an extra Raspberry Pi I had laying around so I didn’t have to worry about it on my main machine. Come back to the next section once it's complete.

You can find this script as well as the following analysis script on my Github in the links below.

Analyze Your Data

Create a new file in your ‘active_user_data’ directory called ‘analysis.py’. This must be in the same folder/directory as your ‘active_users.py’ and ‘userdata.csv’ files.

Import the following libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read in your ‘user_data.csv’ file as a Pandas dataframe:

df = pd.read_csv('userdata.csv', sep = '\t')

If you print ‘df’ it should look something like this:

print(df) Note: this was printed in a Juypter notebook.

The ‘date’ column is currently a data type of string. Change it into a datetime object and create ‘day’ and ‘hour’ columns:

df['date'] = pd.to_datetime(df['date'])
df['hour'] = df['date'].dt.hour
df['day'] = df['date'].dt.dayofweek

If you print ‘df’ again, you should see two new columns, ‘day’ and ‘hour’, represented numerically:

‘hour’ and ‘day’ columns added to the far right-hand side

Take notice that the user counts between different subreddits can vary drastically. Normalizing these values will allow for direct comparison. If you only collected data from one subreddit, you can skip this step.

Create a list of the dataframe columns excluding those that you do not want to include in the normalization: ‘date’, ‘hour’, and ‘day’.

my_col = list(df.columns.values)
my_col.remove('date')
my_col.remove('day')
my_col.remove('hour')

Iterate through the columns and replace the user count values with the normalized values.

for column in my_col:
    maximum = df[f'{column}'].max()
    minimum = df[f'{column}'].min()
    my_range = maximum - minimum
    df[f'{column}'] = (df[column] - minimum) / my_range

Now get a sum of all the normalized values for each row. Add this as a new column:

df['normalized total'] = df[my_col].sum(axis = 1)

Printing out df again, you should see all the user values now between 0 and 1 as well as a new ‘normalized total’ column on the far right.

First, plot just a single subreddit. I chose three different plots to look at. A box plot showing the user count by the hour of the day, a violin plot showing the user count by day of the week, and a scatter plot that shows the entire time series.

fig, axes = plt.subplots(3,1)
fig.set_size_inches(12, 8)
fig.suptitle("Woodworking Subreddits Active User Data")
axes[0].set_title('User Count by Hour')
axes[1].set_title('User Count by Day - Mon=0')
axes[2].set_title('User Count by Day - raw')
sns.boxplot(ax = axes[0], data = df, y  ='woodworking active users',x = 'hour')
sns.violinplot(ax = axes[1], data = df, y  ='woodworking active users',x = 'day')
sns.scatterplot(ax = axes[2],data=df, x='date', y='woodworking active users')
fig.tight_layout()
plt.show()
fig.savefig('userdata.png')

From the scatter plot you can clearly see there are high points and low points throughout the week. From the Violin plot, you can clearly see the distribution of user count throughout the weekdays. This shows that while Friday (day 4) does not have the highest average users per day (white dot in the center of the violin plot) it has a very tight distribution. Finally, the box plot shows us the times of day that have the highest user counts. Mid to late evening looks to be the most active time. This leads to the conclusion that Friday evening is one of the best times to make a new post on r/woodworking!

This was just one subreddit though. Re-plotting using the normalized totals will give a more holistic view of the woodworking communities.

fig, axes = plt.subplots(3,1)
fig.set_size_inches(12, 8)
fig.suptitle("Woodworking Subreddits Active User Data")
axes[0].set_title('User Count by Hour')
axes[1].set_title('User Count by Day - Mon=0')
axes[2].set_title('User Count by Day - raw')
sns.boxplot(ax = axes[0], data = df, y  ='normalized total',x = 'hour')
sns.violinplot(ax = axes[1], data = df, y  ='normalized total',x = 'day')
sns.scatterplot(ax = axes[2],data=df, x='date', y='normalized total')
fig.tight_layout()
plt.show()
fig.savefig('userdatanormalizedtotal.png')

Here a similar analysis can be done. Sunday may have the highest average number of users, but they are distributed all throughout the day. Monday has the tightest distribution and around 8 PM looks to be the most active hour. So Monday evening looks to be the best time to make posts in general!

Conclusion

This data was only collected throughout a couple of weeks. A more robust analysis would use months to years of data.

Further, this brief analysis has provided a hypothesis on the best times to post content on social media, but a lot of questionable assumptions were made to get to that hypothesis. Do more active users = better time to post? Is a tighter distribution better than a total number of users? Can results from this Reddit analysis be applied to other social media platforms? In Part II, I’ll take a deeper look at these assumptions and try to test the hypothesis.

Finally, I hope this tutorial has helped you on your quest for maximal likes, upvotes, and shares and has given you the tools to collect your own data and make your own analysis.

Links

https://www.instagram.com/firthfabrications/

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

Data Science
Social Media
Python
Data
Data Visualization
Recommended from ReadMedium