avatarAnmol Tomar

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5480

Abstract

<pre><span class="hljs-comment"># find nulls </span> df_movies.isnull().<span class="hljs-built_in">sum</span>()</pre></div><figure id="b398"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*t8VTg9oDF3fkG2c5.png"><figcaption>Image by Author</figcaption></figure><p id="0728">If majority of the values(&gt;70%) within a column are nulls, we can drop the column, else if only a few values are nulls(&lt;20%), we can do missing value treatment.</p><h2 id="cd34">STEP 4: Missing values treatment</h2><p id="f5de">Once we know the count of missing values, the next step is to treat the columns with missing values.</p><p id="8887">For illustration purposes, I am filling the nulls with the mean value of the columns, although there are more sophisticated methods of missing value treatment.</p><div id="67e6"><pre><span class="hljs-comment"># let's replace the nulls with mean </span>

df_movies[<span class="hljs-string">'Meta_score'</span>].fillna(df_movies[<span class="hljs-string">'Meta_score'</span>].mean()) df_movies[<span class="hljs-string">'Gross'</span>].fillna(df_movies[<span class="hljs-string">'Gross'</span>].mean()) df_movies.head()</pre></div><h2 id="22db">STEP 5: Outliers</h2><p id="71b5">The step 5 is to check for outliers. There are multiple ways of checking the outliers, we will be using the graphical method in this example.</p><p id="b225">We will pick one continuous variable (meta score) and check for the outlier by looking at the histogram.</p><div id="7178"><pre><span class="hljs-comment"># distribution of meta scores </span> plt.hist(df_movies[<span class="hljs-string">'Meta_score'</span>],bins = <span class="hljs-number">15</span>) plt.show()</pre></div><figure id="29b5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*9LO-TjrDFBNTe-JV.png"><figcaption>Distribution of Meta Score (Image by Author)</figcaption></figure><p id="47d6">In this example we don’t see any outliers in the meta score as it ranges from 0–100.</p><h2 id="bea3">STEP 6: Outlier Treatment</h2><p id="c23f">Step 6 is to treat the outlier detected in step 5. There are different ways of treating the outliers such as 1) Capping the min and max value limits 2) Removing the rows with outlier values.</p><p id="449e">Although there is nothing off with the distribution of meta scores, for illustration purposes, let’s cap the minimum meta score value to 40.</p><div id="f01a"><pre><span class="hljs-comment"># capping the minimum meta score to 40</span> df_movies.loc[df_movies[<span class="hljs-string">'Meta_score'</span>] < <span class="hljs-number">40</span>,<span class="hljs-string">'Meta_score'</span>] = <span class="hljs-number">40</span>

<span class="hljs-comment">#check the minimum score </span> df_movies[<span class="hljs-string">'Meta_score'</span>].<span class="hljs-built_in">min</span>() <span class="hljs-comment"># output : 40.0</span></pre></div><h2 id="109c">STEP 7: Who — A person, member etc.</h2><p id="abf6">Step 7 is to answer the questions related to a person, member, etc. For example in our use case, we have actors and directors and we can formulate and answer the following question related to them.</p><ul><li>Who has directed the most number of top IMDB movies? (univariate)</li><li>Who has acted in most top IMDB movies? (univariate)</li><li>Which Actor-Director combination gave most top IMDB movies? (bivariate)</li><li>Who gave music in most top IMDB movies ? (Data not available)</li><li>And More …..</li></ul><p id="85be">Now, let’s answer these questions.</p><div id="2355"><pre><span class="hljs-comment">## Who has directed the most number of top IMDB movies ?</span> df_movies.groupby([<span class="hljs-string">'Director'</span>]).agg({<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'count'</span>}).reset_index().rename(columns = {<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'count'</span>}).
sort_values(<span class="hljs-string">'count'</span>,ascending = <span class="hljs-literal">False</span>).head(<span class="hljs-number">5</span>)

<span class="hljs-comment">## Who has acted in the most number of top movies </span> df_movies.groupby([<span class="hljs-string">'Star1'</span>]).agg({<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'count'</span>}).reset_index().rename(columns = {<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'count'</span>}).
sort_values(<span class="hljs-string">'count'</span>,ascending = <span class="hljs-literal">False</span>).head(<span class="hljs-number">5</span>)

<span class="hljs-comment">## Director - Actor works best </span> df_movies.groupby([<span class="hljs-string">'Director'</span>,<span class="hljs-string">'Star1'</span>])[<span class="hljs-string">'Series_Title'</span>].count().reset_index().
rename(columns = {<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'Count'</span>}).sort_values(<span class="hljs-string">'Count'</span>,ascending = <span class="hljs-literal">False</span>).head(<span class="hljs-number">5</span>)</pre></div><figure id="c604"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*rW7sRQH_FEgWr7Mk.png"><figcaption>Directors with most top IMDB movies</figcaption></figure><figure id="866e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*4f_SILB9hQMsAI-O.png"><figcaption>Actors with most top IMDB movies</figcaption></figure><figure id="1e71"><im

Options

g src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*fdXyfbb51971i7nI.png"><figcaption>Director-Actor combination with most top movies</figcaption></figure><h2 id="9aa1">STEP 8: When — time related questions!</h2><p id="3c83">Step 8 is to answer questions related to the time aspect- year, month, week, etc. In the context of our data, we can find the following :</p><ul><li>Find the years with most movies in IMDB top 1000 ? (univariate)</li></ul><div id="8243"><pre><span class="hljs-comment"># finding years with most movies in top 1000</span> year_dis = df_movies.groupby(<span class="hljs-string">'Released_Year'</span>)[<span class="hljs-string">'Series_Title'</span>].count().reset_index().
rename(columns = {<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'Count'</span>}).sort_values(<span class="hljs-string">'Count'</span>,ascending = <span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)

plt.bar(year_dis[<span class="hljs-string">'Released_Year'</span>].astype(<span class="hljs-built_in">str</span>), year_dis[<span class="hljs-string">'Count'</span>], width = <span class="hljs-number">0.5</span>) plt.xlabel(<span class="hljs-string">'Years'</span>) plt.ylabel(<span class="hljs-string">'Number of Movies'</span>) plt.title(<span class="hljs-string">'Years with most movies in IMDB top 1000'</span>) plt.show()</pre></div><figure id="6dac"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*LMefqDir-_qbwuBF.png"><figcaption></figcaption></figure><h2 id="5776">STEP 9: Where — Place related questions!</h2><p id="22e4">Step 9 is to look at the things from the “place” perspective, for example, country, state, regions etc. In context to our data set, we can find the following :</p><ul><li>Find countries with most movies in IMDB top 1000.</li></ul><p id="399d">Currently, we don’t have the data to answer this question.</p><p id="8116">While formulating these questions, I would recommend you to be as exhaustive as possible and don’t limit the thought process based on data availability because the data that is not available now could be obtained later.</p><h2 id="a6e6">STEP 10: What/Which</h2><p id="538b">Step 10 is about formulating questions about things that are not covered above. These are not related to people, place, time but everything apart from these. This is a bit subjective and takes some time to get used to.</p><ul><li>Which genres are featured most in the top 1000?</li><li>What is the duration of the top movies?</li><li>What is the correlation between the rating and gross earning?</li><li>and more…</li></ul><p id="17c6">For illustration purposes, we are answering the first question using the following code:</p><div id="fe40"><pre><span class="hljs-comment">### Which genres are featured most in top 1000 ? </span> genre_dis = df_movies.groupby(<span class="hljs-string">'Genre'</span>)[<span class="hljs-string">'Series_Title'</span>].count().reset_index().
rename(columns = {<span class="hljs-string">'Series_Title'</span>:<span class="hljs-string">'Count'</span>}).sort_values(<span class="hljs-string">'Count'</span>,ascending = <span class="hljs-literal">False</span>).head(<span class="hljs-number">5</span>) fig, ax = plt.subplots() plt.bar(genre_dis[<span class="hljs-string">'Genre'</span>], genre_dis[<span class="hljs-string">'Count'</span>], width = <span class="hljs-number">0.5</span>) plt.setp(ax.get_xticklabels(), rotation=<span class="hljs-number">30</span>, horizontalalignment=<span class="hljs-string">'right'</span>) plt.show()</pre></div><figure id="9cd2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*cAekez3QqUJ2jjI8.png"><figcaption></figcaption></figure><p id="82dd">I hope you get an idea of how to implement the 10-step process to perform the data analysis.</p><h2 id="7b74">Conclusion</h2><p id="9175">In conclusion, a structured approach to data analysis is essential for anyone looking to make informed decisions based on data. By following the 10-step approach outlined in this blog, you’ll be well equipped to tackle even the most complex data analysis projects with confidence. Whether you’re a seasoned data analyst or just starting out, this approach will help you to avoid common pitfalls and maximize your productivity.</p><h2 id="6bff">Thank You!</h2><p id="c4ad"><i>If you find my blogs useful, then you can <a href="https://anmol3015.medium.com/subscribe"><b>follow me</b></a> to get direct notifications whenever I publish a story.</i></p><p id="aecf"><i>If you like to access all the amazing stories on Medium, consider supporting me and thousands of other writers by <a href="https://anmol3015.medium.com/membership"><b>signing up for a membership</b></a>. It only costs $5 per month, it supports us, writers, greatly.</i></p><div id="4a9c" class="link-block"> <a href="https://anmol3015.medium.com/membership"> <div> <div> <h2>Join Medium with my referral link - Anmol Tomar</h2> <div><h3>Read every story from Anmol Tomar (and thousands of other writers on Medium). Join now! You'll support me with a…</h3></div> <div><p>anmol3015.medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*x0DE9rrKhffqZHL8)"></div> </div> </div> </a> </div></article></body>

Use This 10-Step Approach to ACE Any Data Analysis

Adding structure to your data analysis

Pic Credits: Unsplash

Data analysis is the key to unlocking insights from vast amounts of data. But with more data being generated every day, it’s becoming increasingly challenging to manage and analyze it effectively.

What if I tell you that there was a simpler and more efficient way to analyze your data?

An approach only helps to minimize errors and inconsistencies but also ensures that you achieve your goals with greater accuracy and efficiency. In this blog, we’ll explore a 10-step approach to ace any data analysis project. Whether you’re a seasoned data analyst or just starting out, this approach will provide you with a roadmap to success. So, get ready to take your data analysis skills to the next level!

If you are new to Data Science and want to learn python for data analysis then you can check out my course and get certified: https://bit.ly/3148Qq6

IMDB Dataset

For the illustration purpose, we will pick the IMDB dataset for the top 1000 movies to understand the features/traits of top IMDB movies by applying the 10 steps process. The dataset has been take from kaggle, you can download it here.

Let’s start with the process:

STEP 1: Summary

The first step is to get the summary of columns present in the dataset, but before that, we will import the packages and read our IMDB dataset.

The summary helps us to understand the data by showing the statistics (count, number of unique values, top, mean, etc) for the columns as shown below:

# import our packages 
import pandas as pd 
import matplotlib.pyplot as plt

# reading the data
df_movies =  pd.read_csv('imdb_top_1000.csv')

# summary of the columns   
df_movies.describe(include = 'all')
Image by Author

Some takeaways:

  • Drishyam occurs twice in the top 1000 IMDB movies (One of the movies is in Malayalam and another is in Hindi).
  • Most of the movies(32) are from 2014.
  • The drama genre has most(85) movies.
  • Alfred Hitchcock has most(14) movies in the top 1000.
  • We can also see the distribution(min,25%,50%…etc) of IMDB ratings across these movies.

STEP 2: Data Types

Step 2 is to do a sanity checks of the data types of the columns of the dataframe. If we find some incorrect data types then we will correct them in this step.

print(df_movies.dtypes)
Image by Author

We can see that Released_Year, Runtime, Gross are object data types and should have been int64. Let’s convert the data types into int64:

# convert release year into int
df_movies.loc[df_movies.Released_Year == 'PG','Released_Year'] = '9999'
df_movies['Released_Year'] = df_movies['Released_Year'].astype(int)

# Gross into int
df_movies['Gross'] = df_movies['Gross'].str.replace(',','')
df_movies['Gross'] = df_movies['Gross'].fillna(0).astype(int)

# runtime into int 
df_movies['Runtime'] = df_movies['Runtime'].str.replace(' min','').astype(int)
df_movies.dtypes
Image by Author

STEP 3: Missing values

The 3rd step is to find the number of missing values across the columns of the dataframe. It’s important to understand the count of nulls so that we can gauge whether we need to treat them.

# find nulls 
df_movies.isnull().sum()
Image by Author

If majority of the values(>70%) within a column are nulls, we can drop the column, else if only a few values are nulls(<20%), we can do missing value treatment.

STEP 4: Missing values treatment

Once we know the count of missing values, the next step is to treat the columns with missing values.

For illustration purposes, I am filling the nulls with the mean value of the columns, although there are more sophisticated methods of missing value treatment.

# let's replace the nulls with mean 
df_movies['Meta_score'].fillna(df_movies['Meta_score'].mean())
df_movies['Gross'].fillna(df_movies['Gross'].mean())
df_movies.head()

STEP 5: Outliers

The step 5 is to check for outliers. There are multiple ways of checking the outliers, we will be using the graphical method in this example.

We will pick one continuous variable (meta score) and check for the outlier by looking at the histogram.

# distribution of meta scores 
plt.hist(df_movies['Meta_score'],bins = 15)
plt.show()
Distribution of Meta Score (Image by Author)

In this example we don’t see any outliers in the meta score as it ranges from 0–100.

STEP 6: Outlier Treatment

Step 6 is to treat the outlier detected in step 5. There are different ways of treating the outliers such as 1) Capping the min and max value limits 2) Removing the rows with outlier values.

Although there is nothing off with the distribution of meta scores, for illustration purposes, let’s cap the minimum meta score value to 40.

# capping the minimum meta score to 40
df_movies.loc[df_movies['Meta_score'] < 40,'Meta_score'] = 40

#check the minimum score 
df_movies['Meta_score'].min()
# output : 40.0

STEP 7: Who — A person, member etc.

Step 7 is to answer the questions related to a person, member, etc. For example in our use case, we have actors and directors and we can formulate and answer the following question related to them.

  • Who has directed the most number of top IMDB movies? (univariate)
  • Who has acted in most top IMDB movies? (univariate)
  • Which Actor-Director combination gave most top IMDB movies? (bivariate)
  • Who gave music in most top IMDB movies ? (Data not available)
  • And More …..

Now, let’s answer these questions.

## Who has directed the most number of top IMDB movies ?
df_movies.groupby(['Director']).agg({'Series_Title':'count'}).reset_index().rename(columns = {'Series_Title':'count'}).\
sort_values('count',ascending = False).head(5)

## Who has acted in the most number of top movies 
df_movies.groupby(['Star1']).agg({'Series_Title':'count'}).reset_index().rename(columns = {'Series_Title':'count'}).\
sort_values('count',ascending = False).head(5)

## Director - Actor works best 
df_movies.groupby(['Director','Star1'])['Series_Title'].count().reset_index().\
rename(columns = {'Series_Title':'Count'}).sort_values('Count',ascending = False).head(5)
Directors with most top IMDB movies
Actors with most top IMDB movies
Director-Actor combination with most top movies

STEP 8: When — time related questions!

Step 8 is to answer questions related to the time aspect- year, month, week, etc. In the context of our data, we can find the following :

  • Find the years with most movies in IMDB top 1000 ? (univariate)
# finding years with most movies in top 1000
year_dis = df_movies.groupby('Released_Year')['Series_Title'].count().reset_index().\
rename(columns = {'Series_Title':'Count'}).sort_values('Count',ascending = False).head(10)

plt.bar(year_dis['Released_Year'].astype(str), year_dis['Count'], width = 0.5)
plt.xlabel('Years')
plt.ylabel('Number of Movies')
plt.title('Years with most movies in IMDB top 1000')
plt.show()

STEP 9: Where — Place related questions!

Step 9 is to look at the things from the “place” perspective, for example, country, state, regions etc. In context to our data set, we can find the following :

  • Find countries with most movies in IMDB top 1000.

Currently, we don’t have the data to answer this question.

While formulating these questions, I would recommend you to be as exhaustive as possible and don’t limit the thought process based on data availability because the data that is not available now could be obtained later.

STEP 10: What/Which

Step 10 is about formulating questions about things that are not covered above. These are not related to people, place, time but everything apart from these. This is a bit subjective and takes some time to get used to.

  • Which genres are featured most in the top 1000?
  • What is the duration of the top movies?
  • What is the correlation between the rating and gross earning?
  • and more…

For illustration purposes, we are answering the first question using the following code:

### Which genres are featured most in top 1000 ? 
genre_dis = df_movies.groupby('Genre')['Series_Title'].count().reset_index().\
rename(columns = {'Series_Title':'Count'}).sort_values('Count',ascending = False).head(5)
fig, ax = plt.subplots()
plt.bar(genre_dis['Genre'], genre_dis['Count'], width = 0.5)
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

I hope you get an idea of how to implement the 10-step process to perform the data analysis.

Conclusion

In conclusion, a structured approach to data analysis is essential for anyone looking to make informed decisions based on data. By following the 10-step approach outlined in this blog, you’ll be well equipped to tackle even the most complex data analysis projects with confidence. Whether you’re a seasoned data analyst or just starting out, this approach will help you to avoid common pitfalls and maximize your productivity.

Thank You!

If you find my blogs useful, then you can follow me to get direct notifications whenever I publish a story.

If you like to access all the amazing stories on Medium, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly.

Data Science
Data Analysis
Python
Programming
Data
Recommended from ReadMedium