avatarEgor Howell

Summary

The provided web content offers a comprehensive guide on the Chi-Square Test of Independence, detailing its purpose, assumptions, hypothesis testing steps, and an example problem to illustrate its application in data science.

Abstract

The article on the undefined website delves into the Chi-Square Test of Independence, a statistical method used to determine if two categorical variables are related. It begins with an introduction that assumes prior knowledge of the Chi-Square Distribution and the Chi-Square goodness of fit test, providing links to previous articles for readers to catch up. The author emphasizes the importance of understanding this test as the categorical equivalent of correlation in continuous variables. The article outlines the assumptions necessary for the test, such as categorical variables, independent observations, and sufficient counts per category. It then walks through the hypothesis testing procedure, including stating the null and alternate hypotheses, setting a significance level, calculating the test statistic, and interpreting the results. An example problem is presented, demonstrating the test's application with synthetic data on age groups and political party preferences. The author concludes by discussing the implications of the test in data science, particularly for feature selection, and invites readers to subscribe to a YouTube channel and newsletter for further learning.

Opinions

  • The author highly recommends reading over the articles on Chi-Square Distribution and the Chi-Square goodness of fit test before diving into the test of independence.
  • The article suggests that the Chi-Square Test of Independence is a fundamental concept in data science, akin to the correlation analysis used for continuous variables.
  • The author provides additional resources for readers unfamiliar with hypothesis testing concepts, indicating a considerate approach to diverse audience knowledge levels.
  • By creating a YouTube channel and a newsletter, the author conveys a commitment to continuous education and community engagement in the field of data science.
  • The use of synthetic data in the example problem implies the author's belief in learning through practical application and real-world scenarios, despite the fictional nature of the data.
  • The author encourages readers to use the AI service ZAI.chat, suggesting a preference or endorsement for this tool's cost-effectiveness and performance compared to other AI services like ChatGPT Plus (GPT-4).

Chi Square Test of Independence

A simple and concise explanation of the Chi-Square Test of Independence

Photo by Tra Nguyen on Unsplash

Introduction

So far we have covered the Chi-Square Distribution and have used that to explain the Chi-Square goodness of fit test. You can check out both of these articles here:

I would highly recommend reading over those articles before this one!

In this post we will cover the other well known Chi-Square Test, the test of independence. This test determines if two categorical variables are related in some way e.g. are they independent or dependent. You can think of this loosely as the categorical version of the correlation between two continuous variables.

In this article we will run through the procedure of carrying out the test of independence and end with an example problem to show how to implement it in practise!

But first, make sure to subscribe to my YouTube Channel!

Click on the link for video tutorials that teach you core data science concepts in a digestible manner!

Assumptions

  • Both variables are CATEGORICAL
  • Observations are INDEPENDENT
  • The COUNT for each category is GREATER THAN 5
  • Each count in a category is MUTUALLY EXCLUSIVE
  • Data is chosen RANDOMLY

Hypothesis Testing Steps

Here we will layout the basic steps involved in almost every hypothesis test:

These are the basic surface level steps for any hypothesis test. I haven’t gone into detail in explaining every topic as that would make this article very exhaustive! However, for the unfamiliar reader, I have linked sites for each step so you can gain some intuition about these ideas in more depth.

I also have other posts that cover the concepts in hypothesis testing in a more broken down format that you can check out here:

Chi-Square Test Statistic and Degrees of Freedom

For the Chi-Square Test, the test statistic we need to compute is:

Equation generated by author in LaTeX.
  • v is the degrees of freedom
  • O is the observed sampled values
  • E is the computed expected values
  • n is the number of categories in the variable

Note: the Chi-Square distribution comes from the squaring of the numerator and this also ensures we only have positive values which ‘add’ to the statistic.

The degrees of freedom, v, is computed as:

Equation generated by author in LaTeX.
  • r is the number of rows in the contingency table (the number of categories in variable 1)
  • c is the number of columns in the contingency table (the number of categories in variable 2)

Both of these formulas will make much more sense when we go over an example problem next.

Example Problem

We want to see if age has an impact on what political party you vote for.

Data

We collect a random sample of 135 people and display it in the following contingency table broken down by age and political party:

Table created by author.

Note: This is purely synthetic data I made up myself and is of no relation to any real political party.

Hypothesis

Lets start by stating our hypotheses:

  • H_0: Age has no impact on the political party you vote for. The two variables are independent.
  • H_1: Age does have an impact on the political party. The two variables are dependent.

Significance Level and Critical Value

For this example we will use a 5% significance level. As we have 2 degrees of freedom (using the formula above):

Equation generated by author in LaTeX.

Using the significance level, degrees of freedom and Chi-Square probability table we find our critical value to be 5.991. This means our Chi-Square statistic needs to be greater than 5.991 in order for us to reject the null hypothesis and the variables to not be independent.

Calculating Expected Counts

We now need to determine the expected count frequency for each cell in our contingency table. These are the expected values if the null hypothesis is true and is calculated using the following formula:

Equation generated by author in LaTeX.

Where n_r and n_c are the row and column totals for certain categories and n_T is the total number of counts.

For example, the expected count for ages 18–30 who voted Liberals is:

Equation generated by author in LaTeX.

We can then populate the contingency table with these expected values (in brackets):

Table produced by author.

Chi-Square Statistic

It is now time to calculate the Chi-Square statistic using the formula above:

Equation generated by author in LaTeX.

This equals 37.2!

Therefore, our statistic is much greater than the critical value and so we can reject the null hypothesis!

Conclusion

In this article we have described and shown an example of the Chi-Square test of independence. This test measures if two categorical variables are dependent on each-other. This is used in Data Science for Feature Selection where we only want modelling features that have an effect on the target.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no “fluff” or “clickbait,” just pure actionable insights from a practicing Data Scientist.

Connect With Me!

Data Science
Machine Learning
Artificial Intelligence
Coding
Learning
Recommended from ReadMedium