avatarAmit Chauhan

Summary

The web content provides an overview of fundamental statistical concepts, including Z, T, and P statistics, and their applications in data science and machine learning.

Abstract

The article demystifies the complexities of Z-Statistics, T-Statistics, and P-Statistics, which are essential for data inference in machine learning. It breaks down the definitions and concepts in statistics, starting with the types of data such as numerical and categorical, and the importance of visualizing data through histograms and scatter plots. The text delves into measures of central tendency and spread, including mean, median, mode, range, mean deviation, variance, and standard deviation. It also covers covariance and correlation, which are crucial for understanding the relationship between variables. The article explains normal distribution and the significance of Z-Distributions and T-Distributions in standardizing data and making inferences from small samples. It introduces the concept of confidence intervals and hypothesis testing, emphasizing the role of P-Distribution in determining the significance of results. The conclusion underscores the practical value of statistics in data science, encouraging readers to engage with the subject through real-world examples.

Opinions

  • The author believes that understanding statistics is crucial for making inferences and solving problems in data science.
  • Statistics is presented not as a boring subject but as a valuable tool that can reveal insights from data when applied with practical examples.
  • The article suggests that learning statistics is often overlooked but is a necessary daily dose for those in data science and machine learning fields.
  • The author implies that visual representation of data, such as histograms and scatter plots, is indispensable for interpreting large datasets.
  • There is an emphasis on the importance of standardizing data through Z and T distributions to facilitate comparison and inference.
  • The author advocates for the use of confidence intervals as a practical approach to estimating population parameters.
  • Hypothesis testing is highlighted as a decision-making tool, with the P-value being critical for rejecting or accepting the null hypothesis.
  • The article concludes with the opinion that a basic understanding of statistics can make data science more interesting and magical for the practitioner.

Statistics

Z- Statistics, T-Statistics, P-Statistics are Still Confusing you?

Definitions and concepts in Statistics for machine learning

Photo by Ruthson Zimmerman on Unsplash

Understanding statistics look like a side parallel road for data science and machine learning people. But learning statistics is worth making the inferences and solutions from the data. The road not to be taken by many people should have to accept the statistics with their daily dose. Well, in this article we will discuss the Z, T and P statistics distribution and will try to learn why we use them in data science. Before diving into this concept we will discuss some basic definition and terms as shown below:

Topics to be covered:

Section 1: Types of Data, Histogram and Scatter plot

Section 2: Central Measures values and Measures of Spread

Section 3: Covariance and Correlation

Section 4: Z, T — Distributions and confidence intervals

Section 5: Hypothesis and P — Distribution

Section 1:

Types of Data

Acquiring knowledge about statistics is a need in a data science career. Before jumping we should know with whom we are dealing with oh! that is obvious “DATA”. It is not like that data will come to you and tell all the inferences, we need to find a way to deal with different types of data.

Data is comprised of numbers and words that can be in the measurable or observational form. We cannot do the same operation with all different types of data, for this we need to identify and according to that, we need to test and visualize.

Numerical Data: This type of data deals with numbers which can be in a quantitative form either discrete or in continuous.

  • Discrete data is like a whole number ( 2, 10, 20, 15, etc) which tells a direct specification of the quantity that we count easily.
  • Continuous data is like a range means the value which falls in some particular measuring range ( Kg, Km, cm, etc ).

Categorical Data: This type deals with qualitative data that we can describe. It comes in a group of two or more types of a different description.

Example: Binary values ( 0 and 1 ),

Nominal Data: They are not in order but shows some groups or category like seasons, brand names, flower names, etc.

Ordinal Data: Well these types of data are those where we give them some rating or in some order.

Histogram and Scatter plot

If data is very big then we cannot sit all day and check every line of data records to take out some information. That’s where graphs and plots of data come into the picture.

Different types of plots are Bar, Line and Pie Charts, Histogram and Scatter plots, etc.

Histogram

Histograms although similar to bar charts but in the histogram, the bar falls in some range i.e. continuous type. The bar charts can have a gap between two bars but histogram doesn’t.

Difference between Bar and Histogram Chart. A photo from mathisfun

Scatter Plot

The scatter plot shows the relationship between two variable points in the record.

Scatter plot between Time and Marks. A photo by Author

Section 2:

Central Measures value

When we want to know about something from a big record we choose one thing which is similar to all others. So, in numbers, we choose a common and around value through Mean, Median and Mode.

Measures of Spread

The Range

When we arrange the data in ascending order and make a difference from the big number to a small number is called range.

Mean Deviation

This spread tells us how far the values from the measured central value.

  • First, take a mean of the values.
  • Then make a difference from the mean value to all values.
  • Then calculates the spread distance from the mean.

Example:

Mean Deviation. A photo by Author

Mean Deviation = (4+3+2+2+1+3+7)/7 = 3.14

The mean deviation tells that on average the values are 3.14 far away from the middle.

Variance and Standard Deviation

Variance is the average of the sum of the squared of absolute values. It tells how the data is spread around the mean.

Standard deviation is how to spread if from the simple mean.

Variance and Standard Deviation. A photo by Author

If we see in the photo the 6+3.6 is 9.6 and 6-3.6 is 2.4 so, we can say that numbers from 2.4 to 9.6 are in standard but if we see that the value 13 is outside the standard value range. From this, we can observe which numbers are very small or very big.

Section 3:

Covariance

Covariance is used to calculate the movement of the two variables. With the result, we can say that both variables are moving together, opposite or independent of each other. A positive value means moving together, negative values moving opposite and equal to zero means both are independent of each other.

Covariance > 0 → moving together

Covariance < 0 → moving opposite

Covariance = 0 → independent to each other

The formula to calculate covariance is

Covariance ( sample ) = Sum of (( x-x mean ) * ( y-y mean ))/n-1

The value can come in any scale from 0.000824 to 23,434,000 to interpret these types of values we do correlation coefficient.

Correlation

Correlation is used to know the relationship between two variables.

The following types are the correlation shown in the photo below:

Correlation relationship with different types of data. A photo from mathisfun

Section 4:

Distribution is the probability that we measure and see the probability distribution through graphs that’s how we see inferential statistics.

Normal Distribution

The normal distribution is sometimes called Gaussian Distribution in statistics terms. Many people also say this distribution is a bell curve because the mean, median and mode are equal and symmetrical with no skew.

Normal Distribution Curve. A photo by Author

The spread of the curve is defined by the standard deviation. We can observe that more number of the observations or values are in the middle means around the mean. If we change the mean to a smaller and larger value then we see the distribution will shift towards right and left. If we change the standard deviation then the distribution becomes wider or narrower. The narrower means most of the observation around the center.

Z— Distributions

The changes we see in the distribution by changing mean and standard deviation are not stable. So, we need to standardize it so that we can calculate and make inferences. To make it standardize we have to make the mean to “0” and Standard deviation to “1” i.e ( 0, 1 ). To standardize we calculate the z score as shown below:

z = (value — mean)/standard deviation

When we calculate the mean and standard deviation of all z values it comes around 0 and 1 only obtaining a standard normal distribution.

T — Distribution

It is generally called Student’s t distribution used to inference from the small sample with unknown variance. The difference between the normal distribution and t-distribution through the graph is t-distribution has a higher dispersion on the left and right sides of the distribution.

N Distribution and t-distribution. A photo by Author

The formula for t-distribution is

T-Distribution formula. A photo by Author

In a common thumb table, we use a t-distribution table up to 50 samples, if the sample size more than 50 then use a z-distribution table.

The degree of freedom is the one less of a total number of samples. Alpha is the percentage left after the confidence level percentage.

Confidence interval

When we do some estimation value for a population to be exact but with a practical approach, it comes wrong. So, we go for the confidence interval for a particular value to be in that range.

We see in the photo that we can choose a confidence interval with these common values and with respect to their alpha values.

Section 5:

Hypothesis

When we estimated the correct confidence interval, we try to find the value in that range, this need to make a decision. The hypothesis test to be done based on some points.

  • Formulate a hypothesis
  • Need the right test to be done
  • Execute the test
  • Make a decision

A hypothesis is just think or idea on which we do testing. There are two types of hypotheses.

  1. Null hypothesis
  2. Alternate hypothesis

The null hypothesis is the one which we think about to be a parameter and the rest is an alternate hypothesis. When we calculate the Z value for the hypothesis then we also calculate the significance value also based on alpha. The confidence interval on the distribution and divide the alpha value for both left and right sides.

Region for the hypothesis. A photo by Author

P — Distribution

P- value gives the smaller significance level on which we can reject the null hypothesis on sample data.

Conclusion:

A basic understanding of statistics can give valuable information in data science. Believe me statistics in not boring. when you try to learn with practical examples it will do magic for you. As statistics is a very vast subject so, I tried to give some basic ideas. Hope this article will give a little interesting for readers.

I hope you like the article. Reach me on my LinkedIn and twitter.

Recommended Articles

  1. NLP — Zero to Hero with Python

2. Python Data Structures Data-types and Objects

3. MySQL: Zero to Hero

Data Science
Data Visualization
Machine Learning
Artificial Intelligence
Programming
Recommended from ReadMedium