Python
Master Data Analysis using Trustworthy Databases
A novice’s guide to Python allows you to unravel the mysteries of data analysis.
Python is an excellent programming language for beginners to handle and analyze data. We’ll also talk about how to find relevant data sources and a bit about statistical testing. Let’s break this down into a few steps:
Step 1: Finding Relevant and Trustworthy Databases
Depending on your field, there are many places where you can find datasets. For instance, some common data repositories are UCI Machine Learning Repository, Kaggle, Google Dataset Search, and Government databases (e.g., data.gov, Eurostat).
Make sure to verify the credibility of the data source and its relevance to your field. I will tell you how to do that in the last section of this article.
Also, note the data format (CSV, JSON, SQL, etc.). CSV is the simplest and most common format.
Step 2: Collecting the Data in a Table
Let’s use Python’s pandas
library to handle our data.
First, let’s import pandas. If you don’t have it, install it via pip:
pip install pandas
Then, in your Python script, do the following:
import pandas as pd
Assuming we have a CSV file named my_data.csv
, we can load this into a pandas DataFrame (which is essentially a table) like so:
df = pd.read_csv('my_data.csv')
You can visualize the first 5 lines of your dataframe using:
print(df.head())
Step 3: Running Traditional Data Analyses
Let’s start by computing some basic statistics.
You can easily get mean, median, mode, standard deviation, etc., with pandas’ built-in functions:
mean = df['YourColumnName'].mean()
median = df['YourColumnName'].median()
std_dev = df['YourColumnName'].std()
Replace 'YourColumnName'
with the actual name of your column.
To get a quick statistical summary of all numeric columns, you can use the describe
function:
print(df.describe())
It gives count, mean, std, min, 25%, 50%, 75%, max of each numerical column in your dataframe.
For more advanced analyses, you may need libraries such as numpy
, scipy
, or statsmodels
. For example, if we're comparing two groups of data, we may use a t-test from the scipy
library.
Step 4: Applying Statistical Tests and Interpreting Them
Let’s use the scipy
library to conduct a t-test. This test checks whether two group means are significantly different from each other. First, we'll import the library:
from scipy import stats
Let’s say we have two groups, ‘group1’ and ‘group2’, and we want to check if their means are significantly different. We can do this with a t-test:
group1 = df[df['group']=='group1']['score']
group2 = df[df['group']=='group2']['score']
t_stat, p_val = stats.ttest_ind(group1, group2)
The t-statistic is a measure of the difference between the two means relative to the variability in the data. The p-value is a measure of the probability that an effect as large as the observed effect could occur if there was no actual difference between the groups.
If the p-value is small (commonly, less than 0.05), we reject the null hypothesis, meaning the group means are significantly different.
Note: Always ensure that the assumptions of each statistical test are met. For a t-test, assumptions include independent observations, normally distributed data, and equal variances among groups (although the test is somewhat robust to this assumption). You may need to conduct additional tests or use different statistical tests if these assumptions are violated.
Assumptions Validity
Let’s cover some of the key assumptions that many statistical tests have, and how to test them:
1. Independence of observations
This is often a given based on study design. For example, if you randomly sample individuals from a population, then measure a variable of interest, the assumption of independence is reasonable. However, if your sampling is not independent (e.g., if you are studying the effect of a treatment on patients within the same family), then more complex statistical methods may be needed.
2. Normality
This assumes your data is distributed normally. You can visually inspect this assumption with a histogram or a Q-Q plot, or you can use a statistical test like the Shapiro-Wilk test.
For instance, to create a histogram in Python:
import matplotlib.pyplot as plt
plt.hist(df['YourColumnName'])
plt.show()
For Q-Q plot:
import statsmodels.api as sm
import pylab
sm.qqplot(df['YourColumnName'], line='s')
pylab.show()
For Shapiro-Wilk test:
from scipy.stats import shapiro
stat, p = shapiro(df['YourColumnName'])
if p > 0.05:
print('Probably Gaussian')
else:
print('Probably not Gaussian')
This test gives you a p-value. If p-value > 0.05, we fail to reject the null hypothesis, i.e., the data seems normally distributed.
3. Equality (or “homogeneity”) of variances
This assumption can be checked with Levene’s test or Bartlett’s test.
For instance, to conduct Levene’s test in Python:
from scipy.stats import levene
stat, p = levene(group1, group2)
if p > 0.05:
print('Variances are probably equal')
else:
print('Variances are probably not equal')
If p-value > 0.05, we fail to reject the null hypothesis, i.e., the variances seem equal.
These are some ways you can validate the assumptions of your statistical tests. There may be more depending on the specific test you’re using.
Remember that if your data violates these assumptions, it doesn’t mean you can’t analyze it — it just means you might have to use different methods or techniques. For instance, non-parametric tests do not assume normality and equal variances, and there are ways to transform your data to better meet these assumptions.
Also, modern practices often value practical significance (effect sizes, confidence intervals) over strict adherence to these assumptions.
How to transform your Data
There are several techniques for data transformation that can help meet the assumptions of normality and equal variances.
1. Logarithmic Transformation (log)
This is a commonly used transformation that can handle skewness towards large values.
import numpy as np
df['log_transformed_column'] = np.log(df['YourColumnName'])
2. Square Root Transformation (sqrt)
This transformation is used for data that follows a Poisson distribution (where variance is proportional to mean). It can also be useful for dealing with skewness towards large values.
df['sqrt_transformed_column'] = np.sqrt(df['YourColumnName'])
3. Square Transformation
It can be used when the data is skewed towards small values.
df['square_transformed_column'] = np.square(df['YourColumnName'])
4. Inverse Transformation (1/x)
This transformation is a powerful way to reduce skewness for positive values. It creates a mirror image of your data and can also stabilize variance.
df['inverse_transformed_column'] = 1/df['YourColumnName']
5. Box-Cox Transformation
This is a parametrized transformation method that seeks the best power transformation of the data that reduces skewness. It is a bit more complex than the above transformations, as it requires computation to determine the lambda (λ) parameter. This can be computed in Python using scipy.stats.boxcox
.
from scipy import stats
df['boxcox_transformed_column'], fitted_lambda = stats.boxcox(df['YourColumnName'])
What is Data Skewness?
Skewness refers to a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In simpler terms, skewness tells you the amount and direction of skew (departure from horizontal symmetry).
The skewness value can be:
- Positive: If the tail on the right side (larger values) of the distribution is longer or fatter. The data are said to be right-skewed, also known as positively skewed. For instance, income distribution in an economy is usually positively skewed because a small number of people earn significantly more than the average.
- Negative: If the tail on the left side (smaller values) of the distribution is longer or fatter. The data are said to be left-skewed, also known as negatively skewed. For instance, the age of death from natural causes (old-age) is usually negatively skewed as more people live up to old age and die then, rather than dying at a young age.
- Zero: If the tail on both sides of the mean have approximately the same length; the distribution is symmetric. This is the case for a perfectly normal distribution.
You can calculate skewness in Python using scipy.stats.skew
:
from scipy.stats import skew
data_skewness = skew(df['YourColumnName'])
print(f"The skewness of the data is: {data_skewness}")
When we say that a dataset has skewness “towards large values” or “towards small values”, we’re talking about the direction of the skew.
- Skewness towards large values means that there’s a long tail in the distribution on the right, towards the larger values. This is also called right skewness or positive skewness. In this case, the mean of the dataset is typically greater than the median, because the mean is influenced more by these large values.
- Skewness towards small values means that there’s a long tail in the distribution on the left, towards the smaller values. This is also called left skewness or negative skewness. In this case, the mean of the dataset is typically less than the median, because the mean is dragged down by these smaller values.
In other words, the direction of skewness is determined by the direction of the tail. If the tail of the distribution is extended towards larger numbers, it’s right-skewed; if it’s extended towards smaller numbers, it’s left-skewed.
This is important because skewness can affect many aspects of data analysis, from exploratory analysis to the choice of model. If a data set has high skewness, then certain statistical techniques might be less effective or inappropriate, because many techniques assume that the data is normally distributed.
How to Verify the credibility of the Data Source?
Verifying the credibility and relevance of a data source can be a bit subjective and can depend on the specifics of the field you’re working in, but here are some general steps you can take:
- Consider the source: Databases from well-known academic institutions, governmental databases, or reputable research firms are usually trustworthy. Be skeptical of data from sources that aren’t well-known or don’t have a good reputation.
- Check the methodology: If possible, look at how the data was collected. Was the methodology sound and unbiased? Was the sample size large enough? The more you know about how the data was gathered, the better you can judge its quality.
- Look for peer reviews or citations: Data that has been used in peer-reviewed research is more likely to be reliable. If the dataset is often cited in other works, it’s a good sign that it’s credible.
- Check if the data is up-to-date: In many fields, the relevancy of the data can decrease over time. Make sure the data isn’t outdated, especially if you’re working on a topic that evolves rapidly (like technology-related fields).
- Relevance to your field: The data should directly or indirectly relate to your study’s variables of interest. Review the descriptions or metadata associated with the dataset to ensure it contains the information you need.
- Cross-verify the data: If there are other similar datasets available, you can cross-verify the information. This may not always be possible but can be a good way to ensure accuracy when it is.
Conclusion
Thank you for making that far. Hope you enjoyed the journey and you find what you were looking for. It is important you remember that it is possible to analyse any data out there and to make any sort of conclusion from it. This is why the databases you are using must be credible and relevant, and so should be your analysis.