avatarB. Chen

Summary

The web content provides a comprehensive guide on six practical Pandas tricks to enhance data analysis efficiency, covering techniques for column selection, data type conversion, missing value handling, feature transformation, data importing from clipboard, and merging multiple files into a single DataFrame.

Abstract

The article titled "6 Pandas Tricks You Should Know to Speed Up Your Data Analysis" offers insights into advanced data manipulation techniques using the Pandas library in Python. It begins by illustrating how to select columns based on data types, which is crucial for focusing on relevant data subsets. The guide then delves into methods for converting strings to numbers, addressing common issues such as invalid characters and missing values. It emphasizes the importance of detecting and handling missing data effectively, providing various strategies to either drop or replace missing values. The article also demonstrates how to transform continuous numerical features into categorical ones, which can be pivotal in certain machine learning algorithms. For quick data entry, the article introduces the read_clipboard() function for creating DataFrames from copied data. Lastly, it presents a memory-efficient approach to combine multiple datasets into one DataFrame using the glob module and concat() function, differentiating between row-wise and column-wise data concatenation.

Opinions

  • The author suggests that selecting columns by data types is a fundamental step in streamlining data analysis.
  • Converting strings to numbers is presented

6 Pandas tricks you should know to speed up your data analysis

Some of the most helpful Pandas tricks

Photo by Alvaro Reyes on Unsplash

In this article, you’ll learn some of the most helpful Pandas tricks to speed up your data analysis.

  1. Select columns by data types
  2. Convert strings to numbers
  3. Detect and handle missing values
  4. Convert a continuous numerical feature into a categorical feature
  5. Create a DataFrame from the clipboard
  6. Build a DataFrame from multiple files

Please check out my Github repo for the source code.

1. Select columns by data types

Here are the data types of the Titanic DataFrame

df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Let’s say you need to select the numeric columns.

df.select_dtypes(include='number').head()

This includes both int and float columns. You could also use this method to

  • select just object columns
  • select multiple data types
  • exclude certain data types
# select just object columns
df.select_dtypes(include='object')
# select multiple data types
df.select_dtypes(include=['int', 'datetime', 'object'])
# exclude certain data types
df.select_dtypes(exclude='int')

2. Convert strings to numbers

There are two methods to convert a string into numbers in Pandas:

  • the astype() method
  • the to_numeric() method

Let’s create an example DataFrame to have a look at the difference.

df = pd.DataFrame({ 'product': ['A','B','C','D'], 
                   'price': ['10','20','30','40'],
                   'sales': ['20','-','60','-']
                  })

The price and sales columns are stored as strings and so result in object columns:

df.dtypes
product    object
price      object
sales      object
dtype: object

We can use the first method astype() to perform the conversion on the price column as follows

# Use Python type
df['price'] = df['price'].astype(int)
# alternatively, pass { col: dtype }
df = df.astype({'price': 'int'})

However, this would have resulted in an error if we tried to use it on the sales column. To fix that, we can use to_numeric() with argument errors='coerce'

df['sales'] = pd.to_numeric(df['sales'], errors='coerce')

Now, invalid values - get converted into NaN and the data type is float.

3. Detect and handle missing values

One way to detect missing values is by using info() method and take a look at the column Non-Null Count.

df.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

When the dataset is large, we can count the number of missing values instead. df.isnull().sum() returns the number of missing values for each column

df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

df.isnull().sum().sum() returns the total number of missing values.

df.isnull().sum().sum()
886

In addition, we can also find out the percentage of values that are missing by running df.isna().mean()

ufo.isna().mean()
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

Dropping missing values

To drop rows if any NaN values are present

df.dropna(axis = 0)

To drop columns if any NaN values are present

df.dropna(axis = 1)

To drop columns in which more than 10% of values are missing

df.dropna(thresh=len(df)*0.9, axis=1)

Replacing missing values

To replace all NaN values with a scalar

df.fillna(value=10)

To replace NaN values with the values in the previous row.

df.fillna(axis=0, method='ffill')

To replace NaN values with the values in the previous column.

df.fillna(axis=1, method='ffill')

The same, you can also replace NaN values with the values in the next row or column.

# Replace with the values in the next row
df.fillna(axis=0, method='bfill')
# Replace with the values in the next column
df.fillna(axis=1, method='bfill')

The other common replacement is to replace NaN values with the mean. For example to replace NaN values in column Age with the mean.

df['Age'].fillna(value=df['Age'].mean(), inplace=True)

For more about missing values in Pandas, please check out Working with missing values in Pandas.

4. Convert a continuous numerical feature into a categorical feature

In the step of data preparation, it is quite common to combine or transform existing features to create a more useful one. One of the most popular ways is to create a categorical feature from a continuous numerical feature.

Let’s take a look at the Age column from the Titanic dataset

df['Age'].head(8)
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
Name: Age, dtype: float64

Age is a continuous numerical attribute, but what if you want to convert it into a categorical attribute, for example, convert ages to groups of age ranges: ≤12, Teen (≤18), Adult (≤60), and Older (>60)

The best way to do this is by using the Pandas cut() function:

import sys
df['ageGroup']=pd.cut(
    df['Age'], 
    bins=[0, 13, 19, 61, sys.maxsize], 
    labels=['<12', 'Teen', 'Adult', 'Older']
)

And calling head() on the ageGroup column should also display the column information.

df['ageGroup'].head(8)
0    Adult
1    Adult
2    Adult
3    Adult
4    Adult
5      NaN
6    Adult
7      <12
Name: ageGroup, dtype: category
Categories (4, object): [<12 < Teen < Adult < Older]

5. Create a DataFrame from the clipboard

Pandas read_clipboard() function is a very handy way to get data into a DataFrame as quickly as possible.

Suppose we have the following data and we want to create a data frame from it:

     product   price
0    A         10
1    B         20
2    C         30
4    D         40

We just need to select the data and copy it to the clipboard. Then, we can use the function to read it into a DataFrame.

df = pd.read_clipboard()
df

6. Build a DataFrame from multiple files

Your dataset might spread across multiple files, but you want to read the dataset into a single DataFrame.

One way to this is to read each file into its own DataFrame, combine them together, and then delete the original DataFrame, but that would be memory inefficient.

A better solution is to use the built-in glob module (Thanks to Data School Pandas Tricks).

In this case, glob() is looking in the data directory for all CSV files that begin with the word “data_row_”. glob() returns filenames in an arbitrary order, which is why we sorted the list using sort() the function.

For row-wise data

Let’s say that our dataset is spread across 2 files, data_row_1.csv and data_row_2.csv, in row-wise

To create a DataFrame from the 2 files.

files = sorted(glob('data/data_row_*.csv'))
pd.concat((pd.read_csv(file) for file in files), ignore_index=True)

sorted(glob('data/data_row_*.csv')) returns filenames. After that, we read each of the files using read_csv() and pass the results to the concat() function, which will concatenate the rows into a single DataFrame. In addition, to avoid duplicate value in the index, we tell the concat() to ignore the index (ignore_index=True) and instead use the default integer index.

For column-wise data

Let’s say that our dataset is spread across 2 files, data_col_1.csv and data_col_2.csv, in column-wise.

To create a DataFrame from the 2 files.

files = sorted(glob('data/data_col_*.csv'))
pd.concat((pd.read_csv(file) for file in files), axis=1)

This time, we tell the concat() function to concatenate along the columns axis.

That’s it

Thanks for reading.

Please checkout the notebook on my Github for the source code.

Stay tuned if you are interested in the practical aspect of machine learning.

Python
Data Science
Pandas
Machine Learning
Pandas Dataframe
Recommended from ReadMedium