avatarB. Chen

Summary

The web content provides essential tips for optimizing data analysis in Python using Pandas' read_csv() function, covering character encoding, header customization, column selection, date parsing, data type setting, invalid value handling, data appending, and chunked loading of large CSV files.

Abstract

The article "Pandas read_csv() tricks you should know to speed up your data analysis" offers a comprehensive guide to efficiently handle CSV data using Pandas in Python. It addresses common issues encountered during the initial data import phase, such as dealing with various character encodings to avoid decoding errors, managing headers for datasets with missing or non-standard headers, selecting specific columns to improve performance, correctly parsing date columns, setting appropriate data types to prevent type errors, identifying and locating invalid values, appending data to existing CSV files, and processing huge CSV files in chunks to conserve memory. The author, Bindi Chen, provides practical examples and links to a GitHub repository with notebooks for readers to follow along and implement these tricks in their own data science projects.

Opinions

  • The author emphasizes the importance of specifying the correct encoding when reading CSV files to avoid UnicodeDecodeErrors.
  • It is suggested that using the header=None parameter is beneficial when dealing with CSV files lacking headers, allowing Pandas to generate default integer headers.
  • The usecols parameter is highlighted as a performance booster when working with datasets that have numerous columns, as it allows for the selection of necessary columns during the import process.
  • The article advocates for the use of parse_dates to properly handle date columns, ensuring they are recognized as datetime objects rather than generic objects.
  • The author points out that setting data types with dtype can lead to TypeErrors if there are invalid values, and provides methods to identify these values.
  • The practice of appending data to existing CSV files is presented as a useful feature, particularly when using mode='a' and header=None in the to_csv() function.
  • For handling large datasets, the author recommends using the chunksize parameter to read data in manageable chunks, demonstrating how to perform aggregations on each chunk and concatenate the results.
  • The author's opinion is that these tricks are essential for anyone working with CSV data in Python, and they encourage readers to apply these techniques to enhance their data analysis workflows.

Pandas read_csv() tricks you should know to speed up your data analysis

Some of the most helpful Pandas tricks to speed up your data analysis

Search result from https://www.flaticon.com/search?word=csv%20file

Importing data is the first step in any data science project. Often, you’ll work with data in CSV files and run into problems at the very beginning. In this article, you’ll see how to use the Pandas read_csv() function to deal with the following common problems.

  1. Dealing with different character encodings
  2. Dealing with headers
  3. Dealing with columns
  4. Parsing date columns
  5. Setting data type for columns
  6. Finding and locating invalid value
  7. Appending data to an existing CSV file
  8. Loading a huge CSV file with chunksize

Please check out my Github repo for the source code.

1. Dealing with different character encodings

Character encodings are specific sets of rules for mapping from raw binary byte strings to characters that make up the human-readable text [1]. Python has built-in support for a list of standard encodings.

Character encoding mismatches are less common today as UTF-8 is the standard text encoding in most of the programming languages including Python. However, it is definitely still a problem if you are trying to read a file with a different encoding than the one it was originally written. You are most likely to end up with something like below or DecodeError when that happens:

Source from Kaggle character encoding

The Pandas read_csv() function has an argument call encoding that allows you to specify an encoding to use when reading a file.

Let’s take a look at an example below:

First, we create a DataFrame with some Chinese characters and save it with encoding='gb2312' .

df = pd.DataFrame({'name': '一 二 三 四'.split(), 'n': [2, 0, 2, 3]})
df.to_csv('data/data_1.csv', encoding='gb2312', index=False)

Then, you should get an UnicodeDecodeError when trying to read the file with the default utf8 encoding.

# Read it with default encoding='utf8'
# You should get an error 
pd.read_csv('data/data_1.csv')

In order to read it correctly, you should pass the encoding that the file was written.

pd.read_csv('data/data_1.csv', encoding='gb2312')

2. Dealing with headers

Headers refer to the column names. For some datasets, the headers may be completely missing, or you might want to consider a different row as headers. The read_csv() function has an argument called header that allows you to specify the headers to use.

No headers

If your CSV file does not have headers, then you need to set the argument header to None and the Pandas will generate some integer values as headers

For example to import data_2_no_headers.csv

pd.read_csv('data/data_2_no_headers.csv', header=None)

Consider different row as headers

Let’s take a look at data_2.csv

x1,       x2,      x3,     x4
product,  price,   cost,   profit
a,        10,      5,      1
b,        20,      12,     2
c,        30,      20,     3
d,        40,      30,     4

It seems like more sensible columns name would be product, price, … profit, but they are not in the first row. The argument header also allows you to specify the row number to use as the column names and the start of data. In this case, we would like to skip the first row and use the 2nd row as headers:

pd.read_csv('data/data_2.csv', header=1)

3. Dealing with columns

When your input dataset contains a large number of columns, and you want to load a subset of those columns into a DataFrame, then usecols will be very useful.

Performance-wise, it is better because instead of loading an entire DataFrame into memory and then deleting the spare columns, we can select the columns we need while loading the dataset.

Let’s use the same dataset data_2.csv and select the product and cost columns.

pd.read_csv('data/data_2.csv',
            header=1,
            usecols=['product', 'cost'])

We can also pass the column index to usecols:

pd.read_csv('data/data_2.csv',
            header=1,
            usecols=[0, 1])

4. Parsing date columns

Date columns are represented as objects by default when loading data from a CSV file.

df = pd.read_csv('data/data_3.csv')
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   date     4 non-null      object
 1   product  4 non-null      object
 2   price    4 non-null      int64 
 3   cost     4 non-null      int64 
 4   profit   4 non-null      int64 
dtypes: int64(3), object(2)
memory usage: 288.0+ bytes

To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   date     4 non-null      datetime64[ns]
 1   product  4 non-null      object        
 2   price    4 non-null      int64         
 3   cost     4 non-null      int64         
 4   profit   4 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 288.0+ bytes

Sometime date is split up into multiple columns, for example, year, month, and day. To combine them into a datetime, we can pass a nested list to parse_dates.

df = pd.read_csv('data/data_4.csv',
                 parse_dates=[['year', 'month', 'day']])
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   year_month_day  4 non-null      datetime64[ns]
 1   product         4 non-null      object        
 2   price           4 non-null      int64         
 3   cost            4 non-null      int64         
 4   profit          4 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 288.0+ bytes

To specify a custom column name instead of the auto-generated year_month_day, we can pass a dictionary instead.

df = pd.read_csv('data/data_4.csv',
                 parse_dates={ 'date': ['year', 'month', 'day'] })
df.info()

If your date column is in a different format, then you can customize a date parser and pass it to the argument date_parser:

from datetime import datetime
custom_date_parser = lambda x: datetime.strptime(x, "%Y %m %d %H:%M:%S")
pd.read_csv('data/data_6.csv',
             parse_dates=['date'],
             date_parser=custom_date_parser)

For more about parsing date columns, please check out this article

5. Setting data type

If you want to set the data type for the DataFrame columns, you can use the argument dtype , for example

pd.read_csv('data/data_7.csv',
                 dtype={
                     'Name': str,
                     'Grade': int
                 })

6. Finding and locating invalid values

You might get the TypeError when setting data type using the argument dtype

It is always useful to find and locate the invalid values when this error happens. Here is how you can find them:

df = pd.read_csv('data/data_8.csv')
is_error = pd.to_numeric(df['Grade'], errors='coerce').isna()
df[is_error]

7. Appending data to an existing CSV file

You can specify a Python write mode in the Pandas to_csv() function. For appending data to an existing CSV file, we can use mode='a':

new_record = pd.DataFrame([['New name', pd.to_datetime('today')]],
                      columns=['Name', 'Date'])
new_record.to_csv('data/existing_data.csv',
              mode='a',
              header=None,
              index=False)

8. Loading a huge CSV file with chunksize

By default, Pandas read_csv() function will load the entire dataset into memory, and this could be a memory and performance issue when importing a huge CSV file.

read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk. This is especially useful when reading a huge dataset as part of your data science project.

Let’s take a look at an example below:

First, let’s make a huge dataset with 400,000 rows and save it to big_file.csv

# Make up a huge dataset
nums = 100_000
for name in 'a b c d'.split():
    df = pd.DataFrame({
        'col_1': [1]*nums,
        'col_2': np.random.randint(100, 2000, size=nums)
    })
    df['name'] = name
    df.to_csv('data/big_file.csv',
              mode='a',
              index=False,
              header= name=='a')

Next, let’s specify a chucksize of 50,000 when loading data with read_csv()

dfs = pd.read_csv('data/big_file.csv',
                  chunksize=50_000,
                  dtype={
                      'col_1': int,
                      'col_2': int,
                      'name': str
                  })

Let’s perform some aggregations on each chunk and then concatenate the result into a single DataFrame.

res_dfs = []
for chunk in dfs:
    res = chunk.groupby('name').col_2.agg(['count', 'sum'])
    res_dfs.append(res)
pd.concat(res_dfs).groupby(level=0).sum()

Let’s validate the result against a solution without chunksize

pd.read_csv('data/big_file.csv',
              dtype={
                  'col_1': int,
                  'col_2': int,
                  'name': str
              }).groupby('name').col_2.agg(['count', 'sum'])

And you should get the same output.

That’s it

Thanks for reading.

Please checkout the notebook on my Github for the source code.

Stay tuned if you are interested in the practical aspect of machine learning.

Here are some related articles

References

Python
Data Science
Pandas
Machine Learning
Read Csv
Recommended from ReadMedium