avatarRamesh Nelluri - Ideas to Life

Summary

The web content provides a Python-based solution for splitting large CSV files into smaller files, either randomly or equally, based on user-defined parameters.

Abstract

The article "How to Split a Large CSV File with Python" offers a practical guide for data professionals who need to divide large CSV files into manageable segments. It presents a Python script that leverages the pandas library to read, split, and write smaller CSV files according to the user's specifications. The script allows for flexibility in determining the number of rows per file, enabling both random splitting for testing purposes and equal splitting for uniform distribution of data. The process involves finding the total number of rows in the original file, prompting the user for the desired range of rows for each split file, and then reading and writing the data accordingly, while also handling the last file's size if it differs from the specified range.

Opinions

  • The author emphasizes the importance of splitting large data files for various data processing tasks.
  • Using pandas is recommended for its powerful data manipulation capabilities, which simplify the process of reading and writing CSV files.
  • The article suggests that providing the same minimum and maximum row count results in an equal split of the CSV file, which can be particularly useful for consistent data segmentation.
  • The author provides a subjective view on the ease of implementing the split functionality using Python, highlighting the simplicity of the code.
  • Encouragement for reader engagement is evident, with the author inviting feedback and sharing, as well as suggesting support via a coffee donation.
  • The article promotes the author's expertise and the value of following their work on various platforms, including Medium, LinkedIn, and Twitter.
  • A cost-benefit analysis is presented for becoming a Medium member through the author's referral link, emphasizing the value of access to a wide range of articles, including the author's future work.

How to Split a Large CSV File with Python

A guide to splitting a large CSV file based on input parameters.

Photo by Joshua Brown on Unsplash

As a Data Engineer, Data Scientist, or Data Analyst each one of us must have encountered a need for the split utility to split CSV/other data files into smaller files to investigate or test scenarios.

This piece of Python code helps to split CSV files randomly or equally based on input parameters.

It is easy to split files using pandas in Python. It has powerful features to pick a number of rows and skip a number of rows. This piece of code orchestrated the number of rows to skip and pick according to the total number of rows in the file.

After the user enters his input file path and name, the code runs as shown in the following steps:

Step 1 (Using Pandas): Find the number of rows from the files.

In this piece of code, using pandas we read the CSV and find the number of rows using the index:

## find number of lines using Pandas
pd_dataframe = pd.read_csv(split_source_file, header=0)
number_of_rows = len(pd_dataframe.index) + 1

Step 1 (Using Traditional Python): Find the number of rows from the files.

Here we open the file and enumerate the data using a loop to find the number of rows:

## find number of lines using traditional python
fh = open(split_source_file, 'r')
for count, line in enumerate(fh):
    pass
py_number_of_rows = count

Step 2: User to input the number of lines per file (Range) and generate a random number.

Here code asks for user input to provide min and max number of rows per file, and code generates a random number between the min and max:

## Incase of equal split, provide the same number for min and max
min_rows = int(input("Minimum Number of rows per file? : "))
max_rows = int(input("Maximum Number of rows per file? : "))
## This value changes per file
number_of_rows_perfile = random.randint(min_rows, max_rows)

In case you want an equal split, provide the same number for max and min. For example, you want all files to contain 5000 rows. Enter 5000 as input for min_rows and max_rows.

Step 3: Read data from the file to pick and skip a number of rows using pandas.

This is the tricky part. To use pandas we create a dataframe from the source file. While reading the data, the number of rows to read is a randomly generated number from the previous step, and the sum of previously created file rows is the skip number.

## Read CSV file with number of rows and skip respective number of lines
df = pd.read_csv(split_source_file, header=None, nrows = number_of_rows_perfile,skiprows = skip_rows)

Step 4: Write Data from the dataframe to a CSV file using pandas.

Here in this step, we write data from dataframe created at Step 3 into the file. Filename is generated based on source file name and number of files to be created.

## Write to csv
df.to_csv(split_target_file, index=False, header=False, mode='a', chunksize=number_of_rows_perfile)

Step 5: Handle the skip rows and generate a number of rows after each small file is generated.

After every small file is generated, the code needs to move ahead to read and skip rows. In addition, the last part of the file doesn’t match the randomly generated row values.

## Last file handler
if skip_rows >= number_of_rows:
    number_of_rows_perfile = number_of_rows - skip_rows
else:
    number_of_rows_perfile = random.randint(min_rows, max_rows)

Now, let looks into these pieces together

And there you have it. Thank you for reading.

We hope that you will find this article insightful. If you like to, please share the link to your friends, family and colleagues.

Do you like to encourage us to spread the insights on more topics, Please encourage us with a cup of coffee

It just takes a moment to follow us, Let’s help each other to spread the knowledge, Follow us on Medium, Insights and Data, LinkedIn, Twitter to stay up-to-date with our latest articles.

It costs just 16 Cents per Day to become a referred member in Medium through us (Subscribe) and unlock full potential of reading our articles and may more in medium.

Pandas
Data Engineering
Data Science
Python
Programming
Recommended from ReadMedium