avatarGiorgos Myrianthous

Summary

This article discusses how to efficiently concatenate two string columns into a new column in pandas DataFrames, considering dataset size and the inclusion of non-string columns.

Abstract

In this guide, the author explains how to create new columns by combining the content of existing columns in pandas DataFrames. The focus is on concatenating string columns, with different approaches considered based on dataset size. For small datasets (up to 100-150 rows), the pandas.Series.str.cat() method or list comprehension can be used. However, for larger datasets, the more efficient method is to use the + operator. The author also covers concatenating string with non-string columns using the astype() method or map() functions in pandas. The guide provides code examples throughout, demonstrating the techniques discussed.

Bullet points

  • The article explains how to concatenate columns in pandas DataFrames.
  • It suggests different methods based on dataset size:
    • For small datasets: pandas.Series.str.cat() or list comprehension.
    • For larger datasets: use the + operator.
  • The author discusses concatenating string with non-string columns using astype() or map().
  • The guide includes code examples illustrating the techniques.
  • The article aims to help users choose the most efficient method for their specific use case.

How to Combine Two String Columns in Pandas

Understanding how to concatenate two string columns into a new column more efficiently in pandas DataFrames

Photo by Pascal Müller on Unsplash

Introduction

Creating new columns by concatenating other columns is a fairly common task. In today’s short guide we will showcase how to concatenate the content of string DataFrame columns into a new column. We will explore a few different options that you should always consider based on the dataset size you are working on.

Some of these approaches tend to be more efficient when applied to small datasets whilst other approaches may be faster to execute when applied over larger datasets.

Additionally, we will also explore how to concatenate string with non-string (e.g. integer) columns.

First, let’s create an example DataFrame that we’ll reference throughout this article in order to demonstrate a few concepts.

import pandas as pd
df = pd.DataFrame(
  [
    (1, '2017', 10, 'Q1'),
    (2, '2017', 20, 'Q2'),
    (3, '2016', 35, 'Q4'),
    (4, '2019', 25, 'Q2'),
    (5, '2020', 44, 'Q3'),
    (6, '2021', 51, 'Q3'),
  ], 
  columns=['colA', 'colB', 'colC', 'colD']
)
print(df)
   colA  colB  colC colD
0     1  2017    10   Q1
1     2  2017    20   Q2
2     3  2016    35   Q4
3     4  2019    25   Q2
4     5  2020    44   Q3
5     6  2021    51   Q3

Concatenating string columns in small datasets

For relatively small datasets (up to 100–150 rows) you can use pandas.Series.str.cat() method that is used to concatenate strings in the Series using the specified separator (by default the separator is set to '').

For example, if we wanted to concatenate columns colB and colD and then store the output into a new column called colE, the following statement would do the trick:

df['colE'] = df.colB.str.cat(df.colD) 
print(df)
   colA  colB  colC colD    colE
0     1  2017    10   Q1  2017Q1
1     2  2017    20   Q2  2017Q2
2     3  2016    35   Q4  2016Q4
3     4  2019    25   Q2  2019Q2
4     5  2020    44   Q3  2020Q3
5     6  2021    51   Q3  2021Q3

Now if we wanted to specify a separator that will be placed between the concatenated columns, then we simply need to pass sep argument:

df['colE'] = df.colB.str.cat(df.colD, sep='-')
print(df)
   colA  colB  colC colD     colE
0     1  2017    10   Q1  2017-Q1
1     2  2017    20   Q2  2017-Q2
2     3  2016    35   Q4  2016-Q4
3     4  2019    25   Q2  2019-Q2
4     5  2020    44   Q3  2020-Q3
5     6  2021    51   Q3  2021-Q3

Alternatively, you can also use a list comprehension which is a bit more verbose but slightly faster:

df['colE'] = [''.join(i) for i in zip(df['colB'], df['colD'])]
print(df)
   colA  colB  colC colD    colE
0     1  2017    10   Q1  2017Q1
1     2  2017    20   Q2  2017Q2
2     3  2016    35   Q4  2016Q4
3     4  2019    25   Q2  2019Q2
4     5  2020    44   Q3  2020Q3
5     6  2021    51   Q3  2021Q3

Concatenating string columns in larger datasets

Now if you are working with large datasets, the more efficient way to concatenate two columns is using the + operator.

df['colE'] = df['colB'] + df['colD']
print(df)
   colA  colB  colC colD    colE
0     1  2017    10   Q1  2017Q1
1     2  2017    20   Q2  2017Q2
2     3  2016    35   Q4  2016Q4
3     4  2019    25   Q2  2019Q2
4     5  2020    44   Q3  2020Q3
5     6  2021    51   Q3  2021Q3

If you want to include a separator then simply place it as a string in-between the two columns:

df['colE'] = df['colB'] + '-' + df['colD']

Concatenating string with non-string columns

Now let’s assume that one of the columns you are trying to concatenate is not in string format:

import pandas as pd
df = pd.DataFrame(
  [
    (1, 2017, 10, 'Q1'),
    (2, 2017, 20, 'Q2'),
    (3, 2016, 35, 'Q4'),
    (4, 2019, 25, 'Q2'),
    (5, 2020, 44, 'Q3'),
    (6, 2021, 51, 'Q3'),
  ], 
  columns=['colA', 'colB', 'colC', 'colD']
)
print(df.dtypes)
colA     int64
colB     int64
colC     int64
colD    object
dtype: object

In this case, you can simply cast the column using pandas.DataFrame.astype() or map() methods.

# Option 1
df['colE'] = df.colB.astype(str).str.cat(df.colD)
# Option 2
df['colE'] = df['colB'].astype(str) + '-' + df['colD']
# Option 3
df['colE'] = [
  ''.join(i) for i in zip(df['colB'].map(str), df['colD'])
]

Final Thoughts

In today’s short guide we discussed about concatenating string columns in pandas DataFrames. Depending on the size of the dataset you are working with, you may have to select the most appropriate method that will be executed more efficiently.

Furthermore, we also showcased how to concatenate string with non-string columns by making use of the astype() method in pandas.

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

You may also like

Python
Programming
Software Development
Data Science
Technology
Recommended from ReadMedium