Tutorial – Pandas Concat, Pandas Append, Pandas Merge, Pandas Join

Tutorial - Pandas Concat, Pandas Append, Pandas Merge, Pandas Join
Pandas Concat, Pandas Append, Pandas Merge, Pandas Join


In our machine learning or data science projects, when we work with pandas library, there are instances when we have to use data from different dataframes, different lists and other such different data containers. For making this operation of merging or adding two different data containers, pandas has some functions such as concat(), append(), merge() and join(). In this tutorial, we will walk through these different pandas functions which are helping in joining different dataframes and other similar data containers.

Importing Pandas Library

As always the very first step is to load the pandas library.

In [1]:
import pandas as pd

Commencing the article by introducing pandas concat() function. Let’s see what it does and how it can be used.

Pandas Concat : pd.concat()

Concat() function helps in concatenating i.e. joining two different pandas objects on different axes.



Deep Learning Specialization on Coursera

objs : Series or Dataframe objects – This parameter takes the series or dataframe objects for performing concatenation operation.

axis : {‘0′ for Index,’1’ for Columns} – By providing axis, we tell whether concatenation is to be performed over index or columns. The default is 0.

ignore_index : bool – This parameter either helps in removing the index or in adding index to daatframes. The default value is false.

There are some other parameters which you can tweak within concat() function. You can have a look here

When a series is concatenated, then a series object is returned. When there are multiple objects consisting of at least 1 dataframe, then resulting object is a dataframe.

Example 1: Simple concat() function example with ignore_index parameter

Here, we are creating two different series and then concatenating them.

In [2]:
s1 = pd.Series(['P', 'Q'])
In [3]:
s2 = pd.Series(['R', 'S'])

We now concat them below. As it can be seen below, the indexing is not been proper. This is because both the series are using their own indexing.

In [4]:
pd.concat([s1, s2])
0    P
1    Q
0    R
1    S
dtype: object

We will add ignore_index parameter to pandas concat() function. This help in indexing the series properly.

Here the indexing has started from 0th index and has ended at 3rd index.

In [5]:
pd.concat([s1, s2], ignore_index=True)
0    P
1    Q
2    R
3    S
dtype: object

Example 2: Adding hierarchical index by using keys parameter

We can make the indexing hierarchical by using keys parameter in pandas concat function.

In [6]:
pd.concat([s1, s2], keys=['s1', 's2'])
s1  0    P
    1    Q
s2  0    R
    1    S
dtype: object

Example 3: Concat two dataframes objects with identical columns

Two different dataframes that have identical columns are concatenated using concat() function of pandas.

In [7]:
df1 = pd.DataFrame([['Apple', 'Jammu & Kashmir'], ['Banana', 'Kerala']],
                    columns=['Fruit', 'State_Grown'])

Fruit State_Grown
0 Apple Jammu & Kashmir
1 Banana Kerala
In [8]:
df2 = pd.DataFrame([['Watermelon', 'Maharashtra'], ['Strawberry', 'Madhya Pradesh']],
                  columns=['Fruit', 'State_Grown'])

Fruit State_Grown
0 Watermelon Maharashtra
1 Strawberry Madhya Pradesh
In [9]:
pd.concat([df1, df2],ignore_index=True)
Fruit State_Grown
0 Apple Jammu & Kashmir
1 Banana Kerala
2 Watermelon Maharashtra
3 Strawberry Madhya Pradesh

Example 4: Concat DataFrame objects horizontally

In this example we have added dataframes horizontally which means by providing axis=1, the two different dataframes are added.

In [10]:
df3 = pd.DataFrame([['Orange', 'Maharashtra'], ['Pineapple', 'Gujarat']],
                  columns=['Fruit', 'State_Grown'])

Fruit State_Grown
0 Orange Maharashtra
1 Pineapple Gujarat
In [11]:
pd.concat([df1, df3], axis=1)
Fruit State_Grown Fruit State_Grown
0 Apple Jammu & Kashmir Orange Maharashtra
1 Banana Kerala Pineapple Gujarat

The next function which we’ll look at is append function.

Pandas Append : append()

With the help of pandas append() function, we can append rows of one object to the rows of the calling object. With the help of append(), columns can also be appended.



self : DataFrame or Series/dict-like object, or list of these – This is the main object on the tail of this the other object will be appended.

other : DataFrame or Series/dict-like object, or list of these – This consists the other object which gets appended to main object.

ignore_index : bool – This parameter either helps in removing the index or in adding index to daatframes. The default value is false.

sort : bool – If required, we can sort the columns of self and other if they are not aligned.

This append() function returns a dataframe as an output.

Example 1: Append two dataframes

Appending a dataframe to another dataframe with ignore_index set as True.

In [12]:
df = pd.DataFrame([[7, 9], [8, 4]], columns=list('AB'))

0 7 9
1 8 4
In [13]:
df2 = pd.DataFrame([[2, 1], [6, 3]], columns=list('AB'))

0 2 1
1 6 3
In [14]:
df.append(df2, ignore_index=True)
0 7 9
1 8 4
2 2 1
3 6 3

Example 2: Building a dataframe using concat() and append() function

In this example, we will see how to build a dataframe using concat() and append() function.

It is advised to use concat for this purpose, this is because it has higher efficiency.

In [15]:
pd.concat([pd.DataFrame([i], columns=['A']) for i in range(7)],
0 0
1 1
2 2
3 3
4 4
5 5
6 6

Here as we can see append() can also be used for building a dataframe, but this is less efficient, thus it is recommended to use pandas concat() function.

In [16]:
df = pd.DataFrame(columns=['A'])

for i in range(7):
    df = df.append({'A': i}, ignore_index=True)
In [17]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6

Moving onto the next function, we will explore more about merge() function.

Pandas Merge : merge()

The merge() function has a high utility, as it can merge dataframe or series objects. The merge function has taken its inspiration from the traditional database join.

Using pandas merge() either columns or indexes of different dataframes can be merged.



self : DataFrame or named Series – This consists the main object participating in merging operations.

right : DataFrame or named Series – This is the other object participating in merging operations.

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’ – The how parameter specifies the kind of merge operation which will be performed. These values of how are inspired by SQL join operations.

left_on : label or list, or array-like – It consists the column or index level names to join on in the left DataFrame.

right_on : label or list, or array-like – It consists the column or index level names to join on in the right DataFrame.

suffixes : tuple of (str, str), default (‘_x’, ‘_y’) – This is used for adding suffix to overlapping column names in the left and right side, respectively.

The output obtained is a dataframe of merged objects.

Example 1: Merge two dataframes

Here the merging of different dataframes is performed using on parameter of merge() function. The key used for merging is id.

In [18]:
df = pd.DataFrame({
   'Player': ['Virat', 'Rohit', 'Shreyas', 'Rahul', 'Shikhar'],

id Player IPL_Team
0 1 Virat RCB
1 2 Rohit MI
2 3 Shreyas DC
3 4 Rahul KXIP
4 5 Shikhar SRH
In [19]:
df1 = pd.DataFrame(
   'Name': ['Lynn', 'De Kock', 'Dhoni', 'Gayle', 'Warner'],

id Name IPL_Team
0 1 Lynn MI
1 2 De Kock DC
2 3 Dhoni CSK
3 4 Gayle KXIP
4 5 Warner SRH
In [20]:
id Player IPL_Team_x Name IPL_Team_y
0 1 Virat RCB Lynn MI
1 2 Rohit MI De Kock DC
2 3 Shreyas DC Dhoni CSK
3 4 Rahul KXIP Gayle KXIP
4 5 Shikhar SRH Warner SRH

Example 2: Fetching common entries from two different dataframes

By using merge() function, entries that are common to both the dataframes are fetched. For this, we provide ‘id’ and ‘IPL_Team’ as keys for fetching common data.

In [21]:
id Player IPL_Team Name
0 4 Rahul KXIP Gayle
1 5 Shikhar SRH Warner

Lastly, we will be going over the join() function of pandas.

Pandas Join: join()

The pandas join() function helps in joining columns of different dataframe. For joining dataframes, we can either use indexes or columns as keys.



self : DataFrame, Series, or list of DataFrame – This is the column of the main dataframe used for joining.

other : DataFrame, Series, or list of DataFrame – This is the column of another dataframe, this will be joined with column of main dataframe.

on : str, list of str, or array-like, optional – Used for specifying the index or column from where joining will take place.

lsuffix : str – This is used as a suffix from left frame.

rsuffix : str – This is used as a suffix from right frame

sort – bool – This helps in ordering resulting dataframe lexicographically.

The result of this join() function is a dataframe containing columns of both dataframes.

Example 1: Using join() function over two dataframes

With the help of join function in pandas, we can perform joining of two different dataframes.

In [22]:
df = pd.DataFrame({'Student_ID': ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],
                'Student_Name': ['George', 'Michael', 'Dennis', 'David', 'Steve', 'Gary']})

Student_ID Student_Name
0 S1 George
1 S2 Michael
2 S3 Dennis
3 S4 David
4 S5 Steve
5 S6 Gary
In [23]:
df1 = pd.DataFrame({'Student_ID': ['S1', 'S2', 'S3'],
                  'Student_Name': ['Brett', 'Adam', 'Matthew']})

Student_ID Student_Name
0 S1 Brett
1 S2 Adam
2 S3 Matthew

Here we specify the lsuffix and rsuffix which are basically column headers of the joined dataframe.

In [24]:
df.join(df1, lsuffix='_caller', rsuffix='_df1')
Student_ID_caller Student_Name_caller Student_ID_df1 Student_Name_df1
0 S1 George S1 Brett
1 S2 Michael S2 Adam
2 S3 Dennis S3 Matthew
3 S4 David NaN NaN
4 S5 Steve NaN NaN
5 S6 Gary NaN NaN

Pandas Concat vs Append vs Merge vs Join

  • We have covered the four joining functions of pandas, namely concat(), append(), merge() and join(). Since these functions operate quite similar to each other. I will tell you the fundamental difference used for distinguishing them and their usage.
  • Pandas append function has limited functionality. This is because we can use it for appending rows only. This is the reason it is used less.
  • Pandas concat has better functionality as it can be used for joining multiple dataframes through both columns or rows. It is considered to be the most efficient method of joining dataframes.
  • By using pandas merge function we can perform joins similar to SQL. With the help of merge() we can merge similar values using a common column found in two dataframes.
  • Lastly, the pandas join function is performing also similar operations like pandas merge, the only major difference is the usage of left-side index by join() whereas merge() uses new index for showing results.

At last, we have reached to the concluding part of this tutorial on joining functions of pandas. In this tutorial, we looked at pandas concat(), append(), merge() and join() functions. Along with the understanding of syntax and its parameters, we went over some examples of these functions used in pandas for a better understanding of the usage and learning the different scenarios where we can perform concatenation, appending, merging and joining with the help of these functions.

Reference – https://pandas.pydata.org/docs/


Please enter your comment!
Please enter your name here