real panda
June 17, 2019 * Python Programming

Pandas - Concatenate or vertically merge dataframes

Consider that there are two or more dataframes that have identical column structure. We just need to stitch up each piece one after the other to create one big dataframe. This end to end vertical concatenation can be done in a few different ways. The tutorial shows a couple of methods that are versatile and safe to use.

Two data files will be used here namely 'data_deposits.csv' and 'data_deposits_extra.csv', both of which are available in the data library.

Vertically concatenate rows from two dataframes

The code below shows that two data files are imported individually into separate dataframes. The columns and data types are identical for both files. The row count and actual data is different.

The first method appends dataframe #2 to #1 to create a 3rd combined dataframe. This method will only work for two dataframes at a time. To combine multiple files, an iteration loop has to be set up.

Note that the combined data is sorted by default. This can be a waste of time, and so consider the option 'sort=False' when calling for appending the dataframes.

Combine two files

import pandas as pd

#load files separately
df1 = pd.read_csv(
  'data_deposits.csv'
)
print(df1.head(8))

df2 = pd.read_csv(
  'data_deposits_extra.csv'
)
print(df2.head(5))

#combine files
df3 = df1.append(
  df2, 
  sort = False
)
print(df3.head(13))
--[df1 first]--------------------------------
  firstname lastname     city  age  deposit
0    Herman  Sanchez    Miami   52     9300
1      Phil   Parker    Miami   45     5010
2    Bradie  Garnett   Denver   36     6300
3    Rudolf   Crooks    Miami   33     5800
4    Marcos   Miller  Seattle   66     4300
5      Chad  Garnett    Miami   38     7420
6     Sally    Evans   Denver   25     3170
7      Chad   Parker  Seattle   55    12600

--[df2 second]-------------------------------
  firstname lastname     city  age  deposit
0      Elli  Laforge   Denver   53     2038
1      Elli  Laforge    Miami   53     6310
2     Sally    Evans   Denver   25     3170
3      Phil   Parker   Denver   49     9010
4       Joe  Laforge   Seatle   51    15380

--[df3 merged]--------------------------------
  firstname lastname     city  age  deposit
0    Herman  Sanchez    Miami   52     9300
1      Phil   Parker    Miami   45     5010
2    Bradie  Garnett   Denver   36     6300
3    Rudolf   Crooks    Miami   33     5800
4    Marcos   Miller  Seattle   66     4300
5      Chad  Garnett    Miami   38     7420
6     Sally    Evans   Denver   25     3170
7      Chad   Parker  Seattle   55    12600
0      Elli  Laforge   Denver   53     2038
1      Elli  Laforge    Miami   53     6310
2     Sally    Evans   Denver   25     3170
3      Phil   Parker   Denver   49     9010
4       Joe  Laforge   Seatle   51    15380
---------------------------------------------

As expected, we have a combined file with record count to be the sum of the individual files. This means there have been no checks for duplicates. Note that the index numbers are also preserved from each file. This method is slightly slower, and also has a drawback of operating on just a pair of dataframes.

Combine a list of two or more dataframes

The second method takes a list of dataframes and concatenates them along axis=0, or vertically. Multiple files can be operated on all at once, or a single file can be repeated.

Concatenate multiple dataframes

import pandas as pd

#load files separately
df1 = pd.read_csv(
  'data_deposits.csv'
)
df2 = pd.read_csv(
  'data_deposits_extra.csv'
)

#combine files (faster)
df4 = pd.concat(
  [df1, df2, df2], 
  axis = 0,
  sort = False
)

#make the index pretty
df4.reset_index(
  drop = True,
  inplace = True
)

print(df4.head(18))
--[ df4 combined df1, df2, df2 ]-------------
   firstname lastname     city  age  deposit
0     Herman  Sanchez    Miami   52     9300
1       Phil   Parker    Miami   45     5010
2     Bradie  Garnett   Denver   36     6300
3     Rudolf   Crooks    Miami   33     5800
4     Marcos   Miller  Seattle   66     4300
5       Chad  Garnett    Miami   38     7420
6      Sally    Evans   Denver   25     3170
7       Chad   Parker  Seattle   55    12600
8       Elli  Laforge   Denver   53     2038
9       Elli  Laforge    Miami   53     6310
10     Sally    Evans   Denver   25     3170
11      Phil   Parker   Denver   49     9010
12       Joe  Laforge   Seatle   51    15380
13      Elli  Laforge   Denver   53     2038
14      Elli  Laforge    Miami   53     6310
15     Sally    Evans   Denver   25     3170
16      Phil   Parker   Denver   49     9010
17       Joe  Laforge   Seatle   51    15380
---------------------------------------------

The results look promising. We were able to concatenate df1 and two times df2. To make the index look pretty, and unique a reset_index() was done to create a new regular index from 0 to 17. The old index which was a patchwork from individual components was dropped. The operation was done inplace, so that another copy of the data was not created.

References