real panda
June 17, 2019 * Python Programming

Pandas - Copying dataframes

There are some unique issues about Python. It really needs to be spelt out on how certain things behave. Else default knowledge carried around from other programming languages can cause trouble. Copying dataframes is one important issue. The data file being used here can be obtained at the data library.

Pitfalls of variable assignment for Pandas dataframe

We need to be careful of assigning a dataframe to a new variable. We would expect that the new variable is a copy of the current dataframe, and that they are disjoint. Look closely at what happens here.

Assigning dataframe

import pandas as pd

#read data into dataframe
df1 = pd.read_csv(
    'data_deposits.csv'
)
print(df1.head(3))

#assign to a new dataframe
#add a static data column
df2 = df1
df2['student'] = 'False'

print(df1.head(3))
print(df2.head(3))
--[df1 original data]-------------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300

--[df1 after copy and addition of column]---------------
  firstname lastname    city  age  deposit student
0    Herman  Sanchez   Miami   52     9300   False
1      Phil   Parker   Miami   45     5010   False
2    Bradie  Garnett  Denver   36     6300   False

--[df2 after copy and addition of column]---------------
  firstname lastname    city  age  deposit student
0    Herman  Sanchez   Miami   52     9300   False
1      Phil   Parker   Miami   45     5010   False
2    Bradie  Garnett  Denver   36     6300   False
----------------------------------------------------

df1 is the original dataframe. We assign it to a variable df2. However when we add a column to df2, we see that df1 is also affected. So assigning df1 to df2 did not create an independent copy of the dataframe. It is just a link! How do we create copies of dataframes?

Dataframe copy() function exists for a reason

Pandas dataframes have an explicit copy() function. It can be used in a 'deep = True' or 'False' mode. If true, then both structure and data is copied to a new dataframe, else just the structure with blank dataframe results.

Copying dataframes explicitly

import pandas as pd

#read data into dataframe
df1 = pd.read_csv(
    'data_deposits.csv'
)
print(df1.head(3))

#copy dataframe structure and contents
df3 = df1.copy(deep=True)

#add new columns
df3['job'] = True

print(df3.head(3))
print(df1.head(3))
--[df1 original data]---------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300


--[df3 post copy and new column creation]-------
  firstname lastname    city  age  deposit   job
0    Herman  Sanchez   Miami   52     9300  True
1      Phil   Parker   Miami   45     5010  True
2    Bradie  Garnett  Denver   36     6300  True

--[df1 original data]---------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300

-------------------------------------------------

Once the deep copy is done into dataframe #3, we can add columns to it, without affecting df1. This is now a true and independent copy of a dataframe.

References