real panda
June 17, 2019 * Python Programming

Pandas - Purge duplicate rows

Duplicate row data can be a big headache in data sciences. Conditions can be created to search for matches, where all duplicated rows can be selected and grouped. The question would then be - what to do with the duplicates.

Get the data files needed for the demonstration in the data collection.

Remove duplicated rows in dataframe

Once we have detected the duplicated rows, we can decide to remove them, and keep only the unique data points. This can be dictated by a rule that all duplicated data is due to some error while collecting data. We will take the two dataframes and concatenate them to create a dataframe that has duplicate rows.

Remove duplicate rows from dataframe

import pandas as pd

#load selected columns from two files
#concatenate data
load_cols = [
    'lastname',
    'firstname',
    'city',
    'age'
]
df1 = pd.read_csv(
  'data_deposits.csv', 
  usecols = load_cols
)
df2 = pd.read_csv(
  'data_deposits_extra.csv',
  usecols = load_cols
)
df3 = pd.concat([df1,df2], sort=False)
print( df3.head(13) )

#----------------------------------------
#match for last and first names both
match_cols = [
    'lastname',
    'firstname'
]

#keep only the unique rows
df4 = df3.drop_duplicates(
  subset = match_cols, 
  keep = False
)
print( df4.head(13) )
--[df3 with duplicates]--------------
  firstname lastname     city  age
0    Herman  Sanchez    Miami   52
1      Phil   Parker    Miami   45
2    Bradie  Garnett   Denver   36
3    Rudolf   Crooks    Miami   33
4    Marcos   Miller  Seattle   66
5      Chad  Garnett    Miami   38
6     Sally    Evans   Denver   25
7      Chad   Parker  Seattle   55
0      Elli  Laforge   Denver   53
1      Elli  Laforge    Miami   53
2     Sally    Evans   Denver   25
3      Phil   Parker   Denver   49
4       Joe  Laforge   Seatle   51

--[df4 duplicates removed]-----------
  firstname lastname     city  age
0    Herman  Sanchez    Miami   52
2    Bradie  Garnett   Denver   36
3    Rudolf   Crooks    Miami   33
4    Marcos   Miller  Seattle   66
5      Chad  Garnett    Miami   38
7      Chad   Parker  Seattle   55
4       Joe  Laforge   Seatle   51

-------------------------------------

The initial part of the code just takes a few columns from two data files and combines them into a single dataframe. This dataframe has duplicated data as detected using combination of values in two columns, namely first and last names.

All the duplicated data is removed due to the 'keep=False' directive. Removing all the duplicates may be extreme. We may want to preserve one copy of the duplicated data. However if rest of the columns within matches are unique, then the question shifts to "which row to preserve".

Preserve first instance of duplicated rows in dataframe

As discussed earlier, we need to decide on which duplicated row to preserve for each incident of duplicates. One method can be to keep the first or the last item as they appear in the dataframe. This order can be pre-determined by sorting the dataframe.

Keep only first instance of duplicates instead of removing them

import pandas as pd

#load selected columns from two files
#concatenate data
load_cols = [
    'lastname',
    'firstname',
    'city',
    'age'
]
df1 = pd.read_csv(
  'data_deposits.csv', 
  usecols = load_cols
)
df2 = pd.read_csv(
  'data_deposits_extra.csv',
  usecols = load_cols
)
df3 = pd.concat([df1,df2], sort=False)

#----------------------------------------
#match for last and first names both
match_cols = [
    'lastname',
    'firstname'
]

#keep first row within duplicates
df4 = df3.drop_duplicates(
  subset = match_cols, 
  keep = 'first'
)
df4.reset_index(
    drop = True,
    inplace = True
)
print( df4.head(13) )
--[ df4 keep first duplicate ]------
  firstname lastname     city  age
0    Herman  Sanchez    Miami   52
1      Phil   Parker    Miami   45
2    Bradie  Garnett   Denver   36
3    Rudolf   Crooks    Miami   33
4    Marcos   Miller  Seattle   66
5      Chad  Garnett    Miami   38
6     Sally    Evans   Denver   25
7      Chad   Parker  Seattle   55
8      Elli  Laforge   Denver   53
9       Joe  Laforge   Seatle   51
-----------------------------------

A few extra rows show up compared to the previous example where all duplicates were removed. A reset_index() was done to make things look organized and pretty. We can alternately preserve the last row in the duplicates, which will be explored in the next example.

Label duplicates, create subset and filter

This example will be a little different, where instead of detecting and operating on the duplicates, we will first label them. Next we will separate the duplicated and non-duplicated parts.

Separate out unique and last one of each duplicated rows

import pandas as pd

#load selected columns from two files
#concatenate data
load_cols = [
    'lastname',
    'firstname',
    'city',
    'age'
]
df1 = pd.read_csv(
  'data_deposits.csv', 
  usecols = load_cols
)
df2 = pd.read_csv(
  'data_deposits_extra.csv',
  usecols = load_cols
)
df3 = pd.concat([df1,df2], sort=False)
df3.reset_index(
  drop = True, 
  inplace = True
)

#----------------------------------------
#match for last and first names both
match_cols = [
    'lastname',
    'firstname'
]

#create label for duplicates
df3['duplicate'] = df3.duplicated(
    subset = match_cols,
    keep = False
)
print(df3.head(13))

#boolean mask for duplicate = True
mask = (df3['duplicate']==True)

#subset for unique rows
df_unq = (df3.loc[~mask]).copy(deep = True)
print(df_unq.head(13))    

#subset for repeated rows
df_rep = (df3.loc[mask]).copy(deep = True)
print( df_rep.head(13) )

#keep last of repeated
df_rep.drop_duplicates(
  subset = match_cols, 
  keep = 'last', 
  inplace = True
)
print( df_rep.head(13) )
--[df3 with labels for duplicates]--------------
   firstname lastname     city  age  duplicate
0     Herman  Sanchez    Miami   52      False
1       Phil   Parker    Miami   45       True
2     Bradie  Garnett   Denver   36      False
3     Rudolf   Crooks    Miami   33      False
4     Marcos   Miller  Seattle   66      False
5       Chad  Garnett    Miami   38      False
6      Sally    Evans   Denver   25       True
7       Chad   Parker  Seattle   55      False
8       Elli  Laforge   Denver   53       True
9       Elli  Laforge    Miami   53       True
10     Sally    Evans   Denver   25       True
11      Phil   Parker   Denver   49       True
12       Joe  Laforge   Seatle   51      False

--[df_unq non-duplicates]------------------------
   firstname lastname     city  age  duplicate
0     Herman  Sanchez    Miami   52      False
2     Bradie  Garnett   Denver   36      False
3     Rudolf   Crooks    Miami   33      False
4     Marcos   Miller  Seattle   66      False
5       Chad  Garnett    Miami   38      False
7       Chad   Parker  Seattle   55      False
12       Joe  Laforge   Seatle   51      False

--[df_rep duplicate elements]--------------------
   firstname lastname    city  age  duplicate
1       Phil   Parker   Miami   45       True
6      Sally    Evans  Denver   25       True
8       Elli  Laforge  Denver   53       True
9       Elli  Laforge   Miami   53       True
10     Sally    Evans  Denver   25       True
11      Phil   Parker  Denver   49       True

--[df_rep last duplicate elements]---------------
   firstname lastname    city  age  duplicate
9       Elli  Laforge   Miami   53       True
10     Sally    Evans  Denver   25       True
11      Phil   Parker  Denver   49       True
-------------------------------------------------

The output shows a method to separate out the unique and the duplicated parts of the dataframe. This is more control on what to do with each part. At the end of the data treatments, the dataframes can be concatenated back into a single entity.

Let us look at another example where the duplicate matching columns are different. We can also limit the columns of data in the unique filtered dataframe.

Subset rows for unique column values

import pandas as pd

#load selected columns from two files
#concatenate data
load_cols = [
    'lastname',
    'firstname',
    'city',
    'age'
]
df1 = pd.read_csv(
  'data_deposits.csv', 
  usecols = load_cols
)
df2 = pd.read_csv(
  'data_deposits_extra.csv',
  usecols = load_cols
)
df3 = pd.concat([df1,df2], sort=False)
df3.reset_index(
  drop = True, 
  inplace = True
)
df3.reset_index(
    drop = True,
    inplace = True
)
print( df3.head(13))
#----------------------------------------
#match on city, and keep first row of each
#subset data columns for matching rows
df4 = df3[['city','lastname']].drop_duplicates(
    subset=['city'], 
    keep='first'
)
print( df4.head(13))
--[df3 merged dataframe ]-------------
   firstname lastname     city  age
0     Herman  Sanchez    Miami   52
1       Phil   Parker    Miami   45
2     Bradie  Garnett   Denver   36
3     Rudolf   Crooks    Miami   33
4     Marcos   Miller  Seattle   66
5       Chad  Garnett    Miami   38
6      Sally    Evans   Denver   25
7       Chad   Parker  Seattle   55
8       Elli  Laforge   Denver   53
9       Elli  Laforge    Miami   53
10     Sally    Evans   Denver   25
11      Phil   Parker   Denver   49
12       Joe  Laforge   Seatle   51

--[df4 filtered subset]--------------
       city lastname
0     Miami  Sanchez
2    Denver  Garnett
4   Seattle   Miller
12   Seatle  Laforge
-------------------------------------

There should have been just three cities in the unique list. The typographical error for 'Seatle' is intentional to create a defect in the data. There will be a use case for such errors in other examples.

References