real panda
June 6, 2019 * Python Programming

Pandas - Read, skip and customize column headers for read_csv

Pandas read_csv() function automatically parses the header while loading a csv file. It assumes that the top row (rowid = 0) contains the column name information. It is possible to change this default behavior to customize the column names.

Header information at the top row

Row# 0 is the first row of a csv file. So pointing the header to row# 0 assigns column name information to be parsed from the top row.

Header at row 0

import pandas as pd

#header at row 0
df = pd.read_csv(
  'data_deposits.csv',
  header = 0,
  sep = ','
)

print(df.columns)
print(df.head(3))
Code output:
--[ header/column names ]--------------------------------------------------
Index(['firstname', 'lastname', 'city', 'age', 'deposit'], dtype='object')

--[ df top 3 rows with header ]--------------------------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300
---------------------------------------------------------------------------

No header, all data rows

If the data file has no header information, and the intent is treat all the rows as data - then header=None is used.

Assign no header from file

import pandas as pd

#no header
df = pd.read_csv(
'data_deposits.csv',
header = None,
sep = ','
)

print(df.columns)
print(df.head(3))
Code output:
--[ header / column names ]-------------------
Int64Index([0, 1, 2, 3, 4], dtype='int64')

--[ df top 3 rows ]---------------------------
           0         1      2    3        4
0  firstname  lastname   city  age  deposit
1     Herman   Sanchez  Miami   52     9300
2       Phil    Parker  Miami   45     5010
----------------------------------------------

The columns have no names, and are just identified by numbers starting from 0. Assigning no header makes the top row to be treated as data. For data_deposits.csv this is not ideal. The top row has header information, but we want to ignore it - not add it to the data values. This means we need to skip the top row.

Read no header, but skip the top row

import pandas as pd

#no header
#skip a row
df = pd.read_csv(
  'data_deposits.csv',
  header = None,
  skiprows = 1,
  sep = ','
)

print(df.columns)
print(df.head(3))
Code output:
--[ header / column names ]-------------------
Int64Index([0, 1, 2, 3, 4], dtype='int64')

--[ df top 3 rows ]---------------------------
        0        1       2   3     4
0  Herman  Sanchez   Miami  52  9300
1    Phil   Parker   Miami  45  5010
2  Bradie  Garnett  Denver  36  6300
----------------------------------------------

Using both header=None, and skiprows=1 does the trick. Our data is not corrupted, and we have clean slate for column names.

Assign custom column names

To name the loaded data columns as needed, we can pass the column names as a list in the same order as loaded columns.

Add custom column names to dataframe

import pandas as pd

#no header
#skip a row
df = pd.read_csv(
  'data_deposits.csv',
  header = None,
  skiprows = 1,
  sep = ',',
  names = [
    'name_first',
    'name_last',
    'city',
    'trips',
    'miles'
  ]
)

print(df.columns)
print(df.head(3))
Code output:
--[ df column headers ]------------------------------------------------------
Index(['name_first', 'name_last', 'city', 'trips', 'miles'], dtype='object')

--[ df top 3 rows ]---------------------------
  name_first name_last    city  trips  miles
0     Herman   Sanchez   Miami     52   9300
1       Phil    Parker   Miami     45   5010
2     Bradie   Garnett  Denver     36   6300
----------------------------------------------

It is also possible to rename the columns after normal data import. A dictionary of current to new column names is all that is needed.

Column rename using dict

import pandas as pd

df = pd.read_csv(
  'data_deposits.csv',
  header = 0,
  sep = ',',
)

#1:1 mapping to new names
new_names = {
  'firstname': 'name_first',
  'lastname': 'name_last',
  'city': 'city',
  'age': 'trips',
  'deposit': 'miles'
}

#do rename
df.rename(
  columns = new_names, 
  inplace = True
)

print(df.columns)
print(df.head(3))
Code output:
--[ df column headers ]------------------------------------------------------
Index(['name_first', 'name_last', 'city', 'trips', 'miles'], dtype='object')

--[ df top 3 rows ]---------------------------
  name_first name_last    city  trips  miles
0     Herman   Sanchez   Miami     52   9300
1       Phil    Parker   Miami     45   5010
2     Bradie   Garnett  Denver     36   6300
----------------------------------------------

It is also possible to assign column names directly, overwriting existing column names. This is risky, as the order of the existing columns need to be noted, else wrong column names would get assigned. It is always better to use a dictionary mapping.

Simple overwriting of column names

import pandas as pd

df = pd.read_csv(
  'data_deposits.csv',
  header = 0,
  sep = ',',
)

df.columns = [
  'name_first',
  'name_last',
  'city',
  'trips',
  'miles'
]

print(df.columns)
print(df.head(3))
Code output:
--[ df column headers ]------------------------------------------------------
Index(['name_first', 'name_last', 'city', 'trips', 'miles'], dtype='object')

--[ df top 3 rows ]---------------------------
  name_first name_last    city  trips  miles
0     Herman   Sanchez   Miami     52   9300
1       Phil    Parker   Miami     45   5010
2     Bradie   Garnett  Denver     36   6300
----------------------------------------------

Basic header customization during and after data import was covered. To read more about data loading, row and column selection check out the related tutorials.

References