Pandas read_csv() function automatically parses the header while loading a csv file. It assumes that the top row (rowid = 0) contains the column name information. It is possible to change this default behavior to customize the column names.
Row# 0 is the first row of a csv file. So pointing the header to row# 0 assigns column name information to be parsed from the top row.
Header at row 0
import pandas as pd #header at row 0 df = pd.read_csv( 'data_deposits.csv', header = 0, sep = ',' ) print(df.columns) print(df.head(3))
Code output: --[ header/column names ]-------------------------------------------------- Index(['firstname', 'lastname', 'city', 'age', 'deposit'], dtype='object') --[ df top 3 rows with header ]-------------------------------------------- firstname lastname city age deposit 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 ---------------------------------------------------------------------------
If the data file has no header information, and the intent is treat all the rows as data - then header=None is used.
Assign no header from file
import pandas as pd #no header df = pd.read_csv( 'data_deposits.csv', header = None, sep = ',' ) print(df.columns) print(df.head(3))
Code output: --[ header / column names ]------------------- Int64Index([0, 1, 2, 3, 4], dtype='int64') --[ df top 3 rows ]--------------------------- 0 1 2 3 4 0 firstname lastname city age deposit 1 Herman Sanchez Miami 52 9300 2 Phil Parker Miami 45 5010 ----------------------------------------------
The columns have no names, and are just identified by numbers starting from 0. Assigning no header makes the top row to be treated as data. For data_deposits.csv this is not ideal. The top row has header information, but we want to ignore it - not add it to the data values. This means we need to skip the top row.
Read no header, but skip the top row
import pandas as pd #no header #skip a row df = pd.read_csv( 'data_deposits.csv', header = None, skiprows = 1, sep = ',' ) print(df.columns) print(df.head(3))
Code output: --[ header / column names ]------------------- Int64Index([0, 1, 2, 3, 4], dtype='int64') --[ df top 3 rows ]--------------------------- 0 1 2 3 4 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 ----------------------------------------------
Using both header=None, and skiprows=1 does the trick. Our data is not corrupted, and we have clean slate for column names.
To name the loaded data columns as needed, we can pass the column names as a list in the same order as loaded columns.
Add custom column names to dataframe
import pandas as pd #no header #skip a row df = pd.read_csv( 'data_deposits.csv', header = None, skiprows = 1, sep = ',', names = [ 'name_first', 'name_last', 'city', 'trips', 'miles' ] ) print(df.columns) print(df.head(3))
Code output: --[ df column headers ]------------------------------------------------------ Index(['name_first', 'name_last', 'city', 'trips', 'miles'], dtype='object') --[ df top 3 rows ]--------------------------- name_first name_last city trips miles 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 ----------------------------------------------
It is also possible to rename the columns after normal data import. A dictionary of current to new column names is all that is needed.
Column rename using dict
import pandas as pd df = pd.read_csv( 'data_deposits.csv', header = 0, sep = ',', ) #1:1 mapping to new names new_names = { 'firstname': 'name_first', 'lastname': 'name_last', 'city': 'city', 'age': 'trips', 'deposit': 'miles' } #do rename df.rename( columns = new_names, inplace = True ) print(df.columns) print(df.head(3))
Code output: --[ df column headers ]------------------------------------------------------ Index(['name_first', 'name_last', 'city', 'trips', 'miles'], dtype='object') --[ df top 3 rows ]--------------------------- name_first name_last city trips miles 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 ----------------------------------------------
It is also possible to assign column names directly, overwriting existing column names. This is risky, as the order of the existing columns need to be noted, else wrong column names would get assigned. It is always better to use a dictionary mapping.
Simple overwriting of column names
import pandas as pd df = pd.read_csv( 'data_deposits.csv', header = 0, sep = ',', ) df.columns = [ 'name_first', 'name_last', 'city', 'trips', 'miles' ] print(df.columns) print(df.head(3))
Code output: --[ df column headers ]------------------------------------------------------ Index(['name_first', 'name_last', 'city', 'trips', 'miles'], dtype='object') --[ df top 3 rows ]--------------------------- name_first name_last city trips miles 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 ----------------------------------------------
Basic header customization during and after data import was covered. To read more about data loading, row and column selection check out the related tutorials.
# Python - Delete multiple elements from a list
# SEO Google page rank and web traffic
# Python: Random access generator for multi value sublist yield
# Python: Enumerate counter for loops over list, tuple, string
# Pandas - Read, skip and customize column headers for read_csv
# Pandas - Selecting data rows and columns using read_csv
# Pandas - Space, tab and custom data separators
# Sample data for Python tutorials
# Pandas - Purge duplicate rows
# Pandas - Concatenate or vertically merge dataframes
# Pandas - Search and replace values in columns
# Pandas - Count rows and columns in dataframe
# Pandas - Adding new static columns
# Python - Hardware and operating system information
# Pandas - Remove or drop columns from Pandas dataframe