real panda
May 2, 2019 * Python Programming

Pandas - Selecting data rows and columns using read_csv

For serious data science applications the data size can be huge. It becomes necessary to load only the few necessary columns for to complete a specific job. Sampling data is a way to limit the number of rows of unique data points are loaded into memory, or to create training and test data sets for machine learning. Pandas read_csv() provides multiple options to configure what data is read from a file.

We will be using data_deposits.csv to demonstrate various techniques to select the required data.

Skipping rows

All available data rows on file may not be needed, in which case certain rows can be skipped. Just provide read_csv with a list of rows to skip to limit what is loaded.

Choosing rows to skip using a list for read_csv

import pandas as pd

#generate a list
skip_list = [n for n in range(1,6,2)]
print( skip_list )

#load all data
df_all = pd.read_csv(
  'data_deposits.csv',
  sep = ','
)
print( df_all.head(10))

#skip rows in list
df_skp = pd.read_csv(
    'data_deposits.csv', 
    sep = ',', 
    skiprows = skip_list
)
print( df_skp.head(10))

Skipped dataframe has fewer rows. The odd rows were skipped successfully.

--[ list to skip ]---------------------------
[1, 3, 5]

--[ df_all everything ]----------------------
  firstname lastname     city  age  deposit
0    Herman  Sanchez    Miami   52     9300
1      Phil   Parker    Miami   45     5010
2    Bradie  Garnett   Denver   36     6300
3    Rudolf   Crooks    Miami   33     5800
4    Marcos   Miller  Seattle   66     4300
5      Chad  Garnett    Miami   38     7420
6     Sally    Evans   Denver   25     3170
7      Chad   Parker  Seattle   55    12600

--[ df_skp skipped rows ]--------------------
  firstname lastname     city  age  deposit
0      Phil   Parker    Miami   45     5010
1    Rudolf   Crooks    Miami   33     5800
2      Chad  Garnett    Miami   38     7420
3     Sally    Evans   Denver   25     3170
4      Chad   Parker  Seattle   55    12600
---------------------------------------------

There is no need to create a skip list. A function to generate the list can be passed on to skiprows. Here a Lambda function neatly checks if a row is even by determining the remainder for division by two.

Skip even data rows

import pandas as pd

#skip even data rows
df_odd = pd.read_csv(
    'data_deposits.csv', 
    sep = ',', 
    skiprows = lambda x: x % 2 != 0
)
print( df_odd.head(10))

Comparing with the entire 8 rows from the full file, it is clear that only the odd rows have been imported.

--[ df_odd ]--------------------------------
  firstname lastname     city  age  deposit
0      Phil   Parker    Miami   45     5010
1    Rudolf   Crooks    Miami   33     5800
2      Chad  Garnett    Miami   38     7420
3      Chad   Parker  Seattle   55    12600
--------------------------------------------

There can be cases where the end of the file has comments, and the last few rows need to be skipped. There is an option for that to using skipfooter = #rows.

Skip rows at the end of file

import pandas as pd

#skip three end rows
df = pd.read_csv(
    'data_deposits.csv', 
    sep = ',', 
    skipfooter = 3,
    engine = 'python'
)
print( df.head(10))

Note that the last three rows have not been read. Also note that an additional parameter has been added which explicitly requests the use of the 'python' engine. The default 'c' engine does not support skipfooter. Python throws a non-fatal warning if engine is not specified.

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.

--[ df 3 less rows at the end ]--------------
  firstname lastname     city  age  deposit
0    Herman  Sanchez    Miami   52     9300
1      Phil   Parker    Miami   45     5010
2    Bradie  Garnett   Denver   36     6300
3    Rudolf   Crooks    Miami   33     5800
4    Marcos   Miller  Seattle   66     4300
---------------------------------------------

It is also possible to skip rows which start with a specific character like % or # which often means that the contents of the line is a comment. Here any line starting with 'C' will be treated as a comment.

Pandas read_csv with comment character = 'C'

import pandas as pd

#skip comments
df_com = pd.read_csv(
    'data_deposits.csv', 
    sep = ',', 
    comment = 'C'
)
print( df_com.head(10))

The resulting dataframe skips two rows that start with 'C' namely Chad... However, while reading Rudolf Crooks, the parsing suddenly stops for the line once we reach 'C' of Crooks. Rest of the line is ignored and filled in with NaN. This is most unfortunate outcome, which shows that the comment option should be used with care. The unique comment character should only be at the beginning of the line, and should have no use within the valid data. Else, the parser would stop parsing the line if it encounters the comment character.

--[ df with C=comment ]----------------------
  firstname lastname     city   age  deposit
0    Herman  Sanchez    Miami  52.0   9300.0
1      Phil   Parker    Miami  45.0   5010.0
2    Bradie  Garnett   Denver  36.0   6300.0
3    Rudolf      NaN      NaN   NaN      NaN
4    Marcos   Miller  Seattle  66.0   4300.0
5     Sally    Evans   Denver  25.0   3170.0
----------------------------------------------

Selecting data columns

It is an unnecessary burden to load unwanted data columns into computer memory. If the columns needed are already determined, then we can use read_csv() to import only the data columns which are absolutely needed. If the names of the columns are not known, then we can address them numerically.

By specifying header=0 we are specifying that the first row is to be treated as header information.

Read selected column id(s) from csv

import pandas as pd

#load specific columns only by column_id
#first line is a header
df = pd.read_csv(
    'data_deposits.csv',
    sep = ',',
    header = 0,
    usecols = [0, 1]
)
print( df.head(10))

The first two columns namely firstname and lastname have been imported into dataframe.

--[ df cols=0,1]-----
  firstname lastname
0    Herman  Sanchez
1      Phil   Parker
2    Bradie  Garnett
3    Rudolf   Crooks
4    Marcos   Miller
5      Chad  Garnett
6     Sally    Evans
7      Chad   Parker
---------------------

It is also possible to match the column names. To be certain of match, the column names are converted to a definite case (lower in this example).

Load data for specific columns by name

import pandas as pd

#load specifc columns only by name
df = pd.read_csv(
    'data_deposits.csv',
    header = 0,
    sep = ',',
    usecols = lambda x:x.lower() in ["lastname", "age"]
)
print( df.head(10))
--[ df name, age]--
  lastname  age
0  Sanchez   52
1   Parker   45
2  Garnett   36
3   Crooks   33
4   Miller   66
5  Garnett   38
6    Evans   25
7   Parker   55
-------------------

Selectively loading data rows and columns is essential when working on projects with very large volume of data, or while testing some data-centric code.

References