For serious data science applications the data size can be huge. It becomes necessary to load only the few necessary columns for to complete a specific job. Sampling data is a way to limit the number of rows of unique data points are loaded into memory, or to create training and test data sets for machine learning. Pandas read_csv() provides multiple options to configure what data is read from a file.
We will be using data_deposits.csv to demonstrate various techniques to select the required data.
All available data rows on file may not be needed, in which case certain rows can be skipped. Just provide read_csv with a list of rows to skip to limit what is loaded.
Choosing rows to skip using a list for read_csv
import pandas as pd #generate a list skip_list = [n for n in range(1,6,2)] print( skip_list ) #load all data df_all = pd.read_csv( 'data_deposits.csv', sep = ',' ) print( df_all.head(10)) #skip rows in list df_skp = pd.read_csv( 'data_deposits.csv', sep = ',', skiprows = skip_list ) print( df_skp.head(10))
Skipped dataframe has fewer rows. The odd rows were skipped successfully.
--[ list to skip ]--------------------------- [1, 3, 5] --[ df_all everything ]---------------------- firstname lastname city age deposit 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 3 Rudolf Crooks Miami 33 5800 4 Marcos Miller Seattle 66 4300 5 Chad Garnett Miami 38 7420 6 Sally Evans Denver 25 3170 7 Chad Parker Seattle 55 12600 --[ df_skp skipped rows ]-------------------- firstname lastname city age deposit 0 Phil Parker Miami 45 5010 1 Rudolf Crooks Miami 33 5800 2 Chad Garnett Miami 38 7420 3 Sally Evans Denver 25 3170 4 Chad Parker Seattle 55 12600 ---------------------------------------------
There is no need to create a skip list. A function to generate the list can be passed on to skiprows. Here a Lambda function neatly checks if a row is even by determining the remainder for division by two.
Skip even data rows
import pandas as pd #skip even data rows df_odd = pd.read_csv( 'data_deposits.csv', sep = ',', skiprows = lambda x: x % 2 != 0 ) print( df_odd.head(10))
Comparing with the entire 8 rows from the full file, it is clear that only the odd rows have been imported.
--[ df_odd ]-------------------------------- firstname lastname city age deposit 0 Phil Parker Miami 45 5010 1 Rudolf Crooks Miami 33 5800 2 Chad Garnett Miami 38 7420 3 Chad Parker Seattle 55 12600 --------------------------------------------
There can be cases where the end of the file has comments, and the last few rows need to be skipped. There is an option for that to using skipfooter = #rows.
Skip rows at the end of file
import pandas as pd #skip three end rows df = pd.read_csv( 'data_deposits.csv', sep = ',', skipfooter = 3, engine = 'python' ) print( df.head(10))
Note that the last three rows have not been read. Also note that an additional parameter has been added which explicitly requests the use of the 'python' engine. The default 'c' engine does not support skipfooter. Python throws a non-fatal warning if engine is not specified.
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
--[ df 3 less rows at the end ]-------------- firstname lastname city age deposit 0 Herman Sanchez Miami 52 9300 1 Phil Parker Miami 45 5010 2 Bradie Garnett Denver 36 6300 3 Rudolf Crooks Miami 33 5800 4 Marcos Miller Seattle 66 4300 ---------------------------------------------
It is also possible to skip rows which start with a specific character like % or # which often means that the contents of the line is a comment. Here any line starting with 'C' will be treated as a comment.
Pandas read_csv with comment character = 'C'
import pandas as pd #skip comments df_com = pd.read_csv( 'data_deposits.csv', sep = ',', comment = 'C' ) print( df_com.head(10))
The resulting dataframe skips two rows that start with 'C' namely Chad... However, while reading Rudolf Crooks, the parsing suddenly stops for the line once we reach 'C' of Crooks. Rest of the line is ignored and filled in with NaN. This is most unfortunate outcome, which shows that the comment option should be used with care. The unique comment character should only be at the beginning of the line, and should have no use within the valid data. Else, the parser would stop parsing the line if it encounters the comment character.
--[ df with C=comment ]---------------------- firstname lastname city age deposit 0 Herman Sanchez Miami 52.0 9300.0 1 Phil Parker Miami 45.0 5010.0 2 Bradie Garnett Denver 36.0 6300.0 3 Rudolf NaN NaN NaN NaN 4 Marcos Miller Seattle 66.0 4300.0 5 Sally Evans Denver 25.0 3170.0 ----------------------------------------------
It is an unnecessary burden to load unwanted data columns into computer memory. If the columns needed are already determined, then we can use read_csv() to import only the data columns which are absolutely needed. If the names of the columns are not known, then we can address them numerically.
By specifying header=0 we are specifying that the first row is to be treated as header information.
Read selected column id(s) from csv
import pandas as pd #load specific columns only by column_id #first line is a header df = pd.read_csv( 'data_deposits.csv', sep = ',', header = 0, usecols = [0, 1] ) print( df.head(10))
The first two columns namely firstname and lastname have been imported into dataframe.
--[ df cols=0,1]----- firstname lastname 0 Herman Sanchez 1 Phil Parker 2 Bradie Garnett 3 Rudolf Crooks 4 Marcos Miller 5 Chad Garnett 6 Sally Evans 7 Chad Parker ---------------------
It is also possible to match the column names. To be certain of match, the column names are converted to a definite case (lower in this example).
Load data for specific columns by name
import pandas as pd #load specifc columns only by name df = pd.read_csv( 'data_deposits.csv', header = 0, sep = ',', usecols = lambda x:x.lower() in ["lastname", "age"] ) print( df.head(10))
--[ df name, age]-- lastname age 0 Sanchez 52 1 Parker 45 2 Garnett 36 3 Crooks 33 4 Miller 66 5 Garnett 38 6 Evans 25 7 Parker 55 -------------------
Selectively loading data rows and columns is essential when working on projects with very large volume of data, or while testing some data-centric code.
# Python - Delete multiple elements from a list
# SEO Google page rank and web traffic
# Python: Random access generator for multi value sublist yield
# Python: Enumerate counter for loops over list, tuple, string
# Pandas - Read, skip and customize column headers for read_csv
# Pandas - Selecting data rows and columns using read_csv
# Pandas - Space, tab and custom data separators
# Sample data for Python tutorials
# Pandas - Purge duplicate rows
# Pandas - Concatenate or vertically merge dataframes
# Pandas - Search and replace values in columns
# Pandas - Count rows and columns in dataframe
# Pandas - Adding new static columns
# Python - Hardware and operating system information
# Pandas - Remove or drop columns from Pandas dataframe