real panda
May 1, 2019

Pandas - Space, tab and custom data separators

Data files need not always be comma separated. Space, tabs, semi-colons or other custom separators may be needed. Consider storing addresses where commas may be used within the data, which makes it impossible to use it as data separator. Let us examine the default behavior of read_csv(), and make changes to accommodate custom separators.

Comma separator used explicitly

The default separator for read_csv() is comma. Mentioning that explicitly does not change normal behavior, but does help remind us which separator is being used. This is helpful when multiple different files and separators are in use.

Using sep="," explicitly

import pandas as pd

#explicit comma separator
df = pd.read_csv(
  'data_deposits.csv', 
  sep = ","
)
print( df.head(3))
Output for code:
--[ df head 3 ]-----------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300
--------------------------------------------

This would be the same output with or without the sep="," option. However, now we can try to load other files like space and tab separated data by changing the parameter.

Tab separated and space separated data

Tab separated data works where both space and comma are part of data. Tab is a special character, and should not be visually confused with space. We can represent tab using "\t". Both single and double quotes work.

Load single tab separated text file (tsv)

import pandas as pd

#tab separated file 
df = pd.read_csv(
  'data_deposits.tsv',
  sep = '\t'
)

print( df.head(3))
Output for code:
--[ df head 3 ]-----------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300
--------------------------------------------

The output above shows that '\t' and a tsv file behaves similar to csv.

For space separated files, let us make the situation more challenging by allowing variable number of consecutive spaces to be separators instead of single space character. The complex separator can be represented in the Regex notation by "\s+". This means a single space, or multiple spaces are all to be treated as a single separator. That was concise, but let's try it out.

Loading variable space separated values (ssv)

import pandas as pd

#variable space separator
df = pd.read_csv(
  'data_deposits.ssv',
  sep = '\s+'
)
print( df.head(3))
Output for code:
--[ df head 3 ]-----------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300
--------------------------------------------

Custom separator

We have used a single tab, and one or more consecutive spaces as separators. How about using a multi-character string as a delimiter? We will use the sequence of letters 'abc' as a separator.

Custom sequence of characters as separator

import pandas as pd

#custom separator
df = pd.read_csv(
  'data_deposits.abc',
  sep = 'abc',
  engine = 'python'
)
print( df.head(3))
Output for code:
--[ df head 3 ]-----------------------------
  firstname lastname    city  age  deposit
0    Herman  Sanchez   Miami   52     9300
1      Phil   Parker   Miami   45     5010
2    Bradie  Garnett  Denver   36     6300
--------------------------------------------

The output is the desired outcome. Note that an additional option engine='python' has been added. This is to prevent Python throwing a non-lethal warning which reads:

"ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'."

Since we are not using Regex notation and just clubbing together characters, a different engine to parse the data is needed. The Python engine is slightly slower when working with huge amounts of data.

References