Python Pandas - Working with Text Data
In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame.
Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values.
Let us now see how each operation performs.
Sr.No
|
Function & Description
|
1
|
lower()
Converts strings in the Series/Index to lower case.
|
2
|
upper()
Converts strings in the Series/Index to upper case.
|
3
|
len()
Computes String length().
|
4
|
strip()
Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
|
5
|
split(' ')
Splits each string with the given pattern.
|
6
|
cat(sep=' ')
Concatenates the series/index elements with given separator.
|
7
|
get_dummies()
Returns the DataFrame with One-Hot Encoded values.
|
8
|
contains(pattern)
Returns a Boolean value True for each element if the substring contains in the element, else False.
|
9
|
replace(a,b)
Replaces the value a
with the value b
.
|
10
|
repeat(value)
Repeats each element with specified number of times.
|
11
|
count(pattern)
Returns count of appearance of pattern in each element.
|
12
|
startswith(pattern)
Returns true if the element in the Series/Index starts with the pattern.
|
13
|
endswith(pattern)
Returns true if the element in the Series/Index ends with the pattern.
|
14
|
find(pattern)
Returns the first position of the first occurrence of the pattern.
|
15
|
findall(pattern)
Returns a list of all occurrence of the pattern.
|
16
|
swapcase
Swaps the case lower/upper.
|
17
|
islower()
Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean
|
18
|
isupper()
Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.
|
19
|
isnumeric()
Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.
|
Let us now create a Series and see how all the above functions work.
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
,
np.
nan,
'1234'
,
'SteveSmith'
])
print
s
Its output
is as follows −
0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234
6 Steve Smith
dtype: object
lower():
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
,
np.
nan,
'1234'
,
'SteveSmith'
])
print
s.
str.
lower()
Its output
is as follows −
0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
dtype: object
upper():
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
,
np.
nan,
'1234'
,
'SteveSmith'
])
print
s.
str.
upper()
Its output
is as follows −
0 TOM
1 WILLIAM RICK
2 JOHN
3 ALBER@T
4 NaN
5 1234
6 STEVE SMITH
dtype: object
len():
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
,
np.
nan,
'1234'
,
'SteveSmith'
])
print
s.
str.
len()
Its output
is as follows −
0 3.0
1 12.0
3 7.0
4 NaN
5 4.0
6 10.0
dtype: float64
strip():
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s
print
(
"After Stripping:"
)
print
s.
str.
strip()
Its output
is as follows −
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
After Stripping:
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
split(pattern):
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s
print
(
"Split Pattern:"
)
print
s.
str.
split(
' '
)
Its output
is as follows −
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
Split Pattern:
0 [Tom, , , , , , , , , , ]
1 [, , , , , William, Rick]
2 [John]
3 [Alber@t]
dtype: object
cat(sep=pattern):
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
cat(
sep=
'_'
)
Its output
is as follows −
Tom _ William Rick_John_Alber@t
get_dummies():
import
pandas as
pd
import
numpy as
np
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
get_dummies()
Its output
is as follows −
William Rick Alber@t John Tom
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
contains ():
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
contains(
' '
)
Its output
is as follows −
0 True
1 True
2 False
3 False
dtype: bool
replace(a,b):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s
print
(
"After replacing @ with $:"
)
print
s.
str.
replace(
'@'
,
'$'
)
Its output
is as follows −
0 Tom
1 William Rick
2 John
3 Alber@t
dtype: object
After replacing @ with $:
0 Tom
1 William Rick
2 John
3 Alber$t
dtype: object
repeat(value):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
repeat(
2
)
Its output
is as follows −
0 Tom Tom
1 William Rick William Rick
2 JohnJohn
3 Alber@tAlber@t
dtype: object
count(pattern):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
(
"The number of 'm's in each string:"
)
print
s.
str.
count(
'm'
)
Its output
is as follows −
The number of 'm's in each string:
0 1
1 1
2 0
3 0
startswith(pattern):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
(
"Strings that start with 'T':"
)
print
s.
str.
startswith (
'T'
)
Its output
is as follows −
0 True
1 False
2 False
3 False
dtype: bool
endswith(pattern):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
(
"Strings that end with 't':"
)
print
s.
str.
endswith(
't'
)
Its output
is as follows −
Strings that end with 't':
1 False
2 False
3 True
dtype: bool
find(pattern):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
find(
'e'
)
Its output
is as follows −
0 -1
1 -1
2 -1
3 3
dtype: int64
"-1" indicates that there no such pattern available in the element.
findall(pattern):
import
pandas as
pd
s =
pd.
Series
([
'Tom '
,
' William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
findall(
'e'
)
Its output
is as follows −
0 []
1 []
2 []
3 [e]
Null list([ ]) indicates that there is no such pattern available in the element.
swapcase():
import
pandas as
pd
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
swapcase()
Its output
is as follows −
0 tOM
1 wILLIAM rICK
2 jOHN
3 aLBER@T
dtype: object
islower():
import
pandas as
pd
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
islower()
Its output
is as follows −
0 False
1 False
2 False
3 False
dtype: bool
isupper():
import
pandas as
pd
s =
pd.
Series
([
'Tom'
,
'William Rick'
,
'John'
,
'Alber@t'
])
print
s.
str.
isupper()
Its output
is as follows −
0 False
1 False
2 False
3 False
dtype: bool