Uploaded by parthmaheshwari943

Pandas python

advertisement
Data analysis refers to process of evaluating big data sets using analytical and statistical tools
so as to discover useful information and conclusions to support business decision-making.
Pandas or Python Pandas is Python's library for data analysis. Pandas has derived its name
from "panel data system", which is an ecometrics term for multi-dimensional, structured data
sets.
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-touse data structures and data analysis tools for the Python programming language. Python with
Pandas is used in a wide range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc.
Pandas is the most popular library in the scientific Python ecosystem for doing data analysis.
Pandas is capable of many tasks including¢) It can read or write in many different data formats (integer, float, double, etc.)
¢) It can calculate in all ways data is organized i.e., across rows and down columns
¢) It can easily select subsets of data from bulky data sets and even combine multiple datasets
together.
¢) It has functionality to find and fill missing data.
¢) It allows you to apply operations to independent groups within the data.
¢) It supports reshaping of data into different forms.
¢) It supports advanced time-series functionality (Time series forecasting is the use of a model to
predict future values based on previously observed values.)
¢) It supports visualization by integrating matplotlib and seaborn etc. libraries.
Key Features of Pandas
•
•
•
•
•
•
•
•
•
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
In other words, Pandas is best at handling huge tabular data sets comprising different data
formats. The Pandas library also supports the most simple of tasks needed with data such as
loading data or doing feature engineering on time series data etc.
A data structure is a particular way of storing and organizing data in a computer to suit a
specific purpose so that it can be accessed and worked with in appropriate ways. Data
Structures refer to specialized way of storing data so as to apply a specific type of functionality
on them. Out of many data structures of Pandas, two basic data structures - Series and
DataFrame - are universally popular.
Data
Structure
Series
Data
Frames
Panel
1
Dimen
sions
1
2
3
Description
1D labeled homogeneous array, size immutable.
General 2D labeled, size-mutable tabular structure with potentially
heterogeneously typed columns.
General 3D labeled, size-mutable array.
PANDAS
Mutability - All Pandas data structures are value mutable (can be changed) and except Series
all are size mutable. Series is size immutable.
Series Data Structure - A Series is a Pandas data structure that represents a one-dimensional
array-like object containing an array of data (of any NumPy data type) and an associated array
of data labels, called its index.
Key Points : Homogeneous data, Size Immutable, Values of Data Mutable.
A Series type object has two main components: an array of actual data, an associated array of
indexes or data labels. A Series type object can be created in many ways using pandas library's
Series( ).
i. Create empty Series Object by using just the Series() with no parameter - Series type
object with no value having default datatype, which is float64. To create an empty object i.e.,
having no values, you can just use the Series() as: <Series Object>= pandas.Series
em = pd.Series()
print(em)
output : Series([ ], dtype: float64)
ii. Creating non-empty Series objects - To create non-empty Series objects, you need to
specify arguments for data and indexes as per following syntax :
<Series object>= pd.Series(data, index=idx) where data is the data part of the Series object, it
can be one of the following : 1) Sequence, 2) ndarray, 3) Dictionary, or 4) A scalar value.
(1) Specify data as Python Sequence. To give a sequence of values as attribute to Series( ),
i.e., as : <Series Object>= Series (<any Python sequence>)
obj1 = pd.Series(range(5))
print(obj1)
0
0
1
1
2
2
3
3
4
4
dtype: int64
(2) Specify data as an ndarray. The data attribute can be an ndarray also.
nda1 = np.arange(3, 13, 3.5)
ser1 = pd.Series(nda1)
print(ser1)
0
3.0
1
6.5
2
10.0
dtype: float64
(3) Specify data as a Python dictionary. can be any sequence, including dictionaries.
obj5 = pd.Series( { ‘Jan’ : 31, ‘Feb’ : 28, ‘Mar’ : 31 } )
Here, one thing is noteworthy that if you are creating a series object
from a dictionary object, then indexes, which are created from keys
are not in the same order as you have typed them.
(4) Specify data as a scalar / single value. BUT if data is a scalar value, then the index must
be provided. There can be one or more entries in index sequence. The scalar value (given as
data) will be repeated to match the length of index.
medalsWon = pd. Series ( 10, index = range ( 0, 1))
medals2 = pd.Series(15, index= range(1, 6, 2))
ser2 = pd. Series ( 'Yet to start' , index= ['Indore', 'Delhi', 'Shimla' ] )
2
PANDAS
Specifying/ Adding NaN values in a Series object - In such cases, you can fill missing data
with a NaN (Not a Number) value. Legal empty value NaN is defined in NumPy module and
hence you can use np.NaN to specify a missing value.
obj3 = pd.Series( [ 6.5, np.NaN, 2.34] )
(ii) Specify index(es) as well as data with Series(). While creating Series type object is that
along with values, you also provide indexes. Both values and indexes are sequences. Syntax is :
<Series Object>= pandas.Series(data = None, index= None)
Both data and index have to be sequences; None is taken by default, if you skip these
parameters.
arr= [31, 28, 31, 30]
mon = [ ‘Jan’, ‘Feb’, ‘Mar’, Apr’ ]
obj3 = pd. Series( data = arr, index = mon)
obj4 = pd.Series( data= [32, 34, 35], index=[ ‘A’,
‘B’, ‘C’ ] )
You may use loop for defining index sequence also, e.g.,
s1 = pd.Series( range(1, 15, 3), index= [x for x in 'abcde' ] )
output : a
1
b
4
c
7
d
10
e
13
dtype: int64
(iii) Specify data type along with data and index. You can also specify data type along with
data and index with Series() as per following syntax:
<Series Object>= pandas.Series(data = None, index= None, dtype = None)
obj4 = pd.Series( data= [32, 34, 35], index=[ ‘A’, ‘B’, ‘C’ ] , dtype=float )
print(obj4)
A
32.0
B
34.0
C
35.0
dtype: float64
NOTE : Series object's indexes are not necessarily to 0 to n -1 always.
(iv) Using a mathematical function/expression to create data array in Series(). The Series()
allows you to define a function or expression that can calculate values for data sequence.
<Series Object>= pd.Series (index= None, data =<function I expression> )
Numpy array
a= np.arange(9, 13)
obj7 = pd.Series ( index =
a, data=a* 2)
3
Python list
It is important to understand that if
we apply the operation / expression
Lst = [9, 10, 11, 12]
obj8 = pd.Series ( data = (2 *
Lst) )
PANDAS
print(obj7)
on a NumPy array then the given
operation is carried in vertorized way
that i.e., applied on each element of
the NumPy array and the newly
generated sequence is taken as data
array. BUT if you apply a similar
operation a Python list then the
result will be entirely different.
print(obj8)
NOTE : While creating a Series object, when you give index array as a sequence then there is
no compulsion for the uniqueness of indexes. That is, you can have duplicate entries in the
index array and Python won't raise any error. Indices need not be unique in Pandas Series. This
will only cause an error if/when you perform an operation that requires unique indices.
Common attributes of Series objects
Attribure
Series.index
Series.values
Series.dtype
Series.shape
Series.nbytes
Series.ndim
Series.size
Series.itemsize
Series.hasnans
Series.empty
Series.head()
Series.tail()
Series.axes
Series.dtype
Description
The index (axis labels) of the Series.
Return Series as ndarray or ndarray-like depending on the dtype
return the dtype object of the underlying data
return a tuple of the shape of the underlying data
return the number of bytes in the underlying data
return the number of dimensions of the underlying data
return the number of elements in the underlying data
return the size of the dtype of the item of the underlying data (in bytes)
return True if there are any NaN values; otherwise return False
return True if the Series object is empty, false otherwise
Returns
Returns
Returns
Returns
the first n rows.
the last n rows.
a list of the row axis labels.
the dtype of the object.
NOTE : If you use len( ) on a series object, then it returns total elements in it including NaNs but
<series>.count( ) returns only the count of non-NaN values in a Series object.
Accessing Individual Elements : To access individual elements of a Series object, you can
give its index in square brackets along with its name as you have been doing with other Python
sequences. For example : <Series Object name> [ <valid index> ] e.g.
obj5[ ‘feb’ ]. # BUT if you try to give an index which is not a legal index, it will give you an error.
Extracting Slices from Series Object : Like other sequences, you can extract slices too from a
Series object. Slicing is a powerful way to retrieve subsets of data from a pandas object.
Slicing takes place position wise and not the index wise in a series object.
All individual elements have position numbers starting from 0 onwards i.e., 0 for first element, 1
for 2nd element and so on.
A slice object is created from Series object using a syntax of <Object>[start : end : step), but the
start and stop signify the positions of elements not the indexes. The slice object of a Series
object is also a panda Series type object.
import pandas as pd
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
4
a
b
c
1
2
3
PANDAS
print(s[:3])
dtype: int64
Operations on Series Object :
1 . Modifying Elements of Series Object : The data values of a Series object can be easily
modified through item assignment, i.e., <SeriesObject>[ <index> ]= <new_data_value>
Above assignment will change the data value of the given index in Series object.
obj4 = pd.Series( [2,4,6,8,10])
obj4[1 : 3]= -1
print(obj4)
0
2
1
-1
2
-1
3
8
4
10
dtype: int64
Above assignment will replace all the values falling in given slice. Please note that Series
object's values can be modified but size cannot. So you can say that Series objects are valuemutable but size-immutable objects.
2. The head() and tail() functions: The head() function is used to fetch first n-rows from a
pandas object and tail() function returns last n-rows from a pandas object.
obj4 = pd.Series( [2,4,6,8,10])
print(obj4.head(2))
obj4 = pd.Series( [2,4,6,8,10])
print(obj4.tail(2))
0
2
1
4
dtype: int64
3
8
4
10
dtype: int64
3. Vector Operations on Series Object : Vector operations mean that if you apply a function or
expression then it is individually applied on each item of the object. Since Series objects are built
upon NumPy arrays (ndarrays), they also support vectorized operations, just like ndarrays.
obj4 = pd.Series( [2,4,6])
print(obj4 ** 2)
0
4
1
16
2
36
dtype: int64
4. Arithmetic on Series Objects : You can perform arithmetic like addition, subtraction, division
etc. with two Series objects and it will calculate result on two corresponding items of the two
objects given in expression BUT it has a caveat - the operation is performed only on the
matching indexes. Also, if the data items of the two matches indexes are not compatible to
perform operation, it will return NaN (Not a Number) as the result of those operations.
When you perform arithmetic operations on two Series type objects , the data is aligned on the
basis of matching indexes (this is called Data Alignment in panda objects) and then performed
arithmetic; for non-overlapping indexes, the arithmetic operations result as a NaN (Not a
5
PANDAS
Number). You can store the result of object arithmetic in another object, which will also be
a Series object : ob6 = ob1 + ob3
5. Filtering Entries : You can filter out entries from a Series objects using expressions that are
of Boolean type.
6. Re-indexing - Sometimes you need to create a similar object but with a different order of
same indexes. You can use reindexing for this purpose as per this syntax:
<Series Object> = <Object>.reindex( <sequence with new order of indexes> )
obj4 = pd.Series([2,4,6], index=[0,1,2])
obj4=obj4.reindex([1,2,3])
print(obj4)
1
4.0
2
6.0
3
NaN
dtype: float64
With this, the same data values and their indexes will be stored in the new object as per the
defined order of index in the reindex( ).
7. Dropping Entries from an Axis - Sometime, you do not need a data value at a particular
index. You can remove that entry from series object using drop( ) as per this syntax :
<Series Object>.drop( <index to be removed> )
obj4 = pd.Series([2,4,6], index=[0,1,2])
obj4=obj4.drop([1,2])
print(obj4)
0
2
dtype: int64
Difference between NumPy Arrays and Series Objects
(i) In case of ndarrays, you can perform vectorized operations only if the shapes of two ndarrays
match, otherwise it returns an error. But with Series objects, in case of vectorized operations, the
data of two Series objects is aligned as per matching indexes and operation is performed on
then and for non-matching indexes, NaN is returned.
(ii) In ndarrays, the indexes are always numeric starting from 0 onwards, BUT series objects can
have any type of indexes, including numbers (not necessarily starting from 0), letters, labels,
strings etc.
6
PANDAS
Accessing Data from Series with Position
Example 1 - Retrieve the first element. As we already know, the counting starts from
zero for the array, which means the first element is stored at zeroth position and so on.
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
#retrieve the first element
print(s[0])
1
Example 2 - Retrieve the first three elements in the Series. If a : is inserted in front of it,
all items from that index onwards will be extracted. If two parameters (with : between
them) is used, items between the two indexes (not including the stop index)
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
#retrieve the first three element
print(s[:3])
a 1
b 2
c 3
dtype: int64
Example 3 - Retrieve the last three elements.
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
#retrieve the last three element
print(s[-3:])
c 3
d 4
e 5
dtype: int64
Retrieve Data Using Label (Index)
A Series is like a fixed-size dict in that you can get and set values by index label.
Example 1 - Retrieve a single element using index label value.
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
#retrieve a single element
print(s['a'])
1
Example 2 - Retrieve multiple elements using a list of index label values.
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
#retrieve multiple elements
print(s[['a','c','d']])
a 1
c 3
d 4
dtype: int64
Example 3 - If a label is not contained, an exception is raised.
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
#retrieve missing element
print(s['f'])
7
…
KeyError: 'f'v
PANDAS
DataFrame Data Structure : A DataFrame in another pandas structure, which stores data in
two-dimensional way. It is actually a two-dimensional (tabular and spreadsheet like) labeled
array, which is actually an ordered collection of columns where columns may store different
types of data, e.g., numeric or string or floating point or Boolean type etc. A two-dimensional
array is an array in which each element is itself an array.
Features of DataFrame
1. Potentially columns are of different types
2. Size – Mutable
3. Labeled axes (rows and columns)
4. Can Perform Arithmetic operations on rows and columns
Major characteristics of a DataFrame data structure can be listed as:
(i) It has two indexes or we can say that two axes - a row (axis = 0) and a column (axis= 1).
(ii) Conceptually it is like a spreadsheet where each value is identifiable with the combination of
row index and column index. The row index is known as index in general and the column
index is called the column-name.
(iii) The indexes can be of numbers or letters or strings.
(iv) There is no condition of having all data of same type across columns; its columns can have
data of different types.
(v) You can easily change its values, i.e., it is value-mutable.
(vi) You can add or delete rows/columns in a DataFrame. In other words, it is size-mutable.
NOTE : DataFrames are both, value-mutable and size-mutable, i.e., you can change both its
values and size.
Creating and Displaying a DataFrame : You can create a DataFrame object by passing data in
many different ways, such as:
(i) Two-dimensional dictionaries i.e., dictionaries having lists or dictionaries or ndarrays or Series
objects etc.
(ii) Two-dimensional ndarrays (NumPy array)
(iii) Series type object
(iv) Another DataFrame object
(i) Two-dimensional dictionaries i.e., dictionaries having lists or dictionaries or ndarrays
or Series objects etc. A two dimensional dictionary is a dictionary having items as (key: value)
where value part is a data structµre of any type : another dictionary, an ndarray, a Series object,
a list etc. But here the value parts of all the keys should have similar structure and equal lengths.
(a) Creating a dataframe from a 2D dictionary having values as lists / ndarrays :
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)
0
1
2
3
Age
28
34
29
42
Name
Tom
Jack
Steve
Ricky
The keys of 2d dictionary have become columns and Indexes have been generated 0 onwards
using np.range(n). You can specify your own indexes too by specifying a sequence by the name
index in the DataFrame( ) function,
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,
index=['rank1','rank2','rank3','rank4'])
8
rank1
rank2
rank3
rank4
Age
28
34
29
42
Name
Tom
Jack
Steve
Ricky
PANDAS
print(df)
(b) Creating a dataframe from a 2D dictionary having values as dictionary objects : A 2D
dictionary can have values as dictionary objects too. You can also create a data frame object
using such 20 dictionary object:
import pandas as pd
yr2015 = { 'Qtr1' : 34500, 'Qtr2'
yr2016 = { 'Qtr1' : 44900, 'Qtr2'
yr2017 = { 'Qtr1' : 54500, 'Qtr2'
disales = {2015 : yr2015, 2016 :
df1 = pd.DataFrame(disales)
print(disales)
: 56000, 'Qtr3' : 47000, 'Qtr4' : 49000}
: 46100, 'Qtr3' : 57000, 'Qtr4' : 59000}
: 51000, 'Qtr3' : 57000, 'Qtr4' : 58500}
yr2016, 2017 : yr2017}
Its output is as follows −
Qtr1
Qtr2
Qtr3
Qtr4
2015
34500
56000
47000
49000
2016
44900
46100
57000
59000
2017
54500
51000
57000
58500
NOTE : While creating a dataframe with a nested or 2d dictionary, Python interprets the outer
dict keys as the columns and the inner keys as the row indices.
Now, had there been a situation where inner dictionaries had non-matching keys, then in that
case Python would have done following things:
(i) There would have been total number of indexes equal to sum of unique inner keys in all the
inner dictionaries.
(ii) For a key that has no matching keys in other inner dictionaries, value NaN would be used to
depict the missing values.
yr2015 = { 'Qtr1' : 34500, 'Qtr2' : 56000, 'Qtr3' : 47000, 'Qtr4' : 49000}
yr2016 = { 'Qtr1' : 44900, 'Qtr2' : 46100, 'Qtr3' : 57000, 'Qtr4' : 59000}
yr2017 = { 'Qtr1' : 54500, 'Qtr2' : 51000, 'Qtr3' : 57000}
diSales = { 2015 : yr2015, 2016: yr2016, 2017 : yr2017}
df3 = pd.DataFrame(diSales)
print(df3)
Its output is as follows –
Qtr1
Qtr2
Qtr3
Qtr4
2015
34500
56000
47000
49000
2016
44900
46100
57000
59000
2017
54500.0
51000.0
57000.0
NaN
NOTE : Total number of indexes in a Data Frame object are equal to total unique inner keys of
the 20 dictionary passed to it and it would use NaN values to fill missing data i.e., where the
corresponding values for a key are missing in any inner dictionary.
(c) Create a DataFrame from List of Dicts - List of Dictionaries can be passed as input data to
create a DataFrame. The dictionary keys are by default taken as column names.
Example 1
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10,
9
a
b
c
PANDAS
'c': 20}]
df = pd.DataFrame(data)
print(df)
0
1
1
5
2
10
NaN
20.0
Note − Observe, NaN (Not a Number) is appended in missing areas.
Example 2
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10,
'c': 20}]
df = pd.DataFrame(data, index=['first',
'second'])
print(df)
first
second
a
1
5
b
2
10
c
NaN
20.0
Example 3
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)
Its output is as follows −
#df1 output
a
first
1
second
5
#df2 output
a
first
1
second
5
b
2
10
b1
NaN
NaN
2. Creating a DataFrame Object from a 1-D or 2-D ndarray – You can also pass a two-dimensional
NumPy array (i.e., having shape as (<n>, <n>)) to DataFrame( ) to create a dataframe object.
Example 1
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)
0
1
2
3
4
0
1
2
3
4
5
Example 2
data = [['Alex',10],['Bob',12],['Clarke',13]]
df =
pd.DataFrame(data,columns=['Name','Age'])
print(df)
0
1
2
Name
Alex
Bob
Clarke
Age
10
12
13
0
1
2
Name
Alex
Bob
Clarke
Example 3
data = [['Alex',10],['Bob',12],['Clarke',13]]
df =
pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
Age
10.0
12.0
13.0
NOTE : By giving an index sequence, you can specify your own index names or labels .
10
PANDAS
If, however, the rows of ndarrays differ in length, i.e., if number of elements in each row differ,
then Python will create just single column in the dataframe object and the type of the column will
be considered as object.
narr3 = np.array( [ [ 101.5, 201.2 ], [ 400, 50, 600, 700 ], [ 212.3, 301.5, 405.2 ] ] )
dtf4=pd.DataFrame(narr3)
3. Creating a DataFrame object from a 2D dictionary with values as Series object
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
a
b
c
d
one
1.0
2.0
3.0
NaN
two
1
2
3
4
NOTE : The column names must be valid identifiers of Python.
4. Creating a DataFrame Object from another DataFrame Object
data = [['Alex',10],['Bob',12],['Clarke',13]]
df =
pd.DataFrame(data,columns=['Name','Age'])
df1 = pd.DataFrame(df)
print(df1)
0
1
2
Name
Alex
Bob
Clarke
Age
10
12
13
DataFrame Attributes:
Getting count of non-NA values in dataframe - Like Series, you can use count( ) with
dataframe too to get the count of Non-NaN values, but count( ) with dataframe is little elaborate
11
PANDAS
(i) If you do not pass any
argument or pass 0
(default is 0 only), then it
returns count of non-NA
values for each column,
e.g.,
(ii) If you pass argument as 1,
then it returns count of non-NA
values for each row, e.g.,
(iii) To get count of non-NA
values from rows/columns,
you can explicitly specify
argument to count() as axis =
'index' or axis = 'columns' as
shown below:
Numpy Representation of DataFrame - You can represent the values of a dataframe object in
numpy way using values attribute. E.g.
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df.values)
[['Alex' 10]
['Bob' 12]
['Clarke' 13]]
Selecting / Accessing a Column –
<DataFrame object>[ <column name>] or
<Data Frame object>.<column name>
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df['one'])
# or
print(df.one)
a
b
c
d
Name:
1.0
2.0
3.0
NaN
one, dtype: float64
Column Addition
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b',
'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a',
'b', 'c', 'd'])}
df = pd.DataFrame(d)
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
print ("Adding a new column using the existing
columns in DataFrame:")
df['four']=df['one']+df['three']
print(df)
12
Adding a new column by passing as Series:
one
two
three
a
1.0
1
10.0
b
2.0
2
20.0
c
3.0
3
30.0
d
NaN
4
NaN
Adding a new column using the existing
columns in DataFrame:
one
two
three
four
a
1.0
1
10.0
11.0
b
2.0
2
20.0
22.0
c
3.0
3
30.0
33.0
d
NaN
4
NaN
NaN
PANDAS
Column Deletion : Example
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b',
'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a',
'b', 'c', 'd']),
'three' : pd.Series([10,20,30],
index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)
print ("Deleting the first column using DEL
function:")
del df['one']
# using del function
print(df)
print ("Deleting another column using POP
function:")
df.pop('two')
# using pop function
print(df)
Our dataframe is:
one
three two
a
1.0
10.0
1
b
2.0
20.0
2
c
3.0
30.0
3
d
NaN
NaN
4
Deleting the first column using DEL function:
three
two
a
10.0
1
b
20.0
2
c
30.0
3
d
NaN
4
Deleting another column using POP function:
three
a 10.0
b 20.0
c 30.0
d NaN
Row Selection, Addition, and Deletion
Selection by Label - Rows can be selected by passing row label to a loc function.
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.loc['b'])
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the
Name of the series is the label with which it is retrieved.
Selection by integer location - Rows can be selected by passing integer location to
an iloc function.
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.iloc[2])
one
3.0
two
3.0
Name: c, dtype: float64
Slice Rows – Multiple rows can be selected using ‘ : ’ operator.
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df[2:4])
one
3.0
NaN
c
d
two
3
4
Addition of Rows - append function helps to append the rows at the end.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)
0
1
0
1
a
1
3
5
7
b
2
4
6
8
Deletion of Rows - Use index label to delete or drop rows from a DataFrame. If label is
duplicated, then multiple rows will be dropped. If you observe, in the above example, the
labels are duplicate. Let us drop a label and will see how many rows will get dropped.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
df = df.drop(0)
print(df)
13
a b
1 3 4
1 7 8
PANDAS
Selecting/ Accessing Multiple Columns –
<Data Frame object>[ [<column name>, <column name>, <column name>, ... ] ]
import pandas as pd
data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data,columns=['Name','Age','City'])
print(df[['Name','City']])
0
1
2
Name
Alex
Bob
Clarke
City
Jaipur
Kota
Ajmer
NOTE : Columns appear in the order of column names given in the list inside square brackets.
Selecting / Accessing a Subset from a Dataframe using Row / Column Names –
<DataFrameObject>.loc [ <startrow> : <endrow>, <startcolumn> : <endcolumn>]
data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data,columns=['Name','Age','City'])
print(df.loc[0:1, : ])
0
1
Name
Alex
Bob
Age
10
12
City
Jaipur
Kota
To access single row: <DataFrameObject> . loc [ <row> ]
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data,columns=['Name','Age','City'])
print(df.loc[0])
Name
Alex
Age
10
City
Jaipur
Name: 0, dtype: object
To access multiple rows: <DataFrameObject> . loc [ <startrow> : <endrow>, : ]
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data,columns=['Name','Age','City'])
print(df.loc[0:1])
0
1
Name
Alex
Bob
Age
10
12
0
1
2
Name
Alex
Bob
Clarke
0
1
Name
Alex
Bob
City
Jaipur
Kota
To access selective columns, use :
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data,columns=['Name','Age','City'])
print(df.loc[ : , 'Name':'Age'])
Age
10
12
13
To access range of columns from a range of rows, use:
<DF object>.loc [<startrow>: <endrow>, <startcolumn>: <endcolumn>].
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data,columns=['Name','Age','City'])
print(df.loc[0:1 , 'Name':'Age'])
Age
10
12
Obtaining a Subset/Slice from a Dataframe using Row/Column Numeric Index/Position:
You can extract subset from dataframe using the row and column numeric index/position, but
this time you will use iloc instead of loc.
NOTE : iloc means integer location.
<DFobject>. iloc[ <start row index> : <end row index>, < start col index> : <end column index> ]
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data)
print(df.loc[0:1 , 0:1])
0
1
0
Alex
Bob
1
10
12
Selecting/Accessing Individual Value
(i) Either give name of row or numeric index in square brackets with, i.e., as this :
14
PANDAS
<DF object>.<column> [ <row name or row numeric index>]
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
print(df.Age[0:2])
0
10
1
12
Name: Age, dtype: int64
ii) You can use at or iat attribute with DF object as shown below :
<DF object>.at[ <row name> , <Column name> ]
Or
<DF object>. iat[ <numeric Row index>, <numeric Column index> ]
data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
print(df.at[0, 'Age'])
data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data)
print(df.iat[0, 0])
10
Alex
Assigning/Modifying Data Values in Dataframes:
(a) To change or add a column, use syntax :
<DF object >.<column name> [ <row label>] = <new value>
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df.Age[0]=11
print(df)
0
1
2
Name
Alex
Bob
Clarke
Age
11
12
13
City
Jaipur
Kota
Ajmer
(b) Similarly, to change or add a row, use syntax :
<DF object> at[ <row name>, : ] = <new value>
<DF object> loc[ <row name>, : ] = <new value>
Or
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df.at[3, 0:1]='NONAME'
print(df)
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df.loc[3, 0:1]='NONAME'
print(df)
0
1
2
3
0
1
2
3
Name
Alex
Bob
Clarke
NONAME
Name
Alex
Bob
Clarke
NONAME
Age
10.0
12.0
13.0
NaN
Age
10.0
12.0
13.0
NaN
City
Jaipur
Kota
Ajmer
NaN
City
Jaipur
Kota
Ajmer
NaN
(c) To change or modify a single data value, use syntax :
<DF>. <columnname> [ <row name/label>] = <Value>
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df.Age[0]=11
print(df)
0
1
2
Name
Alex
Bob
Clarke
Age
11
12
13
City
Jaipur
Kota
Ajmer
Adding Columns in DataFrames
<DF object>.at [ : , <column name> ] = <values for column>
Or
<DF object>. loc [ : , <column name> ] = <values for column>
Or
<DF object> = <DF object>. assign( <column name> = <values for column> )
data =
15
Name
Age
City State
PANDAS
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df.at[:,'State']=['Raj','Raj','Raj']
print(df)
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df.loc[:,'State']=['Raj','Raj','Raj']
print(df)
data =
[['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']]
df = pd.DataFrame(data, columns=['Name','Age','City'])
df=df.assign(State=['Raj','Raj','Raj'])
print(df)
0
1
2
Alex
Bob
Clarke
10
12
13
Jaipur
Kota
Ajmer
Raj
Raj
Raj
0
1
2
Name
Alex
Bob
Clarke
Age
10
12
13
City State
Jaipur
Raj
Kota
Raj
Ajmer
Raj
0
1
2
Name
Alex
Bob
Clarke
Age
10
12
13
City State
Jaipur
Raj
Kota
Raj
Ajmer
Raj
NOTE : When you assign something to a column of dataframe, then for existing column, it will
change the data values and for non-existing column, it will add a new column.
Deleting Columns in DataFrames
del <Df object>[ <column name>]
<Df object> . drop( [index or sequence of indexes], axis = 1)
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b',
'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd']),
'three' : pd.Series([10,20,30],
index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)
print ("Deleting the first column using DEL function:")
del df['one']
# using del function
print(df)
print ("Deleting another column using Drop function:")
df=df.drop('two',axis=1)
# using Drop
function
print(df)
Our dataframe is:
one two three
a 1.0
1
10.0
b 2.0
2
20.0
c 3.0
3
30.0
d NaN
4
NaN
Deleting the first column using DEL
function:
two three
a
1
10.0
b
2
20.0
c
3
30.0
d
4
NaN
Deleting another column using Drop
function:
three
a
10.0
b
20.0
c
30.0
d
NaN
To delete rows from a dataframe, you can use <DF>.drop(index or sequence of indexes), by
default axis value is 0.
Attribute/
Method
T
axes
dtypes
empty
ndim
shape
size
values
head()
tail()
16
Description
Transposes rows and columns.
Returns a list with the row axis labels and column axis labels as the only members.
Returns the dtypes in this object.
True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.
Number of axes / array dimensions.
Returns a tuple representing the dimensionality of the DataFrame.
Number of elements in the NDFrame.
Numpy representation of NDFrame.
Returns the first n rows.
Returns last n rows.
PANDAS
Examples
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith',
'Jack']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Our data series is:")
print(df)
T (Transpose)
interchange.
-
Our data series is:
Age
Name
Rating
0
25
Tom
4.23
1
26
James
3.24
2
25
Ricky
3.98
3
23
Vin
2.56
4
30
Steve
3.20
5
29
Smith
4.60
6
23
Jack
3.80
Returns the transpose of the DataFrame. The rows and columns will
print("The transpose of the data series
is:")
print(df.T)
The transpose of the data series is:
0
1
2
3
Age
25
26
25
23
Name
Tom
James
Ricky Vin
Rating
4.23 3.24
3.98
2.56
4
30
Steve
3.2
5
29
Smith
4.6
6
23
Jack
3.8
Axes - Returns the list of row axis labels and column axis labels.
print("Row axis labels and
column axis labels are:")
print(df.axes)
Row axis labels and column axis labels are:
[RangeIndex(start=0, stop=7, step=1), Index([u'Age', u'Name',
u'Rating'], dtype='object')]
dtypes - Returns the data type of each column.
print("The data types of each column are:")
print(df.dtypes)
The data types of each column are:
Age
int64
Name
object
Rating float64
dtype: object
Empty - Returns the Boolean value saying whether the Object is empty or not; True
indicates that the object is empty.
print("Is the object empty?")
print(df.empty)
Is the object empty?
False
Ndim - Returns the number of dimensions of the object. By definition, DataFrame is a 2D
object.
print("The dimension of the object is:")
print(df.ndim)
The dimension of the object is:
2
Shape - Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b),
where a represents the number of rows and b represents the number of columns.
print("The shape of the object is:")
print(df.shape)
Size
-
The shape of the object is:
(7, 3)
Returns the number of elements in the DataFrame.
print("The total number of elements in our object is:")
print(df.size)
17
The total number of elements in
our object is: 21
PANDAS
Values - Returns the actual data in the DataFrame as an NDarray.
print("The actual data in our data frame is:")
print(df.values)
The actual data in our data frame is:
[[25 'Tom' 4.23]
[26 'James' 3.24]
[25 'Ricky' 3.98]
[23 'Vin' 2.56]
[30 'Steve' 3.2]
[29 'Smith' 4.6]
[23 'Jack' 3.8]]
Head & Tail - To view a small sample of a DataFrame object, use the head() and tail()
methods. head() returns the first n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.
The first two rows of the data frame is:
Age
Name
Rating
0 25
Tom
4.23
1 26
James 3.24
print("The first two rows of the data frame
is:")
print(df.head(2))
tail() returns the last n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.
The last two rows of the data frame is:
Age
Name
Rating
5
29
Smith
4.6
6
23
Jack
3.8
print("The first two rows of the data frame
is:")
print(df.tail(2))
Iterating over a DataFrame
<DFobject>.iterrows( ) - The iterrows() method iterates over dataframe row-wise where each
horizontal subset is in the form of (row-index, Series) where Series contains all column values
for that row-index.
<DFobject>.iteritems( ) - The iteritems() method iterates over dataframe column-wise where
each vertical subset is in the form of (col-index, Series) where Series contains all row values for
that column-index.
NOTE : <DF>.iteritems{ ) iterates over vertical subsets in the form of (col-index, Series) pairs
and <DF>.iterrows( ) iterates over horizontal subsets in the form of (row-index, Series) pairs.
for (row, rowSeries) in df1 . iterrows ():
Each row is taken one at a time in the form of (row, rowSeries) where row would store the rowindex and rowSeries will store all the values of the row in form of a Series object.
for ( col, colSeries) in df1 . iteritems():
Each row is taken one at a time in the form of (col, colSeries) where col would store the rowindex and colSeries will store all the values of the row in form of a Series object.
To iterate over the rows of the DataFrame, we can use the following functions −
•
•
iteritems() − to iterate over the (key,value) pairs
iterrows() − iterate over the rows as (index,series) pairs
iteritems() - Iterates over each column as key, value pair with label as key and column value
as a Series object.
df =
pd.DataFrame(np.arange(1,13).reshape(4,3),
columns=['col1','col2','col3'])
for key,value in df.iteritems():
print(key, value)
18
col1 0
1
1
4
2
7
3
10
Name: col1, dtype: int32
PANDAS
col2 0
2
1
5
2
8
3
11
Name: col2, dtype: int32
col3 0
3
1
6
2
9
3
12
Name: col3, dtype: int32
Observe, each column is iterated separately as a key-value pair in a Series.
iterrows() - iterrows() returns the iterator yielding each index value along with a
series containing the data in each row.
df = pd.DataFrame(np.arange(1,13).reshape(4,3),
columns=['col1','col2','col3'])
for key,value in df.iterrows():
print(key, value)
0 col1
col2
col3
Name: 0,
1 col1
col2
col3
Name: 1,
2 col1
col2
col3
Name: 2,
3 col1
col2
col3
Name: 3,
1
2
3
dtype:
4
5
6
dtype:
7
8
9
dtype:
10
11
12
dtype:
int32
int32
int32
int32
Note − Because iterrows() iterate over the rows, it doesn't preserve the data type across the
row. 0,1,2 are the row indices and col1,col2,col3 are column indices.
Note − Do not try to modify any object while iterating. Iterating is meant for reading and the
iterator returns a copy of the original object (a view), thus the changes will not reflect on the
original object.
df = pd.DataFrame(np.arange(1,13).reshape(4,3),
columns=['col1','col2','col3'])
for index, row in df.iterrows():
row['a'] = 10
print(df)
0
1
2
3
col1
1
4
7
10
col2
2
5
8
11
col3
3
6
9
12
Binary Operations in a DataFrame : Binary operations mean operations requiring two values
to perform and these values are picked elementwise. In a binary operation, the data from the two
dataframes are aligned on the bases of their row and column indexes and for matching row,
column index, the given operation is performed and for nonmatching row, column indexes NaN
value is stored in the result. So you can say that like Series objects, data is aligned in two
dataframes, the data is aligned on the basis of matching row and column indexes and then
performed arithmetic for non-overlapping indexes, the arithmetic operations result as a NaN for
non-matching indexes.
You can perform add binary operation on two dataframe objects using either + operator or using
add( ) as per syntax: <DF1>.add(<DF2>) which means <DF1>+<DF2> or by using radd( ) i.e.,
reverse add as per syntax : <DF1>.radd(<DF2>) which means <DF2>+<DF1>.
You can perform subtract binary operation on two dataframe objects using either - (minus)
operator or using sub() as per syntax: <DF>.sub(<DF>) which means <DF1> - <DF2> or by
19
PANDAS
using rsub( ) i.e., reverse subtract as per syntax : <DF1>.radd(<DF2>) which means <DF2> <DF1>.
You can perform multiply binary operation on two dataframe objects using either * operator or
using mul() as per syntax: <DF>.mul(<DF>).
You can perform division binary operation on two dataframe objects using either / operator or
using div( ) as per syntax : <DF>.div(<DF>).
NOTE : Python integer types cannot store NaN values. To store a NaN value in a column, the
datatype of a column is changed to non-integer suitable type.
NOTE : If you are performing subtraction on two dataframes, make sure the data types of values
are subtraction compatible (e.g., you cannot subtract two strings) otherwise Python will raise an
error.
Some Other Essential Functions:
1. Inspection functions info( ) and describe( ) : To inspect broadly or to get basic information
about your dataframe object, you can use info() and describe( ) functions.
<DF>.info() - The info( ) gives following information for a dataframe object
¢) Its type. Obviously, it is an instance of a DataFrame.
¢) Index valbues. As each row of a dataframe object has an index, this information shows
the assigned indexes.
¢) Number of rows in the dataframe object.
¢) Data columns and values in them. It lists number of columns and count of only non-NA
values in them.
¢) Datatypes of each column. The listed datatypes are not necessarily in the
corresponding order to the listed columns. You can however use the dtypes attribute to get
the datatype for each column.
¢) Memory_usage. Approximate amount of RAM used to hold the DataFrame.
<DF>.describe() - The describe( ) gives following information for a dataframe object having
numeric columns:
¢) Count. Count of non-NA values in a column
¢) Mean. Computed mean of values in a column
¢) Std. Standard deviation of values in c1 column
¢) Min. Minimum value in a column
¢) 25°0, 50%, 75%. Percentiles of values in that column (how these percentile are
calculated, we are explaining below.
¢) Max. Maximum value in a column
The information returned by describe( ) for string columns includes:
¢) Count -the number of non-NA entries in the column
¢) Unique -number of unique entries in the column
20
PANDAS
¢) Top - the most common entry in the column, i.e., the one with highest frequency. If,
however, multiple values have the same highest count, then the count and most common
(i.e., top) pair will be arbitrarily chosen from among those with the highest count.
¢) Freq - it is the frequency of the most common element displayed as top above.
The default behavior of describe( ) is to only provide a summary for the numerical columns. You
can give include= 'all' as argument to describe( ) to list summary for all columns.
d = {'Name':pd.Series(['Tom','James','Ricky','Vin']),
'Age':pd.Series([25,26,25,23]),
'Rating':pd.Series([2.98,4.80,4.10,3.65])}
df = pd.DataFrame(d)
print(df.describe())
d = {'Name':pd.Series(['Tom','James','Ricky','Vin']),
'Age':pd.Series([25,26,25,23]),
'Rating':pd.Series([2.98,4.80,4.10,3.65])}
df = pd.DataFrame(d)
print(df.describe(include='all'))
d = {'Name':pd.Series(['Tom','James','Ricky','Vin']),
'Age':pd.Series([25,26,25,23]),
'Rating':pd.Series([2.98,4.80,4.10,3.65])}
df = pd.DataFrame(d)
print(df.info())
count
mean
std
min
25%
50%
75%
max
Age
4.000000
24.750000
1.258306
23.000000
24.500000
25.000000
25.250000
26.000000
count
unique
top
freq
mean
std
min
25%
50%
75%
max
Name
4
4
Tom
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Rating
4.000000
3.882500
0.765436
2.980000
3.482500
3.875000
4.275000
4.800000
Age
4.000000
NaN
NaN
NaN
24.750000
1.258306
23.000000
24.500000
25.000000
25.250000
26.000000
Rating
4.000000
NaN
NaN
NaN
3.882500
0.765436
2.980000
3.482500
3.875000
4.275000
4.800000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Name
4 non-null object
Age
4 non-null int64
Rating
4 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 120.0+ bytes
None
This function gives the mean, std and IQR values. And, function excludes the
character columns and given summary about numeric columns. 'include' is the
argument which is used to pass necessary information regarding what columns
need to be considered for summarizing. Takes the list of values; by default,
'number'.
•
object − Summarizes String columns
•
number − Summarizes Numeric columns
•
all − Summarizes all columns together (Should not pass it as a list value)
2. Retrieve head and tail rows using head( ) and tail( ) - You can use head() and tail() to
retrieve ‘N’ top or ‘N’ bottom rows respectively of a dataframe object. These functions are to be
used as : <DF>.head( [ n = 5 ] ) or
<DF>.tail( [ n = 5 ] )
You can give any value of ‘N’ as per your need (as many rows you want to list).
3. Cumulative Calculations Functions : You can use these functions for cumulative
calculations on dataframe objects : cumsum( ) calculates cumulative sum i.e., in the output of
21
PANDAS
this function, the value of each row is replace>d by sum of all prior rows including this row.
String value rows use concatenation. It is used. as: <DF> . cumsum( [axis== None])
df = pd.DataFrame([[25,26],[25,23]])
print(df.cumsum())
0
1
0
25
50
1
26
49
In the same manner you can use cumprod( ) to get cumulative product, cummax( ) to get
cumulative maximum and cummin( ) to get cumulative minimum value from a dataframe object.
4. Index of Maximum and Minimum Values - You can get the index of maximum and minimum
value in columns using idxmax( ) and idxmin( ) functions. <DF>.idxmax() or <DF>.idxmin( )
df = pd.DataFrame([[52,26,54],[25,72,78],[25,2,82]])
print(df.idxmax())
print(df.idxmin())
0
0
1
1
2
2
dtype: int64
0
1
1
2
2
0
dtype: int64
Python Pandas – Sorting - There are two kinds of sorting available in Pandas. They are −
1. By label
2. By Actual Value
By Label - Using the sort_index() method, by passing the axis arguments and the
order of sorting, DataFrame can be sorted. By default, sorting is done on row labels
in ascending order.
df=pd.DataFrame([[34,23],[76,43],[76,34],[78,99]],
index=[1,4,6,2], columns=['col2','col1'])
print(df)
sortdf=df.sort_index()
1
2
4
col2
34
76
76
78
col2
34
78
76
col1
23
43
34
99
col1
23
99
43
6
76
34
1
4
6
2
print(sortdf)
Order of Sorting - By passing the Boolean value to ascending parameter, the order of the
sorting can be controlled. Let us consider the following example to understand the same.
sortdf=df.sort_index(ascending=False)
print(sortdf)
6
4
2
1
col2
76
76
78
34
col1
34
43
99
23
Sort the Columns - By passing the axis argument with a value 0 or 1, the sorting can be
done on the column labels. By default, axis=0, sort by row. Let us consider the following
example to understand the same.
sortdf=df.sort_index(axis=1)
print(sortdf)
1
4
6
2
col1
23
43
34
99
col2
34
76
76
78
By Value - Like index sorting, sort_values() is the method for sorting by values. It
accepts a 'by' argument which will use the column name of the DataFrame with which the
values are to be sorted.
df=pd.DataFrame({'name':['aman','raman','lucky','pawan'],
'city':['ajmer','ludhiana','jaipur','jalandhar'],
22
0
name
aman
city state
ajmer
raj
PANDAS
'state':['raj','pb','raj','pb']})
print(df)
sortdf = df.sort_values(by='state')
print(sortdf)
sortdf = df.sort_values(by=['state','city'])
print(sortdf)
1
2
3
1
3
0
2
3
1
0
2
23
raman
lucky
pawan
name
raman
pawan
aman
lucky
name
pawan
raman
aman
lucky
ludhiana
pb
jaipur
raj
jalandhar
pb
city state
ludhiana
pb
jalandhar
pb
ajmer
raj
jaipur
raj
city state
jalandhar
pb
ludhiana
pb
ajmer
raj
jaipur
raj
PANDAS
Matching and Broadcasting Operations - You have read earlier that when you perform
arithmetic operations on two Series type objects , the data is aligned on the basis of matching
indexes and then performed arithmetic; for non-overlapping indexes, the arithmetic operations
result as a NaN (Not a Number), this is called Data Alignment in panda objects. While
performing arithmetic operations on dataframes, the same thing happens, i.e., whenever you
add two dataframes, data is aligned on the basis of matching indexes and then performed
arithmetic; for non-overlapping indexes, the arithmetic operations result as a NaN (Not a
Number). This default behaviour of data alignment on the basis of matching indexes is
called MATCHING.
While performing arithmetic operations, enlarging the smaller object operand by replicating its
elements so as to match the shape of larger object operand, is called BROADCASTING.
<DF>, add( <DF>, axis= 'rows')
<DF>. div( <DF>, axis= 'rows·)
<DF>, rdiv( <DF>, axis= 'rows')
<DF>. mul( <DF>, axis = 'rows')
<DF>. rsub( <DF>, axis= 'rows')
You can specify matching axis for these operations. (default matching is on columns i.e., when
you do not give axis argument)
Broadcasting using a scalar value
s = pd.Series(np.arange(5))
print(s * 10)
0
1
2
3
4
0
10
20
30
40
dtype: int32
df = pd.DataFrame({'a':[10,20], 'b':[5,15]})
print(df*10)
0
1
a
100
200
b
50
150
So what is technically happening here is that the scalar value has been broadcasted along the same
dimensions of the Series and DataFrame above.
Broadcasting using a 1-D array - Say we have a 2-D dataframe of shape 4x3 (4 rows, 3 columns)
we can perform an operation along the x-axis by using a 1-D Series that is the same length as the
row-length:
df = pd.DataFrame( { 'a' : [ 10, 20 ], 'b' : [ 5, 15 ] } )
print(df)
print(df.iloc[0])
print(df + df.iloc[0])
output
a
b
0 10 5
1 20 15
a 10
b 5
Name: 0, dtype: int64
a
b
0 20 10
1 30 20
The general rule is this: In order to broadcast, the size of the trailing axes for both arrays in an operation
must either be the same size or one of them must be one.
So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy
which will raise a ValueError, in Pandas you'll get a df full of NaN values:
24
PANDAS
dt = pd.DataFrame([1])
print(df+dt)
Output:
a
0 NaN
1 NaN
b
NaN
NaN
0
NaN
NaN
Now some of the great things about
pandas is that it will try to align
using existing column names and
row labels, this can get in the way of
trying
to
perform
a
fancier
broadcasting like this:
print(df[['a']] + df.iloc[0])
0
1
a
20
30
b
NaN
NaN
In the above we can see a problem when trying to broadcast using the first row as the column
alignment only aligns on the first column. To get the same form of broadcasting to occur like the
diagram above shows we have to decompose to numpy arrays which then become anonymous data:
print( df[['a']].values + df.iloc[0].values )
[ [20 15]
[30 25] ]
Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D
arrays the minor/trailing axes length must match or one of them must be 1.
Handling Missing Data : Missing values are the values that cannot contribute to any
computation or we can say that missing values are the values that carry no computational
significance. Pandas library is designed to deal with huge amounts of data or big data. In such
volumes of data, there may be some values which have NA values such as NULL or NaN or
None values. These are the values that cannot participate in computation constructively. These
values are known as missing values. You can handle missing data in many ways, most common
ones are: (i) Dropping missing data
(ii) Filling missing data (Imputation)
You can use isnull() and notnull() functions to detect missing values in a panda object; it
returns True or False for each value in a pandas object if it is a missing value or not. It can be
used as :
<PandasObject> . isnull()
<PandasObject> . notnull()
<PandaObject> means it is applicable to both Series as well as Dataframe objects.
df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a', 'c',
'e', 'f'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(df['one'].isnull())
print("-------------------")
print(df['one'].notnull())
a
False
b
True
c
False
d
True
e
False
f
False
Name: one, dtype: bool
------------------a
True
b
False
c
True
d
False
e
True
f
True
Name: one, dtype: bool
Calculations with Missing Data
• When summing data, NA will be treated as Zero
• If the data are all NA, then the result will be Zero
df = pd.DataFrame(np.arange(0,12).reshape(4,3),
25
18.0
PANDAS
index=['a', 'c', 'e', 'f'], columns=['one', 'two',
'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g',
'h'])
print(df['one'].sum())
df =
pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print(df['one'].sum())
0
Handling Missing Data Dropping Missing Values – To drop missing values you can use
dropna( ) in following three ways :
(a) <PandaObjed>.dropna( ). This will drop all the rows that have NaN values in them, even row
with a single NaN value in it.
(b) <PandaObjed>.dropna( how=’all’ ). With argument how= 'all', it will drop only those rows that
have all NaN values, i.e., no value is non-null in those rows.
(c) dropna(axis = 1). With argument axis= 1, will drop columns that have any NaN values in
them. Using argument how= 'all' along with argument axis= 1 will drop columns with all NaN
values.
df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a',
'c', 'e', 'f'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())
df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a',
'c', 'e', 'f'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna(axis=1))
a
c
e
f
one
0.0
3.0
6.0
9.0
two
1.0
4.0
7.0
10.0
three
2.0
5.0
8.0
11.0
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
Handling Missing Data Filling Missing Values - Though dropna( ) removes the null values,
but you also lose other non-null data with it too. To avoid this, you may want to fill the missing
data with some appropriate value of your choice. For this purpose you can use fillna( ) in
following ways:
(a) <PandaObject>.fillna(<n>). This will fill all NaN values with the given <n> value.
(b) Using dictionary with fillna() to specify fill values for each column separately. You can create
a dictionary that defines fill values for each of the columns. And then you can pass this dictionary
as an argument to fillna( ), Pandas will fill the specified value for each column defined in the
dictionary. It will leave those columns untouched or unchanged that are not in the dictionary. The
syntax of fillna( ) is : <DF>.fillna( <dictionary having fill values for columns> )
df = pd.DataFrame(np.arange(0,12).reshape(4,3),
index=['a', 'c', 'e', 'f'],columns=['one', 'two',
'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(df.fillna(-1))
print("--------------------")
print(df.fillna({'one':-1, 'two':-2,'three':-3}))
26
one
two three
a 0.0
1.0
2.0
b -1.0 -1.0
-1.0
c 3.0
4.0
5.0
d -1.0 -1.0
-1.0
e 6.0
7.0
8.0
f 9.0 10.0
11.0
-------------------one
two three
a 0.0
1.0
2.0
b -1.0 -2.0
-3.0
c 3.0
4.0
5.0
d -1.0 -2.0
-3.0
PANDAS
e
f
6.0
9.0
7.0
10.0
8.0
11.0
Missing data / operations with fill values
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value
to substitute when at most one of the values at a location are missing. For example, when adding two
DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in
which case the result will be NaN.
df1 = pd.DataFrame([[10,20,np.NaN],
[np.NaN,30,40]], index=['A','B'], columns=['One',
'Two', 'Three'])
df2 = pd.DataFrame([[np.NaN,20,30], [50,40,30]],
index=['A','B'], columns=['One', 'Two', 'Three'])
print(df1)
print(df2)
A
B
One
10.0
NaN
One
NaN
50.0
A
B
One
NaN
NaN
A
B
One
10.0
50.0
A
B
print(df1+df2)
print(df1.add(df2, fill_value=0))
Two
20
30
Two
20
40
Two
40
70
Two
40
70
Three
NaN
40.0
Three
30
30
Three
NaN
70.0
Three
30.0
70.0
Flexible Comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is
analogous to the binary arithmetic operations described above:
print(df1.gt(df2))
A
B
One
False
False
A
B
One
True
True
print(df2.ne(df1))
NOTE : eq = equal to
ne = not equal
le = less than equal to
ge = greater than equal to
Two
False
False
Two
False
True
lt = less than
Three
False
True
Three
True
True
gt = greater than
Comparisons of Pandas Objects – Thus, normal comparison operators ( == ) do not produce
accurate result for comparison of two similar objects as they cannot compare NaN values. To
compare objects having NaN values, it is better to use equals( ) that returns True if two NaN
values are compared for equality:
<expression 1 yielding a Panda object>. equals ( <expression 2 yielding a Panda object >)
NOTE : The equls( ) tests two objects for equality, with NaNs in corresponding locations treated
as equal.
NOTE : Series or Data Frame indexes of two objects need to be the same and in the same order
for equality to be True. Also, two panda objects being compared should be of same lengths.
Trying to compare two dataframes with different number of rows or two series objects with
different lengths will result in ValueError.
Boolean Reductions - Pandas offers Boolean reductions that summarize the Boolean result for
an axis. That is, with Boolean reduction, you can get the overall result of a row or column with a
27
PANDAS
single True or False. Boolean reductions are a way to summarize all com-parison results of a
dataframe's individual elements in form of single overall Boolean result per column or per row.
For this purpose Pandas offers following Boolean reduction functions or attributes.
¢) empty. this attribute is indicator whether a DataFrame is empty. It is used as: <DF>.empty
¢) any( ). This function returns True if any element is True over requested axis, By default, it
checks if any value in a column ( default axis is 0) meets this criteria, if it does it returns True,
i.e., if any of the values along the specified axis is True, this will return True. It is used as per
following syntax : <Data Frame comparison result object>. any( axis= None )
¢) all(). Unlike any(), the all() will return True / False if only all the values on an axis are True or
False according to given comparison . It is used as per syntax :
<DataFrame comparison result object> . all(axis =None)
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a
boolean result.
df1 = pd.DataFrame([[10,20,np.NaN], [np.NaN,30,40]],
columns=['One', 'Two', 'Three'])
df2
=
pd.DataFrame([[np.NaN,20,30],
[50,40,30]],
columns=['One', 'Two', 'Three'])
print((df1 > 0).all())
index=['A','B'],
index=['A','B'],
One
False
Two
True
Three
False
dtype: bool
print((df1 > 0).any())
one
True
two
True
three
True
dtype: bool
You can reduce to a final boolean value
print((df1 > 0).any().any())
True
You can test if a pandas object is empty, via the empty property.
print(df1.empty)
False
Combining DataFrames
combine_first( ) - The combine_first( ) combines the two dataframes in a way that uses the
method of patching the data : Patching the data means, if in a dataframe a certain cell has
missing data and corresponding cell (the one with same index and column id) in other dataframe
has some valid data then, this method with pick the valid data from the second dataframe and
patch it in the first dataframe so that now it also has valid data at that cell.
The combine_first() is used as per syntax :
<DF>.combine_first(<DF2>)
df1 = pd.DataFrame({'A': 10, 'B': 20, 'C':np.nan, 'D':np.nan}, index=[0])
df2 = pd.DataFrame({'B': 30, 'C': 40, 'D': 50}, index=[0])
df1=df1.combine_first(df2)
print("\n------------ combine_first ----------------\n")
print(df1)
0
A
10
B
20
C
40.0
D
50.0
The concat( ) can concatenate two dataframes along the rows or along the columns. This
method is useful if the two dataframes have similar structures.
It is used as per following syntax : The concat( ) can concatenate two dataframes along the rows
or along the columns. This method is useful if the two dataframes have similar structures. It is
used as per following syntax :
28
PANDAS
pd.concat([<df1>, <df2>])
pd. concat ( [ <df1>, <df2>], ignore_index == True)
pd. concat( [ <df1>, <df2>], axis== 1 )
If you skip the axis = 1 argument, it will join the two dataframes along the rows, i.e., the result
will be the union of rows from both the dataframes.
But if you do not want this mechanism for row indexes and want to have new row indexes
generated from 0 to n - 1, then you can give argument ignore_index = True.
By default it concatenates along the rows ; to concatenate along the column, you can give
argument axis = 1.
Pandas provides various facilities for easily combining together Series, DataFrame,
and Panel objects.
pd.concat(objs,axis=0,join='outer',join_axes=None, ignore_index=False)
•
•
•
•
•
objs − This is a sequence or mapping of Series, DataFrame, or Panel objects.
axis − {0, 1, ...}, default 0. This is the axis to concatenate along.
join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer
for union and inner for intersection.
ignore_index − boolean, default False. If True, do not use the index values on the
concatenation axis. The resulting axis will be labeled 0, ..., n - 1.
join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1)
axes instead of performing inner/outer set logic.
Concatenating Objects - The concat function does all of the heavy lifting of performing
concatenation operations along an axis. Let us create different objects and do concatenation.
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index
=[1,2,3])
two = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79
]},index=[1,2,3])
print(pd.concat([one,two]))
1
2
3
1
2
3
Name
Alex
Amy
Allen
Billy
Brian
Bran
Score sub_id
98
sub1
90
sub2
87
sub4
89
sub2
80
sub4
79
sub3
Suppose we wanted to associate specific keys with each of the pieces of the
chopped up DataFrame. We can do this by using the keys argument −
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2,
3])
two = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},ind
ex=[1,2,3])
print(pd.concat([one,two],keys=['x','y']))
x 1
2
3
y 1
2
3
Name
Alex
Amy
Allen
Billy
Brian
Bran
Score sub_id
98
sub1
90
sub2
87
sub4
89
sub2
80
sub4
79
sub3
The index of the resultant is duplicated; each index is repeated.
If the resultant object has to follow its own indexing, set ignore_index to True.
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2,
3])
two = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},ind
ex=[1,2,3])
print(pd.concat([one,two],keys=['x','y'],ignore_index=True))
0
1
2
3
4
5
Name
Alex
Amy
Allen
Billy
Brian
Bran
Score sub_id
98
sub1
90
sub2
87
sub4
89
sub2
80
sub4
79
sub3
Observe, the index changes completely and the Keys are also overridden.
29
PANDAS
If two objects need to be added along axis=1, then the new columns will be
appended.
one = pd.DataFrame({'Name': ['Alex', 'Amy',
'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87
]},index=[1,2,3])
two = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[
89,80,79]},index=[1,2,3])
print(pd.concat([one,two],keys=['x','y'],axis=1))
1
2
3
x
Name Score sub_id
Alex
98
sub1
Amy
90
sub2
Allen
87
sub4
y
Name Score sub_id
Billy
89
sub2
Brian
80
sub4
Bran
79
sub3
Concatenating Using append - A useful shortcut to concat are the append instance
methods on Series and DataFrame. These methods actually predated concat. They
concatenate along axis=0, namely the index –
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2,3])
two = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},index=
[1,2,3])
print(one.append(two))
1
2
3
1
2
3
Name
Alex
Amy
Allen
Billy
Brian
Bran
Score sub_id
98
sub1
90
sub2
87
sub4
89
sub2
80
sub4
79
sub3
1
2
3
1
2
3
1
2
3
Name
Alex
Amy
Allen
Billy
Brian
Bran
Alex
Amy
Allen
Score sub_id
98
sub1
90
sub2
87
sub4
89
sub2
80
sub4
79
sub3
98
sub1
90
sub2
87
sub4
The append function can take multiple objects as well −
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2,3])
two = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},index=
[1,2,3])
print(one.append([two,one]))
merge( ) - merge( ) function in which you can specify the field on the basis of which you want to
combine the two dataframes. It is used as per syntax:
pd.merge( <Df1>, <DF2>)
or
pd.merge( <DF1>, <DF2>, on = <field_name>)
If you skip the argument on= <field_name>, then 1t will take any merge on common fields of the
two dataframes but you can explicitly '-r􀀦ify the field on the basis of which you want to merge
the two dataframes.
Pandas has full-featured, high performance in-memory join operations idiomatically very similar
to relational databases like SQL.
Pandas provides a single function, merge, as the entry point for all standard database join
operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False,
right_index=False, sort=True)
Here, we have used the following parameters −
• left − A DataFrame object.
• right − Another DataFrame object.
• on − Columns (names) to join on. Must be found in both the left and right DataFrame
objects.
• left_on − Columns from the left DataFrame to use as keys. Can either be column names or
arrays with length equal to the length of the DataFrame.
• right_on − Columns from the right DataFrame to use as keys. Can either be column names
or arrays with length equal to the length of the DataFrame.
30
PANDAS
• left_index − If True, use the index (row labels) from the left DataFrame as its join key(s).
In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match
the number of join keys from the right DataFrame.
• right_index − Same usage as left_index for the right DataFrame.
• how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been
described below.
• sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True,
setting to False will improve the performance substantially in many cases.
Let us now create two different DataFrames and perform the merging operations on it.
left = pd.DataFrame({'Name': ['Alex', 'Amy',
'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]
},index=[1,2,3])
right = pd.DataFrame({'Name': ['Billy', 'Brian',
'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[8
9,80,79]},index=[1,2,3])
print(left)
print(right)
Merge Two DataFrames on a Key
print(pd.merge(left,right,on='sub_id'))
1
2
3
Name
Alex
Amy
Allen
Name
Billy
Brian
Bran
0
1
Name_x
Amy
Allen
1
2
3
Score sub_id
98
sub1
90
sub2
87
sub4
Score sub_id
89
sub2
80
sub4
79
sub3
Score_x sub_id Name_y
90
sub2 Billy
87
sub4 Brian
Score_y
89
80
Merge Two DataFrames on Multiple Keys – For Example
print(pd.merge(left,right,on=['id','subject_id']))
Merge Using 'how' Argument - The how argument to merge specifies how to determine
which keys are to be included in the resulting table. If a key combination does not appear in
either the left or the right tables, the values in the joined table will be NA. Here is a summary of
the how options and their SQL equivalent names −
Merge Method
SQL Equivalent
Description
left
LEFT OUTER JOIN
Use keys from left object
right
RIGHT OUTER JOIN
Use keys from right object
outer
FULL OUTER JOIN
Use union of keys
inner
INNER JOIN
Use intersection of keys
Left Join
print(pd.merge(left,right,on='sub_id',how='left'))
0
1
2
Name_x
Alex
Amy
Allen
Score_x sub_id Name_y
98
sub1
NaN
90
sub2 Billy
87
sub4 Brian
Score_y
NaN
89.0
80.0
0
1
2
Name_x
Amy
Allen
NaN
Score_x sub_id Name_y
90.0
sub2 Billy
87.0
sub4 Brian
NaN
sub3
Bran
Score_y
89
80
79
0
1
2
3
Name_x
Alex
Amy
Allen
NaN
Score_x sub_id Name_y
98.0
sub1
NaN
90.0
sub2 Billy
87.0
sub4 Brian
NaN
sub3
Bran
Score_y
NaN
89.0
80.0
79.0
0
1
Name_x
Amy
Allen
Score_x sub_id Name_y
90
sub2 Billy
87
sub4 Brian
Score_y
89
80
Right Join
print(pd.merge(left,right,on='sub_id',how='right'))
Outer Join
print(pd.merge(left,right,on='sub_id',how='outer'))
Inner Join
print(pd.merge(left,right,on='sub_id',how='inner'))
31
PANDAS
Joining will be performed on index. Join operation honors the object on which it is called.
So, a.join(b) is not equal to b.join(a).
Statistics Functions & Description - Let us now understand the functions under Descriptive
Statistics in Python Pandas. The following table list down the important functions −
Function
count()
sum()
mean()
median()
mode()
std()
min()
max()
abs()
prod()
cumsum()
cumprod()
var()
quantile()
Description
Number of non-null observations
Sum of values
Mean of Values
Median of Values
Mode of values
Standard Deviation of the Values
Minimum Value
Maximum Value
Absolute Value
Product of Values
Cumulative Sum
Cumulative Product
to calculate variance of a given set of numbers
Return values at the given quantile over requested axis
df = pd.DataFrame({'Name':['abc', 'lmn', 'xyz'], 'Science':[67,78,56], 'IP':[97,98,99]}, index=['a', 'b', 'c'],
columns=['Name', 'Science', 'IP'])
print('sum ', df['IP'].sum())
print('count ', df['IP'].count())
print('mean ', df['IP'].mean())
print('median ', df['IP'].median())
print('mode ', df['IP'].mode())
print('std ', df['IP'].std())
print('min ', df['IP'].min())
print('max ', df['IP'].max())
print('abs ',df['Science'].abs())
print('prod ', df['IP'].prod())
print('cumsum ', df['IP'].cumsum())
print('cumprod ', df['IP'].cumprod())
print('var ', df['IP'].var())
print('quantile ', df['IP'].var())
sum 294
count 3
mean 98.0
median 98.0
mode 0
97
1
98
2
99
dtype: int64
std 1.0
min 97
max 99
abs a
97
b
98
c
99
Name: IP, dtype: int64
prod 941094
cumsum a
97
b
195
c
294
Name: IP, dtype: int64
cumprod a
97
b
9506
c
941094
Name: IP, dtype: int64
var 1.0
quantile 1.0
min() and max() - The min() and max( ) functions find out the minimum or maximum value
respectively from a given set of data. The syntax is :
<dataframe>. min(axis = None, skipna = None, numeric_only = None)
<dataframe>.max (axis= None, skipna = None, numeric_only = None)
axis
(0 or 1) by default, minimum or maximum is calculated along axis 0 (i.e.,
{index (0), columns (1)} ).
skipna
(True or False) Exclude NA/null values when computing the result
numeric_
(True or False) Include only float, int, boolean columns. If None, will attempt
only
to use everything, then use only numeric data.
32
PANDAS
mode( ), mean( ), median( ) - Function mode() returns the mode value (i.e., the value that
appears most often) from a set of values. Syntax :
<dataframe>.mode(axis = 0, numeric_only = False)
Function mean() returns the computed mean (average) from a set of values. Function
median() returns the middle number from a set of numbers. Syntax :
<dataframe>.mean(axis = None, skipna = None, numeric_only = None)
<dataframe>.median(axis = None, skipna = None, numeric_only = None)
The function count() counts the non-NA entries for each row or column. The values None, NaN,
NaT etc., are considered as NA in pandas. Syntax :
<dataframe>.count(axis = 0, numeric_only = False)
The function sum() returns the sum of the values for the requested axis. Syntax
<dataframe>. sum(axis = None, skipna = None, numeric_only = None, min_count = 0)
min_count - If None, will attempt to use everything, then use only numeric data. If int, default
0 ; the required number of valid values to perform the operation. If fewer than min_count nonNA values are present the result will be NA. (Added with the default being 1. This means the
sum or product of an all-NA or empty series is NaN.)
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN
]},index=[1,2,3])
print('----count----')
print(one.count())
print('----skipna-False----')
print(one.sum(skipna=False))
print('----skipna-True----')
print(one.sum(skipna=True))
print('----numeric-only----')
print(one.count(numeric_only=True))
----count---Name
3
Score
2
sub_id
3
dtype: int64
----skipna-False---Name
AlexAmyAllen
Score
NaN
sub_id
sub1sub2sub4
dtype: object
----skipna-True---Name
AlexAmyAllen
Score
188
sub_id
sub1sub2sub4
dtype: object
----numeric-only---Score
2
dtype: int64
quantile( ) and var( ) - The quantile( ) function returns the values at the given quantiles
over requested axis ( axis 0 or 1 ). Quantiles are points in a distribution that relate to the rank
order of values in that distribution. The quantile of a value is the fraction of observations less
than or equal to the value. The quantile of the median is 0.5, by definition. The 0.25 quantile
(also known as the 25 percentile; percentiles are just quantiles multiplied by 100) and the 0.75,
are known as quartiles, and the difference between them in the interquartile range. Syntax :
<dataframe>.quantile(q = 0.5, axis= 0, numeric_only = True)
The var( ) function computes variance and returns unbiased variance over requested axis.
<dataframe>.var(axis = None, skipna = None, numeric_only =None)
Applying Functions on a Subset of Dataframe :
Applying Functions on a Column of a DataFrame :
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},
index=[1,2,3])
33
188.0
PANDAS
print(one['Score'].sum())
Applying functions on Multiple Columns of a DataFrame :
print(one[['Name','Score']].count())
Name
3
Score
2
dtype: int64
Applying Functions on a Row of a DataFrame :
<dataframe>. lac [ <row index>, : )
sal_df.loc['Qtr2', :].max()
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},index=['a','b','c'])
print(one.loc['a', ].count())
3
Applying Functions on a Range of Rows of a DataFrame :
<dataframe>. loc [ < start row<J>: <end row>, : ]
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},index=['a','b','c'])
print(one.loc['a':'b', ].count())
Name
2
Score
2
sub_id
2
dtype: int64
Applying functions to a subset of the DataFrames :
<dataframe>. loc [ <start row> : <end row>, : <start column> : <end column>]
one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'],
'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},index=['a','b','c'])
print(one.loc['a':'b', 'Name':'Score'].count())
Name
2
Score
2
dtype: int64
Python lambda - We can create anonymous functions, known as lambda functions. Lambda
functions are different from normal Python functions, they origin from Lambda Calculus. It
allows you to write very short functions. This code shows the use of a lambda function:
f = lambda x : 2 * x
print f(3)
6
A return statements is never used in a lambda function, it always returns something. A lambda
functions may contain if statements:
f = lambda x: x > 10
print(f(2))
print(f(12))
34
False
True
PANDAS
Transferring Data between .csv Files and DataFrames
The acronym CSV is short for Comma-Separated Values. The CSV format refers to a tabular
data that has been saved as plaintext where data is separated by commas.
The CSV format is popular as it offers following advantages:
¢) A simple, compact and ubiquitous format for data storage. ¢) A common format for data
interchange.
¢) It can be opened in popular spreadsheet packages like MS-Excel, Calc etc.
¢) Nearly all spreadsheets and databases support import I export to csv format.
Loading data from CSV to Dataframe - Python's Pandas library offers two functions
read_csv() and to_csv() that help you bring data from a CSV file into a dataframe and write a
dataframe's data to a CSV file.
Reading From a CSV File to Dataframe - You can use read_csv() function to read data from a
CSV file in your dataframe by using the function as per following syntax : <DF> = pandas.
read_csv( <filepath>)
df =pd. read_csv( "c:\\data\\sample. csv")
Reading CSV File and Specifying Own Column Names - you may have a CSV file that does
not have top row containing column headers. For such a situation, you can specify own column
headings in read_csv( ) using names argument as per following syntax : <DF> =pandas.
read_csv( <filepath>, names= <sequence>)
df2 =pd. read_csv("c:\\data\\mydata. csv", names=["Roll no", "First_Name", "last Name"])
If you want the first row not to be used as header and at the same time you do not want to
specify column headings rather go with default column headings which go like 0, l, 2, ,3 ... ,
then simply give argument as header= None in read_csv( ), i.e., as: df3 =pd. read_csv( "c:\
\data\\mydata.csv", header= None)
if you give argument as header= None, it will take default headings as 0,1, 2, ..... if you want to
skip row 1 while fetching data from CSV file, you can use :
dfS=pd.read_csv("c:\\data//mydata.csv", names= ["Rollno", "Name", "Marks"), skiprows=1).
Reading Specified Number of Rows from CSV file – Giving argument nrows = <n> in
read_csv( ), will read the specified number of rows from the CSV file.
df6 = pd.read_csv("c:\ \data\ \mydata.csv", names= ("Rollno", "Name", "Surname"], nrows = 3)
Reading from CSV files having Separator Different from Comma – you need to specify an
additional argument as sep = <separator character>. If you skip this argument then default
separator character (comma) is considered.
dfNew=pd.read_csv( "c:\\data\\match.csv", sep=';', names=["Country1", "Stat", "Country2"])
we can summarize read_csv( ) as follows:
<DF> =pandas. read_csv( <filepath>, sep = <separator character>, names = <column names
sequence>, header= None, skiprows = <n>, nrows =<n>)
Storing Dataframe's Data to CSV File - Sometimes, we have data available in dataframes and
we want to save that data in a CSV file. For this purpose, Python Pandas provides to_csv()
function that saves the data of a dataframe in a CSV file.
<DF>.to_csv(<filepath>)
35
or
<DF>.to_csv( <filepath>, sep =<separator _character>)
PANDAS
The separator character must be a one character string only.
df7. to_csv( "c:\\data\\new2.csv", sep = "|")
Handling NaN Values with to_csv( ) - By default, the missing/NaN values are stored as empty
strings in CSV file. You can specify your own string that can be written for missing/NaN values
by giving an argument na_rep = <string>.
df7. to_csv( "c: \\data\\new3. csv", sep - "I", na_rep :.a "NULL")
NOTE : Note that the function read_csv( ) is pandas function and to_csv( ) is dataframe
structure's function.
We can summarise to_csv() as follows where sep and na_rep arguments are optional:
<DF>. to_csv( <filepath>, sep = <separator character>, na_rep =<string>)
Transferring Data Between Dataframes and SQL Databases - An SQL database is a
relational database having data in tables called relations and it uses a special type of query
language, Structured Query Language, to query upon manipulate data or to communicate with
database.
Brief Introduction to SQLite Database - SQLite is an embedded SQL database engine, which
implements a relational database manage-ment system that is a self-contained, serverless and
requires zero configuration. In other words, SQLite does not have a separate server process
and it implements SQL databases in a very compact way. SQLite database is in public domain
and is freely available for any type of use.
Bringing Data from SQL Database Table into a Dataframe – In order to read data from an
SQL database such as sqlite, Python comes equipped with library sqlite3.
(i) import sqlite3 library by giving Python statement :
import sqlite3 as sq
(ii) Make connection to SQL database : conn = sq.connect("C:\\sqlite3\\new.db")
(iii) Read data from a table into a dataframe using read_sql( ) as per following syntax: df =
pd.read_sql("SELECT * From Stud;", conn)
(iv) Print values : print(df)
NOTE : You can also give database name as ":memory:" while creating connection. This
database will reside in RAM (i.e., temporary database) rather than on hard disk.
Storing a Dataframe's Data as a Table in an SQL Database –
(i) import sqlite3 library by giving Python statement :
import sqlite3 as sq
(ii) Make connection to SQL database : conn = sq.connect("C:\\sqlite3\\new.db")
(iii) you can write a dataframe in the form of a table by using to_sql( ) as per following syntax :
dft4. to_sql( "metros", conn)
If you run to_sql() that has a table name which already exists then you must specify argument
if_exists = "append" or if_exists = "replace" otherwise Python will give ERROR. If you set the
value as "append", then new data will be appended to existing table and if you set the value as
"replace" then new data will replace the old data in given table. For example, if we have another
dataframe dtf5 having following data: dtf5.to_sql("metros", conn, if exists= "append")
36
PANDAS
Advanced Operations on DataFrame :
PIVOTING - Pivoting technique rearranges the data from rows and columns, by possibly
aggregating data from multiple sources, in a report form (with rows transferred to columns) so
that data can be viewed in a different perspective. Pivoting is actually a summary technique that
works on tabular data.
Syntax:<dataframe>.pivot(index=<columnname>, columns=<columnname>, values= <columnname>)
The result of pivot() function has the index-rows as per the index argument, columns as per the
Vfllue.s of columns argument and the values created from the values argument (see above).
Cells in the pivoted table which do not have a matching entry in the original one are set with
NaN.
The pivot() function returns the result in form of a newly created dataframe, i.e., you may store
the result in a dataframe.
NOTE : With pivot( ), if there are multiple entries for a columns value for the same values for
index(row), it leads to error. Hence, before you use pivot( ), you should ensure that the data
does not have rows with duplicate values for the specified columns.
Using pivot table( ) Function - For data having multiple values for same row and column
combination, you can use another pivoting function-the pivot-table() function.
The pivot_table( ) is also a pivoting function, which like pivot(), also produces a pivoted table,
BUT it is different from the pivot( ) function in following ways :
(i)
It does not raise errors for multiple entries of a row, column combination.
(ii)
It aggregates the multiple entries present for a row-column combination; you need to
specify what type of aggregation you want (sum, mean etc.)
pandas.pivot_table(<dataframe>, values=None, index=None, columns=None, aggfunc='mean')
or
<dataframe>.pivot_table(values= None, index=None, columns=None, aggfunc= 'mean')
Where : the index argument contains the column name for rows.
the columns argument contains the column name for columns.
the values argument contains the column names for data of the pivoted table.
the aggfunc argument contains the function as per which data is to be aggregated, if skipped,
it, by default will compute the mean of the multiple entries for the same row-column
combination.
Being able to quickly summarize hundreds of rows and columns can save you a lot of time and
frustration. A simple tool you can use to achieve this is a pivot table, which helps you slice, filter,
and group data at the speed of inquiry and represent the information in a visually appealing way.
Introducing our data set: Maruti Sale Report - Some interesting questions we might like to
answer are:
•
•
•
Which are the most liked and least liked cars according to regions in India?
Is sale affected by region?
Did the sale change significantly over the past five years?
Let's import our data and take a quick first look:
import pandas as pd
import numpy as np
37
S.NO.
1
NAME
ALTO
YEAR
2016
ZONE
EAST
PANDAS
SALE
45
import matplotlib.pyplot as plt
# reading the data
data = pd.read_csv('c:\\python\\pivotsale.csv',
index_col=0)
# sort the df by ascending name and descending zone
data.sort_values(["NAME", "ZONE"], ascending=[True,
False])
#diplay first 10 rows
print(data.head(10))
2
3
4
5
6
7
8
9
10
ALTO
ALTO
ALTO
ALTO
ALTO
ALTO
ALTO
ALTO
ALTO
2017
2018
2019
2016
2017
2018
2019
2016
2017
EAST
EAST
EAST
WEST
WEST
WEST
WEST
NORTH
NORTH
43
76
23
56
34
65
87
34
67
print(pd.pivot_table(data, index= 'NAME', values= "SALE"))
NAME
800
ALTO
BALENO
ESTEEM
K10
ZEN
SALE
56.3125
57.0000
60.6875
57.3750
61.7500
54.8125
Creating a multi-index pivot table
print(pd.pivot_table(data, index= ['NAME','YEAR'], values= "SALE"))
This is one way to look at the data, but we can use the columns parameter to get a better
display:
•
columns is the column, grouper, array, or list of the previous you'd like to group your data by.
Using it will spread the different values horizontally.
Using Year as the Columns argument will display the different values for year, and will make for
a much better display, like so:
print(pd.pivot_table(data, index= 'NAME', columns='YEAR', values= "SALE"))
38
PANDAS
Visualizing the pivot table using plot() - If you want to look at the visual representation of the
previous pivot table we created, all you need to do is add plot() at the end of the pivot_table function call.
pd.pivot_table(data, index= 'NAME', columns= 'YEAR', values= "SALE").plot(kind= 'bar')
plt.ylabel("SALE")
The visual representation helps reveal that the differences are minor.
Manipulating the data using aggfunc
Up until now we've used the average to get insights about the data, but there are other important
values to consider. Time to experiment with the aggfunc parameter:
•
aggfunc (optional) accepts a function or list of functions you'd like to use on your group
(default: numpy.mean). If a list of functions is passed, the resulting pivot table will have
hierarchical columns whose top level are the function names.
Let's add the median, minimum, maximum, and the standard deviation for each “NAME”. This
can help us evaluate how accurate the average is, and if it's really representative of the real
picture.
print(pd.pivot_table(data, index= 'NAME', values= "SALE", aggfunc= [np.mean, np.median, min,
max, np.std]))
Categorizing using string manipulation - Up until now we've grouped our data according to
the categories in the original table. However, we can search the strings in the categories to
create our own groups.
39
PANDAS
table=pd.pivot_table(data, index= 'NAME', columns='YEAR', values= "SALE")
print(table[table.index.str.endswith('O')])
Adding total rows/columns
The last two parameters are both optional and mostly useful to improve display:
•
margins is type boolean and allows you to add an all row / columns, e.g. for subtotal / grand
totals (Default False)
•
margins_name which is type string and accepts the name of the row / column that will contain
the totals when margins is True (default ‘All’)
Let's use these to add a total to our last table.
print(pd.pivot_table(data, index=['NAME', 'ZONE'], columns='YEAR', values= "SALE", aggfunc=
'sum', fill_value= 0, margins = True, margins_name= 'Total count'))
Let's summarize
If you're looking for a way to inspect your data from a different perspective then pivot_table is the
answer. It's easy to use, it's useful for both numeric and categorical values, and it can get you
results in one line of code.
Sorting - Sorting refers to arranging values in a particular order. The values can be sorted on
the basis of a specific column or columns and can be in ascending or descending order.
Pandas makes available sort_ values() function for this purpose, which can be used as per
following syntax : <dataframe>. sort_values(by, axis= 0, ascending= True, inplace = False,
kind= 'quicksort', na_position ='last')
by
axis
40
str or list of str; Name or list of names to sort by.
if axis is 0 or 'index' then by may contain index levels and/or column labels if axis
is 1 or 'columns' then by may contain column levels and/or index labels
{0 or 'index', 1 or 'columns'}, default O ; Axis to be sorted
PANDAS
ascending
inplace
na_position
bool or list of bool, default True ; Sort ascending vs. descending. Specify list for
multiple sort orders. If this is a list of bools, must match the length of the by
argument.
bool, default False ; if True, perform operation in-place (dataframe itself)
{'first', 'last'}, default 'last' ; first puts NaNs at the beginning, last puts NaNs at the
end.
¢) You can use parameter in place, if you want to store the sorted data in the dataframe itself.
¢) Use na_position parameter to specify the position of NaN values, which is by default last but
can be selected as first.
Aggregation - With large amounts of data, most often we need to aggregate data so as to
analyse it effectively. Pandas offers many aggregate functions, using which you can aggregate
data and get summary statistics of the data.
Aggregation
S.No.
count()
1.
2.
sum()
Description
Total number of items
Sum of all items
3.
mean( ), median( )
Mean and median
4.
min(), max()
Minimum and maximum
5.
std(), var()
Standard deviation and variance
6.
mad()
Mean absolute deviation
The mad( ) function is used to calculate the mean absolute deviation of the values for the
requested axis. The Mean Absolute Deviation (MAD) of a set of data is the average distance
between each data value and the mean. You can use mad( ) as per following syntax :
<dataframe>.mad(axis = None, skipna = None)
axis
skipna
{index (0), columns (1)} default 0
boolean, default True; Exclude NA/null values. If an entire row/column is NA, the
result will be NA
The std() function calculates the standard deviation of a given set of numbers, Standard
deviation of a data frame, Standard deviation of column and Standard deviation of rows, let's
see an example of each.
Creating Histogram - A histogram is a plot that lets you discover, and show, the underlying
frequency distribution (shape) of a set of continuous data. Consider the following histogram that
has been computed using the following dataset containing ages of 20 people.
41
PANDAS
To create histogram from a dataframe, you can use hist( ) function of dataframe, which draws
one histogram of the DataFrame's columns. This function calls PyPlot library's hist(), on each
series in the DataFrame, resulting in one histogram per column. Syntax :
DataFrame.hist(column = None, by= None, grid= True, bins= 10)
Function Application - By function application, it means that a function (a library function or
user-defined function) may be applied on a dataframe in multiple ways.
(a) on the whole dataframe - pipe( )
(b) row-wise or column-wise - apply( )
(c) on individual elements, i.e., element-wise - applymap( )
Other than the above three, there are two more function-application functions : aggregation
through groupby() and transform().
The piping of functions through pipe( ) basically means the chaining of functions in the order
they are executed. Syntax : <DataFrame>.pipe(func, *args,)
func
args
function name to be applied on the dataframe with the provided args
iterable, optional ; positional arguments passed into func.
Pipe function - Suppose you want to apply a function to a data frame or series, to then apply
other, other, … One way would be to perform this operations in a “sandwich” like fashion:
df = sub(div(add(df, 10), 2), 1)
In the long run, this notation becomes fairly messy and error prone. What you want to do here
is use pipe(). Pipe can be thought of as a function chaining. This is how you’d perform the same
task as before with pipe():
df=df.pipe(add, 10).pipe(div, 2).pipe(sub, 1)
This way is a cleaner way that helps keep track the order in which the functions and its
corresponding arguments are applied.
Suppose, for a moment, that you want to apply the following three functions to a data set or
series: The first function add a number from the data. The second function divides the data by a
given parameter. The third function subtract the data by a given.
Here is the data set.
A
B
C
Col1
1
2
3
def add(df, num):
return df[:] + num
def div(df, num):
return df[:] / num
def sub(df, num):
return df[:] - num
dt = {'Col1':[1,2,3]}
df = pd.DataFrame(dt, index=['A','B','C'])
print(df)
df=df.pipe(add, 10).pipe(div, 2).pipe(sub, 1)
print(df)
42
PANDAS
output:
Col1
A
4.5
B
5.0
C
5.5
Note: To apply pipe(), the first argument of the function must be the data set. For
example, adder accepts two arguments adder(data, add). As data is the first parameter that
takes in the data set, we can directly use pipe(). When this is not the case? There’s a way
around this. We only need to specify to pipe what’s the name of the argument in the function
that refers to the data set.
Suppose, now, that the function adder is specified as add(num, df). As the data is not the first
argument, we need to pass it to pipe as follows:
data_set.pipe((add, "df"), 2)
def add(num, df):
return df[:] + num
def div(num, df):
return df[:] / num
def sub(num, df):
return df[:] - num
dt = {'Col1':[1,2,3]}
df = pd.DataFrame(dt, index=['A','B','C'])
print(df)
df=df.pipe((add, "df"), 10).pipe((div, "df"), 2).pipe((sub, "df"), 1)
print(df)
output:
Col1
A
4.5
B
5.0
C
5.5
The apply( ) and applymap( ) functions –
¢) apply() is a series function, so it applies the given function to one row or one column
of the dataframe (as single row/column of a dataframe is equivalent to a series). Arbitrary
functions can be applied along the axes of a DataFrame or Panel using the apply() method,
which, like the descriptive statistics methods, takes an optional axis argument. By default, the
operation performs column wise, taking each column as an array-like.
The syntax for using apply() in minimalist form is: <dataframe>. apply( <funcname>, axis= 0)
<funcname>
axis
the function to be applied on the series inside the dataframes i.e., on rows and
columns. It should be a function that works with series and similar objects.
0 or 1 default 0 ; axis along with the function is applied.
If axis is 0 or 'index' : function is applied on each column
If axis is 1 or 'columns' : function is applied on each row.
df =
pd.DataFrame(np.arange(0,15).reshape(5,3),columns=['col1
','col2','col3'])
print(df.apply(np.mean))
print(df.apply(np.mean, axis=1))
By
43
passing axis parameter,
operations
can
be
col1
6.0
col2
7.0
col3
8.0
dtype: float64
0
1.0
1
4.0
2
7.0
PANDAS
performed row wise.
3
10.0
4
13.0
dtype: float64
df =
pd.DataFrame([[2,6,9],[23,56,32],[11,12,13]],columns=['c
ol1','col2','col3'])
print(df.apply(lambda x: x.max() - x.min()))
col1
21
col2
50
col3
23
dtype: int64
¢) applymap( ) is an element function, so it applies the given function to each
individual element, separately - without taking into account other elements. Not all
functions can be vectorized (neither the NumPy arrays which return another array nor any
value), the methods applymap() on DataFrame and analogously map() on Series accept any
Python function taking a single value and returning a single value.
The syntax for using applymap( ) is : <dataframe>.applymap(<funcname>)
df =
pd.DataFrame([[2,6,9],[23,56,32],[11,12,13]],columns=['col1','col2','co
l3'])
print(df.applymap(lambda x: x*100))
df =
pd.DataFrame([[2,6,9],[23,56,32],[11,12,13]],columns=['col1','col2','co
l3'])
print(df['col1'].map(lambda x: x*100))
col1 col2 col3
0
200
600
900
1 2300 5600 3200
2 1100 1200 1300
0
200
1
2300
2
1100
Name: col1, dtype: int64
NOTE : The apply( ) will apply the function on individual columns/rows, only if the passed
function name is a Series function. If you pass a single value function, then apply( ) will behave
like applymap( ).
Function groupby( ) - Within a dataframe, based on a field's values, you may want to group
the data. To create such groups, Pandas provide group by( ) function. The syntax is:
<dataframe>.groupby(by=None, axis=0)
by
axis
label, or list of labels to be used for grouping
{0 or 'index', 1 or 'columns'}, default 0 ; Split along rows or columns.
Df1.groupby(‘tutor’)
The result of groupby() is also an object , the DataFrameGroupBy object.
You can store the GroupBy object in a variable name and then use following attributes and
functions to get information about groups or to display groups :
<GroupByObject>.groups
<GroupByObject>.get_group(<value>)
<GroupByObject>.size()
<GroupByObject>.count()
<GroupByObject>. [ <columnname>].
head()
lists the groups created
lists the group created for the passed value
lists the size of the groups created
lists the count of non-NA values for each column in
the groups created.
lists the specified column from the grouped object
created
Grouping on Multiple Columns - df.groupby(['Tutor', 'cotmtry'])
Aggregation via groupby( ) - Often in data science, you need to have summary statistics in
the same table. You can achieve this using agg() method on the groupby object crested using
groupby( ) method. The agg( ) method aggregates the data of the dataframe using one or
more operations over the specified axis. The syntax for using agg( ) is :
<dataframe>.agg( func, axis = 0)
func
axis
44
function, str, list or diet
{0 or 'index', 1 or 'columns'}, default 0 ; If 0 or 'index': apply function to each column.
If 1 or 'columns': apply function to each row.
PANDAS
Any groupby operation involves one of the following operations. They are:
1. Splitting the Object
2. Applying a function
3. Combining the results
In many situations, we split the data into sets and we apply some functionality on each subset.
In the apply functionality, we can perform the following operations −
• Aggregation − computing a summary statistic
•
•
Transformation − perform some group-specific operation
Filtration − discarding the data with some condition
ipl_data = {'Team': ['Riders', 'Riders', 'Devils',
'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals',
'Riders'],
'Rank': [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,
2014, 2015, 2017],
'Points': [876,789, 863,673, 741,812,756, 788,694,
701,804, 690]}
df = pd.DataFrame(ipl_data)
print(df)
0
1
2
3
4
5
6
7
8
9
10
11
Points
876
789
863
673
741
812
756
788
694
701
804
690
Rank
1
2
2
3
3
4
1
1
2
4
1
2
Team
Riders
Riders
Devils
Devils
Kings
kings
Kings
Kings
Riders
Royals
Royals
Riders
Year
2014
2015
2014
2015
2014
2015
2016
2017
2016
2014
2015
2017
Split Data into Groups - Pandas object can be split into any of their objects. There are
multiple ways to split an object like − obj.groupby('key') or obj.groupby(['key1','key2']) or
obj.groupby(key,axis=1)
df = pd.DataFrame(ipl_data)
print(df.groupby('Team'))
<pandas.core.groupby.DataFrameGroupBy object at
0x7fa46a977e50>
View Groups
df = pd.DataFrame(ipl_data)
print(df.groupby('Team').groups
)
Example - Group by with
multiple columns −
df = pd.DataFrame(ipl_data)
print(df.groupby(['Team','Year']
).groups)
{'Kings': Int64Index([4,
'Devils': Int64Index([2,
'Riders': Int64Index([0,
'Royals': Int64Index([9,
'kings' : Int64Index([5],
6, 7],
3],
1, 8, 11],
10],
dtype='int64'),
dtype='int64'),
dtype='int64'),
dtype='int64'),
dtype='int64')}
{('Kings', 2014): Int64Index([4], dtype='int64'),
('Royals', 2014): Int64Index([9], dtype='int64'),
('Riders', 2014): Int64Index([0], dtype='int64'),
('Riders', 2015): Int64Index([1], dtype='int64'),
('Kings', 2016): Int64Index([6], dtype='int64'),
('Riders', 2016): Int64Index([8], dtype='int64'),
('Riders', 2017): Int64Index([11], dtype='int64'),
('Devils', 2014): Int64Index([2], dtype='int64'),
('Devils', 2015): Int64Index([3], dtype='int64'),
('kings', 2015): Int64Index([5], dtype='int64'),
('Royals', 2015): Int64Index([10], dtype='int64'),
('Kings', 2017): Int64Index([7], dtype='int64')}
Iterating through Groups - With the groupby object in hand, we can iterate through the
object similar to itertools.obj.
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
for name,group in grouped:
print(name)
print(group)
45
2014
Points
0
876
2
863
4
741
9
701
Rank
1
2
3
4
Team
Riders
Devils
Kings
Royals
Year
2014
2014
2014
2014
PANDAS
NOTE : By default,
the groupby object has the same
label name as the group name.
2015
Points
1
789
3
673
5
812
10
804
2016
Points
6
756
8
694
2017
Points
7
788
11
690
Rank
2
3
4
1
Team
Riders
Devils
kings
Royals
Year
2015
2015
2015
2015
Rank
1
2
Team
Kings
Riders
Year
2016
2016
Rank
1
Team
Kings
2
Riders
Year
2017
2017
Select a Group - Using the get_group() method, we can select a single group.
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print(grouped.get_group(2014))
0
2
4
9
Points Rank
876
1
863
2
741
3
701
4
Team
Riders
Devils
Kings
Year
2014
2014
2014
Royals
2014
Aggregations - An aggregated function returns a single aggregated value for each group.
Once the group by object is created, several aggregation operations can be performed on
the grouped data. An obvious one is aggregation via the aggregate or
equivalent agg method –
Year
2014
795.25
2015
769.50
2016
725.00
2017
739.00
Name: Points, dtype: float64
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print(grouped['Points'].agg(np.mean))
Another way to see the size of each group is by applying the size() function –
df = pd.DataFrame(ipl_data)
Attribute Access in Python Pandas
grouped = df.groupby('Team')
print(grouped.agg(np.size))
Team
Devils
Kings
Riders
Royals
kings
Points
2
3
4
2
Rank
2
3
4
2
Year
2
3
4
2
1
1
1
Applying Multiple Aggregation Functions at Once
With grouped Series, you can also pass a list or dict of functions to do aggregation with,
and generate DataFrame as output –
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
print(grouped['Points'].agg([np.sum,
np.mean, np.std]))
Team
Devils
Kings
Riders
Royals
kings
sum
1536
2285
3049
1505
812
mean
768.000000
761.666667
762.250000
752.500000
812.000000
std
134.350288
24.006943
88.567771
72.831998
NaN
The transform( ) function - The groupby( ) function rearranges data into groups based
on some criteria and stores the rearranged data in a new groupby object. You can apply
aggregate functions on the groupby object using agg( ). The transform( ) function
46
PANDAS
transforms the aggregate data by repeating the summary result for each row of the group
and makes the result have the same shape as original data.
Transform is an operation used in conjunction with “groupby”. While aggregation must
return a reduced version of the data, transformation can return some transformed version
of the full data to recombine. For such a transformation, the output is the same shape as
the input. A common example is to center the data by subtracting the group-wise mean.
Problem Set - For this example, we will analyze some fictitious sales data. In order to
keep the dataset small, here is a sample of 12 sales transactions for our company.
(transform.xlsx).
First Approach - Merging
If you are familiar with pandas, your first inclination is going to be trying to group the data into a
new dataframe and combine it in a multi-step process. Here’s what that approach would
look like.
Import all the modules we need and read in our data:
import pandas as pd
df = pd.read_excel("sales_transactions.xlsx")
df.groupby('order')["ext price"].sum()
order
10001
576.12
10005
8185.49
10006
3724.49
Name: ext price, dtype: float64
Here is a simple image showing what is happening with the standard groupby.
The tricky part is figuring out how to combine this data back with the original dataframe. The first instinct
is to create a new dataframe with the totals by order and merge it back with the original. We could do
something like this:
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
47
PANDAS
account
name
order
sku
quantity
unit price
ext price
Order_Total
PercentofOrder
0
383080
Will LLC
10001
B1-20000
7
33.69
576.12
0.4093
1
383080
Will LLC
10001
S1-27722
11
21.12
576.12
0.4032
2
383080
Will LLC
10001
B1-86481
3
35.99
576.12
0.1874
3
412290
Jerde-Hilpert
10005
S1-06532
48
55.82
8185.5
0.3273
4
412290
Jerde-Hilpert
10005
S1-82801
21
13.62
8185.5
0.0349
5
412290
Jerde-Hilpert
10005
S1-06532
9
92.55
8185.5
0.1018
6
412290
Jerde-Hilpert
10005
S1-47412
44
78.91
8185.5
0.4242
7
412290
Jerde-Hilpert
10005
S1-27722
36
25.42
8185.5
0.1118
8
218895
Kulas Inc
10006
S1-27722
32
95.66
3724.5
0.8219
9
218895
Kulas Inc
10006
B1-33087
23
22.55
3724.5
0.1393
10
218895
Kulas Inc
10006
B1-33364
3
72.3
3724.5
0.0582
11
218895
Kulas Inc
10006
B1-20000
-1
72.18
3724.5
-0.019
This certainly works but there are several steps needed to get the data combined in the manner we need.
Second Approach - Using Transform
Using the original data, let’s try using “transform” and “groupby” and see what we get:
df.groupby('order')["ext price"].transform('sum')
0
576.12
1
576.12
2
576.12
3
8185.49
4
8185.49
5
8185.49
6
8185.49
7
8185.49
8
3724.49
9
3724.49
10
3724.49
11
3724.49
Name: ext price, dtype: float64
You will notice how this returns a different size data set from our normal groupby functions. Instead of
only showing the totals for 3 orders, we retain the same number of items as the original data set. That is
the unique feature of using transform .
The final step is pretty simple:
df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
print(df)
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
print(df)
accou
nt
name
order
sku
quantity
unit
price
ext price
Order_Total
Percent_of_Order
48
PANDAS
0
383080
Will LLC
10001
B1-20000
7
33.69
576.12
0.4093
1
383080
Will LLC
10001
S1-27722
11
21.12
576.12
0.4032
2
383080
Will LLC
10001
B1-86481
3
35.99
576.12
0.1874
3
412290
Jerde-Hilpert
10005
S1-06532
48
55.82
8185.5
0.3273
4
412290
Jerde-Hilpert
10005
S1-82801
21
13.62
8185.5
0.0349
5
412290
Jerde-Hilpert
10005
S1-06532
9
92.55
8185.5
0.1018
6
412290
Jerde-Hilpert
10005
S1-47412
44
78.91
8185.5
0.4242
7
412290
Jerde-Hilpert
10005
S1-27722
36
25.42
8185.5
0.1118
8
218895
Kulas Inc
10006
S1-27722
32
95.66
3724.5
0.8219
9
218895
Kulas Inc
10006
B1-33087
23
22.55
3724.5
0.1393
10
218895
Kulas Inc
10006
B1-33364
3
72.3
3724.5
0.0582
11
218895
Kulas Inc
10006
B1-20000
-1
72.18
3724.5
-0.019
As an added bonus, you could combine into one statement if you did not want to show the individual
order totals:
df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')
NOTE : The aggregation function agg() returns a reduced version of the data by producing one
summary result per group. The transform() on the other hand, returns the transformed version of the
summary data by repeating rows for the group to make it have same shape as that of the full data and
thus the result of transform can be combined with the dataframe easily.
Reindexing and Altering Labels – When you create a dataframe object, it gets its row numbers
(or the indexes) and column labels automatically. But sometimes we ate not satisfied with the row and
column labels of a dataframe. For this, Pandas offers you, a major functionality. You may change the
rov,- index and column label as and when you require.
Recall that index refers to labels of axis O i.e., row labels and columns refers to the labels of axis 1 i.e.,
column labels. There are many similar methods provided by Pandas library that help you change
rearrange, rename indexes or column labels. Thus, you should read following lines carefully to know the
difference between the working of these methods.
The methods provided by Pandas for reindexing and relabelling are :
(i)
rename(). A method that simply renames the index and/or column labels in a dataframe.
(ii)
reindex( ). A method that can specify the new order of existing indexes and column labels, and/or
also create new indexes/column labels.
(iii)
reindex_like( ). A method for creating indexes/column--labels based on other dataframe object.
(i) The rename( ) method - The rename() function renames the existing indexes/column-labels in a
dataframe. The old and new index/column-labels are to be provided in the form of a dictionary where
keys are the old indexes/row-labels, and the values are the new names for the same, e.g., Syntax :
<dataframe>. rename(mapper = None, axis= None, inplace = False )
<dataframe>. rename(index = None, columns= None, inplace = False)
mapper, index, columns
axis
inplace
49
dict-like (Dictionary-like)
int ( 0 or 1) or str ('index' or 'columns'); The default is O or 'index'.
boolean, default False (which returns a new dataframe with renamed
index/labels; If True, then changes are made in the current dataframe and
new dataframe is not returned.
PANDAS
ndf.rename( {'Qtrl':1, 'Qtr2':2, 'Qtr3':3, 'Qtr4':4}, axis=0)
NOTE : You either use mapper dictionary with axis argument or use dictionary with index = or columns =
positional arguments.
<Df>. rename( columns = { <dictionary with old and new labels>}) or
<Df>. rename( { <dictiona'ry with old and new labels>}, axis = 1)
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an
arbitrary function.
df1 =
pd.DataFrame([[50,60,70],[90,80,70]],columns=['Sub1','Sub2'
,'Sub3'])
print(df1)
print ("After renaming the rows and columns:")
print(df1.rename(columns={'Sub1' : 'Eng', 'Sub2' : 'Hin',
'Sub3':'Evs'}, index = {0 : 11, 1 : 21, 2 : 51}))
Sub1 Sub2 Sub3
0
50
60
70
1
90
80
70
After renaming the rows and
columns:
Eng Hin Evs
11
50
60
70
21
90
80
70
The rename() method provides an inplace named parameter, which by default is False and copies the
underlying data. Pass inplace=True to rename the data in place.
(ii) The reindex( ) method - The function reindex( ) is used to change the order or existing
indices/labels. It is used as per following syntaxes in minimalistic form:
Data Frame. reindex(index = None, columns= None, method= None, fill_value = nan)
DataFrame.reindex(labels = None, axis= None, method= None, fill_value = nan)
labels
index, columns
axis
fill_value
array-like, optional; New labels/index to conform to the axis specified by 'axis' to.
array-like, optional; New labels I index to conform to, should be specified using
keywords. Preferably an Index object to avoid duplicating data.
int( 0 or 1) or str('index' or 'columns'), optional; Axis to target. Default 0 or 'index'.
the value to be filled in the newly added rows/columns.
NOTE : Like rename( ), in reindex() too, you either use mapper sequence with axis argument or use it
with index = or columns = positional arguments.
(a) Reordering the existing indexes using reindex( ) - Both the following commands are equivalent
(as by default axis is 0) and will reorder the row indexes in the datafarme.
(b) Reordering as well as adding/deleting indexes/labels - Exiting row-indices/column-labels are
reorders as per given order and non-existing row-indexes/column-labels create new rows/columns and
by default NaN values are filled in them.
(c) Specifying fill values for new rows/columns - By using argument fill_ value, you can specify which
will be filled in the newly added row/column. In the absence of fill_value argument, the new row/column
is filled with NaN.
df1 =
pd.DataFrame([[50,60,70],[90,80,70]],columns=['Sub1
','Sub2','Sub3'])
print(df1)
print ("After reindexing the rows and columns:")
print(df1.reindex(columns={'Sub1', 'Sub2', 'Sub4'},
index = {0, 1, 3}))
50
Sub1 Sub2 Sub3
0
50
60
70
1
90
80
70
After reindexing the rows and columns:
Sub4 Sub2 Sub1
0
NaN 60.0 50.0
1
NaN 80.0 90.0
3
NaN
NaN
NaN
PANDAS
(iii) The reindex_like( ) method - The reindex_like( ) function works on a dataframe and
reindexes its data as per the argument dataframe passed to it. This function does following things:
(a)
If the current dataframe has some matching row-indexes/column-labels as the passed dataframe,
then retain the index/label and its data.
(b)
If the current dataframe has some row-indexes/column-labels in it, which are not in the passed
dataframe, drop them.
(c)
If the current dataframe does not have some row-indexes/column-labels which are in the passed
dataframe , then add them to current dataframe with value as NaN
(d)
The reindex_like( ) ensure that the current dataframe object conforms to the same indexes/labels
on all axes.
The syntax for using reindex_like( ) is : <dataframe>.reindex_like(other)
Other - name of a dataframe as per which current <dataframe> is to reindexed.
df1 =
pd.DataFrame([[50,60,34],[90,80,44]],columns=['Sub1','Sub2','Sub3'])
df2 = pd.DataFrame([[78,76],[67,98]],columns=['Sub1','Sub2'])
print ("DF1 After reindexing like DF2")
print(df1.reindex_like(df2))
DF1 After reindexing
like DF2
Sub1 Sub2
0
50
60
1
90
80
Note − Here, the df1 DataFrame is altered and reindexed like df2. The column names should be
matched or else NAN will be added for the entire column label.
51
PANDAS
Plotting with Pyplot Bar Graphs and Scatter Plots
Data visualization basically refers to the graphical or visual representation of information and
data using visual elements like charts, graphs, and maps etc. Data visualization is immensely useful in
decision making. Data visualization unveils patterns, trends, outliers, correlations etc. in the data, and
thereby helps decision makers understand the meaning of data to drive business decisions.
The matplotlib is a Python library that provides many interfaces and functionality for 2D-graphics
similar to MATLAB's in various forms. In short, you can call matplotlib as a high quality plotting library of
Python. It provides both a very quick way to visualize data from Python and publication-quality figures in
many formats. matplotlib library offers many different named collections of methods; PyPlot is one such
interface.
PyPlot is a collection of methods within matplotlib which allows user to construct 20 plots easily and
interactively. PyPlot essentially reproduces plotting functions and behavior of MATLAB.
After downloading, next you need to install it by giving following commands on the command prompt.
python -m pip install -U pip
python -m pip install -U matplotlib
Importing PyPlot - import mat plot lib. pyplot as pl
You can create many different types of graphs and charts using PyPlot. Some commonly used chart
types are:
¢) Line Chart. A line chart or line graph is a type of chart which displays information as a series of
data points called 'markers' connected by straight line segments.
¢) Bar Chart. A bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars with heights or lengths proportional to the values that they represent. The bars can be
plotted vertically or horizontally.
¢) Scatter Plot. The scatter plot is similar to a line chart, the major difference is that while line
graph connects the data points with a line, scatter chart simply plots the data points to show the trend in
the data.
NOTE : Data points are called markers.
Line Chart using plot( ) function –
import matplotlib. pyplot as pl
a= [1, 2, 3, 4] b = [2, 4, 6, 8]
pl.plot(a, b)
You can set x-axis' and y-axis' labels using functions xlabel( ) and ylabel( ) respectively, i.e., :
<matplotlib.pyplot or its alias>.xlabel( <str> )
<matplotlib.pyplot or its alias>.ylabel( <str> )
52
PANDAS
import matplotlib.pyplot as plt
a= [1, 2, 3, 4]; b = [3, 4, 9, 8];
plt.plot(a, b)
plt.xlabel('X axis - Values of A')
plt.ylabel('Y axis - Values of B')
plt.show()
Applying Various Settings in plot( ) Function - The plot( ) function allows you specify multiple settings
for your chart/graph such as : ¢) color (line color / marker color) ¢) marker type ¢) marker size etc.
Changing Line Color - <matplotlib>.plot(<datal>, [,data2], <color code>)
import matplotlib.pyplot as plt
x = np.arange(0., 10, 0.1)
a = np . cos ( x)
b = np.sin(x)
plt.plot(x, a, 'b')
plt.plot(x, b, 'r')
plt.show()
To change the line style, you can add following additional optional argument in plot() function: linestyle
or ls= ['solid' , 'dashed', 'dashdot', 'dotted']
Changing Marker Type, Size and Color marker=<validMarkerType>,markerSize=<inPoints>,markerEdgeColor=<validColor>
53
PANDAS
plt.plot(p, q, ‘k’, marker=’d’, markersize=5, markercolor=’red’)
plt.plot(p, q, ‘r+’, marker=’d’, linestyle=’solid’)
import matplotlib.pyplot as plt
x = np.arange(0., 10, 0.1)
a = np.cos(x)
b = np.sin(x)
plt.plot(x, a, 'bo', markerSize=5)
plt.plot(x, b, 'r', marker='D', markerSize=5)
plt.show()
When you do not specify markeredgecolor separately in plot(), the marker takes the same color as the
line. Also, if you do not specify the linestyle separately along with linecolor- and -marketstylecombination-string (e.g., 'r+' above), Python will only plot the markers and not the line.
Creating Scatter Charts - The scatter charts can be created through two functions of pyplot
library. (i)
plot( ) function
(ii)
scatter( ) function
In plot() function, whenever you specify marker type/style, whether with color or without color,
and do not give linestyle argument, plot( ) will create a scatter chart,
Full useful syntax of plot( ) can be summarised as :
plot( <datal> [,< data2>] [, ... <colorcodeandmarker type>] [, < linewidth>] [, < linestyle>] [,<marker>] [,
<markersize>] [, <markeredgecolor>] )
¢) data1, data2 are sequences of values to be plotted on x-axis and y-axis respectively
¢) Rest of the arguments affect the look and format of the line/scatter chart as given below:
color code with markertype color and marker symbol for the line chart
linewidth width of the line in points (a float value)
54
PANDAS
linestyle can be ['solid' | 'dashed', 'dashdot', 'dotted' or take markertype string]
marker a symbol denoting valid marker style such as '-!, 'o', 'x', '+', 'd', and others etc.
markersize size of marker in points ( a float value)
markeredgecolor color for markers; should be a valid color identified by PyPlot
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='r');
plt.show()
Scatter Charts using scatter( ) Function - It is, more powerful method of creating scatter plots than the
plot() function. In its simplest form, the scatter( ) function is used as –
matplotlib. pyplot.scatter( <array1>, <array2> )
or
<pyplot aliasname>.scatter( <array1>, <array2> )
Specifying marker type - You can specify marker type using marker argument of the scatter( )
function.
pl.scatter(a1, a4, marker="x")
Specifying size of the markers - Using arguments, you can specify the size of the markers.
Specifying co/or of the markers - Using argument c, you can specify the color of the markers.
matplotlib.pyplot.scatter(x, y, s = None, c = None, marker=None)
x, y
s
c
marker
The data positions.
The marker size in points**2. Optional argument.
marker color, sequence, or sequence of color, optional argument
MarkerStyle, optional argument
The primary difference of scatter( ) from plot( ) is that it can be used to create scatter plots where the
properties of each individual point (size, face color, edge color, etc.) can be individually controlled or
mapped to data.
55
PANDAS
arr1 = np. linspace ( -1, 1, 5)
# arr1 with 5 data points created
arr2 = np.exp(arr1)
# arr2 also has 5 data points created
colarr = [ 'r', 'b', 'm', 'g', 'k']
#colarr is a sequence of colors with
same shape as arr1
sarr = [20, 60, 100, 45, 25]
# sarr is a sequence of sizescolor with
same shape as arr1
plt.scatter(arr1, arr2, c = colarr, s =sarr)
Creating Bar Charts - A Bar Graph or a Bar Chart is a graphical display of data using bars of
different heights. A bar chart can be drawn vertically or horizontally using rectangles or bars of different
heights/widths PyPlot offers bar( ) function to create a bar chart.
a, b = [1, 2, 3, 4], [2, 4, 6, 8]
matplotlib.pyplot.bar(a, b)
Notice, first sequence given in the
bar( ) forms the x-axis and the
second sequence values are
plotted on y-axis.
If you want to specify x-axis label
and y-axis label, then you need to
give commands:
matplotlib.pyplot.xlabel(<label string>)
matplotlib.pyplot.ylabel(<label string>)
NOTE : The simple bar plot is best used when there is just one level of grouping to your variable.
56
PANDAS
Changing Widths of the Bars in a Bar Chart – By default, bar chart draws bars with
equal widths and having a default width of 0.8 units on a bar chart. That is, all bars have
the width same as the default width. But you can always change the widths of the bars.
¢) You can specify a different width ( other than the default width) for all the bars of a bar
chart.
¢) You can also specify different widths for different bars of a bar chart.
(i) To specify common width (other than the default width) for all bars, you can specify
width argument having a scalar float value in the bar( ) function, i.e., as
<matplotlib.pyplot>.bar( <x-sequence>, <y-sequence>, width= <float value>)
(ii) To specify different widths for different bars of a bar chart, you can specify width
argument having a sequence (such as lists or tuples) containing widths for each of the bars,
in the bar( ) function, i.e., as :
<matplotlib.pyplot>.bar(<x-sequence>,<y-sequence>,width=<width values sequence>)
Changing Colors of the Bars in a Bar Chart –
(i)To specify common color (other than the default color) for all bars, you can specify
color argument having a valid color code/name in the bar( ) function, i.e., as
<matplotlib.pyplot>.bar( <x-sequence>, <y-sequence>, color= <color code/name>)
(ii)To specify different colors for different bars of a bar chart, you can specify color
argument having a sequence (such as lists or tuples) containing colors for each of the bars,
in the bar( ) function, i.e., as :
<matplotlib.pyplot>.bar(<x-sequence>,<y-sequence>,color=<colorNames/codes sequence>)
a, b = ['a','b','c','d'], [2, 6, 4, 8]
plt.bar(a, b, width=[.05,.1,.05,.1], color=
['red', 'b', 'g', 'black'])
plt.plot(a,b)
plt.show()
Creating Multiple Bars chart - As such PyPlot does not provide a specific function for this, BUT
you can always create one exploiting the width and color arguments of bar( ).
57
PANDAS
import numpy as np
import matplotlib. pyplot as pl t
Val= [[5., 25., 45., 20.], [4., 23., 49., 17.], [6.,
22., 47., 19.]]
X = np.arange(4)
plt.bar(X+0.00, Val[0], color= 'b', width=0.25)
plt.bar(X+ 0.25, Val[l], color= 'g', width= 0.25)
plt.bar(X+0.50, Val[2], color= 'r', width= 0.25)
plt.title('Bar Chart')
plt. show()
Creating a Horizontal Bar Chart - To create a horizontal bar chart, you need to use
barh() function (bar horizontal), in place of bar(). Also, you need to give x and y axis labels
carefully - the label that you gave to x axis in bar( ), will become y-axis label in barh( ) and
vice-versa.
Val= [[5., 25., 45., 20.], [4., 23., 49., 17.],
[6., 22., 47., 19.]]
X = np.arange(1,15,4)
plt.barh(X+0.00, Val[0], color= 'b')
plt.barh(X+ 1, Val[1], color= 'g')
plt.barh(X+2, Val[2], color= 'r')
plt.show()
58
PANDAS
Anatomy of a Chart - Any graph
or chart that you create using
matplotlib's PyPlot interface, is
created as per a specific structure
of a plot or shall we say a specific
anatomy. As anatomy, generally
refers to study of bodily structure
(or parts) of something, here we
shall discuss various parts of a
plot that you can create using
PyPlot.
PyPlot charts have hierarchical
structures or in simple words they
are
actually
like
containers
containing multiple items/things
inside it. Look at the figure given
below carefully and then go
through the terms given below it
describe it.
¢) Figure. PyPlot by default plots every chart into an area called Figure. A figure contains
other elements of the plot in it.
¢) Axes. The axes define the area (mostly rectangular in shape for simple plots) on which
actual plot (line or bar or graph etc.) will appear. Axes have properties like label, limits and
tick marks on them.
There are two axes in a plot: (i) X-axis, the horizontal axis, (ii) Y-axis, the vertical axis.
• Axis label. It defines the name for an axis. It is individually defined for X-axis and Yaxis each.
• Limits. These define the range of values and number of values marked on X-axis
and Y-axis.
• Tick_Marks. The tick marks are individual points marked on the X-axis or Y-axis.
¢) Title. This is the text that appears on the top of the plot. It defines what the chart is
about.
¢) Legends. These are the different colors that identify different sets of data plotted on the
plot. The legends are shown in a corner of the plot.
Adding a Title - To add a title to your plot, you need to call function title( ) before you
show your plot. The syntax of title( ) function is as : plt.title( "A Bar chart")
You can use title() function for all types of plots i.e., for plot( ), for bar( ) and for pie( ) as
well.
Setting Xlimits and Ylimits – You can use xlim( ) and ylim( ) functions to set limits for Xaxis and Y-axis respectively.
<matplotlib.pyplot>.xlim(<xmin>, <xmax>) # set the X-axis limits as xmin to xmax
<matplotlib.pyplot>.ylim(<ymin>, <ymax>) # set the Y-axis limits as ymin to ymax
59
PANDAS
NOTE : Only the data values mapping on Xlimits and Ylimits will get plotted. If no data
values maps to the Xlimits or Ylimits, nothing will show on the plot.
Val= [[5., 25., 45., 20.], [4., 23.,
49., 17.], [6., 22., 47., 19.]]
X = np.arange(1,15,4)
plt.bar(X+0.00,
'b')
Val[0],
color=
plt.bar(X+ 1, Val[1], color= 'g')
plt.bar(X+2, Val[2], color= 'r')
plt.title('Bar Chart')
plt.xlim(-2,18)
plt.show()
You can use decreasing axes by flipping the normal order of the axis limits i.e., if you swap
the limits (min, max) as (max, min), then the plot gets flipped.
Setting Ticks for Axes - By default, PyPlot will automatically decide which data points will
have ticks on the axes, but you can also decide which data points will have tick marks on Xand Y-axes.
To set own tick marks:
¢) for X-axis, you can use xticks() function as per format:
xticks( <sequence containing tick data points>, [ <Optional sequence containing tick
labels>])
¢) for Y-axis, you can use yticks() function as per format:
yticks( <sequence containing tick data points>,[<Optional sequence containing tick
labels>])
val= [5, 25, 45, 20]
x = [1,2,3,4]
plt.bar(x, val, color= 'b')
plt.title('Bar Chart')
plt.xlim(0,5)
plt.xticks([0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5])
plt.show()
Adding Legends - A legend is a color or mark linked to a specific data range plotted. To
plot a legend, you need to do two things –
60
PANDAS
(i) In the plotting functions like plot( ), bar( ) etc., give a specific label to data range using
argument label.
(ii) Add legend to the plot using legend( ) as per format : <matplotlib. pyplot>. legend( loc
= <position number or string>)
The loc argument can either take values 1, 2, 3, 4 signifying the position strings 'upper
right', 'upper left', 'lower left', 'lower right' respectively. Default position is 'upper right' or
1.
import numpy as np
import matplotlib. pyplot as pl t
Data= [[5., 25., 45., 20.], [8., 13., 29., 27.],
[9., 29., 27., 39.]] X = np.arange(4)
plt.plot(X
'rangel')
,
Data[0],
color=
'b',
label=
pl t. plot (X, Data [1], color = 'g ·, label =
'range2' )
plt.plot(X, Data[2], color= 'r', label= 'range3')
pl t. legend ( loc = 'upper left' )
pl t. title ( "MultiRange Line chart")
plt.xlabel(‘X’)
plt.ylabel(‘Y’)
plt. show()
Saving a Figure - If you want to save a plot created using pyplot functions for later use or
for keeping records, you can use savefig() to save the plot.
You can use the pyplot' s savefig( ) as per format :
<matplotlib. pyplot >. savefig (<string with filename and path>)
¢) You can save figures in popular formats like .pdf, .png, .eps etc. Specify the format as
file extension.
¢) Also, while specifying the path, use double slashes to suppress special meaning of single
slash character.
Consider the following examples
plt.savefig("multibar.pdf")
# save the plot in current directory
plt.savefig("c:\\data\\multibar.pdf")
# save the plot at the given path
plt.savefig("c:\\data\\multibar.png") # save the plot at the given path in png format
Creating Histograms with Pyplot - A histogram is a summarization tool for discrete or
continuous data. A histogram provides a visual interpretation of numerical data by showing
the number of data points that fall within a specified range of values (called bins). It is
similar to a vertical bar graph. However, a histogram, unlike a vertical bar graph, shows no
gaps between the bars.
61
PANDAS
Histograms are a great way to show results of continuous data, such as : weight, height,
how much time, and so forth. But when the data is in categories (such as Country or
Subject etc.), one should use a bar chart.
Histogram using hist( ) Function –
matplotlib. pyplot. hist ( x, bins = None, cumulative = False, histtype = 'bar', align= 'mid',
orientation= 'vertical', )
x
bins
cumulative
histtype
orientation
(n,) array or sequence of (n,) arrays to be plotted on histogram.
int, optional. If an integer is given, bins + 1 bin edges are calculated and
returned. Default value is automatically provided internally
bool, optional; If True, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin gives the total
number of datapoints. Default is False.
{'bar', 'barstacked', 'step', 'stepfilled'}, optional; the type of histogram to
draw. 'bar' is a traditional bar-type histogram. If multiple data are given the
bars are arranged side by side.
'barstacked' is a bar-type histogram where multiple data are stacked on top of
each other.
'step' generates a lineplot that is by default unfilled.
'stepfilled' generates a lineplot that is by default filled.
Default is 'bar'
{'horizontal', 'vertical'}, optional ; If 'horizontal', barh will be used for bar-type
histograms.
a = np.array([22,87,5,43,56,73,55,54,71,20,51,45,79,51,27])
plt.hist(a, bins = [0,20,40,60,80,100])
plt.hist(a, bins = [0,20,40,60,80,100])
plt.hist(a, bins = 20)
plt.xticks(np.arange(0,100,5))
plt.hist(a, bins = [0,20,40,60,80,100],
cumulative= True)
62
plt.hist(a, bins = [0,20,40,60,80,100],
histtype = 'step')
PANDAS
a = np.array([22,87,5,43,56,73,55,54,71,20,51,45,79,51,27])
b = np.array([27,92,10,53,60,79,60,60,79,20,59,51,80,59,33])
plt.hist([a,b], bins = [0,20,40,60,80,100])
plt.hist([a,b], bins = [0,20,40,60,80,100],
histtype = 'barstacked')
7. pl. hist([a,b], bins = [0,20,40,60,80,100], plt.hist([a,b], bins = [0,20,40,60,80,100],
histtype = 'barstacked', cumulative= True)
orientation= 'horizontal')
63
PANDAS
Creating Frequency Polygons - A frequency polygon is a type of frequency distribution
graph. In a frequency polygon, the number of observations is marked with a single point at
the midpoint of an interval. A straight line then connects each set of points. Frequency
polygons make it easy to compare two or more distributions on the same set of axes.
Python's pyplot module of matplotlib provides no separate function for creating frequency
polygon. Therefore, to create a frequency polygon, what you can do is :
(i)
Plot a histogram from the data.
(ii)
Mark a single point at the midpoint of an interval/bin.
(iii)
Draw straight lines to connect the adjacent points.
(iv)
Connect first data point to the midpoint of previous interval on x-axis.
pl.hist ( com, bins = 10, histtype = 'step')
64
Join midpoints of each set of adjacent bins to
create frequency polygon.
PANDAS
Creating Box Plots - The box plot has become the standard technique for presenting the
5-number summary which consists of : (i) the minimum range value, (ii)
the maximum
range value, (iii) the upper quartile, (iv) the lower quartiles, and (v) the median.
A box plot is used to show the range and middle half of ranked data. Ranked data is
numerical data such as numbers etc. The middle half of the data is represented by the box.
The highest and lowest scores are joined to the box by straight lines. The regions above the
upper quartile and below the lower quartile each contain 25% of the data.
The box plot uses five important numbers of a data range : the extremes (the highest and
the lowest numbers), the median, and the upper and lower quartiles, making up the fivenumber summary.
The five-number summary is shown in the diagram below :
65
PANDAS
matplotlib.pyplot.boxplot(x, notch= None, vert = None, meanline = None, showmeans =
None, showbox= None)
x
notch
vert
meanline
showmeans
showbox
Array or a sequence of vectors. The input data.
bool, optional (False) ; If True, will produce a notched box plot. Otherwise, a
rectangular boxplot is produced.
bool, optional (True) ; If True (default), makes the boxes vertical. If False,
everything is drawn horizontally.
bool, optional (False) ; If True (and showmeans is True), will try to render the
mean as a line spanning the full width of the box.
bool, optional (True) ; Show the central box.
bool, optional (False) ; Show the arithmetic means.
ary= [5, 20, 30, 45, 60, 80, 100, 140, 150, 200, 240]
1. Drow the plain bpxplot - pl.boxplot(ary)
2.Using the above sample data (ary), draw
the boxplot with mean shown –
pl.boxplot ( ary, showmeans = True)
3.Draw a notched boxplot for the same - 4.Draw the boxplot with above data without
pl.boxplot(ary, notch= True, showmeans = the central box - pl.boxplot(ary, showbox =
True)
False)
66
PANDAS
Using vert argument of boxplot( ), you can change the orientation of the boxplot.
Customizing/Adding Details to the Plots You have read about the same in previous chapter, so here we shall not cover these again.
Instead, we are quickly revising it here Anatomy of all plot types is the same, so you can
customise them using same functions.
¢) Use <matplotlib.pyplot>.title( ) to add title for your plot
¢) Use <matplotlib.pyplot>.xticks( )/yticks( ) for setting xticks and yticks.
¢) Use <matplotlib.pyplot>.xlim( )/ylim( ) for setting x limit/y limit
¢) Use <matplotlib.pyplot>.xlabel( )/ylabel( ) for setting x-axis label/y-axis label ¢) Use
<matplotlib.pyplot>.legend( ) to add legends to your plot
67
PANDAS
Download