Pandas Data Analysis: Series and DataFrames

Data analysis refers to process of evaluating big data sets using analytical and statistical tools so as to discover useful information and conclusions to support business decision-making. Pandas or Python Pandas is Python's library for data analysis. Pandas has derived its name from "panel data system", which is an ecometrics term for multi-dimensional, structured data sets. Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-touse data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. Pandas is the most popular library in the scientific Python ecosystem for doing data analysis. Pandas is capable of many tasks including¢) It can read or write in many different data formats (integer, float, double, etc.) ¢) It can calculate in all ways data is organized i.e., across rows and down columns ¢) It can easily select subsets of data from bulky data sets and even combine multiple datasets together. ¢) It has functionality to find and fill missing data. ¢) It allows you to apply operations to independent groups within the data. ¢) It supports reshaping of data into different forms. ¢) It supports advanced time-series functionality (Time series forecasting is the use of a model to predict future values based on previously observed values.) ¢) It supports visualization by integrating matplotlib and seaborn etc. libraries. Key Features of Pandas • • • • • • • • • Fast and efficient DataFrame object with default and customized indexing. Tools for loading data into in-memory data objects from different file formats. Data alignment and integrated handling of missing data. Reshaping and pivoting of date sets. Label-based slicing, indexing and subsetting of large data sets. Columns from a data structure can be deleted or inserted. Group by data for aggregation and transformations. High performance merging and joining of data. Time Series functionality. In other words, Pandas is best at handling huge tabular data sets comprising different data formats. The Pandas library also supports the most simple of tasks needed with data such as loading data or doing feature engineering on time series data etc. A data structure is a particular way of storing and organizing data in a computer to suit a specific purpose so that it can be accessed and worked with in appropriate ways. Data Structures refer to specialized way of storing data so as to apply a specific type of functionality on them. Out of many data structures of Pandas, two basic data structures - Series and DataFrame - are universally popular. Data Structure Series Data Frames Panel 1 Dimen sions 1 2 3 Description 1D labeled homogeneous array, size immutable. General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. General 3D labeled, size-mutable array. PANDAS Mutability - All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Series Data Structure - A Series is a Pandas data structure that represents a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. Key Points : Homogeneous data, Size Immutable, Values of Data Mutable. A Series type object has two main components: an array of actual data, an associated array of indexes or data labels. A Series type object can be created in many ways using pandas library's Series( ). i. Create empty Series Object by using just the Series() with no parameter - Series type object with no value having default datatype, which is float64. To create an empty object i.e., having no values, you can just use the Series() as: <Series Object>= pandas.Series em = pd.Series() print(em) output : Series([ ], dtype: float64) ii. Creating non-empty Series objects - To create non-empty Series objects, you need to specify arguments for data and indexes as per following syntax : <Series object>= pd.Series(data, index=idx) where data is the data part of the Series object, it can be one of the following : 1) Sequence, 2) ndarray, 3) Dictionary, or 4) A scalar value. (1) Specify data as Python Sequence. To give a sequence of values as attribute to Series( ), i.e., as : <Series Object>= Series (<any Python sequence>) obj1 = pd.Series(range(5)) print(obj1) 0 0 1 1 2 2 3 3 4 4 dtype: int64 (2) Specify data as an ndarray. The data attribute can be an ndarray also. nda1 = np.arange(3, 13, 3.5) ser1 = pd.Series(nda1) print(ser1) 0 3.0 1 6.5 2 10.0 dtype: float64 (3) Specify data as a Python dictionary. can be any sequence, including dictionaries. obj5 = pd.Series( { ‘Jan’ : 31, ‘Feb’ : 28, ‘Mar’ : 31 } ) Here, one thing is noteworthy that if you are creating a series object from a dictionary object, then indexes, which are created from keys are not in the same order as you have typed them. (4) Specify data as a scalar / single value. BUT if data is a scalar value, then the index must be provided. There can be one or more entries in index sequence. The scalar value (given as data) will be repeated to match the length of index. medalsWon = pd. Series ( 10, index = range ( 0, 1)) medals2 = pd.Series(15, index= range(1, 6, 2)) ser2 = pd. Series ( 'Yet to start' , index= ['Indore', 'Delhi', 'Shimla' ] ) 2 PANDAS Specifying/ Adding NaN values in a Series object - In such cases, you can fill missing data with a NaN (Not a Number) value. Legal empty value NaN is defined in NumPy module and hence you can use np.NaN to specify a missing value. obj3 = pd.Series( [ 6.5, np.NaN, 2.34] ) (ii) Specify index(es) as well as data with Series(). While creating Series type object is that along with values, you also provide indexes. Both values and indexes are sequences. Syntax is : <Series Object>= pandas.Series(data = None, index= None) Both data and index have to be sequences; None is taken by default, if you skip these parameters. arr= [31, 28, 31, 30] mon = [ ‘Jan’, ‘Feb’, ‘Mar’, Apr’ ] obj3 = pd. Series( data = arr, index = mon) obj4 = pd.Series( data= [32, 34, 35], index=[ ‘A’, ‘B’, ‘C’ ] ) You may use loop for defining index sequence also, e.g., s1 = pd.Series( range(1, 15, 3), index= [x for x in 'abcde' ] ) output : a 1 b 4 c 7 d 10 e 13 dtype: int64 (iii) Specify data type along with data and index. You can also specify data type along with data and index with Series() as per following syntax: <Series Object>= pandas.Series(data = None, index= None, dtype = None) obj4 = pd.Series( data= [32, 34, 35], index=[ ‘A’, ‘B’, ‘C’ ] , dtype=float ) print(obj4) A 32.0 B 34.0 C 35.0 dtype: float64 NOTE : Series object's indexes are not necessarily to 0 to n -1 always. (iv) Using a mathematical function/expression to create data array in Series(). The Series() allows you to define a function or expression that can calculate values for data sequence. <Series Object>= pd.Series (index= None, data =<function I expression> ) Numpy array a= np.arange(9, 13) obj7 = pd.Series ( index = a, data=a* 2) 3 Python list It is important to understand that if we apply the operation / expression Lst = [9, 10, 11, 12] obj8 = pd.Series ( data = (2 * Lst) ) PANDAS print(obj7) on a NumPy array then the given operation is carried in vertorized way that i.e., applied on each element of the NumPy array and the newly generated sequence is taken as data array. BUT if you apply a similar operation a Python list then the result will be entirely different. print(obj8) NOTE : While creating a Series object, when you give index array as a sequence then there is no compulsion for the uniqueness of indexes. That is, you can have duplicate entries in the index array and Python won't raise any error. Indices need not be unique in Pandas Series. This will only cause an error if/when you perform an operation that requires unique indices. Common attributes of Series objects Attribure Series.index Series.values Series.dtype Series.shape Series.nbytes Series.ndim Series.size Series.itemsize Series.hasnans Series.empty Series.head() Series.tail() Series.axes Series.dtype Description The index (axis labels) of the Series. Return Series as ndarray or ndarray-like depending on the dtype return the dtype object of the underlying data return a tuple of the shape of the underlying data return the number of bytes in the underlying data return the number of dimensions of the underlying data return the number of elements in the underlying data return the size of the dtype of the item of the underlying data (in bytes) return True if there are any NaN values; otherwise return False return True if the Series object is empty, false otherwise Returns Returns Returns Returns the first n rows. the last n rows. a list of the row axis labels. the dtype of the object. NOTE : If you use len( ) on a series object, then it returns total elements in it including NaNs but <series>.count( ) returns only the count of non-NaN values in a Series object. Accessing Individual Elements : To access individual elements of a Series object, you can give its index in square brackets along with its name as you have been doing with other Python sequences. For example : <Series Object name> [ <valid index> ] e.g. obj5[ ‘feb’ ]. # BUT if you try to give an index which is not a legal index, it will give you an error. Extracting Slices from Series Object : Like other sequences, you can extract slices too from a Series object. Slicing is a powerful way to retrieve subsets of data from a pandas object. Slicing takes place position wise and not the index wise in a series object. All individual elements have position numbers starting from 0 onwards i.e., 0 for first element, 1 for 2nd element and so on. A slice object is created from Series object using a syntax of <Object>[start : end : step), but the start and stop signify the positions of elements not the indexes. The slice object of a Series object is also a panda Series type object. import pandas as pd s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) 4 a b c 1 2 3 PANDAS print(s[:3]) dtype: int64 Operations on Series Object : 1 . Modifying Elements of Series Object : The data values of a Series object can be easily modified through item assignment, i.e., <SeriesObject>[ <index> ]= <new_data_value> Above assignment will change the data value of the given index in Series object. obj4 = pd.Series( [2,4,6,8,10]) obj4[1 : 3]= -1 print(obj4) 0 2 1 -1 2 -1 3 8 4 10 dtype: int64 Above assignment will replace all the values falling in given slice. Please note that Series object's values can be modified but size cannot. So you can say that Series objects are valuemutable but size-immutable objects. 2. The head() and tail() functions: The head() function is used to fetch first n-rows from a pandas object and tail() function returns last n-rows from a pandas object. obj4 = pd.Series( [2,4,6,8,10]) print(obj4.head(2)) obj4 = pd.Series( [2,4,6,8,10]) print(obj4.tail(2)) 0 2 1 4 dtype: int64 3 8 4 10 dtype: int64 3. Vector Operations on Series Object : Vector operations mean that if you apply a function or expression then it is individually applied on each item of the object. Since Series objects are built upon NumPy arrays (ndarrays), they also support vectorized operations, just like ndarrays. obj4 = pd.Series( [2,4,6]) print(obj4 ** 2) 0 4 1 16 2 36 dtype: int64 4. Arithmetic on Series Objects : You can perform arithmetic like addition, subtraction, division etc. with two Series objects and it will calculate result on two corresponding items of the two objects given in expression BUT it has a caveat - the operation is performed only on the matching indexes. Also, if the data items of the two matches indexes are not compatible to perform operation, it will return NaN (Not a Number) as the result of those operations. When you perform arithmetic operations on two Series type objects , the data is aligned on the basis of matching indexes (this is called Data Alignment in panda objects) and then performed arithmetic; for non-overlapping indexes, the arithmetic operations result as a NaN (Not a 5 PANDAS Number). You can store the result of object arithmetic in another object, which will also be a Series object : ob6 = ob1 + ob3 5. Filtering Entries : You can filter out entries from a Series objects using expressions that are of Boolean type. 6. Re-indexing - Sometimes you need to create a similar object but with a different order of same indexes. You can use reindexing for this purpose as per this syntax: <Series Object> = <Object>.reindex( <sequence with new order of indexes> ) obj4 = pd.Series([2,4,6], index=[0,1,2]) obj4=obj4.reindex([1,2,3]) print(obj4) 1 4.0 2 6.0 3 NaN dtype: float64 With this, the same data values and their indexes will be stored in the new object as per the defined order of index in the reindex( ). 7. Dropping Entries from an Axis - Sometime, you do not need a data value at a particular index. You can remove that entry from series object using drop( ) as per this syntax : <Series Object>.drop( <index to be removed> ) obj4 = pd.Series([2,4,6], index=[0,1,2]) obj4=obj4.drop([1,2]) print(obj4) 0 2 dtype: int64 Difference between NumPy Arrays and Series Objects (i) In case of ndarrays, you can perform vectorized operations only if the shapes of two ndarrays match, otherwise it returns an error. But with Series objects, in case of vectorized operations, the data of two Series objects is aligned as per matching indexes and operation is performed on then and for non-matching indexes, NaN is returned. (ii) In ndarrays, the indexes are always numeric starting from 0 onwards, BUT series objects can have any type of indexes, including numbers (not necessarily starting from 0), letters, labels, strings etc. 6 PANDAS Accessing Data from Series with Position Example 1 - Retrieve the first element. As we already know, the counting starts from zero for the array, which means the first element is stored at zeroth position and so on. s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve the first element print(s[0]) 1 Example 2 - Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index) s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve the first three element print(s[:3]) a 1 b 2 c 3 dtype: int64 Example 3 - Retrieve the last three elements. s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve the last three element print(s[-3:]) c 3 d 4 e 5 dtype: int64 Retrieve Data Using Label (Index) A Series is like a fixed-size dict in that you can get and set values by index label. Example 1 - Retrieve a single element using index label value. s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve a single element print(s['a']) 1 Example 2 - Retrieve multiple elements using a list of index label values. s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve multiple elements print(s[['a','c','d']]) a 1 c 3 d 4 dtype: int64 Example 3 - If a label is not contained, an exception is raised. s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) #retrieve missing element print(s['f']) 7 … KeyError: 'f'v PANDAS DataFrame Data Structure : A DataFrame in another pandas structure, which stores data in two-dimensional way. It is actually a two-dimensional (tabular and spreadsheet like) labeled array, which is actually an ordered collection of columns where columns may store different types of data, e.g., numeric or string or floating point or Boolean type etc. A two-dimensional array is an array in which each element is itself an array. Features of DataFrame 1. Potentially columns are of different types 2. Size – Mutable 3. Labeled axes (rows and columns) 4. Can Perform Arithmetic operations on rows and columns Major characteristics of a DataFrame data structure can be listed as: (i) It has two indexes or we can say that two axes - a row (axis = 0) and a column (axis= 1). (ii) Conceptually it is like a spreadsheet where each value is identifiable with the combination of row index and column index. The row index is known as index in general and the column index is called the column-name. (iii) The indexes can be of numbers or letters or strings. (iv) There is no condition of having all data of same type across columns; its columns can have data of different types. (v) You can easily change its values, i.e., it is value-mutable. (vi) You can add or delete rows/columns in a DataFrame. In other words, it is size-mutable. NOTE : DataFrames are both, value-mutable and size-mutable, i.e., you can change both its values and size. Creating and Displaying a DataFrame : You can create a DataFrame object by passing data in many different ways, such as: (i) Two-dimensional dictionaries i.e., dictionaries having lists or dictionaries or ndarrays or Series objects etc. (ii) Two-dimensional ndarrays (NumPy array) (iii) Series type object (iv) Another DataFrame object (i) Two-dimensional dictionaries i.e., dictionaries having lists or dictionaries or ndarrays or Series objects etc. A two dimensional dictionary is a dictionary having items as (key: value) where value part is a data structµre of any type : another dictionary, an ndarray, a Series object, a list etc. But here the value parts of all the keys should have similar structure and equal lengths. (a) Creating a dataframe from a 2D dictionary having values as lists / ndarrays : import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data) print(df) 0 1 2 3 Age 28 34 29 42 Name Tom Jack Steve Ricky The keys of 2d dictionary have become columns and Indexes have been generated 0 onwards using np.range(n). You can specify your own indexes too by specifying a sequence by the name index in the DataFrame( ) function, import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4']) 8 rank1 rank2 rank3 rank4 Age 28 34 29 42 Name Tom Jack Steve Ricky PANDAS print(df) (b) Creating a dataframe from a 2D dictionary having values as dictionary objects : A 2D dictionary can have values as dictionary objects too. You can also create a data frame object using such 20 dictionary object: import pandas as pd yr2015 = { 'Qtr1' : 34500, 'Qtr2' yr2016 = { 'Qtr1' : 44900, 'Qtr2' yr2017 = { 'Qtr1' : 54500, 'Qtr2' disales = {2015 : yr2015, 2016 : df1 = pd.DataFrame(disales) print(disales) : 56000, 'Qtr3' : 47000, 'Qtr4' : 49000} : 46100, 'Qtr3' : 57000, 'Qtr4' : 59000} : 51000, 'Qtr3' : 57000, 'Qtr4' : 58500} yr2016, 2017 : yr2017} Its output is as follows − Qtr1 Qtr2 Qtr3 Qtr4 2015 34500 56000 47000 49000 2016 44900 46100 57000 59000 2017 54500 51000 57000 58500 NOTE : While creating a dataframe with a nested or 2d dictionary, Python interprets the outer dict keys as the columns and the inner keys as the row indices. Now, had there been a situation where inner dictionaries had non-matching keys, then in that case Python would have done following things: (i) There would have been total number of indexes equal to sum of unique inner keys in all the inner dictionaries. (ii) For a key that has no matching keys in other inner dictionaries, value NaN would be used to depict the missing values. yr2015 = { 'Qtr1' : 34500, 'Qtr2' : 56000, 'Qtr3' : 47000, 'Qtr4' : 49000} yr2016 = { 'Qtr1' : 44900, 'Qtr2' : 46100, 'Qtr3' : 57000, 'Qtr4' : 59000} yr2017 = { 'Qtr1' : 54500, 'Qtr2' : 51000, 'Qtr3' : 57000} diSales = { 2015 : yr2015, 2016: yr2016, 2017 : yr2017} df3 = pd.DataFrame(diSales) print(df3) Its output is as follows – Qtr1 Qtr2 Qtr3 Qtr4 2015 34500 56000 47000 49000 2016 44900 46100 57000 59000 2017 54500.0 51000.0 57000.0 NaN NOTE : Total number of indexes in a Data Frame object are equal to total unique inner keys of the 20 dictionary passed to it and it would use NaN values to fill missing data i.e., where the corresponding values for a key are missing in any inner dictionary. (c) Create a DataFrame from List of Dicts - List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names. Example 1 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 9 a b c PANDAS 'c': 20}] df = pd.DataFrame(data) print(df) 0 1 1 5 2 10 NaN 20.0 Note − Observe, NaN (Not a Number) is appended in missing areas. Example 2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data, index=['first', 'second']) print(df) first second a 1 5 b 2 10 c NaN 20.0 Example 3 import pandas as pd data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] #With two column indices, values same as dictionary keys df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b']) #With two column indices with one index with other name df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1']) print(df1) print(df2) Its output is as follows − #df1 output a first 1 second 5 #df2 output a first 1 second 5 b 2 10 b1 NaN NaN 2. Creating a DataFrame Object from a 1-D or 2-D ndarray – You can also pass a two-dimensional NumPy array (i.e., having shape as (<n>, <n>)) to DataFrame( ) to create a dataframe object. Example 1 data = [1,2,3,4,5] df = pd.DataFrame(data) print(df) 0 1 2 3 4 0 1 2 3 4 5 Example 2 data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) print(df) 0 1 2 Name Alex Bob Clarke Age 10 12 13 0 1 2 Name Alex Bob Clarke Example 3 data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age'],dtype=float) print(df) Age 10.0 12.0 13.0 NOTE : By giving an index sequence, you can specify your own index names or labels . 10 PANDAS If, however, the rows of ndarrays differ in length, i.e., if number of elements in each row differ, then Python will create just single column in the dataframe object and the type of the column will be considered as object. narr3 = np.array( [ [ 101.5, 201.2 ], [ 400, 50, 600, 700 ], [ 212.3, 301.5, 405.2 ] ] ) dtf4=pd.DataFrame(narr3) 3. Creating a DataFrame object from a 2D dictionary with values as Series object d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df) a b c d one 1.0 2.0 3.0 NaN two 1 2 3 4 NOTE : The column names must be valid identifiers of Python. 4. Creating a DataFrame Object from another DataFrame Object data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) df1 = pd.DataFrame(df) print(df1) 0 1 2 Name Alex Bob Clarke Age 10 12 13 DataFrame Attributes: Getting count of non-NA values in dataframe - Like Series, you can use count( ) with dataframe too to get the count of Non-NaN values, but count( ) with dataframe is little elaborate 11 PANDAS (i) If you do not pass any argument or pass 0 (default is 0 only), then it returns count of non-NA values for each column, e.g., (ii) If you pass argument as 1, then it returns count of non-NA values for each row, e.g., (iii) To get count of non-NA values from rows/columns, you can explicitly specify argument to count() as axis = 'index' or axis = 'columns' as shown below: Numpy Representation of DataFrame - You can represent the values of a dataframe object in numpy way using values attribute. E.g. data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) print(df.values) [['Alex' 10] ['Bob' 12] ['Clarke' 13]] Selecting / Accessing a Column – <DataFrame object>[ <column name>] or <Data Frame object>.<column name> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df['one']) # or print(df.one) a b c d Name: 1.0 2.0 3.0 NaN one, dtype: float64 Column Addition d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print ("Adding a new column by passing as Series:") df['three']=pd.Series([10,20,30],index=['a','b','c']) print(df) print ("Adding a new column using the existing columns in DataFrame:") df['four']=df['one']+df['three'] print(df) 12 Adding a new column by passing as Series: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 d NaN 4 NaN Adding a new column using the existing columns in DataFrame: one two three four a 1.0 1 10.0 11.0 b 2.0 2 20.0 22.0 c 3.0 3 30.0 33.0 d NaN 4 NaN NaN PANDAS Column Deletion : Example d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])} df = pd.DataFrame(d) print ("Our dataframe is:") print(df) print ("Deleting the first column using DEL function:") del df['one'] # using del function print(df) print ("Deleting another column using POP function:") df.pop('two') # using pop function print(df) Our dataframe is: one three two a 1.0 10.0 1 b 2.0 20.0 2 c 3.0 30.0 3 d NaN NaN 4 Deleting the first column using DEL function: three two a 10.0 1 b 20.0 2 c 30.0 3 d NaN 4 Deleting another column using POP function: three a 10.0 b 20.0 c 30.0 d NaN Row Selection, Addition, and Deletion Selection by Label - Rows can be selected by passing row label to a loc function. d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.loc['b']) one 2.0 two 2.0 Name: b, dtype: float64 The result is a series with labels as column names of the DataFrame. And, the Name of the series is the label with which it is retrieved. Selection by integer location - Rows can be selected by passing integer location to an iloc function. d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.iloc[2]) one 3.0 two 3.0 Name: c, dtype: float64 Slice Rows – Multiple rows can be selected using ‘ : ’ operator. d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df[2:4]) one 3.0 NaN c d two 3 4 Addition of Rows - append function helps to append the rows at the end. df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) print(df) 0 1 0 1 a 1 3 5 7 b 2 4 6 8 Deletion of Rows - Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped. If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped. df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) df = df.drop(0) print(df) 13 a b 1 3 4 1 7 8 PANDAS Selecting/ Accessing Multiple Columns – <Data Frame object>[ [<column name>, <column name>, <column name>, ... ] ] import pandas as pd data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data,columns=['Name','Age','City']) print(df[['Name','City']]) 0 1 2 Name Alex Bob Clarke City Jaipur Kota Ajmer NOTE : Columns appear in the order of column names given in the list inside square brackets. Selecting / Accessing a Subset from a Dataframe using Row / Column Names – <DataFrameObject>.loc [ <startrow> : <endrow>, <startcolumn> : <endcolumn>] data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data,columns=['Name','Age','City']) print(df.loc[0:1, : ]) 0 1 Name Alex Bob Age 10 12 City Jaipur Kota To access single row: <DataFrameObject> . loc [ <row> ] data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data,columns=['Name','Age','City']) print(df.loc[0]) Name Alex Age 10 City Jaipur Name: 0, dtype: object To access multiple rows: <DataFrameObject> . loc [ <startrow> : <endrow>, : ] data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data,columns=['Name','Age','City']) print(df.loc[0:1]) 0 1 Name Alex Bob Age 10 12 0 1 2 Name Alex Bob Clarke 0 1 Name Alex Bob City Jaipur Kota To access selective columns, use : data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data,columns=['Name','Age','City']) print(df.loc[ : , 'Name':'Age']) Age 10 12 13 To access range of columns from a range of rows, use: <DF object>.loc [<startrow>: <endrow>, <startcolumn>: <endcolumn>]. data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data,columns=['Name','Age','City']) print(df.loc[0:1 , 'Name':'Age']) Age 10 12 Obtaining a Subset/Slice from a Dataframe using Row/Column Numeric Index/Position: You can extract subset from dataframe using the row and column numeric index/position, but this time you will use iloc instead of loc. NOTE : iloc means integer location. <DFobject>. iloc[ <start row index> : <end row index>, < start col index> : <end column index> ] data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data) print(df.loc[0:1 , 0:1]) 0 1 0 Alex Bob 1 10 12 Selecting/Accessing Individual Value (i) Either give name of row or numeric index in square brackets with, i.e., as this : 14 PANDAS <DF object>.<column> [ <row name or row numeric index>] data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) print(df.Age[0:2]) 0 10 1 12 Name: Age, dtype: int64 ii) You can use at or iat attribute with DF object as shown below : <DF object>.at[ <row name> , <Column name> ] Or <DF object>. iat[ <numeric Row index>, <numeric Column index> ] data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) print(df.at[0, 'Age']) data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data) print(df.iat[0, 0]) 10 Alex Assigning/Modifying Data Values in Dataframes: (a) To change or add a column, use syntax : <DF object >.<column name> [ <row label>] = <new value> data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df.Age[0]=11 print(df) 0 1 2 Name Alex Bob Clarke Age 11 12 13 City Jaipur Kota Ajmer (b) Similarly, to change or add a row, use syntax : <DF object> at[ <row name>, : ] = <new value> <DF object> loc[ <row name>, : ] = <new value> Or data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df.at[3, 0:1]='NONAME' print(df) data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df.loc[3, 0:1]='NONAME' print(df) 0 1 2 3 0 1 2 3 Name Alex Bob Clarke NONAME Name Alex Bob Clarke NONAME Age 10.0 12.0 13.0 NaN Age 10.0 12.0 13.0 NaN City Jaipur Kota Ajmer NaN City Jaipur Kota Ajmer NaN (c) To change or modify a single data value, use syntax : <DF>. <columnname> [ <row name/label>] = <Value> data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df.Age[0]=11 print(df) 0 1 2 Name Alex Bob Clarke Age 11 12 13 City Jaipur Kota Ajmer Adding Columns in DataFrames <DF object>.at [ : , <column name> ] = <values for column> Or <DF object>. loc [ : , <column name> ] = <values for column> Or <DF object> = <DF object>. assign( <column name> = <values for column> ) data = 15 Name Age City State PANDAS [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df.at[:,'State']=['Raj','Raj','Raj'] print(df) data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df.loc[:,'State']=['Raj','Raj','Raj'] print(df) data = [['Alex',10,'Jaipur'],['Bob',12,'Kota'],['Clarke',13,'Ajmer']] df = pd.DataFrame(data, columns=['Name','Age','City']) df=df.assign(State=['Raj','Raj','Raj']) print(df) 0 1 2 Alex Bob Clarke 10 12 13 Jaipur Kota Ajmer Raj Raj Raj 0 1 2 Name Alex Bob Clarke Age 10 12 13 City State Jaipur Raj Kota Raj Ajmer Raj 0 1 2 Name Alex Bob Clarke Age 10 12 13 City State Jaipur Raj Kota Raj Ajmer Raj NOTE : When you assign something to a column of dataframe, then for existing column, it will change the data values and for non-existing column, it will add a new column. Deleting Columns in DataFrames del <Df object>[ <column name>] <Df object> . drop( [index or sequence of indexes], axis = 1) import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])} df = pd.DataFrame(d) print ("Our dataframe is:") print(df) print ("Deleting the first column using DEL function:") del df['one'] # using del function print(df) print ("Deleting another column using Drop function:") df=df.drop('two',axis=1) # using Drop function print(df) Our dataframe is: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 d NaN 4 NaN Deleting the first column using DEL function: two three a 1 10.0 b 2 20.0 c 3 30.0 d 4 NaN Deleting another column using Drop function: three a 10.0 b 20.0 c 30.0 d NaN To delete rows from a dataframe, you can use <DF>.drop(index or sequence of indexes), by default axis value is 0. Attribute/ Method T axes dtypes empty ndim shape size values head() tail() 16 Description Transposes rows and columns. Returns a list with the row axis labels and column axis labels as the only members. Returns the dtypes in this object. True if NDFrame is entirely empty [no items]; if any of the axes are of length 0. Number of axes / array dimensions. Returns a tuple representing the dimensionality of the DataFrame. Number of elements in the NDFrame. Numpy representation of NDFrame. Returns the first n rows. Returns last n rows. PANDAS Examples import pandas as pd import numpy as np #Create a Dictionary of series d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith', 'Jack']), 'Age':pd.Series([25,26,25,23,30,29,23]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} #Create a DataFrame df = pd.DataFrame(d) print("Our data series is:") print(df) T (Transpose) interchange. - Our data series is: Age Name Rating 0 25 Tom 4.23 1 26 James 3.24 2 25 Ricky 3.98 3 23 Vin 2.56 4 30 Steve 3.20 5 29 Smith 4.60 6 23 Jack 3.80 Returns the transpose of the DataFrame. The rows and columns will print("The transpose of the data series is:") print(df.T) The transpose of the data series is: 0 1 2 3 Age 25 26 25 23 Name Tom James Ricky Vin Rating 4.23 3.24 3.98 2.56 4 30 Steve 3.2 5 29 Smith 4.6 6 23 Jack 3.8 Axes - Returns the list of row axis labels and column axis labels. print("Row axis labels and column axis labels are:") print(df.axes) Row axis labels and column axis labels are: [RangeIndex(start=0, stop=7, step=1), Index([u'Age', u'Name', u'Rating'], dtype='object')] dtypes - Returns the data type of each column. print("The data types of each column are:") print(df.dtypes) The data types of each column are: Age int64 Name object Rating float64 dtype: object Empty - Returns the Boolean value saying whether the Object is empty or not; True indicates that the object is empty. print("Is the object empty?") print(df.empty) Is the object empty? False Ndim - Returns the number of dimensions of the object. By definition, DataFrame is a 2D object. print("The dimension of the object is:") print(df.ndim) The dimension of the object is: 2 Shape - Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns. print("The shape of the object is:") print(df.shape) Size - The shape of the object is: (7, 3) Returns the number of elements in the DataFrame. print("The total number of elements in our object is:") print(df.size) 17 The total number of elements in our object is: 21 PANDAS Values - Returns the actual data in the DataFrame as an NDarray. print("The actual data in our data frame is:") print(df.values) The actual data in our data frame is: [[25 'Tom' 4.23] [26 'James' 3.24] [25 'Ricky' 3.98] [23 'Vin' 2.56] [30 'Steve' 3.2] [29 'Smith' 4.6] [23 'Jack' 3.8]] Head & Tail - To view a small sample of a DataFrame object, use the head() and tail() methods. head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number. The first two rows of the data frame is: Age Name Rating 0 25 Tom 4.23 1 26 James 3.24 print("The first two rows of the data frame is:") print(df.head(2)) tail() returns the last n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number. The last two rows of the data frame is: Age Name Rating 5 29 Smith 4.6 6 23 Jack 3.8 print("The first two rows of the data frame is:") print(df.tail(2)) Iterating over a DataFrame <DFobject>.iterrows( ) - The iterrows() method iterates over dataframe row-wise where each horizontal subset is in the form of (row-index, Series) where Series contains all column values for that row-index. <DFobject>.iteritems( ) - The iteritems() method iterates over dataframe column-wise where each vertical subset is in the form of (col-index, Series) where Series contains all row values for that column-index. NOTE : <DF>.iteritems{ ) iterates over vertical subsets in the form of (col-index, Series) pairs and <DF>.iterrows( ) iterates over horizontal subsets in the form of (row-index, Series) pairs. for (row, rowSeries) in df1 . iterrows (): Each row is taken one at a time in the form of (row, rowSeries) where row would store the rowindex and rowSeries will store all the values of the row in form of a Series object. for ( col, colSeries) in df1 . iteritems(): Each row is taken one at a time in the form of (col, colSeries) where col would store the rowindex and colSeries will store all the values of the row in form of a Series object. To iterate over the rows of the DataFrame, we can use the following functions − • • iteritems() − to iterate over the (key,value) pairs iterrows() − iterate over the rows as (index,series) pairs iteritems() - Iterates over each column as key, value pair with label as key and column value as a Series object. df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=['col1','col2','col3']) for key,value in df.iteritems(): print(key, value) 18 col1 0 1 1 4 2 7 3 10 Name: col1, dtype: int32 PANDAS col2 0 2 1 5 2 8 3 11 Name: col2, dtype: int32 col3 0 3 1 6 2 9 3 12 Name: col3, dtype: int32 Observe, each column is iterated separately as a key-value pair in a Series. iterrows() - iterrows() returns the iterator yielding each index value along with a series containing the data in each row. df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=['col1','col2','col3']) for key,value in df.iterrows(): print(key, value) 0 col1 col2 col3 Name: 0, 1 col1 col2 col3 Name: 1, 2 col1 col2 col3 Name: 2, 3 col1 col2 col3 Name: 3, 1 2 3 dtype: 4 5 6 dtype: 7 8 9 dtype: 10 11 12 dtype: int32 int32 int32 int32 Note − Because iterrows() iterate over the rows, it doesn't preserve the data type across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices. Note − Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object. df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=['col1','col2','col3']) for index, row in df.iterrows(): row['a'] = 10 print(df) 0 1 2 3 col1 1 4 7 10 col2 2 5 8 11 col3 3 6 9 12 Binary Operations in a DataFrame : Binary operations mean operations requiring two values to perform and these values are picked elementwise. In a binary operation, the data from the two dataframes are aligned on the bases of their row and column indexes and for matching row, column index, the given operation is performed and for nonmatching row, column indexes NaN value is stored in the result. So you can say that like Series objects, data is aligned in two dataframes, the data is aligned on the basis of matching row and column indexes and then performed arithmetic for non-overlapping indexes, the arithmetic operations result as a NaN for non-matching indexes. You can perform add binary operation on two dataframe objects using either + operator or using add( ) as per syntax: <DF1>.add(<DF2>) which means <DF1>+<DF2> or by using radd( ) i.e., reverse add as per syntax : <DF1>.radd(<DF2>) which means <DF2>+<DF1>. You can perform subtract binary operation on two dataframe objects using either - (minus) operator or using sub() as per syntax: <DF>.sub(<DF>) which means <DF1> - <DF2> or by 19 PANDAS using rsub( ) i.e., reverse subtract as per syntax : <DF1>.radd(<DF2>) which means <DF2> <DF1>. You can perform multiply binary operation on two dataframe objects using either * operator or using mul() as per syntax: <DF>.mul(<DF>). You can perform division binary operation on two dataframe objects using either / operator or using div( ) as per syntax : <DF>.div(<DF>). NOTE : Python integer types cannot store NaN values. To store a NaN value in a column, the datatype of a column is changed to non-integer suitable type. NOTE : If you are performing subtraction on two dataframes, make sure the data types of values are subtraction compatible (e.g., you cannot subtract two strings) otherwise Python will raise an error. Some Other Essential Functions: 1. Inspection functions info( ) and describe( ) : To inspect broadly or to get basic information about your dataframe object, you can use info() and describe( ) functions. <DF>.info() - The info( ) gives following information for a dataframe object ¢) Its type. Obviously, it is an instance of a DataFrame. ¢) Index valbues. As each row of a dataframe object has an index, this information shows the assigned indexes. ¢) Number of rows in the dataframe object. ¢) Data columns and values in them. It lists number of columns and count of only non-NA values in them. ¢) Datatypes of each column. The listed datatypes are not necessarily in the corresponding order to the listed columns. You can however use the dtypes attribute to get the datatype for each column. ¢) Memory_usage. Approximate amount of RAM used to hold the DataFrame. <DF>.describe() - The describe( ) gives following information for a dataframe object having numeric columns: ¢) Count. Count of non-NA values in a column ¢) Mean. Computed mean of values in a column ¢) Std. Standard deviation of values in c1 column ¢) Min. Minimum value in a column ¢) 25°0, 50%, 75%. Percentiles of values in that column (how these percentile are calculated, we are explaining below. ¢) Max. Maximum value in a column The information returned by describe( ) for string columns includes: ¢) Count -the number of non-NA entries in the column ¢) Unique -number of unique entries in the column 20 PANDAS ¢) Top - the most common entry in the column, i.e., the one with highest frequency. If, however, multiple values have the same highest count, then the count and most common (i.e., top) pair will be arbitrarily chosen from among those with the highest count. ¢) Freq - it is the frequency of the most common element displayed as top above. The default behavior of describe( ) is to only provide a summary for the numerical columns. You can give include= 'all' as argument to describe( ) to list summary for all columns. d = {'Name':pd.Series(['Tom','James','Ricky','Vin']), 'Age':pd.Series([25,26,25,23]), 'Rating':pd.Series([2.98,4.80,4.10,3.65])} df = pd.DataFrame(d) print(df.describe()) d = {'Name':pd.Series(['Tom','James','Ricky','Vin']), 'Age':pd.Series([25,26,25,23]), 'Rating':pd.Series([2.98,4.80,4.10,3.65])} df = pd.DataFrame(d) print(df.describe(include='all')) d = {'Name':pd.Series(['Tom','James','Ricky','Vin']), 'Age':pd.Series([25,26,25,23]), 'Rating':pd.Series([2.98,4.80,4.10,3.65])} df = pd.DataFrame(d) print(df.info()) count mean std min 25% 50% 75% max Age 4.000000 24.750000 1.258306 23.000000 24.500000 25.000000 25.250000 26.000000 count unique top freq mean std min 25% 50% 75% max Name 4 4 Tom 1 NaN NaN NaN NaN NaN NaN NaN Rating 4.000000 3.882500 0.765436 2.980000 3.482500 3.875000 4.275000 4.800000 Age 4.000000 NaN NaN NaN 24.750000 1.258306 23.000000 24.500000 25.000000 25.250000 26.000000 Rating 4.000000 NaN NaN NaN 3.882500 0.765436 2.980000 3.482500 3.875000 4.275000 4.800000 <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 3 columns): Name 4 non-null object Age 4 non-null int64 Rating 4 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 120.0+ bytes None This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'. • object − Summarizes String columns • number − Summarizes Numeric columns • all − Summarizes all columns together (Should not pass it as a list value) 2. Retrieve head and tail rows using head( ) and tail( ) - You can use head() and tail() to retrieve ‘N’ top or ‘N’ bottom rows respectively of a dataframe object. These functions are to be used as : <DF>.head( [ n = 5 ] ) or <DF>.tail( [ n = 5 ] ) You can give any value of ‘N’ as per your need (as many rows you want to list). 3. Cumulative Calculations Functions : You can use these functions for cumulative calculations on dataframe objects : cumsum( ) calculates cumulative sum i.e., in the output of 21 PANDAS this function, the value of each row is replace>d by sum of all prior rows including this row. String value rows use concatenation. It is used. as: <DF> . cumsum( [axis== None]) df = pd.DataFrame([[25,26],[25,23]]) print(df.cumsum()) 0 1 0 25 50 1 26 49 In the same manner you can use cumprod( ) to get cumulative product, cummax( ) to get cumulative maximum and cummin( ) to get cumulative minimum value from a dataframe object. 4. Index of Maximum and Minimum Values - You can get the index of maximum and minimum value in columns using idxmax( ) and idxmin( ) functions. <DF>.idxmax() or <DF>.idxmin( ) df = pd.DataFrame([[52,26,54],[25,72,78],[25,2,82]]) print(df.idxmax()) print(df.idxmin()) 0 0 1 1 2 2 dtype: int64 0 1 1 2 2 0 dtype: int64 Python Pandas – Sorting - There are two kinds of sorting available in Pandas. They are − 1. By label 2. By Actual Value By Label - Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order. df=pd.DataFrame([[34,23],[76,43],[76,34],[78,99]], index=[1,4,6,2], columns=['col2','col1']) print(df) sortdf=df.sort_index() 1 2 4 col2 34 76 76 78 col2 34 78 76 col1 23 43 34 99 col1 23 99 43 6 76 34 1 4 6 2 print(sortdf) Order of Sorting - By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same. sortdf=df.sort_index(ascending=False) print(sortdf) 6 4 2 1 col2 76 76 78 34 col1 34 43 99 23 Sort the Columns - By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same. sortdf=df.sort_index(axis=1) print(sortdf) 1 4 6 2 col1 23 43 34 99 col2 34 76 76 78 By Value - Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted. df=pd.DataFrame({'name':['aman','raman','lucky','pawan'], 'city':['ajmer','ludhiana','jaipur','jalandhar'], 22 0 name aman city state ajmer raj PANDAS 'state':['raj','pb','raj','pb']}) print(df) sortdf = df.sort_values(by='state') print(sortdf) sortdf = df.sort_values(by=['state','city']) print(sortdf) 1 2 3 1 3 0 2 3 1 0 2 23 raman lucky pawan name raman pawan aman lucky name pawan raman aman lucky ludhiana pb jaipur raj jalandhar pb city state ludhiana pb jalandhar pb ajmer raj jaipur raj city state jalandhar pb ludhiana pb ajmer raj jaipur raj PANDAS Matching and Broadcasting Operations - You have read earlier that when you perform arithmetic operations on two Series type objects , the data is aligned on the basis of matching indexes and then performed arithmetic; for non-overlapping indexes, the arithmetic operations result as a NaN (Not a Number), this is called Data Alignment in panda objects. While performing arithmetic operations on dataframes, the same thing happens, i.e., whenever you add two dataframes, data is aligned on the basis of matching indexes and then performed arithmetic; for non-overlapping indexes, the arithmetic operations result as a NaN (Not a Number). This default behaviour of data alignment on the basis of matching indexes is called MATCHING. While performing arithmetic operations, enlarging the smaller object operand by replicating its elements so as to match the shape of larger object operand, is called BROADCASTING. <DF>, add( <DF>, axis= 'rows') <DF>. div( <DF>, axis= 'rows·) <DF>, rdiv( <DF>, axis= 'rows') <DF>. mul( <DF>, axis = 'rows') <DF>. rsub( <DF>, axis= 'rows') You can specify matching axis for these operations. (default matching is on columns i.e., when you do not give axis argument) Broadcasting using a scalar value s = pd.Series(np.arange(5)) print(s * 10) 0 1 2 3 4 0 10 20 30 40 dtype: int32 df = pd.DataFrame({'a':[10,20], 'b':[5,15]}) print(df*10) 0 1 a 100 200 b 50 150 So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above. Broadcasting using a 1-D array - Say we have a 2-D dataframe of shape 4x3 (4 rows, 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length: df = pd.DataFrame( { 'a' : [ 10, 20 ], 'b' : [ 5, 15 ] } ) print(df) print(df.iloc[0]) print(df + df.iloc[0]) output a b 0 10 5 1 20 15 a 10 b 5 Name: 0, dtype: int64 a b 0 20 10 1 30 20 The general rule is this: In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one. So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values: 24 PANDAS dt = pd.DataFrame([1]) print(df+dt) Output: a 0 NaN 1 NaN b NaN NaN 0 NaN NaN Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this: print(df[['a']] + df.iloc[0]) 0 1 a 20 30 b NaN NaN In the above we can see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data: print( df[['a']].values + df.iloc[0].values ) [ [20 15] [30 25] ] Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1. Handling Missing Data : Missing values are the values that cannot contribute to any computation or we can say that missing values are the values that carry no computational significance. Pandas library is designed to deal with huge amounts of data or big data. In such volumes of data, there may be some values which have NA values such as NULL or NaN or None values. These are the values that cannot participate in computation constructively. These values are known as missing values. You can handle missing data in many ways, most common ones are: (i) Dropping missing data (ii) Filling missing data (Imputation) You can use isnull() and notnull() functions to detect missing values in a panda object; it returns True or False for each value in a pandas object if it is a missing value or not. It can be used as : <PandasObject> . isnull() <PandasObject> . notnull() <PandaObject> means it is applicable to both Series as well as Dataframe objects. df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a', 'c', 'e', 'f'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f']) print(df['one'].isnull()) print("-------------------") print(df['one'].notnull()) a False b True c False d True e False f False Name: one, dtype: bool ------------------a True b False c True d False e True f True Name: one, dtype: bool Calculations with Missing Data • When summing data, NA will be treated as Zero • If the data are all NA, then the result will be Zero df = pd.DataFrame(np.arange(0,12).reshape(4,3), 25 18.0 PANDAS index=['a', 'c', 'e', 'f'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df['one'].sum()) df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two']) print(df['one'].sum()) 0 Handling Missing Data Dropping Missing Values – To drop missing values you can use dropna( ) in following three ways : (a) <PandaObjed>.dropna( ). This will drop all the rows that have NaN values in them, even row with a single NaN value in it. (b) <PandaObjed>.dropna( how=’all’ ). With argument how= 'all', it will drop only those rows that have all NaN values, i.e., no value is non-null in those rows. (c) dropna(axis = 1). With argument axis= 1, will drop columns that have any NaN values in them. Using argument how= 'all' along with argument axis= 1 will drop columns with all NaN values. df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a', 'c', 'e', 'f'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.dropna()) df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a', 'c', 'e', 'f'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.dropna(axis=1)) a c e f one 0.0 3.0 6.0 9.0 two 1.0 4.0 7.0 10.0 three 2.0 5.0 8.0 11.0 Empty DataFrame Columns: [] Index: [a, b, c, d, e, f, g, h] Handling Missing Data Filling Missing Values - Though dropna( ) removes the null values, but you also lose other non-null data with it too. To avoid this, you may want to fill the missing data with some appropriate value of your choice. For this purpose you can use fillna( ) in following ways: (a) <PandaObject>.fillna(<n>). This will fill all NaN values with the given <n> value. (b) Using dictionary with fillna() to specify fill values for each column separately. You can create a dictionary that defines fill values for each of the columns. And then you can pass this dictionary as an argument to fillna( ), Pandas will fill the specified value for each column defined in the dictionary. It will leave those columns untouched or unchanged that are not in the dictionary. The syntax of fillna( ) is : <DF>.fillna( <dictionary having fill values for columns> ) df = pd.DataFrame(np.arange(0,12).reshape(4,3), index=['a', 'c', 'e', 'f'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f']) print(df.fillna(-1)) print("--------------------") print(df.fillna({'one':-1, 'two':-2,'three':-3})) 26 one two three a 0.0 1.0 2.0 b -1.0 -1.0 -1.0 c 3.0 4.0 5.0 d -1.0 -1.0 -1.0 e 6.0 7.0 8.0 f 9.0 10.0 11.0 -------------------one two three a 0.0 1.0 2.0 b -1.0 -2.0 -3.0 c 3.0 4.0 5.0 d -1.0 -2.0 -3.0 PANDAS e f 6.0 9.0 7.0 10.0 8.0 11.0 Missing data / operations with fill values In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN. df1 = pd.DataFrame([[10,20,np.NaN], [np.NaN,30,40]], index=['A','B'], columns=['One', 'Two', 'Three']) df2 = pd.DataFrame([[np.NaN,20,30], [50,40,30]], index=['A','B'], columns=['One', 'Two', 'Three']) print(df1) print(df2) A B One 10.0 NaN One NaN 50.0 A B One NaN NaN A B One 10.0 50.0 A B print(df1+df2) print(df1.add(df2, fill_value=0)) Two 20 30 Two 20 40 Two 40 70 Two 40 70 Three NaN 40.0 Three 30 30 Three NaN 70.0 Three 30.0 70.0 Flexible Comparisons Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous to the binary arithmetic operations described above: print(df1.gt(df2)) A B One False False A B One True True print(df2.ne(df1)) NOTE : eq = equal to ne = not equal le = less than equal to ge = greater than equal to Two False False Two False True lt = less than Three False True Three True True gt = greater than Comparisons of Pandas Objects – Thus, normal comparison operators ( == ) do not produce accurate result for comparison of two similar objects as they cannot compare NaN values. To compare objects having NaN values, it is better to use equals( ) that returns True if two NaN values are compared for equality: <expression 1 yielding a Panda object>. equals ( <expression 2 yielding a Panda object >) NOTE : The equls( ) tests two objects for equality, with NaNs in corresponding locations treated as equal. NOTE : Series or Data Frame indexes of two objects need to be the same and in the same order for equality to be True. Also, two panda objects being compared should be of same lengths. Trying to compare two dataframes with different number of rows or two series objects with different lengths will result in ValueError. Boolean Reductions - Pandas offers Boolean reductions that summarize the Boolean result for an axis. That is, with Boolean reduction, you can get the overall result of a row or column with a 27 PANDAS single True or False. Boolean reductions are a way to summarize all com-parison results of a dataframe's individual elements in form of single overall Boolean result per column or per row. For this purpose Pandas offers following Boolean reduction functions or attributes. ¢) empty. this attribute is indicator whether a DataFrame is empty. It is used as: <DF>.empty ¢) any( ). This function returns True if any element is True over requested axis, By default, it checks if any value in a column ( default axis is 0) meets this criteria, if it does it returns True, i.e., if any of the values along the specified axis is True, this will return True. It is used as per following syntax : <Data Frame comparison result object>. any( axis= None ) ¢) all(). Unlike any(), the all() will return True / False if only all the values on an axis are True or False according to given comparison . It is used as per syntax : <DataFrame comparison result object> . all(axis =None) You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result. df1 = pd.DataFrame([[10,20,np.NaN], [np.NaN,30,40]], columns=['One', 'Two', 'Three']) df2 = pd.DataFrame([[np.NaN,20,30], [50,40,30]], columns=['One', 'Two', 'Three']) print((df1 > 0).all()) index=['A','B'], index=['A','B'], One False Two True Three False dtype: bool print((df1 > 0).any()) one True two True three True dtype: bool You can reduce to a final boolean value print((df1 > 0).any().any()) True You can test if a pandas object is empty, via the empty property. print(df1.empty) False Combining DataFrames combine_first( ) - The combine_first( ) combines the two dataframes in a way that uses the method of patching the data : Patching the data means, if in a dataframe a certain cell has missing data and corresponding cell (the one with same index and column id) in other dataframe has some valid data then, this method with pick the valid data from the second dataframe and patch it in the first dataframe so that now it also has valid data at that cell. The combine_first() is used as per syntax : <DF>.combine_first(<DF2>) df1 = pd.DataFrame({'A': 10, 'B': 20, 'C':np.nan, 'D':np.nan}, index=[0]) df2 = pd.DataFrame({'B': 30, 'C': 40, 'D': 50}, index=[0]) df1=df1.combine_first(df2) print("\n------------ combine_first ----------------\n") print(df1) 0 A 10 B 20 C 40.0 D 50.0 The concat( ) can concatenate two dataframes along the rows or along the columns. This method is useful if the two dataframes have similar structures. It is used as per following syntax : The concat( ) can concatenate two dataframes along the rows or along the columns. This method is useful if the two dataframes have similar structures. It is used as per following syntax : 28 PANDAS pd.concat([<df1>, <df2>]) pd. concat ( [ <df1>, <df2>], ignore_index == True) pd. concat( [ <df1>, <df2>], axis== 1 ) If you skip the axis = 1 argument, it will join the two dataframes along the rows, i.e., the result will be the union of rows from both the dataframes. But if you do not want this mechanism for row indexes and want to have new row indexes generated from 0 to n - 1, then you can give argument ignore_index = True. By default it concatenates along the rows ; to concatenate along the column, you can give argument axis = 1. Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. pd.concat(objs,axis=0,join='outer',join_axes=None, ignore_index=False) • • • • • objs − This is a sequence or mapping of Series, DataFrame, or Panel objects. axis − {0, 1, ...}, default 0. This is the axis to concatenate along. join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection. ignore_index − boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic. Concatenating Objects - The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create different objects and do concatenation. one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index =[1,2,3]) two = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79 ]},index=[1,2,3]) print(pd.concat([one,two])) 1 2 3 1 2 3 Name Alex Amy Allen Billy Brian Bran Score sub_id 98 sub1 90 sub2 87 sub4 89 sub2 80 sub4 79 sub3 Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this by using the keys argument − one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2, 3]) two = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},ind ex=[1,2,3]) print(pd.concat([one,two],keys=['x','y'])) x 1 2 3 y 1 2 3 Name Alex Amy Allen Billy Brian Bran Score sub_id 98 sub1 90 sub2 87 sub4 89 sub2 80 sub4 79 sub3 The index of the resultant is duplicated; each index is repeated. If the resultant object has to follow its own indexing, set ignore_index to True. one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2, 3]) two = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},ind ex=[1,2,3]) print(pd.concat([one,two],keys=['x','y'],ignore_index=True)) 0 1 2 3 4 5 Name Alex Amy Allen Billy Brian Bran Score sub_id 98 sub1 90 sub2 87 sub4 89 sub2 80 sub4 79 sub3 Observe, the index changes completely and the Keys are also overridden. 29 PANDAS If two objects need to be added along axis=1, then the new columns will be appended. one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87 ]},index=[1,2,3]) two = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[ 89,80,79]},index=[1,2,3]) print(pd.concat([one,two],keys=['x','y'],axis=1)) 1 2 3 x Name Score sub_id Alex 98 sub1 Amy 90 sub2 Allen 87 sub4 y Name Score sub_id Billy 89 sub2 Brian 80 sub4 Bran 79 sub3 Concatenating Using append - A useful shortcut to concat are the append instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index – one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2,3]) two = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},index= [1,2,3]) print(one.append(two)) 1 2 3 1 2 3 Name Alex Amy Allen Billy Brian Bran Score sub_id 98 sub1 90 sub2 87 sub4 89 sub2 80 sub4 79 sub3 1 2 3 1 2 3 1 2 3 Name Alex Amy Allen Billy Brian Bran Alex Amy Allen Score sub_id 98 sub1 90 sub2 87 sub4 89 sub2 80 sub4 79 sub3 98 sub1 90 sub2 87 sub4 The append function can take multiple objects as well − one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87]},index=[1,2,3]) two = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[89,80,79]},index= [1,2,3]) print(one.append([two,one])) merge( ) - merge( ) function in which you can specify the field on the basis of which you want to combine the two dataframes. It is used as per syntax: pd.merge( <Df1>, <DF2>) or pd.merge( <DF1>, <DF2>, on = <field_name>) If you skip the argument on= <field_name>, then 1t will take any merge on common fields of the two dataframes but you can explicitly '-r􀀦ify the field on the basis of which you want to merge the two dataframes. Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects − pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) Here, we have used the following parameters − • left − A DataFrame object. • right − Another DataFrame object. • on − Columns (names) to join on. Must be found in both the left and right DataFrame objects. • left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. • right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. 30 PANDAS • left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame. • right_index − Same usage as left_index for the right DataFrame. • how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been described below. • sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases. Let us now create two different DataFrames and perform the merging operations on it. left = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,87] },index=[1,2,3]) right = pd.DataFrame({'Name': ['Billy', 'Brian', 'Bran'],'sub_id':['sub2','sub4','sub3'],'Score':[8 9,80,79]},index=[1,2,3]) print(left) print(right) Merge Two DataFrames on a Key print(pd.merge(left,right,on='sub_id')) 1 2 3 Name Alex Amy Allen Name Billy Brian Bran 0 1 Name_x Amy Allen 1 2 3 Score sub_id 98 sub1 90 sub2 87 sub4 Score sub_id 89 sub2 80 sub4 79 sub3 Score_x sub_id Name_y 90 sub2 Billy 87 sub4 Brian Score_y 89 80 Merge Two DataFrames on Multiple Keys – For Example print(pd.merge(left,right,on=['id','subject_id'])) Merge Using 'how' Argument - The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names − Merge Method SQL Equivalent Description left LEFT OUTER JOIN Use keys from left object right RIGHT OUTER JOIN Use keys from right object outer FULL OUTER JOIN Use union of keys inner INNER JOIN Use intersection of keys Left Join print(pd.merge(left,right,on='sub_id',how='left')) 0 1 2 Name_x Alex Amy Allen Score_x sub_id Name_y 98 sub1 NaN 90 sub2 Billy 87 sub4 Brian Score_y NaN 89.0 80.0 0 1 2 Name_x Amy Allen NaN Score_x sub_id Name_y 90.0 sub2 Billy 87.0 sub4 Brian NaN sub3 Bran Score_y 89 80 79 0 1 2 3 Name_x Alex Amy Allen NaN Score_x sub_id Name_y 98.0 sub1 NaN 90.0 sub2 Billy 87.0 sub4 Brian NaN sub3 Bran Score_y NaN 89.0 80.0 79.0 0 1 Name_x Amy Allen Score_x sub_id Name_y 90 sub2 Billy 87 sub4 Brian Score_y 89 80 Right Join print(pd.merge(left,right,on='sub_id',how='right')) Outer Join print(pd.merge(left,right,on='sub_id',how='outer')) Inner Join print(pd.merge(left,right,on='sub_id',how='inner')) 31 PANDAS Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a). Statistics Functions & Description - Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions − Function count() sum() mean() median() mode() std() min() max() abs() prod() cumsum() cumprod() var() quantile() Description Number of non-null observations Sum of values Mean of Values Median of Values Mode of values Standard Deviation of the Values Minimum Value Maximum Value Absolute Value Product of Values Cumulative Sum Cumulative Product to calculate variance of a given set of numbers Return values at the given quantile over requested axis df = pd.DataFrame({'Name':['abc', 'lmn', 'xyz'], 'Science':[67,78,56], 'IP':[97,98,99]}, index=['a', 'b', 'c'], columns=['Name', 'Science', 'IP']) print('sum ', df['IP'].sum()) print('count ', df['IP'].count()) print('mean ', df['IP'].mean()) print('median ', df['IP'].median()) print('mode ', df['IP'].mode()) print('std ', df['IP'].std()) print('min ', df['IP'].min()) print('max ', df['IP'].max()) print('abs ',df['Science'].abs()) print('prod ', df['IP'].prod()) print('cumsum ', df['IP'].cumsum()) print('cumprod ', df['IP'].cumprod()) print('var ', df['IP'].var()) print('quantile ', df['IP'].var()) sum 294 count 3 mean 98.0 median 98.0 mode 0 97 1 98 2 99 dtype: int64 std 1.0 min 97 max 99 abs a 97 b 98 c 99 Name: IP, dtype: int64 prod 941094 cumsum a 97 b 195 c 294 Name: IP, dtype: int64 cumprod a 97 b 9506 c 941094 Name: IP, dtype: int64 var 1.0 quantile 1.0 min() and max() - The min() and max( ) functions find out the minimum or maximum value respectively from a given set of data. The syntax is : <dataframe>. min(axis = None, skipna = None, numeric_only = None) <dataframe>.max (axis= None, skipna = None, numeric_only = None) axis (0 or 1) by default, minimum or maximum is calculated along axis 0 (i.e., {index (0), columns (1)} ). skipna (True or False) Exclude NA/null values when computing the result numeric_ (True or False) Include only float, int, boolean columns. If None, will attempt only to use everything, then use only numeric data. 32 PANDAS mode( ), mean( ), median( ) - Function mode() returns the mode value (i.e., the value that appears most often) from a set of values. Syntax : <dataframe>.mode(axis = 0, numeric_only = False) Function mean() returns the computed mean (average) from a set of values. Function median() returns the middle number from a set of numbers. Syntax : <dataframe>.mean(axis = None, skipna = None, numeric_only = None) <dataframe>.median(axis = None, skipna = None, numeric_only = None) The function count() counts the non-NA entries for each row or column. The values None, NaN, NaT etc., are considered as NA in pandas. Syntax : <dataframe>.count(axis = 0, numeric_only = False) The function sum() returns the sum of the values for the requested axis. Syntax <dataframe>. sum(axis = None, skipna = None, numeric_only = None, min_count = 0) min_count - If None, will attempt to use everything, then use only numeric data. If int, default 0 ; the required number of valid values to perform the operation. If fewer than min_count nonNA values are present the result will be NA. (Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.) one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN ]},index=[1,2,3]) print('----count----') print(one.count()) print('----skipna-False----') print(one.sum(skipna=False)) print('----skipna-True----') print(one.sum(skipna=True)) print('----numeric-only----') print(one.count(numeric_only=True)) ----count---Name 3 Score 2 sub_id 3 dtype: int64 ----skipna-False---Name AlexAmyAllen Score NaN sub_id sub1sub2sub4 dtype: object ----skipna-True---Name AlexAmyAllen Score 188 sub_id sub1sub2sub4 dtype: object ----numeric-only---Score 2 dtype: int64 quantile( ) and var( ) - The quantile( ) function returns the values at the given quantiles over requested axis ( axis 0 or 1 ). Quantiles are points in a distribution that relate to the rank order of values in that distribution. The quantile of a value is the fraction of observations less than or equal to the value. The quantile of the median is 0.5, by definition. The 0.25 quantile (also known as the 25 percentile; percentiles are just quantiles multiplied by 100) and the 0.75, are known as quartiles, and the difference between them in the interquartile range. Syntax : <dataframe>.quantile(q = 0.5, axis= 0, numeric_only = True) The var( ) function computes variance and returns unbiased variance over requested axis. <dataframe>.var(axis = None, skipna = None, numeric_only =None) Applying Functions on a Subset of Dataframe : Applying Functions on a Column of a DataFrame : one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]}, index=[1,2,3]) 33 188.0 PANDAS print(one['Score'].sum()) Applying functions on Multiple Columns of a DataFrame : print(one[['Name','Score']].count()) Name 3 Score 2 dtype: int64 Applying Functions on a Row of a DataFrame : <dataframe>. lac [ <row index>, : ) sal_df.loc['Qtr2', :].max() one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},index=['a','b','c']) print(one.loc['a', ].count()) 3 Applying Functions on a Range of Rows of a DataFrame : <dataframe>. loc [ < start row<J>: <end row>, : ] one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},index=['a','b','c']) print(one.loc['a':'b', ].count()) Name 2 Score 2 sub_id 2 dtype: int64 Applying functions to a subset of the DataFrames : <dataframe>. loc [ <start row> : <end row>, : <start column> : <end column>] one = pd.DataFrame({'Name': ['Alex', 'Amy', 'Allen'], 'sub_id':['sub1','sub2','sub4'],'Score':[98,90,np.NaN]},index=['a','b','c']) print(one.loc['a':'b', 'Name':'Score'].count()) Name 2 Score 2 dtype: int64 Python lambda - We can create anonymous functions, known as lambda functions. Lambda functions are different from normal Python functions, they origin from Lambda Calculus. It allows you to write very short functions. This code shows the use of a lambda function: f = lambda x : 2 * x print f(3) 6 A return statements is never used in a lambda function, it always returns something. A lambda functions may contain if statements: f = lambda x: x > 10 print(f(2)) print(f(12)) 34 False True PANDAS Transferring Data between .csv Files and DataFrames The acronym CSV is short for Comma-Separated Values. The CSV format refers to a tabular data that has been saved as plaintext where data is separated by commas. The CSV format is popular as it offers following advantages: ¢) A simple, compact and ubiquitous format for data storage. ¢) A common format for data interchange. ¢) It can be opened in popular spreadsheet packages like MS-Excel, Calc etc. ¢) Nearly all spreadsheets and databases support import I export to csv format. Loading data from CSV to Dataframe - Python's Pandas library offers two functions read_csv() and to_csv() that help you bring data from a CSV file into a dataframe and write a dataframe's data to a CSV file. Reading From a CSV File to Dataframe - You can use read_csv() function to read data from a CSV file in your dataframe by using the function as per following syntax : <DF> = pandas. read_csv( <filepath>) df =pd. read_csv( "c:\\data\\sample. csv") Reading CSV File and Specifying Own Column Names - you may have a CSV file that does not have top row containing column headers. For such a situation, you can specify own column headings in read_csv( ) using names argument as per following syntax : <DF> =pandas. read_csv( <filepath>, names= <sequence>) df2 =pd. read_csv("c:\\data\\mydata. csv", names=["Roll no", "First_Name", "last Name"]) If you want the first row not to be used as header and at the same time you do not want to specify column headings rather go with default column headings which go like 0, l, 2, ,3 ... , then simply give argument as header= None in read_csv( ), i.e., as: df3 =pd. read_csv( "c:\ \data\\mydata.csv", header= None) if you give argument as header= None, it will take default headings as 0,1, 2, ..... if you want to skip row 1 while fetching data from CSV file, you can use : dfS=pd.read_csv("c:\\data//mydata.csv", names= ["Rollno", "Name", "Marks"), skiprows=1). Reading Specified Number of Rows from CSV file – Giving argument nrows = <n> in read_csv( ), will read the specified number of rows from the CSV file. df6 = pd.read_csv("c:\ \data\ \mydata.csv", names= ("Rollno", "Name", "Surname"], nrows = 3) Reading from CSV files having Separator Different from Comma – you need to specify an additional argument as sep = <separator character>. If you skip this argument then default separator character (comma) is considered. dfNew=pd.read_csv( "c:\\data\\match.csv", sep=';', names=["Country1", "Stat", "Country2"]) we can summarize read_csv( ) as follows: <DF> =pandas. read_csv( <filepath>, sep = <separator character>, names = <column names sequence>, header= None, skiprows = <n>, nrows =<n>) Storing Dataframe's Data to CSV File - Sometimes, we have data available in dataframes and we want to save that data in a CSV file. For this purpose, Python Pandas provides to_csv() function that saves the data of a dataframe in a CSV file. <DF>.to_csv(<filepath>) 35 or <DF>.to_csv( <filepath>, sep =<separator _character>) PANDAS The separator character must be a one character string only. df7. to_csv( "c:\\data\\new2.csv", sep = "|") Handling NaN Values with to_csv( ) - By default, the missing/NaN values are stored as empty strings in CSV file. You can specify your own string that can be written for missing/NaN values by giving an argument na_rep = <string>. df7. to_csv( "c: \\data\\new3. csv", sep - "I", na_rep :.a "NULL") NOTE : Note that the function read_csv( ) is pandas function and to_csv( ) is dataframe structure's function. We can summarise to_csv() as follows where sep and na_rep arguments are optional: <DF>. to_csv( <filepath>, sep = <separator character>, na_rep =<string>) Transferring Data Between Dataframes and SQL Databases - An SQL database is a relational database having data in tables called relations and it uses a special type of query language, Structured Query Language, to query upon manipulate data or to communicate with database. Brief Introduction to SQLite Database - SQLite is an embedded SQL database engine, which implements a relational database manage-ment system that is a self-contained, serverless and requires zero configuration. In other words, SQLite does not have a separate server process and it implements SQL databases in a very compact way. SQLite database is in public domain and is freely available for any type of use. Bringing Data from SQL Database Table into a Dataframe – In order to read data from an SQL database such as sqlite, Python comes equipped with library sqlite3. (i) import sqlite3 library by giving Python statement : import sqlite3 as sq (ii) Make connection to SQL database : conn = sq.connect("C:\\sqlite3\\new.db") (iii) Read data from a table into a dataframe using read_sql( ) as per following syntax: df = pd.read_sql("SELECT * From Stud;", conn) (iv) Print values : print(df) NOTE : You can also give database name as ":memory:" while creating connection. This database will reside in RAM (i.e., temporary database) rather than on hard disk. Storing a Dataframe's Data as a Table in an SQL Database – (i) import sqlite3 library by giving Python statement : import sqlite3 as sq (ii) Make connection to SQL database : conn = sq.connect("C:\\sqlite3\\new.db") (iii) you can write a dataframe in the form of a table by using to_sql( ) as per following syntax : dft4. to_sql( "metros", conn) If you run to_sql() that has a table name which already exists then you must specify argument if_exists = "append" or if_exists = "replace" otherwise Python will give ERROR. If you set the value as "append", then new data will be appended to existing table and if you set the value as "replace" then new data will replace the old data in given table. For example, if we have another dataframe dtf5 having following data: dtf5.to_sql("metros", conn, if exists= "append") 36 PANDAS Advanced Operations on DataFrame : PIVOTING - Pivoting technique rearranges the data from rows and columns, by possibly aggregating data from multiple sources, in a report form (with rows transferred to columns) so that data can be viewed in a different perspective. Pivoting is actually a summary technique that works on tabular data. Syntax:<dataframe>.pivot(index=<columnname>, columns=<columnname>, values= <columnname>) The result of pivot() function has the index-rows as per the index argument, columns as per the Vfllue.s of columns argument and the values created from the values argument (see above). Cells in the pivoted table which do not have a matching entry in the original one are set with NaN. The pivot() function returns the result in form of a newly created dataframe, i.e., you may store the result in a dataframe. NOTE : With pivot( ), if there are multiple entries for a columns value for the same values for index(row), it leads to error. Hence, before you use pivot( ), you should ensure that the data does not have rows with duplicate values for the specified columns. Using pivot table( ) Function - For data having multiple values for same row and column combination, you can use another pivoting function-the pivot-table() function. The pivot_table( ) is also a pivoting function, which like pivot(), also produces a pivoted table, BUT it is different from the pivot( ) function in following ways : (i) It does not raise errors for multiple entries of a row, column combination. (ii) It aggregates the multiple entries present for a row-column combination; you need to specify what type of aggregation you want (sum, mean etc.) pandas.pivot_table(<dataframe>, values=None, index=None, columns=None, aggfunc='mean') or <dataframe>.pivot_table(values= None, index=None, columns=None, aggfunc= 'mean') Where : the index argument contains the column name for rows. the columns argument contains the column name for columns. the values argument contains the column names for data of the pivoted table. the aggfunc argument contains the function as per which data is to be aggregated, if skipped, it, by default will compute the mean of the multiple entries for the same row-column combination. Being able to quickly summarize hundreds of rows and columns can save you a lot of time and frustration. A simple tool you can use to achieve this is a pivot table, which helps you slice, filter, and group data at the speed of inquiry and represent the information in a visually appealing way. Introducing our data set: Maruti Sale Report - Some interesting questions we might like to answer are: • • • Which are the most liked and least liked cars according to regions in India? Is sale affected by region? Did the sale change significantly over the past five years? Let's import our data and take a quick first look: import pandas as pd import numpy as np 37 S.NO. 1 NAME ALTO YEAR 2016 ZONE EAST PANDAS SALE 45 import matplotlib.pyplot as plt # reading the data data = pd.read_csv('c:\\python\\pivotsale.csv', index_col=0) # sort the df by ascending name and descending zone data.sort_values(["NAME", "ZONE"], ascending=[True, False]) #diplay first 10 rows print(data.head(10)) 2 3 4 5 6 7 8 9 10 ALTO ALTO ALTO ALTO ALTO ALTO ALTO ALTO ALTO 2017 2018 2019 2016 2017 2018 2019 2016 2017 EAST EAST EAST WEST WEST WEST WEST NORTH NORTH 43 76 23 56 34 65 87 34 67 print(pd.pivot_table(data, index= 'NAME', values= "SALE")) NAME 800 ALTO BALENO ESTEEM K10 ZEN SALE 56.3125 57.0000 60.6875 57.3750 61.7500 54.8125 Creating a multi-index pivot table print(pd.pivot_table(data, index= ['NAME','YEAR'], values= "SALE")) This is one way to look at the data, but we can use the columns parameter to get a better display: • columns is the column, grouper, array, or list of the previous you'd like to group your data by. Using it will spread the different values horizontally. Using Year as the Columns argument will display the different values for year, and will make for a much better display, like so: print(pd.pivot_table(data, index= 'NAME', columns='YEAR', values= "SALE")) 38 PANDAS Visualizing the pivot table using plot() - If you want to look at the visual representation of the previous pivot table we created, all you need to do is add plot() at the end of the pivot_table function call. pd.pivot_table(data, index= 'NAME', columns= 'YEAR', values= "SALE").plot(kind= 'bar') plt.ylabel("SALE") The visual representation helps reveal that the differences are minor. Manipulating the data using aggfunc Up until now we've used the average to get insights about the data, but there are other important values to consider. Time to experiment with the aggfunc parameter: • aggfunc (optional) accepts a function or list of functions you'd like to use on your group (default: numpy.mean). If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names. Let's add the median, minimum, maximum, and the standard deviation for each “NAME”. This can help us evaluate how accurate the average is, and if it's really representative of the real picture. print(pd.pivot_table(data, index= 'NAME', values= "SALE", aggfunc= [np.mean, np.median, min, max, np.std])) Categorizing using string manipulation - Up until now we've grouped our data according to the categories in the original table. However, we can search the strings in the categories to create our own groups. 39 PANDAS table=pd.pivot_table(data, index= 'NAME', columns='YEAR', values= "SALE") print(table[table.index.str.endswith('O')]) Adding total rows/columns The last two parameters are both optional and mostly useful to improve display: • margins is type boolean and allows you to add an all row / columns, e.g. for subtotal / grand totals (Default False) • margins_name which is type string and accepts the name of the row / column that will contain the totals when margins is True (default ‘All’) Let's use these to add a total to our last table. print(pd.pivot_table(data, index=['NAME', 'ZONE'], columns='YEAR', values= "SALE", aggfunc= 'sum', fill_value= 0, margins = True, margins_name= 'Total count')) Let's summarize If you're looking for a way to inspect your data from a different perspective then pivot_table is the answer. It's easy to use, it's useful for both numeric and categorical values, and it can get you results in one line of code. Sorting - Sorting refers to arranging values in a particular order. The values can be sorted on the basis of a specific column or columns and can be in ascending or descending order. Pandas makes available sort_ values() function for this purpose, which can be used as per following syntax : <dataframe>. sort_values(by, axis= 0, ascending= True, inplace = False, kind= 'quicksort', na_position ='last') by axis 40 str or list of str; Name or list of names to sort by. if axis is 0 or 'index' then by may contain index levels and/or column labels if axis is 1 or 'columns' then by may contain column levels and/or index labels {0 or 'index', 1 or 'columns'}, default O ; Axis to be sorted PANDAS ascending inplace na_position bool or list of bool, default True ; Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by argument. bool, default False ; if True, perform operation in-place (dataframe itself) {'first', 'last'}, default 'last' ; first puts NaNs at the beginning, last puts NaNs at the end. ¢) You can use parameter in place, if you want to store the sorted data in the dataframe itself. ¢) Use na_position parameter to specify the position of NaN values, which is by default last but can be selected as first. Aggregation - With large amounts of data, most often we need to aggregate data so as to analyse it effectively. Pandas offers many aggregate functions, using which you can aggregate data and get summary statistics of the data. Aggregation S.No. count() 1. 2. sum() Description Total number of items Sum of all items 3. mean( ), median( ) Mean and median 4. min(), max() Minimum and maximum 5. std(), var() Standard deviation and variance 6. mad() Mean absolute deviation The mad( ) function is used to calculate the mean absolute deviation of the values for the requested axis. The Mean Absolute Deviation (MAD) of a set of data is the average distance between each data value and the mean. You can use mad( ) as per following syntax : <dataframe>.mad(axis = None, skipna = None) axis skipna {index (0), columns (1)} default 0 boolean, default True; Exclude NA/null values. If an entire row/column is NA, the result will be NA The std() function calculates the standard deviation of a given set of numbers, Standard deviation of a data frame, Standard deviation of column and Standard deviation of rows, let's see an example of each. Creating Histogram - A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. Consider the following histogram that has been computed using the following dataset containing ages of 20 people. 41 PANDAS To create histogram from a dataframe, you can use hist( ) function of dataframe, which draws one histogram of the DataFrame's columns. This function calls PyPlot library's hist(), on each series in the DataFrame, resulting in one histogram per column. Syntax : DataFrame.hist(column = None, by= None, grid= True, bins= 10) Function Application - By function application, it means that a function (a library function or user-defined function) may be applied on a dataframe in multiple ways. (a) on the whole dataframe - pipe( ) (b) row-wise or column-wise - apply( ) (c) on individual elements, i.e., element-wise - applymap( ) Other than the above three, there are two more function-application functions : aggregation through groupby() and transform(). The piping of functions through pipe( ) basically means the chaining of functions in the order they are executed. Syntax : <DataFrame>.pipe(func, *args,) func args function name to be applied on the dataframe with the provided args iterable, optional ; positional arguments passed into func. Pipe function - Suppose you want to apply a function to a data frame or series, to then apply other, other, … One way would be to perform this operations in a “sandwich” like fashion: df = sub(div(add(df, 10), 2), 1) In the long run, this notation becomes fairly messy and error prone. What you want to do here is use pipe(). Pipe can be thought of as a function chaining. This is how you’d perform the same task as before with pipe(): df=df.pipe(add, 10).pipe(div, 2).pipe(sub, 1) This way is a cleaner way that helps keep track the order in which the functions and its corresponding arguments are applied. Suppose, for a moment, that you want to apply the following three functions to a data set or series: The first function add a number from the data. The second function divides the data by a given parameter. The third function subtract the data by a given. Here is the data set. A B C Col1 1 2 3 def add(df, num): return df[:] + num def div(df, num): return df[:] / num def sub(df, num): return df[:] - num dt = {'Col1':[1,2,3]} df = pd.DataFrame(dt, index=['A','B','C']) print(df) df=df.pipe(add, 10).pipe(div, 2).pipe(sub, 1) print(df) 42 PANDAS output: Col1 A 4.5 B 5.0 C 5.5 Note: To apply pipe(), the first argument of the function must be the data set. For example, adder accepts two arguments adder(data, add). As data is the first parameter that takes in the data set, we can directly use pipe(). When this is not the case? There’s a way around this. We only need to specify to pipe what’s the name of the argument in the function that refers to the data set. Suppose, now, that the function adder is specified as add(num, df). As the data is not the first argument, we need to pass it to pipe as follows: data_set.pipe((add, "df"), 2) def add(num, df): return df[:] + num def div(num, df): return df[:] / num def sub(num, df): return df[:] - num dt = {'Col1':[1,2,3]} df = pd.DataFrame(dt, index=['A','B','C']) print(df) df=df.pipe((add, "df"), 10).pipe((div, "df"), 2).pipe((sub, "df"), 1) print(df) output: Col1 A 4.5 B 5.0 C 5.5 The apply( ) and applymap( ) functions – ¢) apply() is a series function, so it applies the given function to one row or one column of the dataframe (as single row/column of a dataframe is equivalent to a series). Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. By default, the operation performs column wise, taking each column as an array-like. The syntax for using apply() in minimalist form is: <dataframe>. apply( <funcname>, axis= 0) <funcname> axis the function to be applied on the series inside the dataframes i.e., on rows and columns. It should be a function that works with series and similar objects. 0 or 1 default 0 ; axis along with the function is applied. If axis is 0 or 'index' : function is applied on each column If axis is 1 or 'columns' : function is applied on each row. df = pd.DataFrame(np.arange(0,15).reshape(5,3),columns=['col1 ','col2','col3']) print(df.apply(np.mean)) print(df.apply(np.mean, axis=1)) By 43 passing axis parameter, operations can be col1 6.0 col2 7.0 col3 8.0 dtype: float64 0 1.0 1 4.0 2 7.0 PANDAS performed row wise. 3 10.0 4 13.0 dtype: float64 df = pd.DataFrame([[2,6,9],[23,56,32],[11,12,13]],columns=['c ol1','col2','col3']) print(df.apply(lambda x: x.max() - x.min())) col1 21 col2 50 col3 23 dtype: int64 ¢) applymap( ) is an element function, so it applies the given function to each individual element, separately - without taking into account other elements. Not all functions can be vectorized (neither the NumPy arrays which return another array nor any value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value. The syntax for using applymap( ) is : <dataframe>.applymap(<funcname>) df = pd.DataFrame([[2,6,9],[23,56,32],[11,12,13]],columns=['col1','col2','co l3']) print(df.applymap(lambda x: x*100)) df = pd.DataFrame([[2,6,9],[23,56,32],[11,12,13]],columns=['col1','col2','co l3']) print(df['col1'].map(lambda x: x*100)) col1 col2 col3 0 200 600 900 1 2300 5600 3200 2 1100 1200 1300 0 200 1 2300 2 1100 Name: col1, dtype: int64 NOTE : The apply( ) will apply the function on individual columns/rows, only if the passed function name is a Series function. If you pass a single value function, then apply( ) will behave like applymap( ). Function groupby( ) - Within a dataframe, based on a field's values, you may want to group the data. To create such groups, Pandas provide group by( ) function. The syntax is: <dataframe>.groupby(by=None, axis=0) by axis label, or list of labels to be used for grouping {0 or 'index', 1 or 'columns'}, default 0 ; Split along rows or columns. Df1.groupby(‘tutor’) The result of groupby() is also an object , the DataFrameGroupBy object. You can store the GroupBy object in a variable name and then use following attributes and functions to get information about groups or to display groups : <GroupByObject>.groups <GroupByObject>.get_group(<value>) <GroupByObject>.size() <GroupByObject>.count() <GroupByObject>. [ <columnname>]. head() lists the groups created lists the group created for the passed value lists the size of the groups created lists the count of non-NA values for each column in the groups created. lists the specified column from the grouped object created Grouping on Multiple Columns - df.groupby(['Tutor', 'cotmtry']) Aggregation via groupby( ) - Often in data science, you need to have summary statistics in the same table. You can achieve this using agg() method on the groupby object crested using groupby( ) method. The agg( ) method aggregates the data of the dataframe using one or more operations over the specified axis. The syntax for using agg( ) is : <dataframe>.agg( func, axis = 0) func axis 44 function, str, list or diet {0 or 'index', 1 or 'columns'}, default 0 ; If 0 or 'index': apply function to each column. If 1 or 'columns': apply function to each row. PANDAS Any groupby operation involves one of the following operations. They are: 1. Splitting the Object 2. Applying a function 3. Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − • Aggregation − computing a summary statistic • • Transformation − perform some group-specific operation Filtration − discarding the data with some condition ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016, 2014, 2015, 2017], 'Points': [876,789, 863,673, 741,812,756, 788,694, 701,804, 690]} df = pd.DataFrame(ipl_data) print(df) 0 1 2 3 4 5 6 7 8 9 10 11 Points 876 789 863 673 741 812 756 788 694 701 804 690 Rank 1 2 2 3 3 4 1 1 2 4 1 2 Team Riders Riders Devils Devils Kings kings Kings Kings Riders Royals Royals Riders Year 2014 2015 2014 2015 2014 2015 2016 2017 2016 2014 2015 2017 Split Data into Groups - Pandas object can be split into any of their objects. There are multiple ways to split an object like − obj.groupby('key') or obj.groupby(['key1','key2']) or obj.groupby(key,axis=1) df = pd.DataFrame(ipl_data) print(df.groupby('Team')) <pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50> View Groups df = pd.DataFrame(ipl_data) print(df.groupby('Team').groups ) Example - Group by with multiple columns − df = pd.DataFrame(ipl_data) print(df.groupby(['Team','Year'] ).groups) {'Kings': Int64Index([4, 'Devils': Int64Index([2, 'Riders': Int64Index([0, 'Royals': Int64Index([9, 'kings' : Int64Index([5], 6, 7], 3], 1, 8, 11], 10], dtype='int64'), dtype='int64'), dtype='int64'), dtype='int64'), dtype='int64')} {('Kings', 2014): Int64Index([4], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64')} Iterating through Groups - With the groupby object in hand, we can iterate through the object similar to itertools.obj. df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped: print(name) print(group) 45 2014 Points 0 876 2 863 4 741 9 701 Rank 1 2 3 4 Team Riders Devils Kings Royals Year 2014 2014 2014 2014 PANDAS NOTE : By default, the groupby object has the same label name as the group name. 2015 Points 1 789 3 673 5 812 10 804 2016 Points 6 756 8 694 2017 Points 7 788 11 690 Rank 2 3 4 1 Team Riders Devils kings Royals Year 2015 2015 2015 2015 Rank 1 2 Team Kings Riders Year 2016 2016 Rank 1 Team Kings 2 Riders Year 2017 2017 Select a Group - Using the get_group() method, we can select a single group. df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print(grouped.get_group(2014)) 0 2 4 9 Points Rank 876 1 863 2 741 3 701 4 Team Riders Devils Kings Year 2014 2014 2014 Royals 2014 Aggregations - An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data. An obvious one is aggregation via the aggregate or equivalent agg method – Year 2014 795.25 2015 769.50 2016 725.00 2017 739.00 Name: Points, dtype: float64 df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print(grouped['Points'].agg(np.mean)) Another way to see the size of each group is by applying the size() function – df = pd.DataFrame(ipl_data) Attribute Access in Python Pandas grouped = df.groupby('Team') print(grouped.agg(np.size)) Team Devils Kings Riders Royals kings Points 2 3 4 2 Rank 2 3 4 2 Year 2 3 4 2 1 1 1 Applying Multiple Aggregation Functions at Once With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output – df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print(grouped['Points'].agg([np.sum, np.mean, np.std])) Team Devils Kings Riders Royals kings sum 1536 2285 3049 1505 812 mean 768.000000 761.666667 762.250000 752.500000 812.000000 std 134.350288 24.006943 88.567771 72.831998 NaN The transform( ) function - The groupby( ) function rearranges data into groups based on some criteria and stores the rearranged data in a new groupby object. You can apply aggregate functions on the groupby object using agg( ). The transform( ) function 46 PANDAS transforms the aggregate data by repeating the summary result for each row of the group and makes the result have the same shape as original data. Transform is an operation used in conjunction with “groupby”. While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean. Problem Set - For this example, we will analyze some fictitious sales data. In order to keep the dataset small, here is a sample of 12 sales transactions for our company. (transform.xlsx). First Approach - Merging If you are familiar with pandas, your first inclination is going to be trying to group the data into a new dataframe and combine it in a multi-step process. Here’s what that approach would look like. Import all the modules we need and read in our data: import pandas as pd df = pd.read_excel("sales_transactions.xlsx") df.groupby('order')["ext price"].sum() order 10001 576.12 10005 8185.49 10006 3724.49 Name: ext price, dtype: float64 Here is a simple image showing what is happening with the standard groupby. The tricky part is figuring out how to combine this data back with the original dataframe. The first instinct is to create a new dataframe with the totals by order and merge it back with the original. We could do something like this: order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index() df_1 = df.merge(order_total) df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"] 47 PANDAS account name order sku quantity unit price ext price Order_Total PercentofOrder 0 383080 Will LLC 10001 B1-20000 7 33.69 576.12 0.4093 1 383080 Will LLC 10001 S1-27722 11 21.12 576.12 0.4032 2 383080 Will LLC 10001 B1-86481 3 35.99 576.12 0.1874 3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 8185.5 0.3273 4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 8185.5 0.0349 5 412290 Jerde-Hilpert 10005 S1-06532 9 92.55 8185.5 0.1018 6 412290 Jerde-Hilpert 10005 S1-47412 44 78.91 8185.5 0.4242 7 412290 Jerde-Hilpert 10005 S1-27722 36 25.42 8185.5 0.1118 8 218895 Kulas Inc 10006 S1-27722 32 95.66 3724.5 0.8219 9 218895 Kulas Inc 10006 B1-33087 23 22.55 3724.5 0.1393 10 218895 Kulas Inc 10006 B1-33364 3 72.3 3724.5 0.0582 11 218895 Kulas Inc 10006 B1-20000 -1 72.18 3724.5 -0.019 This certainly works but there are several steps needed to get the data combined in the manner we need. Second Approach - Using Transform Using the original data, let’s try using “transform” and “groupby” and see what we get: df.groupby('order')["ext price"].transform('sum') 0 576.12 1 576.12 2 576.12 3 8185.49 4 8185.49 5 8185.49 6 8185.49 7 8185.49 8 3724.49 9 3724.49 10 3724.49 11 3724.49 Name: ext price, dtype: float64 You will notice how this returns a different size data set from our normal groupby functions. Instead of only showing the totals for 3 orders, we retain the same number of items as the original data set. That is the unique feature of using transform . The final step is pretty simple: df["Order_Total"] = df.groupby('order')["ext price"].transform('sum') print(df) df["Percent_of_Order"] = df["ext price"] / df["Order_Total"] print(df) accou nt name order sku quantity unit price ext price Order_Total Percent_of_Order 48 PANDAS 0 383080 Will LLC 10001 B1-20000 7 33.69 576.12 0.4093 1 383080 Will LLC 10001 S1-27722 11 21.12 576.12 0.4032 2 383080 Will LLC 10001 B1-86481 3 35.99 576.12 0.1874 3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 8185.5 0.3273 4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 8185.5 0.0349 5 412290 Jerde-Hilpert 10005 S1-06532 9 92.55 8185.5 0.1018 6 412290 Jerde-Hilpert 10005 S1-47412 44 78.91 8185.5 0.4242 7 412290 Jerde-Hilpert 10005 S1-27722 36 25.42 8185.5 0.1118 8 218895 Kulas Inc 10006 S1-27722 32 95.66 3724.5 0.8219 9 218895 Kulas Inc 10006 B1-33087 23 22.55 3724.5 0.1393 10 218895 Kulas Inc 10006 B1-33364 3 72.3 3724.5 0.0582 11 218895 Kulas Inc 10006 B1-20000 -1 72.18 3724.5 -0.019 As an added bonus, you could combine into one statement if you did not want to show the individual order totals: df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum') NOTE : The aggregation function agg() returns a reduced version of the data by producing one summary result per group. The transform() on the other hand, returns the transformed version of the summary data by repeating rows for the group to make it have same shape as that of the full data and thus the result of transform can be combined with the dataframe easily. Reindexing and Altering Labels – When you create a dataframe object, it gets its row numbers (or the indexes) and column labels automatically. But sometimes we ate not satisfied with the row and column labels of a dataframe. For this, Pandas offers you, a major functionality. You may change the rov,- index and column label as and when you require. Recall that index refers to labels of axis O i.e., row labels and columns refers to the labels of axis 1 i.e., column labels. There are many similar methods provided by Pandas library that help you change rearrange, rename indexes or column labels. Thus, you should read following lines carefully to know the difference between the working of these methods. The methods provided by Pandas for reindexing and relabelling are : (i) rename(). A method that simply renames the index and/or column labels in a dataframe. (ii) reindex( ). A method that can specify the new order of existing indexes and column labels, and/or also create new indexes/column labels. (iii) reindex_like( ). A method for creating indexes/column--labels based on other dataframe object. (i) The rename( ) method - The rename() function renames the existing indexes/column-labels in a dataframe. The old and new index/column-labels are to be provided in the form of a dictionary where keys are the old indexes/row-labels, and the values are the new names for the same, e.g., Syntax : <dataframe>. rename(mapper = None, axis= None, inplace = False ) <dataframe>. rename(index = None, columns= None, inplace = False) mapper, index, columns axis inplace 49 dict-like (Dictionary-like) int ( 0 or 1) or str ('index' or 'columns'); The default is O or 'index'. boolean, default False (which returns a new dataframe with renamed index/labels; If True, then changes are made in the current dataframe and new dataframe is not returned. PANDAS ndf.rename( {'Qtrl':1, 'Qtr2':2, 'Qtr3':3, 'Qtr4':4}, axis=0) NOTE : You either use mapper dictionary with axis argument or use dictionary with index = or columns = positional arguments. <Df>. rename( columns = { <dictionary with old and new labels>}) or <Df>. rename( { <dictiona'ry with old and new labels>}, axis = 1) The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function. df1 = pd.DataFrame([[50,60,70],[90,80,70]],columns=['Sub1','Sub2' ,'Sub3']) print(df1) print ("After renaming the rows and columns:") print(df1.rename(columns={'Sub1' : 'Eng', 'Sub2' : 'Hin', 'Sub3':'Evs'}, index = {0 : 11, 1 : 21, 2 : 51})) Sub1 Sub2 Sub3 0 50 60 70 1 90 80 70 After renaming the rows and columns: Eng Hin Evs 11 50 60 70 21 90 80 70 The rename() method provides an inplace named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place. (ii) The reindex( ) method - The function reindex( ) is used to change the order or existing indices/labels. It is used as per following syntaxes in minimalistic form: Data Frame. reindex(index = None, columns= None, method= None, fill_value = nan) DataFrame.reindex(labels = None, axis= None, method= None, fill_value = nan) labels index, columns axis fill_value array-like, optional; New labels/index to conform to the axis specified by 'axis' to. array-like, optional; New labels I index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data. int( 0 or 1) or str('index' or 'columns'), optional; Axis to target. Default 0 or 'index'. the value to be filled in the newly added rows/columns. NOTE : Like rename( ), in reindex() too, you either use mapper sequence with axis argument or use it with index = or columns = positional arguments. (a) Reordering the existing indexes using reindex( ) - Both the following commands are equivalent (as by default axis is 0) and will reorder the row indexes in the datafarme. (b) Reordering as well as adding/deleting indexes/labels - Exiting row-indices/column-labels are reorders as per given order and non-existing row-indexes/column-labels create new rows/columns and by default NaN values are filled in them. (c) Specifying fill values for new rows/columns - By using argument fill_ value, you can specify which will be filled in the newly added row/column. In the absence of fill_value argument, the new row/column is filled with NaN. df1 = pd.DataFrame([[50,60,70],[90,80,70]],columns=['Sub1 ','Sub2','Sub3']) print(df1) print ("After reindexing the rows and columns:") print(df1.reindex(columns={'Sub1', 'Sub2', 'Sub4'}, index = {0, 1, 3})) 50 Sub1 Sub2 Sub3 0 50 60 70 1 90 80 70 After reindexing the rows and columns: Sub4 Sub2 Sub1 0 NaN 60.0 50.0 1 NaN 80.0 90.0 3 NaN NaN NaN PANDAS (iii) The reindex_like( ) method - The reindex_like( ) function works on a dataframe and reindexes its data as per the argument dataframe passed to it. This function does following things: (a) If the current dataframe has some matching row-indexes/column-labels as the passed dataframe, then retain the index/label and its data. (b) If the current dataframe has some row-indexes/column-labels in it, which are not in the passed dataframe, drop them. (c) If the current dataframe does not have some row-indexes/column-labels which are in the passed dataframe , then add them to current dataframe with value as NaN (d) The reindex_like( ) ensure that the current dataframe object conforms to the same indexes/labels on all axes. The syntax for using reindex_like( ) is : <dataframe>.reindex_like(other) Other - name of a dataframe as per which current <dataframe> is to reindexed. df1 = pd.DataFrame([[50,60,34],[90,80,44]],columns=['Sub1','Sub2','Sub3']) df2 = pd.DataFrame([[78,76],[67,98]],columns=['Sub1','Sub2']) print ("DF1 After reindexing like DF2") print(df1.reindex_like(df2)) DF1 After reindexing like DF2 Sub1 Sub2 0 50 60 1 90 80 Note − Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label. 51 PANDAS Plotting with Pyplot Bar Graphs and Scatter Plots Data visualization basically refers to the graphical or visual representation of information and data using visual elements like charts, graphs, and maps etc. Data visualization is immensely useful in decision making. Data visualization unveils patterns, trends, outliers, correlations etc. in the data, and thereby helps decision makers understand the meaning of data to drive business decisions. The matplotlib is a Python library that provides many interfaces and functionality for 2D-graphics similar to MATLAB's in various forms. In short, you can call matplotlib as a high quality plotting library of Python. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats. matplotlib library offers many different named collections of methods; PyPlot is one such interface. PyPlot is a collection of methods within matplotlib which allows user to construct 20 plots easily and interactively. PyPlot essentially reproduces plotting functions and behavior of MATLAB. After downloading, next you need to install it by giving following commands on the command prompt. python -m pip install -U pip python -m pip install -U matplotlib Importing PyPlot - import mat plot lib. pyplot as pl You can create many different types of graphs and charts using PyPlot. Some commonly used chart types are: ¢) Line Chart. A line chart or line graph is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments. ¢) Bar Chart. A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. ¢) Scatter Plot. The scatter plot is similar to a line chart, the major difference is that while line graph connects the data points with a line, scatter chart simply plots the data points to show the trend in the data. NOTE : Data points are called markers. Line Chart using plot( ) function – import matplotlib. pyplot as pl a= [1, 2, 3, 4] b = [2, 4, 6, 8] pl.plot(a, b) You can set x-axis' and y-axis' labels using functions xlabel( ) and ylabel( ) respectively, i.e., : <matplotlib.pyplot or its alias>.xlabel( <str> ) <matplotlib.pyplot or its alias>.ylabel( <str> ) 52 PANDAS import matplotlib.pyplot as plt a= [1, 2, 3, 4]; b = [3, 4, 9, 8]; plt.plot(a, b) plt.xlabel('X axis - Values of A') plt.ylabel('Y axis - Values of B') plt.show() Applying Various Settings in plot( ) Function - The plot( ) function allows you specify multiple settings for your chart/graph such as : ¢) color (line color / marker color) ¢) marker type ¢) marker size etc. Changing Line Color - <matplotlib>.plot(<datal>, [,data2], <color code>) import matplotlib.pyplot as plt x = np.arange(0., 10, 0.1) a = np . cos ( x) b = np.sin(x) plt.plot(x, a, 'b') plt.plot(x, b, 'r') plt.show() To change the line style, you can add following additional optional argument in plot() function: linestyle or ls= ['solid' , 'dashed', 'dashdot', 'dotted'] Changing Marker Type, Size and Color marker=<validMarkerType>,markerSize=<inPoints>,markerEdgeColor=<validColor> 53 PANDAS plt.plot(p, q, ‘k’, marker=’d’, markersize=5, markercolor=’red’) plt.plot(p, q, ‘r+’, marker=’d’, linestyle=’solid’) import matplotlib.pyplot as plt x = np.arange(0., 10, 0.1) a = np.cos(x) b = np.sin(x) plt.plot(x, a, 'bo', markerSize=5) plt.plot(x, b, 'r', marker='D', markerSize=5) plt.show() When you do not specify markeredgecolor separately in plot(), the marker takes the same color as the line. Also, if you do not specify the linestyle separately along with linecolor- and -marketstylecombination-string (e.g., 'r+' above), Python will only plot the markers and not the line. Creating Scatter Charts - The scatter charts can be created through two functions of pyplot library. (i) plot( ) function (ii) scatter( ) function In plot() function, whenever you specify marker type/style, whether with color or without color, and do not give linestyle argument, plot( ) will create a scatter chart, Full useful syntax of plot( ) can be summarised as : plot( <datal> [,< data2>] [, ... <colorcodeandmarker type>] [, < linewidth>] [, < linestyle>] [,<marker>] [, <markersize>] [, <markeredgecolor>] ) ¢) data1, data2 are sequences of values to be plotted on x-axis and y-axis respectively ¢) Rest of the arguments affect the look and format of the line/scatter chart as given below: color code with markertype color and marker symbol for the line chart linewidth width of the line in points (a float value) 54 PANDAS linestyle can be ['solid' | 'dashed', 'dashdot', 'dotted' or take markertype string] marker a symbol denoting valid marker style such as '-!, 'o', 'x', '+', 'd', and others etc. markersize size of marker in points ( a float value) markeredgecolor color for markers; should be a valid color identified by PyPlot import pandas as pd import numpy as np import matplotlib. pyplot as plt x = np.linspace(0, 10, 30) y = np.sin(x) plt.plot(x, y, 'o', color='r'); plt.show() Scatter Charts using scatter( ) Function - It is, more powerful method of creating scatter plots than the plot() function. In its simplest form, the scatter( ) function is used as – matplotlib. pyplot.scatter( <array1>, <array2> ) or <pyplot aliasname>.scatter( <array1>, <array2> ) Specifying marker type - You can specify marker type using marker argument of the scatter( ) function. pl.scatter(a1, a4, marker="x") Specifying size of the markers - Using arguments, you can specify the size of the markers. Specifying co/or of the markers - Using argument c, you can specify the color of the markers. matplotlib.pyplot.scatter(x, y, s = None, c = None, marker=None) x, y s c marker The data positions. The marker size in points**2. Optional argument. marker color, sequence, or sequence of color, optional argument MarkerStyle, optional argument The primary difference of scatter( ) from plot( ) is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data. 55 PANDAS arr1 = np. linspace ( -1, 1, 5) # arr1 with 5 data points created arr2 = np.exp(arr1) # arr2 also has 5 data points created colarr = [ 'r', 'b', 'm', 'g', 'k'] #colarr is a sequence of colors with same shape as arr1 sarr = [20, 60, 100, 45, 25] # sarr is a sequence of sizescolor with same shape as arr1 plt.scatter(arr1, arr2, c = colarr, s =sarr) Creating Bar Charts - A Bar Graph or a Bar Chart is a graphical display of data using bars of different heights. A bar chart can be drawn vertically or horizontally using rectangles or bars of different heights/widths PyPlot offers bar( ) function to create a bar chart. a, b = [1, 2, 3, 4], [2, 4, 6, 8] matplotlib.pyplot.bar(a, b) Notice, first sequence given in the bar( ) forms the x-axis and the second sequence values are plotted on y-axis. If you want to specify x-axis label and y-axis label, then you need to give commands: matplotlib.pyplot.xlabel(<label string>) matplotlib.pyplot.ylabel(<label string>) NOTE : The simple bar plot is best used when there is just one level of grouping to your variable. 56 PANDAS Changing Widths of the Bars in a Bar Chart – By default, bar chart draws bars with equal widths and having a default width of 0.8 units on a bar chart. That is, all bars have the width same as the default width. But you can always change the widths of the bars. ¢) You can specify a different width ( other than the default width) for all the bars of a bar chart. ¢) You can also specify different widths for different bars of a bar chart. (i) To specify common width (other than the default width) for all bars, you can specify width argument having a scalar float value in the bar( ) function, i.e., as <matplotlib.pyplot>.bar( <x-sequence>, <y-sequence>, width= <float value>) (ii) To specify different widths for different bars of a bar chart, you can specify width argument having a sequence (such as lists or tuples) containing widths for each of the bars, in the bar( ) function, i.e., as : <matplotlib.pyplot>.bar(<x-sequence>,<y-sequence>,width=<width values sequence>) Changing Colors of the Bars in a Bar Chart – (i)To specify common color (other than the default color) for all bars, you can specify color argument having a valid color code/name in the bar( ) function, i.e., as <matplotlib.pyplot>.bar( <x-sequence>, <y-sequence>, color= <color code/name>) (ii)To specify different colors for different bars of a bar chart, you can specify color argument having a sequence (such as lists or tuples) containing colors for each of the bars, in the bar( ) function, i.e., as : <matplotlib.pyplot>.bar(<x-sequence>,<y-sequence>,color=<colorNames/codes sequence>) a, b = ['a','b','c','d'], [2, 6, 4, 8] plt.bar(a, b, width=[.05,.1,.05,.1], color= ['red', 'b', 'g', 'black']) plt.plot(a,b) plt.show() Creating Multiple Bars chart - As such PyPlot does not provide a specific function for this, BUT you can always create one exploiting the width and color arguments of bar( ). 57 PANDAS import numpy as np import matplotlib. pyplot as pl t Val= [[5., 25., 45., 20.], [4., 23., 49., 17.], [6., 22., 47., 19.]] X = np.arange(4) plt.bar(X+0.00, Val[0], color= 'b', width=0.25) plt.bar(X+ 0.25, Val[l], color= 'g', width= 0.25) plt.bar(X+0.50, Val[2], color= 'r', width= 0.25) plt.title('Bar Chart') plt. show() Creating a Horizontal Bar Chart - To create a horizontal bar chart, you need to use barh() function (bar horizontal), in place of bar(). Also, you need to give x and y axis labels carefully - the label that you gave to x axis in bar( ), will become y-axis label in barh( ) and vice-versa. Val= [[5., 25., 45., 20.], [4., 23., 49., 17.], [6., 22., 47., 19.]] X = np.arange(1,15,4) plt.barh(X+0.00, Val[0], color= 'b') plt.barh(X+ 1, Val[1], color= 'g') plt.barh(X+2, Val[2], color= 'r') plt.show() 58 PANDAS Anatomy of a Chart - Any graph or chart that you create using matplotlib's PyPlot interface, is created as per a specific structure of a plot or shall we say a specific anatomy. As anatomy, generally refers to study of bodily structure (or parts) of something, here we shall discuss various parts of a plot that you can create using PyPlot. PyPlot charts have hierarchical structures or in simple words they are actually like containers containing multiple items/things inside it. Look at the figure given below carefully and then go through the terms given below it describe it. ¢) Figure. PyPlot by default plots every chart into an area called Figure. A figure contains other elements of the plot in it. ¢) Axes. The axes define the area (mostly rectangular in shape for simple plots) on which actual plot (line or bar or graph etc.) will appear. Axes have properties like label, limits and tick marks on them. There are two axes in a plot: (i) X-axis, the horizontal axis, (ii) Y-axis, the vertical axis. • Axis label. It defines the name for an axis. It is individually defined for X-axis and Yaxis each. • Limits. These define the range of values and number of values marked on X-axis and Y-axis. • Tick_Marks. The tick marks are individual points marked on the X-axis or Y-axis. ¢) Title. This is the text that appears on the top of the plot. It defines what the chart is about. ¢) Legends. These are the different colors that identify different sets of data plotted on the plot. The legends are shown in a corner of the plot. Adding a Title - To add a title to your plot, you need to call function title( ) before you show your plot. The syntax of title( ) function is as : plt.title( "A Bar chart") You can use title() function for all types of plots i.e., for plot( ), for bar( ) and for pie( ) as well. Setting Xlimits and Ylimits – You can use xlim( ) and ylim( ) functions to set limits for Xaxis and Y-axis respectively. <matplotlib.pyplot>.xlim(<xmin>, <xmax>) # set the X-axis limits as xmin to xmax <matplotlib.pyplot>.ylim(<ymin>, <ymax>) # set the Y-axis limits as ymin to ymax 59 PANDAS NOTE : Only the data values mapping on Xlimits and Ylimits will get plotted. If no data values maps to the Xlimits or Ylimits, nothing will show on the plot. Val= [[5., 25., 45., 20.], [4., 23., 49., 17.], [6., 22., 47., 19.]] X = np.arange(1,15,4) plt.bar(X+0.00, 'b') Val[0], color= plt.bar(X+ 1, Val[1], color= 'g') plt.bar(X+2, Val[2], color= 'r') plt.title('Bar Chart') plt.xlim(-2,18) plt.show() You can use decreasing axes by flipping the normal order of the axis limits i.e., if you swap the limits (min, max) as (max, min), then the plot gets flipped. Setting Ticks for Axes - By default, PyPlot will automatically decide which data points will have ticks on the axes, but you can also decide which data points will have tick marks on Xand Y-axes. To set own tick marks: ¢) for X-axis, you can use xticks() function as per format: xticks( <sequence containing tick data points>, [ <Optional sequence containing tick labels>]) ¢) for Y-axis, you can use yticks() function as per format: yticks( <sequence containing tick data points>,[<Optional sequence containing tick labels>]) val= [5, 25, 45, 20] x = [1,2,3,4] plt.bar(x, val, color= 'b') plt.title('Bar Chart') plt.xlim(0,5) plt.xticks([0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5]) plt.show() Adding Legends - A legend is a color or mark linked to a specific data range plotted. To plot a legend, you need to do two things – 60 PANDAS (i) In the plotting functions like plot( ), bar( ) etc., give a specific label to data range using argument label. (ii) Add legend to the plot using legend( ) as per format : <matplotlib. pyplot>. legend( loc = <position number or string>) The loc argument can either take values 1, 2, 3, 4 signifying the position strings 'upper right', 'upper left', 'lower left', 'lower right' respectively. Default position is 'upper right' or 1. import numpy as np import matplotlib. pyplot as pl t Data= [[5., 25., 45., 20.], [8., 13., 29., 27.], [9., 29., 27., 39.]] X = np.arange(4) plt.plot(X 'rangel') , Data[0], color= 'b', label= pl t. plot (X, Data [1], color = 'g ·, label = 'range2' ) plt.plot(X, Data[2], color= 'r', label= 'range3') pl t. legend ( loc = 'upper left' ) pl t. title ( "MultiRange Line chart") plt.xlabel(‘X’) plt.ylabel(‘Y’) plt. show() Saving a Figure - If you want to save a plot created using pyplot functions for later use or for keeping records, you can use savefig() to save the plot. You can use the pyplot' s savefig( ) as per format : <matplotlib. pyplot >. savefig (<string with filename and path>) ¢) You can save figures in popular formats like .pdf, .png, .eps etc. Specify the format as file extension. ¢) Also, while specifying the path, use double slashes to suppress special meaning of single slash character. Consider the following examples plt.savefig("multibar.pdf") # save the plot in current directory plt.savefig("c:\\data\\multibar.pdf") # save the plot at the given path plt.savefig("c:\\data\\multibar.png") # save the plot at the given path in png format Creating Histograms with Pyplot - A histogram is a summarization tool for discrete or continuous data. A histogram provides a visual interpretation of numerical data by showing the number of data points that fall within a specified range of values (called bins). It is similar to a vertical bar graph. However, a histogram, unlike a vertical bar graph, shows no gaps between the bars. 61 PANDAS Histograms are a great way to show results of continuous data, such as : weight, height, how much time, and so forth. But when the data is in categories (such as Country or Subject etc.), one should use a bar chart. Histogram using hist( ) Function – matplotlib. pyplot. hist ( x, bins = None, cumulative = False, histtype = 'bar', align= 'mid', orientation= 'vertical', ) x bins cumulative histtype orientation (n,) array or sequence of (n,) arrays to be plotted on histogram. int, optional. If an integer is given, bins + 1 bin edges are calculated and returned. Default value is automatically provided internally bool, optional; If True, then a histogram is computed where each bin gives the counts in that bin plus all bins for smaller values. The last bin gives the total number of datapoints. Default is False. {'bar', 'barstacked', 'step', 'stepfilled'}, optional; the type of histogram to draw. 'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side. 'barstacked' is a bar-type histogram where multiple data are stacked on top of each other. 'step' generates a lineplot that is by default unfilled. 'stepfilled' generates a lineplot that is by default filled. Default is 'bar' {'horizontal', 'vertical'}, optional ; If 'horizontal', barh will be used for bar-type histograms. a = np.array([22,87,5,43,56,73,55,54,71,20,51,45,79,51,27]) plt.hist(a, bins = [0,20,40,60,80,100]) plt.hist(a, bins = [0,20,40,60,80,100]) plt.hist(a, bins = 20) plt.xticks(np.arange(0,100,5)) plt.hist(a, bins = [0,20,40,60,80,100], cumulative= True) 62 plt.hist(a, bins = [0,20,40,60,80,100], histtype = 'step') PANDAS a = np.array([22,87,5,43,56,73,55,54,71,20,51,45,79,51,27]) b = np.array([27,92,10,53,60,79,60,60,79,20,59,51,80,59,33]) plt.hist([a,b], bins = [0,20,40,60,80,100]) plt.hist([a,b], bins = [0,20,40,60,80,100], histtype = 'barstacked') 7. pl. hist([a,b], bins = [0,20,40,60,80,100], plt.hist([a,b], bins = [0,20,40,60,80,100], histtype = 'barstacked', cumulative= True) orientation= 'horizontal') 63 PANDAS Creating Frequency Polygons - A frequency polygon is a type of frequency distribution graph. In a frequency polygon, the number of observations is marked with a single point at the midpoint of an interval. A straight line then connects each set of points. Frequency polygons make it easy to compare two or more distributions on the same set of axes. Python's pyplot module of matplotlib provides no separate function for creating frequency polygon. Therefore, to create a frequency polygon, what you can do is : (i) Plot a histogram from the data. (ii) Mark a single point at the midpoint of an interval/bin. (iii) Draw straight lines to connect the adjacent points. (iv) Connect first data point to the midpoint of previous interval on x-axis. pl.hist ( com, bins = 10, histtype = 'step') 64 Join midpoints of each set of adjacent bins to create frequency polygon. PANDAS Creating Box Plots - The box plot has become the standard technique for presenting the 5-number summary which consists of : (i) the minimum range value, (ii) the maximum range value, (iii) the upper quartile, (iv) the lower quartiles, and (v) the median. A box plot is used to show the range and middle half of ranked data. Ranked data is numerical data such as numbers etc. The middle half of the data is represented by the box. The highest and lowest scores are joined to the box by straight lines. The regions above the upper quartile and below the lower quartile each contain 25% of the data. The box plot uses five important numbers of a data range : the extremes (the highest and the lowest numbers), the median, and the upper and lower quartiles, making up the fivenumber summary. The five-number summary is shown in the diagram below : 65 PANDAS matplotlib.pyplot.boxplot(x, notch= None, vert = None, meanline = None, showmeans = None, showbox= None) x notch vert meanline showmeans showbox Array or a sequence of vectors. The input data. bool, optional (False) ; If True, will produce a notched box plot. Otherwise, a rectangular boxplot is produced. bool, optional (True) ; If True (default), makes the boxes vertical. If False, everything is drawn horizontally. bool, optional (False) ; If True (and showmeans is True), will try to render the mean as a line spanning the full width of the box. bool, optional (True) ; Show the central box. bool, optional (False) ; Show the arithmetic means. ary= [5, 20, 30, 45, 60, 80, 100, 140, 150, 200, 240] 1. Drow the plain bpxplot - pl.boxplot(ary) 2.Using the above sample data (ary), draw the boxplot with mean shown – pl.boxplot ( ary, showmeans = True) 3.Draw a notched boxplot for the same - 4.Draw the boxplot with above data without pl.boxplot(ary, notch= True, showmeans = the central box - pl.boxplot(ary, showbox = True) False) 66 PANDAS Using vert argument of boxplot( ), you can change the orientation of the boxplot. Customizing/Adding Details to the Plots You have read about the same in previous chapter, so here we shall not cover these again. Instead, we are quickly revising it here Anatomy of all plot types is the same, so you can customise them using same functions. ¢) Use <matplotlib.pyplot>.title( ) to add title for your plot ¢) Use <matplotlib.pyplot>.xticks( )/yticks( ) for setting xticks and yticks. ¢) Use <matplotlib.pyplot>.xlim( )/ylim( ) for setting x limit/y limit ¢) Use <matplotlib.pyplot>.xlabel( )/ylabel( ) for setting x-axis label/y-axis label ¢) Use <matplotlib.pyplot>.legend( ) to add legends to your plot 67 PANDAS

Pandas Data Analysis: Series and DataFrames

Related documents

Products

Support

Pandas Data Analysis: Series and DataFrames

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib