In [3]: !pip install pandas Requirement already satisfied: pandas in c:\users\supaki\appdata\local\programs\python\python38\lib\site-pac kages (1.1.2) Requirement already satisfied: pytz>=2017.2 in c:\users\supaki\appdata\local\programs\python\python38\lib\si te-packages (from pandas) (2020.1) Requirement already satisfied: numpy>=1.15.4 in c:\users\supaki\appdata\local\programs\python\python38\lib\s ite-packages (from pandas) (1.19.2) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\supaki\appdata\local\programs\python\pytho n38\lib\site-packages (from pandas) (2.8.1) Requirement already satisfied: six>=1.5 in c:\users\supaki\appdata\local\programs\python\python38\lib\site-p ackages (from python-dateutil>=2.7.3->pandas) (1.15.0) WARNING: You are using pip version 19.2.3, however version 20.2.3 is available. You should consider upgrading via the 'python -m pip install --upgrade pip' command. In [4]: !pip uninstall -y pandas Uninstalling pandas-1.1.2: Successfully uninstalled pandas-1.1.2 In [5]: !cd #See current working directory The system cannot find the path specified. In [6]: !cd C:\Users\supaki\OneDrive\DATA SCIENCE LEARNING\Udemy\Python Data Analysis & Visualization Bootcamp In [7]: !pip freeze > AllPackages.txt list all directories in your curent working directory In [8]: !dir Volume in drive C has no label. Volume Serial Number is 9EE3-C949 Directory of C:\Users\supaki\OneDrive\DATA SCIENCE LEARNING\Udemy\Python Data Analysis & Visualization Boot camp 09/12/2020 09/12/2020 09/12/2020 09/12/2020 09/12/2020 09:09 AM <DIR> . 09:09 AM <DIR> .. 08:23 AM <DIR> .ipynb_checkpoints 09:20 AM 914 AllPackages.txt 09:09 AM 10,766 pip & python package index.ipynb 2 File(s) 11,680 bytes 3 Dir(s) 351,185,477,632 bytes free Numpy Arrays In [10]: In [11]: import numpy as np #creating a 1D Array mylist1=[101,102,103] myArray=np.array(mylist1) print(myArray) [101 102 103] In [12]: #creating a 2D np Array mylist2=[201,202,203] myArray2D=np.array([mylist1,mylist2]) print(myArray2D) [[101 102 103] [201 202 203]] Finding Dimensions of Array-Shape In [13]: print("myArray Dimensions") print(myArray.shape) #Tells numbers of rows and columns in this array print("myArray2D Dimensions") print(myArray2D.shape) #has 2 rows and 3 columns myArray Dimensions (3,) myArray2D Dimensions (2, 3) Finding Datatype of Array - dtype Actual type of each datatype in that array In [14]: print("myArray Dimensions") print(myArray.dtype) print("myArray2D Dimensions") print(myArray2D.dtype) #int32 - 32 bit integer myArray Dimensions int32 myArray2D Dimensions int32 Array Creation and Initialization Functions - zeros function create and initialise numpy array with zero value elements inside it In [15]: zero_array=np.zeros(5) print(zero_array) [0. 0. 0. 0. 0.] In [16]: zero_array=np.zeros([5,5]) #create an array with 5rows and 5 columns print(zero_array) [[0. [0. [0. [0. [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 0.] 0.] 0.] 0.]] Array Creation and Initialization Functions - ones function create and initialise numpy array with one value elements inside it In [17]: print("1D") zero_array=np.ones(5) print(zero_array) print("2D") zero_array2D=np.ones([5,5]) print(zero_array2D) 1D [1. 1. 1. 1. 1.] 2D [[1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.]] Array Creation and Initialization Functions - empty function empty array creates array with junk value. Doesn't assign each value inside the array In [18]: print("1D") zero_array=np.empty(5) print(zero_array) print("2D") zero_array2D=np.empty([5,5]) print(zero_array2D) 1D [1. 1. 1. 1. 1.] 2D [[1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.]] Array Creation and Initialization Functions - eyes function create an identity matrix. All diagnol matrix are 1 the rest are zero Can only be 2D In [21]: identity_array = np.eye(3) print(identity_array) #Diagonal elements are 1 [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]] Array Creation and Initialization Functions - arange function paremeters - starting value, end value, difference between the element (arithmetic progression) In [22]: AP_array=np.arange(0,50,2) print(AP_array) [ 0 2 48] In [ ]: 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Scalar Operations in Numpy Array Simple mathematical operation that is performed on each array element. In [1]: import numpy as np Scalar Array Multiplication - using * operator In [2]: arr1=np.array([[1,2,3],[5,6,7]]) print(arr1) [[1 2 3] [5 6 7]] In [3]: arr2=arr1*arr1 print(arr2) #each element of arr1 is a square of itself [[ 1 4 9] [25 36 49]] Exponential Multiplication - using ** operator In [4]: arr3=arr1**2 print(arr3) [[ 1 4 9] [25 36 49]] In [5]: arr2==arr3 Out[5]: array([[ True, [ True, In [6]: True, True, True], True]]) arr3=arr1**3 print(arr3) [[ 1 8 27] [125 216 343]] Scalar subtraction of Array numbers In [7]: print(arr3) [[ 1 8 27] [125 216 343]] In [8]: print(arr1) [[1 2 3] [5 6 7]] In [9]: arr4=arr3-arr1 print(arr4) [[ 0 6 24] [120 210 336]] Scalar division of Array elements In [10]: arr5=1/arr1 print(arr5) [[1. [0.2 0.5 0.33333333] 0.16666667 0.14285714]] Array Indexes Introduction to Indexes In [11]: myArray = np.arange(100,160,2) print(myArray) #starts from 100 ends at 159 [100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158] Access Individual Array Elements In [12]: print("First Element:") print(myArray[0]) print("Third Element:") print(myArray[3]) print("","") #Space between the two methods #Another method of accessing individual array elements on same line print("Sixth Element:", myArray[6]) print("Tenth Element:", myArray[10]) First Element: 100 Third Element: 106 Sixth Element: 112 Tenth Element: 120 Slicing of Array Indexes arr[start:stop:step] Slicing is used to access contents/elements of an N-D array In [13]: myArray[0:5:1] #Accessing index number 4, you write 5 #Gives the first 5 elements of the array Out[13]: array([100, 102, 104, 106, 108]) Updating Array using Slices In [14]: myArray[0:5]=0 #Update array from zero to 5 to 0 print(myArray) [ 0 0 0 0 0 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158] Slicing memory allocation, view vs copy In [15]: myArray2=myArray[4:10] #index values of 4 to 9 print(myArray2) [ In [16]: 0 110 112 114 116 118] #Update myArray2 myArray2[:]=1 print(myArray2) [1 1 1 1 1 1] In [17]: print(myArray) [ 0 0 0 0 1 1 1 1 1 1 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158] copy() function=> creating new numpy memory for array In [18]: myArray3=myArray.copy() print(myArray3) [ 0 0 0 0 1 1 1 1 1 1 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158] In [19]: In [20]: myArray3[:]=0 print(myArray3) print(myArray) #All elements of myArray did not change [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 1 1 1 1 1 1 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158] Array Indexes in Multi-Dimensional Array In [21]: arr2d=np.array([[1,2,3],[4,5,6],[7,8,9]]) print(arr2d) [[1 2 3] [4 5 6] [7 8 9]] Accessing Rows In [22]: #Accessing different rown in the array print(arr2d[0]) #first row print(arr2d[1]) #second row of array [1 2 3] [4 5 6] Accessing Elements of Array In [23]: print(arr2d[0][0]) #intersection of row 1 and column 1 (row,column) 1 In [24]: #print number 6 from arr2d print(arr2d[1][2]) #value 6 is in the row index 1 (2nd row) and column index 2 (3rd column) 6 In [25]: #Example 1 ex_array=np.array([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15],[16,17,18,19,20],[21,22,23,24,25],[26,27,28,29,30]]) print(ex_array) print("") print("OR") print(" ") array1=np.arange(1,31).reshape(6,5) print(array1) [[ 1 [ 6 [11 [16 [21 [26 2 7 12 17 22 27 3 8 13 18 23 28 4 9 14 19 24 29 5] 10] 15] 20] 25] 30]] 2 7 12 17 22 27 3 8 13 18 23 28 4 9 14 19 24 29 5] 10] 15] 20] 25] 30]] OR [[ 1 [ 6 [11 [16 [21 [26 In [26]: #print 2,8,14,20 print(array1[[0,1,2,3],[1,2,3,4]]) [ 2 In [27]: 8 14 20] #Example 2: array2=np.ones([5,5]) print(array2) [[1. [1. [1. [1. [1. In [28]: 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.] 1.] 1.] 1.]] array2_int=np.ones([5,5],dtype=int) #change to integer print(array2_int) [[1 [1 [1 [1 [1 In [29]: 1. 1. 1. 1. 1. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 1] 1] 1] 1]] #matrix_1=array2_int[1:3][1:3] #print(matrix_1) In [ ]: Slicing in 2D Arrays In [30]: print(arr2d) [[1 2 3] [4 5 6] [7 8 9]] In [31]: slice1=arr2d[0:1,0:2] #first row, first & second column print(slice1) [[1 2]] In [32]: slice2=arr2d[0:2,0:3] #first 2 rows and first 3 columns print(slice2) [[1 2 3] [4 5 6]] In [33]: slice3=arr2d[:2,:3] #print rows till 2nd row and columns till 3rd column print(slice3) [[1 2 3] [4 5 6]] In [34]: slice4=arr2d[1:,2:] #print row starting row 2 and colums starting column 3 print(slice4) [[6] [9]] Using Loops to change index Update array values using loops In [35]: arr_len=arr2d.shape #Get length of the array print(arr_len) nrows=arr_len[0] (3, 3) In [36]: for i in range(nrows): #nrows is 3 arr2d[i][0] = i #First element of each row, assign to value of i print(arr2d) [[0 [4 [7 [[0 [1 [7 [[0 [1 [2 2 5 8 2 5 8 2 5 8 3] 6] 9]] 3] 6] 9]] 3] 6] 9]] Accessing rows using list of index values In [37]: print(arr2d[[0,1]]) #mention row value you want [[0 2 3] [1 5 6]] In [ ]: Premium Array Operations In [6]: import numpy as np arange() Function In [7]: A=np.arange(15) print(A) [ 0 In [8]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14] B=np.arange(5,10) print(B) [5 6 7 8 9] In [9]: C=np.arange(5,10,2) #start from print(C) 5 end at 9 but increment by 2 [5 7 9] sqrt() Function In [10]: D=np.sqrt(A) #gives square route of each element in array A print(D) [0. 1. 1.41421356 1.73205081 2. 2.23606798 2.44948974 2.64575131 2.82842712 3. 3.16227766 3.31662479 3.46410162 3.60555128 3.74165739] exp() Function Exponential Function In [12]: E=np.exp(A) print(E) [1.00000000e+00 5.45981500e+01 2.98095799e+03 1.62754791e+05 2.71828183e+00 1.48413159e+02 8.10308393e+03 4.42413392e+05 7.38905610e+00 2.00855369e+01 4.03428793e+02 1.09663316e+03 2.20264658e+04 5.98741417e+04 1.20260428e+06] random() Function Creates specific shape of an array with random numbers within it. Whenever you run it again, a different set of random numbers is generated In [14]: F=np.random.randn(5) #(5) is size of array print(F) #1D array having 5 random values [0.12211897 0.83655809 0.85458522 0.06846784 1.79424511] In [15]: F=np.random.randn(5,5) #2D Array print(F) [[ 0.75830792 0.15049377 0.82932155 0.53882024 0.58156692] [ 0.32583286 0.79532875 1.48664644 0.44572963 1.06960642] [-0.23963189 -0.16518398 -0.83799173 -1.05410172 -0.81569001] [-0.20836428 0.96537819 -0.27722683 0.63144831 -0.19772012] [ 0.10362402 -0.92146184 -0.79719747 -0.13387981 1.54196333]] add() Function In [16]: print(A) print(D) G=np.add(A,D) #Corresponding elements of A and D are added and stored in the new array G print(G) [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] [0. 1. 1.41421356 1.73205081 2. 2.23606798 2.44948974 2.64575131 2.82842712 3. 3.16227766 3.31662479 3.46410162 3.60555128 3.74165739] [ 0. 2. 3.41421356 4.73205081 6. 7.23606798 8.44948974 9.64575131 10.82842712 12. 13.16227766 14.31662479 15.46410162 16.60555128 17.74165739] maximum() Function In [25]: H = np.array([1,5,7,10]) I = np.array([0,6,8,9]) J = np.maximum(H,I) #Maximum value of the corresponding elements in the 2 arrays (H & I) #max. value of element 1=1,2=6 print(J) [ 1 6 8 10] Additional Numpy Documentation numpy.org, docs.scipy.org Saving and Loading Arrays to External Memory In [26]: arr=np.arange(10) print(arr) [0 1 2 3 4 5 6 7 8 9] In [ ]: #Why do we need to save arrays to hard drive? Saving Single Array In [27]: #np.save() - Function to save an array. np.save('saved_array',arr) #Extension for saved array is .npy Loading Single Array In [28]: #np.load() - Loading a saved array load_arr1=np.load('saved_array.npy') print(load_arr1) [0 1 2 3 4 5 6 7 8 9] Saving Multiple Array In [29]: arr2=np.arange(25) arr3=np.arange(5) #np.savez - save function for multiple arrays and saves as a zip or .npz #np.save - save function for a single array and stored as .npy np.savez('saved_archieve.npz', x=arr2, y=arr3) Loading Multiple Array In [30]: load_npz=np.load('saved_archieve.npz') print(load_npz['x']) print(load_npz['y']) [ 0 1 2 3 24] [0 1 2 3 4] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Saving Arrays to text file In [32]: np.savetxt('myarray.txt',arr2,delimiter=',') #text file name, pass array 2 and a delimeter Loading Arrays from text file In [33]: load_file = np.loadtxt('myarray.txt',delimiter=',') print(load_file) #loaded as a float value [ 0. 1. 2. 3. 4. 5. 6. 7. 18. 19. 20. 21. 22. 23. 24.] In [35]: 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. load_file = np.loadtxt('myarray.txt',delimiter=',',dtype=int) print(load_file) #loaded as an int value [ 0 1 24] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 What is a delimiter? It is specific character that will separate a value from the other in a txt file .csv, .txt In [ ]: Install Matplotlib Library Used to plot various graphs in python In [7]: !pip install matplotlib Requirement already satisfied: matplotlib in c:\users\supaki\anaconda3\lib\site-packages (3.1.1) Requirement already satisfied: cycler>=0.10 in c:\users\supaki\anaconda3\lib\site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\supaki\anaconda3\lib\site-packages (from matplo tlib) (1.1.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\supaki\anaconda3\lib\sit e-packages (from matplotlib) (2.4.2) Requirement already satisfied: python-dateutil>=2.1 in c:\users\supaki\anaconda3\lib\site-packages (from mat plotlib) (2.8.0) Requirement already satisfied: numpy>=1.11 in c:\users\supaki\anaconda3\lib\site-packages (from matplotlib) (1.16.5) Requirement already satisfied: six in c:\users\supaki\anaconda3\lib\site-packages (from cycler>=0.10->matplo tlib) (1.12.0) Requirement already satisfied: setuptools in c:\users\supaki\anaconda3\lib\site-packages (from kiwisolver>= 1.0.1->matplotlib) (41.4.0) In [18]: import numpy as np import matplotlib.pyplot as plt Explain np.meshgrid() Used to create square matrix In [10]: x=np.arange(3) y=np.arange(4,8) print(x) print(y) #lengths of array is different. x is 3, y is 4 [0 1 2] [4 5 6 7] In [14]: In [12]: x2,y2=np.meshgrid(x,y) print(x2) [[0 [0 [0 [0 In [15]: 1 1 1 1 print(y2) [[4 [5 [6 [7 In [16]: 2] 2] 2] 2]] 4 5 6 7 4] 5] 6] 7]] z=2*x2+3*y2 print(z) [[12 [15 [18 [21 14 17 20 23 16] 19] 22] 25]] Plot a linear function heatmap In [20]: plt.imshow(z) Out[20]: <matplotlib.image.AxesImage at 0x2672519dac8> Add title and colorbar In [21]: plt.imshow(z) plt.title('Plot of 2x+3y') plt.colorbar() Out[21]: <matplotlib.colorbar.Colorbar at 0x2672965a308> Plot a cos function heatmap In [22]: z2=np.cos(x2)+np.cos(y2) print(z2) [[ [ [ [ In [24]: 0.34635638 -0.11334131 -1.06979046] 1.28366219 0.82396449 -0.13248465] 1.96017029 1.50047259 0.54402345] 1.75390225 1.29420456 0.33775542]] plt.imshow(z2) plt.title('Plot of cos(x2)+cos(y2)') plt.colorbar() Out[24]: <matplotlib.colorbar.Colorbar at 0x26729023a88> In [25]: In [ ]: # Save plot figure plt.imshow(z2) plt.title('Plot of cos(x2)+cos(y2)') plt.colorbar() plt.savefig('cos plot.png') In [7]: import numpy as np Populate a list based on conditional values In [8]: x=np.array([100,400,-50,-40]) #each element a y=np.array([10,15,20,25]) #each element b condition=np.array([True,True,False,False]) z= [a if cond else b for a,cond,b in zip(x,condition,y)] #for loop for mutiple arrays in python. print(z) [100, 400, 20, 25] Using np.where() Tackling the disadvabtage of a confusing code above (condition, values for true,value for false) In [10]: z2=np.where(condition,x,y) print(z2) #if condition is true pul in value of x, else pull in value of y [100 400 In [11]: 20 25] z3=np.where(x>0,1,x) print(z3) #if x is greater than 0 take value of 1 or else take x [ 1 1 -50 -40] Standard functions in numpy, sum, sum(0), mean(), std(), var() In [12]: print(x) print(x.sum()) #prints sum of all elements of the array [100 400 -50 -40] 410 In [14]: x2=np.array([[1,2],[3,4]]) print(x2) [[1 2] [3 4]] In [15]: print(x2.sum(0)) #sum of each column in the array [4 6] In [17]: #Mean,std,variant print(x.mean()) print(x.std()) print(x.var()) 102.5 181.71062159378576 33018.75 Logical AND, OR operation - any(), all() In [19]: condition2=np.array([True,False,True]) #OR print(condition2.any()) #AND print(condition2.all()) #if all conditions are true True False sort() function In [21]: unsorted=np.array([1,2,10,6,4,3]) print(unsorted) unsorted.sort() #sorting array print(unsorted) [ 1 [ 1 2 10 2 3 6 4 4 3] 6 10] unique() In [22]: arr2=np.array(['liquid','liquid','liquid','gas','gas','solid','solid']) print(arr2) #prints set of unique values (one of each value that is present) print(np.unique(arr2)) ['liquid' 'liquid' 'liquid' 'gas' 'gas' 'solid' 'solid'] ['gas' 'liquid' 'solid'] in1d() In [23]: #Checks whether lists of elements is present in an array or not print(np.in1d(['solid','liquid','plasma'],arr2)) [ True In [ ]: True False] In [2]: import numpy as np What is Pandas Build on numpy library Provides key data structures: series, dataframes What is Series 1D Array and can hold any type of data 1 column of an excel sheet will be a description of a series Install Pandas In [3]: !pip install pandas Requirement already satisfied: Requirement already satisfied: andas) (2.8.0) Requirement already satisfied: (1.16.5) Requirement already satisfied: 19.3) Requirement already satisfied: >=2.6.1->pandas) (1.12.0) In [4]: pandas in c:\users\supaki\anaconda3\lib\site-packages (0.25.1) python-dateutil>=2.6.1 in c:\users\supaki\anaconda3\lib\site-packages (from p numpy>=1.13.3 in c:\users\supaki\anaconda3\lib\site-packages (from pandas) pytz>=2017.2 in c:\users\supaki\anaconda3\lib\site-packages (from pandas) (20 six>=1.5 in c:\users\supaki\anaconda3\lib\site-packages (from python-dateutil import pandas as pd from pandas import Series Create a simple series variable In [5]: s1=Series([5,10,15,20]) print(s1) 0 5 1 10 2 15 3 20 dtype: int64 In [6]: print(s1.values) #printing only the elements [ 5 10 15 20] In [9]: print(s1.index) #print index RangeIndex(start=0, stop=4, step=1) In [10]: print(s1.index.values) #print index values [0 1 2 3] Create series from numpy In [11]: revenue_array=np.array([400,300,200,100]) print(revenue_array) [400 300 200 100] In [13]: revenue_series=Series(revenue_array) print(revenue_series) #elements of numpy array are now present in the series 0 400 1 300 2 200 3 100 dtype: int32 Create series with custom indexes In [15]: revenue=Series(revenue_array, index=['uber','ola','lyft','gojek']) print(revenue) #indexes have label values uber 400 ola 300 lyft 200 gojek 100 dtype: int32 In [18]: #Accessing elements in the series revenue['lyft'] Out[18]: 200 In [19]: revenue['ola'] Out[19]: 300 Filter a series based on conditions In [20]: print(revenue[revenue>250]) uber 400 ola 300 dtype: int32 Check whether an element is in a series In [21]: #Check whether ola is present in the series or not print('ola' in revenue) True In [22]: print('swvl' in revenue) False to_dict() function In [24]: #Dictionaries are data structures in python revenue_dict=revenue.to_dict() print(revenue_dict) #format of a dictionary. Using series to create a dictionary {'uber': 400, 'ola': 300, 'lyft': 200, 'gojek': 100} Additional of 2 series In [28]: print(revenue) print('') print(revenue+revenue) uber 400 ola 300 lyft 200 gojek 100 dtype: int32 uber 800 ola 600 lyft 400 gojek 200 dtype: int32 Assign names to a series and its index In [30]: #Documenting a dataset such as a series revenue.name="Company-Revenue" #name for the series revenue.index.name="Company Name" print(revenue) Company Name uber 400 ola 300 lyft 200 gojek 100 Name: Company-Revenue, dtype: int32 In [24]: import numpy as np import pandas as pd from pandas import Series, DataFrame What is a DataFrame 2D Data structure with 3 principal components - Rows, columns and values Similar to a table in excel sheet Labelled axes-Rows and columns are labelled size mutable - size of a dataframe can be changed anytime Create a DataFrame from clipboard https://en.wikipedia.org/wiki/Table_(information)#:~:text=A%20table%20is%20an%20arrangement,signs%2C%20and%20many%20other%20 In [25]: age_df=pd.read_clipboard() print(age_df) 0 1 2 3 4 5 6 7 8 First name Tinu Blaszczyk Lily Olatunkbo Adrienne Axelia Jon-Kabat Thabang Kgaogelo Last name Elejogun Kostrzewski McGarrett Chijiaku Anthoula Athanasios Zinn Mosoa Mosoa Age 14 25 18 22 22 22 22 15 11 Display column names In [26]: age_df.columns Out[26]: Index(['First name', 'Last name', 'Age'], dtype='object') Access one or more columns In [27]: #Accessing one column age_df['First name'] Tinu Out[27]: 0 1 Blaszczyk 2 Lily 3 Olatunkbo 4 Adrienne 5 Axelia 6 Jon-Kabat 7 Thabang 8 Kgaogelo Name: First name, dtype: object In [28]: #Accessing multiple column names age_df[['First name','Age']] First name Age 0 Tinu 14 1 Blaszczyk 25 2 Lily 18 3 Olatunkbo 22 4 Adrienne 22 5 Axelia 22 6 Jon-Kabat 22 7 Thabang 15 8 Kgaogelo 11 Out[28]: What is NAN Value NAN Values - A value is not present or missing in a values Pandas doesn't leave cell blank. It feels it with NAN Values Create a column with NAN Values In [29]: age_df['rank'] = np.nan print(age_df) 0 1 2 3 4 5 6 7 8 First name Tinu Blaszczyk Lily Olatunkbo Adrienne Axelia Jon-Kabat Thabang Kgaogelo Last name Elejogun Kostrzewski McGarrett Chijiaku Anthoula Athanasios Zinn Mosoa Mosoa Age 14 25 18 22 22 22 22 15 11 rank NaN NaN NaN NaN NaN NaN NaN NaN NaN Head and Tail Functions In [30]: #Head - first age_df.head() First name Last name Age rank 0 Tinu Elejogun 14 NaN 1 Blaszczyk Kostrzewski 25 NaN 2 Lily McGarrett 18 NaN 3 Olatunkbo Chijiaku 22 NaN 4 Adrienne Anthoula 22 NaN Out[30]: In [31]: rows of the dataframe #Tail - last 5 rows of dataframe age_df.tail() First name Last name Age rank 4 Adrienne Anthoula 22 NaN 5 Axelia Athanasios 22 NaN 6 Jon-Kabat Zinn 22 NaN 7 Thabang Mosoa 15 NaN 8 Kgaogelo Mosoa 11 NaN Out[31]: Assign values to a dataframe using 1. Numpy 2. Series Numpy In [32]: array_1=np.arange(9) print(array_1) [0 1 2 3 4 5 6 7 8] In [33]: age_df['rank'] = array_1 print(age_df) 0 1 2 3 4 5 6 7 8 First name Tinu Blaszczyk Lily Olatunkbo Adrienne Axelia Jon-Kabat Thabang Kgaogelo Last name Elejogun Kostrzewski McGarrett Chijiaku Anthoula Athanasios Zinn Mosoa Mosoa Age 14 25 18 22 22 22 22 15 11 rank 0 1 2 3 4 5 6 7 8 Series In [34]: newrank_series=Series([10,11,12],index=[3,5,6]) age_df['rank']=newrank_series print(age_df) 0 1 2 3 4 5 6 7 8 First name Tinu Blaszczyk Lily Olatunkbo Adrienne Axelia Jon-Kabat Thabang Kgaogelo Last name Elejogun Kostrzewski McGarrett Chijiaku Anthoula Athanasios Zinn Mosoa Mosoa Age 14 25 18 22 22 22 22 15 11 rank NaN NaN NaN 10.0 NaN 11.0 12.0 NaN NaN Delete a column In [35]: #When you delete a column you can never get it back del(age_df['Last name']) print(age_df) 0 1 2 3 4 5 6 7 8 First name Tinu Blaszczyk Lily Olatunkbo Adrienne Axelia Jon-Kabat Thabang Kgaogelo Age 14 25 18 22 22 22 22 15 11 rank NaN NaN NaN 10.0 NaN 11.0 12.0 NaN NaN Covert dictionary into a dataframe In [36]: sample_dict={ "company":['Nestle', 'PG'], "profit":[1000,500] } print(sample_dict) {'company': ['Nestle', 'PG'], 'profit': [1000, 500]} In [37]: #Change dict to dataframe sample_df=DataFrame(sample_dict) print(sample_df) 0 1 In [ ]: company Nestle PG profit 1000 500 In [5]: In [6]: import numpy as np import pandas as pd from pandas import Series,DataFrame s1=Series([10,20,30,40],index=['a','b','c','d']) print(s1) a 10 b 20 c 30 d 40 dtype: int64 Why index is important In both series and dataframe structures they use indexes to refer to the row and the column. Index Array In [7]: #fetch index index_obj=s1.index print(index_obj) Index(['a', 'b', 'c', 'd'], dtype='object') In [8]: index_obj[0] Out[8]: 'a' Negative Indexes In [10]: index_obj[-2:] #bring the last two elements Out[10]: Index(['c', 'd'], dtype='object') In [11]: index_obj[:-2] #All elements except the last 2 elements Out[11]: Index(['a', 'b'], dtype='object') Range of Indexes In [12]: index_obj[2:4] Out[12]: Index(['c', 'd'], dtype='object') Warning: You can never change a series/dataframe index once assigned In [13]: print(index_obj) Index(['a', 'b', 'c', 'd'], dtype='object') In [14]: index_obj[0] = "W" --------------------------------------------------------------------------TypeError Traceback (most recent call last) <ipython-input-14-a73b5743157a> in <module> ----> 1 index_obj[0] = "W" ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 4258 4259 def __setitem__(self, key, value): -> 4260 raise TypeError("Index does not support mutable operations") 4261 4262 def __getitem__(self, key): TypeError: Index does not support mutable operations Workaround if you want to change index name In [15]: print(s1.rename(index={'a':'W'})) print('') print(s1) W 10 b 20 c 30 d 40 dtype: int64 a 10 b 20 c 30 d 40 dtype: int64 In [16]: #Making it permanent and renaming indexex s1 = s1.rename(index={'a':'W'}) print(s1) W 10 b 20 c 30 d 40 dtype: int64 Reindexing in Panda Series and DataFrames Reindexing in Series - reindex() method In [17]: s2=Series([1,2,3,4], index=['m','n','o','p']) print(s2) m 1 n 2 o 3 p 4 dtype: int64 In [18]: s2=s2.reindex(['m','n','o','p','q','r']) print(s2) m 1.0 n 2.0 o 3.0 p 4.0 q NaN r NaN dtype: float64 Reindexing in Series - reindex() method with fill_vaue In [20]: s2=s2.reindex(['m','n','o','p','q','r','s'], fill_value=100) print(s2) #fiil value works only when a new index is created within the reindex value m 1.0 n 2.0 o 3.0 p 4.0 q NaN r NaN s 100.0 dtype: float64 Reindexing in Series - forwardfill In [22]: cars=Series(['Audi','Mercedes','BMW'], index=[0,4,8]) print(cars) 0 Audi 4 Mercedes 8 BMW dtype: object In [24]: new_index=range(12) #creates list of numbers 0 to 11 cars=cars.reindex(new_index,method='ffill') #forward fill - creates a new array of indices 0 to 11. Values 0 to 3 get audi as value, 4 to 8 get mercedes etc print(cars) 0 Audi 1 Audi 2 Audi 3 Audi 4 Mercedes 5 Mercedes 6 Mercedes 7 Mercedes 8 BMW 9 BMW 10 BMW 11 BMW dtype: object Reindexing in DataFrame In [25]: df1=DataFrame(np.random.randn(25).reshape(5,5),index=['a','b','c','d','e'],columns=['c1','c2','c3','c4','c5']) print(df1) #creates a numpy array of 25 random numbers #reshape to 5 rows and 5 columns c1 c2 c3 c4 c5 a 1.011185 -0.852577 -0.724630 -1.335125 -0.452088 b 0.616100 1.105497 -0.561416 -1.457150 -0.425128 c 0.927090 -0.347877 -0.110454 -0.117803 -0.751450 d -0.558997 0.842959 -1.072806 -1.484860 0.643806 e -2.440240 0.717430 0.237937 -1.123031 1.065598 In [26]: #reindexing df1=df1.reindex(index=['a','b','c','d','e','f'],columns=['c1','c2','c3','c4','c5','c6']) print(df1) c1 c2 c3 c4 c5 c6 a 1.011185 -0.852577 -0.724630 -1.335125 -0.452088 NaN b 0.616100 1.105497 -0.561416 -1.457150 -0.425128 NaN c 0.927090 -0.347877 -0.110454 -0.117803 -0.751450 NaN d -0.558997 0.842959 -1.072806 -1.484860 0.643806 NaN e -2.440240 0.717430 0.237937 -1.123031 1.065598 NaN f NaN NaN NaN NaN NaN NaN Dropping Entries in Pandas Series and DataFrames Drop values from series In [27]: best_cars=Series(['BMW','Audi','Mercedes'], index=['a','b','c']) print(best_cars) a BMW b Audi c Mercedes dtype: object In [28]: best_cars=best_cars.drop('a') print(best_cars) b Audi c Mercedes dtype: object Drop Rows from DataFrame In [30]: cars_df=DataFrame(np.random.randn(9).reshape(3,3),index=['BMW','Audi','Mercedes'],columns=['test1','test2','test3']) print(cars_df) test1 test2 test3 BMW 1.011705 -1.040975 -1.129744 Audi -1.888311 -0.721882 0.500407 Mercedes 0.629233 1.581129 1.668745 In [31]: #Drop an index cars_df=cars_df.drop('BMW') #Entire row for BMW is dropped print(cars_df) test1 test2 Audi -1.888311 -0.721882 Mercedes 0.629233 1.581129 test3 0.500407 1.668745 Drop Columns from DataFrame In [32]: #Remove column test1 cars_df=cars_df.drop('test1') print(cars_df) --------------------------------------------------------------------------KeyError Traceback (most recent call last) <ipython-input-32-c18237fa546b> in <module> 1 #Remove column test1 ----> 2 cars_df=cars_df.drop('test1') 3 print(cars_df) ~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, erro rs) 4100 level=level, 4101 inplace=inplace, -> 4102 errors=errors, 4103 ) 4104 ~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, er rors) 3912 for axis, labels in axes.items(): 3913 if labels is not None: -> 3914 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 3915 3916 if inplace: ~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors) 3944 new_axis = axis.drop(labels, level=level, errors=errors) 3945 else: -> 3946 new_axis = axis.drop(labels, errors=errors) 3947 result = self.reindex(**{axis_name: new_axis}) 3948 ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors) 5338 if mask.any(): 5339 if errors != "ignore": -> 5340 raise KeyError("{} not found in axis".format(labels[mask])) 5341 indexer = indexer[~mask] 5342 return self.delete(indexer) KeyError: "['test1'] not found in axis" In [33]: #Remove column test1 cars_df=cars_df.drop('test1', axis=1) #axis value for column is 1 and rows 0 print(cars_df) test2 Audi -0.721882 Mercedes 1.581129 In [ ]: test3 0.500407 1.668745 In [1]: In [2]: import numpy as np import pandas as pd from pandas import Series,DataFrame revenue_series=Series([100,200,300,np.nan], index=['Audi','Mercedes','BMW','VW']) print(revenue_series) Audi 100.0 Mercedes 200.0 BMW 300.0 VW NaN dtype: float64 Dropna() along the row Check for null using isnull In [3]: #Check whether null value is present in a series. Outcome should be True revenue_series.isnull() False Out[3]: Audi Mercedes False BMW False VW True dtype: bool Series-dropna() In [5]: #Dropping the null value, call series name and pass the dropna() function revenue_series.dropna() 100.0 Out[5]: Audi Mercedes 200.0 BMW 300.0 dtype: float64 DataFrame - dropna() In [6]: df1=DataFrame(np.random.randn(12).reshape(4,3)) print(df1) 0 1 2 0 -0.953114 -1.281652 -0.407395 1 1.996550 -0.466889 -1.120392 2 -0.600661 0.061281 -0.340421 3 0.941932 1.039773 -0.093596 In [7]: #loc function is used to select individual elements of a dataframe and assign values df1.loc[1,2]= np.nan #2nd row and 3rd column df1.loc[2,1] = np.nan #3rd row and 2nd column df1.loc[3,] = np.nan #whole 4th row print(df1) 0 1 2 0 -0.953114 -1.281652 -0.407395 1 1.996550 -0.466889 NaN 2 -0.600661 NaN -0.340421 3 NaN NaN NaN In [8]: df1.dropna() #All NAN values were dropped 0 1 2 -0.953114 -1.281652 -0.407395 Out[8]: 0 Disadvantages of dropping all NAN values in dataframe In [9]: df1.dropna() #All rows with NAN Values were dropped and this is disadvantageous 0 1 2 -0.953114 -1.281652 -0.407395 Out[9]: 0 If you drop all the rows and columns with NAN values in dataframe is not good because majority of the data is deleted. DataFrame dropna() with how parameter In [11]: print(df1) 0 1 2 0 -0.953114 -1.281652 -0.407395 1 1.996550 -0.466889 NaN 2 -0.600661 NaN -0.340421 3 NaN NaN NaN In [10]: #Handling Null data efficiently # how='all' parameter only drops the row only if all elements of the row is null df1.dropna(how='all') #only row 4 was dropped 0 1 2 0 -0.953114 -1.281652 -0.407395 1 1.996550 -0.466889 NaN 2 -0.600661 NaN -0.340421 Out[10]: DataFrame - dropna() along column In [13]: Out[13]: df1.dropna(axis=1) #axis for indexes is 0, for columns is 1 #all columns with NAN values are dropped, hence no data 0 1 2 3 Dropna with thresh parameter In [15]: df2=DataFrame([[1,2,3,np.nan], [4,5,6,7], [8,9,np.nan,np.nan], [12,np.nan,np.nan,np.nan] ]) print(df2) 0 1 2 3 0 1 4 8 12 1 2.0 5.0 9.0 NaN 2 3.0 6.0 NaN NaN 3 NaN 7.0 NaN NaN Thresh parameter = 3, tells the panda that if 3 or more actual values exist other than null then the row should exist. And if number of values in row is less than 3, then go ahead and delete row In [16]: df2.dropna(thresh=3) 0 1 2 3 0 1 2.0 3.0 NaN 1 4 5.0 6.0 7.0 Out[16]: In [17]: df2.dropna(thresh=2) #We need 2 actual values to be present, else delete 0 1 2 3 0 1 2.0 3.0 NaN 1 4 5.0 6.0 7.0 2 8 9.0 NaN NaN Out[17]: fillna() function In [19]: #Fill the NAN values instead of dropping them df2.fillna(0) #Fill NAN value with zero 0 1 2 3 0 1 2.0 3.0 0.0 1 4 5.0 6.0 7.0 2 8 9.0 0.0 0.0 3 12 0.0 0.0 0.0 Out[19]: In [20]: #fill each column with a different value, pass dictionary to it with sufficient data df2.fillna({0:0,1:50,2:100,3:200}) #column of nan values filled 0 1 2 3 0 1 2.0 3.0 200.0 1 4 5.0 6.0 7.0 2 8 9.0 100.0 200.0 3 12 50.0 100.0 200.0 Out[20]: In [ ]: In [1]: In [2]: import numpy as np import pandas as pd from pandas import Series,DataFrame s1=Series([100,200,300], index=['a','b','c']) print(s1) a 100 b 200 c 300 dtype: int64 Access single element of series In [4]: s1['a'] #pass index value Out[4]: 100 In [5]: s1['c'] Out[5]: 300 Access multiple elements of series In [7]: s1[['a','c']] #pass in a list of values 100 Out[7]: a c 300 dtype: int64 Using Numerical Indexes In [9]: s1[2] #accessing 3rd element Out[9]: 300 Conditional Indexes In [10]: #if you don't the indexes prior you can use conditions s1[s1>100] 200 Out[10]: b c 300 dtype: int64 In [11]: s1[s1==200] 200 Out[11]: b dtype: int64 Accessing one or more column data from a dataframe In [12]: df1=DataFrame(np.arange(9).reshape(3,3),index=['car','bike','motorcycle'],columns=['c1','c2','c3']) print(df1) c1 0 3 6 car bike motorcycle In [13]: c2 1 4 7 c3 2 5 8 #Access 1 column at a time df1['c3'] 2 Out[13]: car bike 5 motorcycle 8 Name: c3, dtype: int32 In [15]: #Access multiple columns df1[ ['c1','c3'] ] c1 c3 car 0 2 bike 3 5 motorcycle 6 8 Out[15]: Boolean Opearation in DataFrame In [17]: print(df1) c1 0 3 6 car bike motorcycle In [18]: c2 1 4 7 c3 2 5 8 #show what values are greater than 5 (satisfying a particular condition or not) df1>5 #greater than 5 is printed as True c1 c2 c3 car False False False bike False False False motorcycle True True True Out[18]: Using DataFrame.lo[] attribute Loc attribute can be used to access the elements of an array (complete row, complete column, individual elements) In [21]: #Access rows df1.loc['car'] #index car we have all columns present Out[21]: c1 c2 c3 Name: In [23]: 0 1 2 car, dtype: int32 #Access columns df1.loc[:,'c1'] #[row,column] 0 Out[23]: car bike 3 motorcycle 6 Name: c1, dtype: int32 In [27]: #Access individual elements of a dataframme df1.loc['bike','c1'] #pass index and column variable Out[27]: 3 Coordinate and Regulate Data in Pandas In [28]: myseries=Series([100,200,300],index=['a','b','c']) print(myseries) a 100 b 200 c 300 dtype: int64 Add 2 Series In [31]: myseries_1=Series([400,500,600,700],index=['a','b','c','d']) print(myseries_1) a 400 b 500 c 600 d 700 dtype: int64 In [30]: myseries+myseries_1 #NAN something not known+value=something not know(Nan) 200.0 Out[30]: a b 400.0 c 600.0 d NaN dtype: float64 Add 2 DataFrames In [33]: df1=DataFrame(np.arange(4).reshape(2,2),columns=['a','b'],index=['car','bike']) print(df1) car bike In [34]: a 0 2 df2=DataFrame(np.arange(9).reshape(3,3),columns=['a','b','c'],index=['car','bike','cycle']) print(df2) a 0 3 6 car bike cycle In [35]: b 1 3 b 1 4 7 c 2 5 8 df1+df2 a b c bike 5.0 7.0 NaN car 0.0 2.0 NaN cycle NaN NaN NaN Out[35]: Use add() function so that you don't end up with the NaN values In [37]: df1=df1.add(df2,fill_value=0) print(df1) #values not present in df1 are assigned value of 0 and added to df2 a 8.0 0.0 12.0 bike car cycle b 11.0 3.0 14.0 c 10.0 4.0 16.0 Subtract a series from a dataframe In [39]: #DataFrame is Collection of numerous series where each column is a series myseries_2=df2.loc['car'] print(myseries_2) a 0 b 1 c 2 Name: car, dtype: int32 In [40]: #subtract df2 and myseries_2 df2-myseries_2 a b c car 0 0 0 bike 3 3 3 cycle 6 6 6 Out[40]: Ranking and Sorting in Pandas Series In [44]: ser1=Series([500,600,550],index=['a','c','b']) print(ser1) a 500 c 600 b 550 dtype: int64 Sorting by Index In [45]: ser1.sort_index() 500 Out[45]: a b 550 c 600 dtype: int64 Sorting by Values In [46]: ser1.sort_values() #Ascending orders 500 Out[46]: a b 550 c 600 dtype: int64 Ranking of Series In [47]: ser2=Series([10,15,14,12,6,7]) print(ser2) 0 10 1 15 2 14 3 12 4 6 5 7 dtype: int64 In [48]: ser2.rank() #least element gets value of 1 3.0 Out[48]: 0 1 6.0 2 5.0 3 4.0 4 1.0 5 2.0 dtype: float64 Sorting uses ranking function In [52]: print(ser2.rank()) #array before sorting print('') ser2=ser2.sort_values() #sorting array and assigning to ser2 print('') print(ser2.rank()) 0 3.0 1 6.0 2 5.0 3 4.0 4 1.0 5 2.0 dtype: float64 4 1.0 5 2.0 0 3.0 3 4.0 2 5.0 1 6.0 dtype: float64 In [ ]: In [2]: In [3]: import numpy as np import pandas as pd from pandas import Series,DataFrame import matplotlib.pyplot as plt array1=np.array([ [10,np.nan,20],[30,40,np.nan],[50,np.nan,60] ]) print(array1) [[10. nan 20.] [30. 40. nan] [50. nan 60.]] In [4]: df1=DataFrame(array1,columns=list('ABC')) #Return list of values column A,B,C print(df1) 0 1 2 A 10.0 30.0 50.0 B NaN 40.0 NaN C 20.0 NaN 60.0 sum() function In [5]: #Sum() reads all NaN Values as zero df1.sum() #sum along the columns 90.0 Out[5]: A B 40.0 C 80.0 dtype: float64 In [9]: #sum along the indexes (rows) df1.sum(axis=1) 30.0 Out[9]: 0 1 70.0 2 110.0 dtype: float64 In [7]: #Finding mimimum value along each column df1.min() 10.0 Out[7]: A B 40.0 C 20.0 dtype: float64 In [8]: #Finding maximum value along each column df1.max() 50.0 Out[8]: A B 40.0 C 60.0 dtype: float64 In [10]: #Finding maximum value along each indexes along the columns df1.idxmax() 2 Out[10]: A B 1 C 2 dtype: int64 In [11]: #Cumulative sum along columns df1.cumsum() A B C 0 10.0 NaN 20.0 1 40.0 40.0 NaN 2 90.0 NaN 80.0 A B C 0 10.0 NaN 20.0 1 30.0 40.0 NaN 2 50.0 NaN 60.0 Out[11]: In [12]: df1 Out[12]: describe() function In [13]: df1.describe() #gives statistical values for each numerical column A B C count 3.0 1.0 2.000000 mean 30.0 40.0 40.000000 std 20.0 NaN 28.284271 min 10.0 40.0 20.000000 25% 20.0 40.0 30.000000 50% 30.0 40.0 40.000000 75% 40.0 40.0 50.000000 max 50.0 40.0 60.000000 Out[13]: Graph of dataframe In [15]: cars_df=DataFrame(np.random.randn(9).reshape(3,3),index=['BMW','Audi','Mercedes'],columns=['test1','test2','test3']) print(cars_df) test1 test2 test3 BMW 0.975933 -1.037808 -0.391437 Audi -0.993646 0.063885 -0.971377 Mercedes -4.254867 0.487722 0.056041 In [16]: #plot graph for tests of cars plt.plot(cars_df) #indexes along x-axis Out[16]: [<matplotlib.lines.Line2D at 0x273c45ec648>, <matplotlib.lines.Line2D at 0x273c8a62d48>, <matplotlib.lines.Line2D at 0x273c8a62f08>] In [17]: plt.plot(cars_df) plt.legend(cars_df.columns,loc='lower right') plt.savefig('cars_df.png') Unique elements of a series In [18]: ser1=Series(list('aaabbbcccdabd')) print(ser1) 0 a 1 a 2 a 3 b 4 b 5 b 6 c 7 c 8 c 9 d 10 a 11 b 12 d dtype: object In [19]: ser1.unique() #displays only unique elements Out[19]: array(['a', 'b', 'c', 'd'], dtype=object) Frequency of elements in a series Knowing how many times each unique element is repeated In [20]: ser1.value_counts() #value of a is repeated 4 times 4 Out[20]: a b 4 c 3 d 2 dtype: int64 In [ ]: