Uploaded by purity supaki

Data Analysis

advertisement
In [3]:
!pip install pandas
Requirement already satisfied: pandas in c:\users\supaki\appdata\local\programs\python\python38\lib\site-pac
kages (1.1.2)
Requirement already satisfied: pytz>=2017.2 in c:\users\supaki\appdata\local\programs\python\python38\lib\si
te-packages (from pandas) (2020.1)
Requirement already satisfied: numpy>=1.15.4 in c:\users\supaki\appdata\local\programs\python\python38\lib\s
ite-packages (from pandas) (1.19.2)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\supaki\appdata\local\programs\python\pytho
n38\lib\site-packages (from pandas) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\supaki\appdata\local\programs\python\python38\lib\site-p
ackages (from python-dateutil>=2.7.3->pandas) (1.15.0)
WARNING: You are using pip version 19.2.3, however version 20.2.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
In [4]:
!pip uninstall -y pandas
Uninstalling pandas-1.1.2:
Successfully uninstalled pandas-1.1.2
In [5]:
!cd #See current working directory
The system cannot find the path specified.
In [6]:
!cd
C:\Users\supaki\OneDrive\DATA SCIENCE LEARNING\Udemy\Python Data Analysis & Visualization Bootcamp
In [7]:
!pip freeze > AllPackages.txt
list all directories in your curent working directory
In [8]:
!dir
Volume in drive C has no label.
Volume Serial Number is 9EE3-C949
Directory of C:\Users\supaki\OneDrive\DATA SCIENCE LEARNING\Udemy\Python Data Analysis & Visualization Boot
camp
09/12/2020
09/12/2020
09/12/2020
09/12/2020
09/12/2020
09:09 AM
<DIR>
.
09:09 AM
<DIR>
..
08:23 AM
<DIR>
.ipynb_checkpoints
09:20 AM
914 AllPackages.txt
09:09 AM
10,766 pip & python package index.ipynb
2 File(s)
11,680 bytes
3 Dir(s) 351,185,477,632 bytes free
Numpy Arrays
In [10]:
In [11]:
import numpy as np
#creating a 1D Array
mylist1=[101,102,103]
myArray=np.array(mylist1)
print(myArray)
[101 102 103]
In [12]:
#creating a 2D np Array
mylist2=[201,202,203]
myArray2D=np.array([mylist1,mylist2])
print(myArray2D)
[[101 102 103]
[201 202 203]]
Finding Dimensions of Array-Shape
In [13]:
print("myArray Dimensions")
print(myArray.shape) #Tells numbers of rows and columns in this array
print("myArray2D Dimensions")
print(myArray2D.shape) #has 2 rows and 3 columns
myArray Dimensions
(3,)
myArray2D Dimensions
(2, 3)
Finding Datatype of Array - dtype
Actual type of each datatype in that array
In [14]:
print("myArray Dimensions")
print(myArray.dtype)
print("myArray2D Dimensions")
print(myArray2D.dtype)
#int32 - 32 bit integer
myArray Dimensions
int32
myArray2D Dimensions
int32
Array Creation and Initialization Functions - zeros function
create and initialise numpy array with zero value elements inside it
In [15]:
zero_array=np.zeros(5)
print(zero_array)
[0. 0. 0. 0. 0.]
In [16]:
zero_array=np.zeros([5,5]) #create an array with 5rows and 5 columns
print(zero_array)
[[0.
[0.
[0.
[0.
[0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.]
0.]
0.]
0.]
0.]]
Array Creation and Initialization Functions - ones function
create and initialise numpy array with one value elements inside it
In [17]:
print("1D")
zero_array=np.ones(5)
print(zero_array)
print("2D")
zero_array2D=np.ones([5,5])
print(zero_array2D)
1D
[1. 1. 1. 1. 1.]
2D
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
Array Creation and Initialization Functions - empty function
empty array creates array with junk value. Doesn't assign each value inside the array
In [18]:
print("1D")
zero_array=np.empty(5)
print(zero_array)
print("2D")
zero_array2D=np.empty([5,5])
print(zero_array2D)
1D
[1. 1. 1. 1. 1.]
2D
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
Array Creation and Initialization Functions - eyes function
create an identity matrix. All diagnol matrix are 1 the rest are zero Can only be 2D
In [21]:
identity_array = np.eye(3)
print(identity_array) #Diagonal elements are 1
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Array Creation and Initialization Functions - arange function
paremeters - starting value, end value, difference between the element (arithmetic progression)
In [22]:
AP_array=np.arange(0,50,2)
print(AP_array)
[ 0 2
48]
In [ ]:
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Scalar Operations in Numpy Array
Simple mathematical operation that is performed on each array element.
In [1]:
import numpy as np
Scalar Array Multiplication - using * operator
In [2]:
arr1=np.array([[1,2,3],[5,6,7]])
print(arr1)
[[1 2 3]
[5 6 7]]
In [3]:
arr2=arr1*arr1
print(arr2)
#each element of arr1 is a square of itself
[[ 1 4 9]
[25 36 49]]
Exponential Multiplication - using ** operator
In [4]:
arr3=arr1**2
print(arr3)
[[ 1 4 9]
[25 36 49]]
In [5]:
arr2==arr3
Out[5]: array([[ True,
[ True,
In [6]:
True,
True,
True],
True]])
arr3=arr1**3
print(arr3)
[[ 1
8 27]
[125 216 343]]
Scalar subtraction of Array numbers
In [7]:
print(arr3)
[[ 1
8 27]
[125 216 343]]
In [8]:
print(arr1)
[[1 2 3]
[5 6 7]]
In [9]:
arr4=arr3-arr1
print(arr4)
[[ 0
6 24]
[120 210 336]]
Scalar division of Array elements
In [10]:
arr5=1/arr1
print(arr5)
[[1.
[0.2
0.5
0.33333333]
0.16666667 0.14285714]]
Array Indexes
Introduction to Indexes
In [11]:
myArray = np.arange(100,160,2)
print(myArray) #starts from 100 ends at 159
[100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134
136 138 140 142 144 146 148 150 152 154 156 158]
Access Individual Array Elements
In [12]:
print("First Element:")
print(myArray[0])
print("Third Element:")
print(myArray[3])
print("","") #Space between the two methods
#Another method of accessing individual array elements on same line
print("Sixth Element:", myArray[6])
print("Tenth Element:", myArray[10])
First Element:
100
Third Element:
106
Sixth Element: 112
Tenth Element: 120
Slicing of Array Indexes arr[start:stop:step]
Slicing is used to access contents/elements of an N-D array
In [13]:
myArray[0:5:1] #Accessing index number 4, you write 5
#Gives the first 5 elements of the array
Out[13]: array([100, 102, 104, 106, 108])
Updating Array using Slices
In [14]:
myArray[0:5]=0 #Update array from zero to 5 to 0
print(myArray)
[
0
0
0
0
0 110 112 114 116 118 120 122 124 126 128 130 132 134
136 138 140 142 144 146 148 150 152 154 156 158]
Slicing memory allocation, view vs copy
In [15]:
myArray2=myArray[4:10] #index values of 4 to 9
print(myArray2)
[
In [16]:
0 110 112 114 116 118]
#Update myArray2
myArray2[:]=1
print(myArray2)
[1 1 1 1 1 1]
In [17]:
print(myArray)
[
0
0
0
0
1
1
1
1
1
1 120 122 124 126 128 130 132 134
136 138 140 142 144 146 148 150 152 154 156 158]
copy() function=> creating new numpy memory for array
In [18]:
myArray3=myArray.copy()
print(myArray3)
[
0
0
0
0
1
1
1
1
1
1 120 122 124 126 128 130 132 134
136 138 140 142 144 146 148 150 152 154 156 158]
In [19]:
In [20]:
myArray3[:]=0
print(myArray3)
print(myArray) #All elements of myArray did not change
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0
0
0
0
1
1
1
1
1
1 120 122 124 126 128 130 132 134
136 138 140 142 144 146 148 150 152 154 156 158]
Array Indexes in Multi-Dimensional Array
In [21]:
arr2d=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(arr2d)
[[1 2 3]
[4 5 6]
[7 8 9]]
Accessing Rows
In [22]:
#Accessing different rown in the array
print(arr2d[0]) #first row
print(arr2d[1]) #second row of array
[1 2 3]
[4 5 6]
Accessing Elements of Array
In [23]:
print(arr2d[0][0]) #intersection of row 1 and column 1 (row,column)
1
In [24]:
#print number 6 from arr2d
print(arr2d[1][2])
#value 6 is in the row index 1 (2nd row) and column index 2 (3rd column)
6
In [25]:
#Example 1
ex_array=np.array([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15],[16,17,18,19,20],[21,22,23,24,25],[26,27,28,29,30]])
print(ex_array)
print("")
print("OR")
print(" ")
array1=np.arange(1,31).reshape(6,5)
print(array1)
[[ 1
[ 6
[11
[16
[21
[26
2
7
12
17
22
27
3
8
13
18
23
28
4
9
14
19
24
29
5]
10]
15]
20]
25]
30]]
2
7
12
17
22
27
3
8
13
18
23
28
4
9
14
19
24
29
5]
10]
15]
20]
25]
30]]
OR
[[ 1
[ 6
[11
[16
[21
[26
In [26]:
#print 2,8,14,20
print(array1[[0,1,2,3],[1,2,3,4]])
[ 2
In [27]:
8 14 20]
#Example 2:
array2=np.ones([5,5])
print(array2)
[[1.
[1.
[1.
[1.
[1.
In [28]:
1.
1.
1.
1.
1.
1.
1.
1.
1.
1.
1.]
1.]
1.]
1.]
1.]]
array2_int=np.ones([5,5],dtype=int) #change to integer
print(array2_int)
[[1
[1
[1
[1
[1
In [29]:
1.
1.
1.
1.
1.
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1]
1]
1]
1]
1]]
#matrix_1=array2_int[1:3][1:3]
#print(matrix_1)
In [ ]:
Slicing in 2D Arrays
In [30]:
print(arr2d)
[[1 2 3]
[4 5 6]
[7 8 9]]
In [31]:
slice1=arr2d[0:1,0:2] #first row, first & second column
print(slice1)
[[1 2]]
In [32]:
slice2=arr2d[0:2,0:3] #first 2 rows and first 3 columns
print(slice2)
[[1 2 3]
[4 5 6]]
In [33]:
slice3=arr2d[:2,:3] #print rows till 2nd row and columns till 3rd column
print(slice3)
[[1 2 3]
[4 5 6]]
In [34]:
slice4=arr2d[1:,2:] #print row starting row 2 and colums starting column 3
print(slice4)
[[6]
[9]]
Using Loops to change index
Update array values using loops
In [35]:
arr_len=arr2d.shape #Get length of the array
print(arr_len)
nrows=arr_len[0]
(3, 3)
In [36]:
for i in range(nrows): #nrows is 3
arr2d[i][0] = i #First element of each row, assign to value of i
print(arr2d)
[[0
[4
[7
[[0
[1
[7
[[0
[1
[2
2
5
8
2
5
8
2
5
8
3]
6]
9]]
3]
6]
9]]
3]
6]
9]]
Accessing rows using list of index values
In [37]:
print(arr2d[[0,1]]) #mention row value you want
[[0 2 3]
[1 5 6]]
In [ ]:
Premium Array Operations
In [6]:
import numpy as np
arange() Function
In [7]:
A=np.arange(15)
print(A)
[ 0
In [8]:
1
2
3
4
5
6
7
8
9 10 11 12 13 14]
B=np.arange(5,10)
print(B)
[5 6 7 8 9]
In [9]:
C=np.arange(5,10,2) #start from
print(C)
5 end at 9 but increment by 2
[5 7 9]
sqrt() Function
In [10]:
D=np.sqrt(A) #gives square route of each element in array A
print(D)
[0.
1.
1.41421356 1.73205081 2.
2.23606798
2.44948974 2.64575131 2.82842712 3.
3.16227766 3.31662479
3.46410162 3.60555128 3.74165739]
exp() Function
Exponential Function
In [12]:
E=np.exp(A)
print(E)
[1.00000000e+00
5.45981500e+01
2.98095799e+03
1.62754791e+05
2.71828183e+00
1.48413159e+02
8.10308393e+03
4.42413392e+05
7.38905610e+00 2.00855369e+01
4.03428793e+02 1.09663316e+03
2.20264658e+04 5.98741417e+04
1.20260428e+06]
random() Function
Creates specific shape of an array with random numbers within it. Whenever you run it again, a different set of random numbers is
generated
In [14]:
F=np.random.randn(5) #(5) is size of array
print(F) #1D array having 5 random values
[0.12211897 0.83655809 0.85458522 0.06846784 1.79424511]
In [15]:
F=np.random.randn(5,5) #2D Array
print(F)
[[ 0.75830792 0.15049377 0.82932155 0.53882024 0.58156692]
[ 0.32583286 0.79532875 1.48664644 0.44572963 1.06960642]
[-0.23963189 -0.16518398 -0.83799173 -1.05410172 -0.81569001]
[-0.20836428 0.96537819 -0.27722683 0.63144831 -0.19772012]
[ 0.10362402 -0.92146184 -0.79719747 -0.13387981 1.54196333]]
add() Function
In [16]:
print(A)
print(D)
G=np.add(A,D) #Corresponding elements of A and D are added and stored in the new array G
print(G)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[0.
1.
1.41421356 1.73205081 2.
2.23606798
2.44948974 2.64575131 2.82842712 3.
3.16227766 3.31662479
3.46410162 3.60555128 3.74165739]
[ 0.
2.
3.41421356 4.73205081 6.
7.23606798
8.44948974 9.64575131 10.82842712 12.
13.16227766 14.31662479
15.46410162 16.60555128 17.74165739]
maximum() Function
In [25]:
H = np.array([1,5,7,10])
I = np.array([0,6,8,9])
J = np.maximum(H,I) #Maximum value of the corresponding elements in the 2 arrays (H & I)
#max. value of element 1=1,2=6
print(J)
[ 1
6
8 10]
Additional Numpy Documentation
numpy.org, docs.scipy.org
Saving and Loading Arrays to External Memory
In [26]:
arr=np.arange(10)
print(arr)
[0 1 2 3 4 5 6 7 8 9]
In [ ]:
#Why do we need to save arrays to hard drive?
Saving Single Array
In [27]:
#np.save() - Function to save an array.
np.save('saved_array',arr) #Extension for saved array is .npy
Loading Single Array
In [28]:
#np.load() - Loading a saved array
load_arr1=np.load('saved_array.npy')
print(load_arr1)
[0 1 2 3 4 5 6 7 8 9]
Saving Multiple Array
In [29]:
arr2=np.arange(25)
arr3=np.arange(5)
#np.savez - save function for multiple arrays and saves as a zip or .npz
#np.save - save function for a single array and stored as .npy
np.savez('saved_archieve.npz', x=arr2, y=arr3)
Loading Multiple Array
In [30]:
load_npz=np.load('saved_archieve.npz')
print(load_npz['x'])
print(load_npz['y'])
[ 0 1 2 3
24]
[0 1 2 3 4]
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Saving Arrays to text file
In [32]:
np.savetxt('myarray.txt',arr2,delimiter=',') #text file name, pass array 2 and a delimeter
Loading Arrays from text file
In [33]:
load_file = np.loadtxt('myarray.txt',delimiter=',')
print(load_file) #loaded as a float value
[ 0. 1. 2. 3. 4. 5. 6. 7.
18. 19. 20. 21. 22. 23. 24.]
In [35]:
8.
9. 10. 11. 12. 13. 14. 15. 16. 17.
load_file = np.loadtxt('myarray.txt',delimiter=',',dtype=int)
print(load_file) #loaded as an int value
[ 0 1
24]
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
What is a delimiter?
It is specific character that will separate a value from the other in a txt file .csv, .txt
In [ ]:
Install Matplotlib Library
Used to plot various graphs in python
In [7]:
!pip install matplotlib
Requirement already satisfied: matplotlib in c:\users\supaki\anaconda3\lib\site-packages (3.1.1)
Requirement already satisfied: cycler>=0.10 in c:\users\supaki\anaconda3\lib\site-packages (from matplotlib)
(0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\supaki\anaconda3\lib\site-packages (from matplo
tlib) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\supaki\anaconda3\lib\sit
e-packages (from matplotlib) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\supaki\anaconda3\lib\site-packages (from mat
plotlib) (2.8.0)
Requirement already satisfied: numpy>=1.11 in c:\users\supaki\anaconda3\lib\site-packages (from matplotlib)
(1.16.5)
Requirement already satisfied: six in c:\users\supaki\anaconda3\lib\site-packages (from cycler>=0.10->matplo
tlib) (1.12.0)
Requirement already satisfied: setuptools in c:\users\supaki\anaconda3\lib\site-packages (from kiwisolver>=
1.0.1->matplotlib) (41.4.0)
In [18]:
import numpy as np
import matplotlib.pyplot as plt
Explain np.meshgrid()
Used to create square matrix
In [10]:
x=np.arange(3)
y=np.arange(4,8)
print(x)
print(y)
#lengths of array is different. x is 3, y is 4
[0 1 2]
[4 5 6 7]
In [14]:
In [12]:
x2,y2=np.meshgrid(x,y)
print(x2)
[[0
[0
[0
[0
In [15]:
1
1
1
1
print(y2)
[[4
[5
[6
[7
In [16]:
2]
2]
2]
2]]
4
5
6
7
4]
5]
6]
7]]
z=2*x2+3*y2
print(z)
[[12
[15
[18
[21
14
17
20
23
16]
19]
22]
25]]
Plot a linear function heatmap
In [20]:
plt.imshow(z)
Out[20]: <matplotlib.image.AxesImage at 0x2672519dac8>
Add title and colorbar
In [21]:
plt.imshow(z)
plt.title('Plot of 2x+3y')
plt.colorbar()
Out[21]: <matplotlib.colorbar.Colorbar at 0x2672965a308>
Plot a cos function heatmap
In [22]:
z2=np.cos(x2)+np.cos(y2)
print(z2)
[[
[
[
[
In [24]:
0.34635638 -0.11334131 -1.06979046]
1.28366219 0.82396449 -0.13248465]
1.96017029 1.50047259 0.54402345]
1.75390225 1.29420456 0.33775542]]
plt.imshow(z2)
plt.title('Plot of cos(x2)+cos(y2)')
plt.colorbar()
Out[24]: <matplotlib.colorbar.Colorbar at 0x26729023a88>
In [25]:
In [ ]:
# Save plot figure
plt.imshow(z2)
plt.title('Plot of cos(x2)+cos(y2)')
plt.colorbar()
plt.savefig('cos plot.png')
In [7]:
import numpy as np
Populate a list based on conditional values
In [8]:
x=np.array([100,400,-50,-40]) #each element a
y=np.array([10,15,20,25]) #each element b
condition=np.array([True,True,False,False])
z= [a if cond else b for a,cond,b in zip(x,condition,y)]
#for loop for mutiple arrays in python.
print(z)
[100, 400, 20, 25]
Using np.where()
Tackling the disadvabtage of a confusing code above (condition, values for true,value for false)
In [10]:
z2=np.where(condition,x,y)
print(z2) #if condition is true pul in value of x, else pull in value of y
[100 400
In [11]:
20
25]
z3=np.where(x>0,1,x)
print(z3)
#if x is greater than 0 take value of 1 or else take x
[
1
1 -50 -40]
Standard functions in numpy, sum, sum(0), mean(), std(), var()
In [12]:
print(x)
print(x.sum()) #prints sum of all elements of the array
[100 400 -50 -40]
410
In [14]:
x2=np.array([[1,2],[3,4]])
print(x2)
[[1 2]
[3 4]]
In [15]:
print(x2.sum(0)) #sum of each column in the array
[4 6]
In [17]:
#Mean,std,variant
print(x.mean())
print(x.std())
print(x.var())
102.5
181.71062159378576
33018.75
Logical AND, OR operation - any(), all()
In [19]:
condition2=np.array([True,False,True])
#OR
print(condition2.any())
#AND
print(condition2.all()) #if all conditions are true
True
False
sort() function
In [21]:
unsorted=np.array([1,2,10,6,4,3])
print(unsorted)
unsorted.sort() #sorting array
print(unsorted)
[ 1
[ 1
2 10
2 3
6
4
4 3]
6 10]
unique()
In [22]:
arr2=np.array(['liquid','liquid','liquid','gas','gas','solid','solid'])
print(arr2)
#prints set of unique values (one of each value that is present)
print(np.unique(arr2))
['liquid' 'liquid' 'liquid' 'gas' 'gas' 'solid' 'solid']
['gas' 'liquid' 'solid']
in1d()
In [23]:
#Checks whether lists of elements is present in an array or not
print(np.in1d(['solid','liquid','plasma'],arr2))
[ True
In [ ]:
True False]
In [2]:
import numpy as np
What is Pandas
Build on numpy library Provides key data structures: series, dataframes
What is Series
1D Array and can hold any type of data
1 column of an excel sheet will be a description of a series
Install Pandas
In [3]:
!pip install pandas
Requirement already satisfied:
Requirement already satisfied:
andas) (2.8.0)
Requirement already satisfied:
(1.16.5)
Requirement already satisfied:
19.3)
Requirement already satisfied:
>=2.6.1->pandas) (1.12.0)
In [4]:
pandas in c:\users\supaki\anaconda3\lib\site-packages (0.25.1)
python-dateutil>=2.6.1 in c:\users\supaki\anaconda3\lib\site-packages (from p
numpy>=1.13.3 in c:\users\supaki\anaconda3\lib\site-packages (from pandas)
pytz>=2017.2 in c:\users\supaki\anaconda3\lib\site-packages (from pandas) (20
six>=1.5 in c:\users\supaki\anaconda3\lib\site-packages (from python-dateutil
import pandas as pd
from pandas import Series
Create a simple series variable
In [5]:
s1=Series([5,10,15,20])
print(s1)
0
5
1
10
2
15
3
20
dtype: int64
In [6]:
print(s1.values) #printing only the elements
[ 5 10 15 20]
In [9]:
print(s1.index) #print index
RangeIndex(start=0, stop=4, step=1)
In [10]:
print(s1.index.values) #print index values
[0 1 2 3]
Create series from numpy
In [11]:
revenue_array=np.array([400,300,200,100])
print(revenue_array)
[400 300 200 100]
In [13]:
revenue_series=Series(revenue_array)
print(revenue_series) #elements of numpy array are now present in the series
0
400
1
300
2
200
3
100
dtype: int32
Create series with custom indexes
In [15]:
revenue=Series(revenue_array, index=['uber','ola','lyft','gojek'])
print(revenue)
#indexes have label values
uber
400
ola
300
lyft
200
gojek
100
dtype: int32
In [18]:
#Accessing elements in the series
revenue['lyft']
Out[18]: 200
In [19]:
revenue['ola']
Out[19]: 300
Filter a series based on conditions
In [20]:
print(revenue[revenue>250])
uber
400
ola
300
dtype: int32
Check whether an element is in a series
In [21]:
#Check whether ola is present in the series or not
print('ola' in revenue)
True
In [22]:
print('swvl' in revenue)
False
to_dict() function
In [24]:
#Dictionaries are data structures in python
revenue_dict=revenue.to_dict()
print(revenue_dict) #format of a dictionary. Using series to create a dictionary
{'uber': 400, 'ola': 300, 'lyft': 200, 'gojek': 100}
Additional of 2 series
In [28]:
print(revenue)
print('')
print(revenue+revenue)
uber
400
ola
300
lyft
200
gojek
100
dtype: int32
uber
800
ola
600
lyft
400
gojek
200
dtype: int32
Assign names to a series and its index
In [30]:
#Documenting a dataset such as a series
revenue.name="Company-Revenue" #name for the series
revenue.index.name="Company Name"
print(revenue)
Company Name
uber
400
ola
300
lyft
200
gojek
100
Name: Company-Revenue, dtype: int32
In [24]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
What is a DataFrame
2D Data structure with 3 principal components - Rows, columns and values
Similar to a table in excel sheet
Labelled axes-Rows and columns are labelled
size mutable - size of a dataframe can be changed anytime
Create a DataFrame from clipboard
https://en.wikipedia.org/wiki/Table_(information)#:~:text=A%20table%20is%20an%20arrangement,signs%2C%20and%20many%20other%20
In [25]:
age_df=pd.read_clipboard()
print(age_df)
0
1
2
3
4
5
6
7
8
First name
Tinu
Blaszczyk
Lily
Olatunkbo
Adrienne
Axelia
Jon-Kabat
Thabang
Kgaogelo
Last name
Elejogun
Kostrzewski
McGarrett
Chijiaku
Anthoula
Athanasios
Zinn
Mosoa
Mosoa
Age
14
25
18
22
22
22
22
15
11
Display column names
In [26]:
age_df.columns
Out[26]: Index(['First name', 'Last name', 'Age'], dtype='object')
Access one or more columns
In [27]:
#Accessing one column
age_df['First name']
Tinu
Out[27]: 0
1
Blaszczyk
2
Lily
3
Olatunkbo
4
Adrienne
5
Axelia
6
Jon-Kabat
7
Thabang
8
Kgaogelo
Name: First name, dtype: object
In [28]:
#Accessing multiple column names
age_df[['First name','Age']]
First name
Age
0
Tinu
14
1
Blaszczyk
25
2
Lily
18
3
Olatunkbo
22
4
Adrienne
22
5
Axelia
22
6
Jon-Kabat
22
7
Thabang
15
8
Kgaogelo
11
Out[28]:
What is NAN Value
NAN Values - A value is not present or missing in a values
Pandas doesn't leave cell blank. It feels it with NAN Values
Create a column with NAN Values
In [29]:
age_df['rank'] = np.nan
print(age_df)
0
1
2
3
4
5
6
7
8
First name
Tinu
Blaszczyk
Lily
Olatunkbo
Adrienne
Axelia
Jon-Kabat
Thabang
Kgaogelo
Last name
Elejogun
Kostrzewski
McGarrett
Chijiaku
Anthoula
Athanasios
Zinn
Mosoa
Mosoa
Age
14
25
18
22
22
22
22
15
11
rank
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Head and Tail Functions
In [30]:
#Head - first
age_df.head()
First name
Last name
Age
rank
0
Tinu
Elejogun
14
NaN
1
Blaszczyk
Kostrzewski
25
NaN
2
Lily
McGarrett
18
NaN
3
Olatunkbo
Chijiaku
22
NaN
4
Adrienne
Anthoula
22
NaN
Out[30]:
In [31]:
rows of the dataframe
#Tail - last 5 rows of dataframe
age_df.tail()
First name
Last name
Age
rank
4
Adrienne
Anthoula
22
NaN
5
Axelia
Athanasios
22
NaN
6
Jon-Kabat
Zinn
22
NaN
7
Thabang
Mosoa
15
NaN
8
Kgaogelo
Mosoa
11
NaN
Out[31]:
Assign values to a dataframe using 1. Numpy 2. Series
Numpy
In [32]:
array_1=np.arange(9)
print(array_1)
[0 1 2 3 4 5 6 7 8]
In [33]:
age_df['rank'] = array_1
print(age_df)
0
1
2
3
4
5
6
7
8
First name
Tinu
Blaszczyk
Lily
Olatunkbo
Adrienne
Axelia
Jon-Kabat
Thabang
Kgaogelo
Last name
Elejogun
Kostrzewski
McGarrett
Chijiaku
Anthoula
Athanasios
Zinn
Mosoa
Mosoa
Age
14
25
18
22
22
22
22
15
11
rank
0
1
2
3
4
5
6
7
8
Series
In [34]:
newrank_series=Series([10,11,12],index=[3,5,6])
age_df['rank']=newrank_series
print(age_df)
0
1
2
3
4
5
6
7
8
First name
Tinu
Blaszczyk
Lily
Olatunkbo
Adrienne
Axelia
Jon-Kabat
Thabang
Kgaogelo
Last name
Elejogun
Kostrzewski
McGarrett
Chijiaku
Anthoula
Athanasios
Zinn
Mosoa
Mosoa
Age
14
25
18
22
22
22
22
15
11
rank
NaN
NaN
NaN
10.0
NaN
11.0
12.0
NaN
NaN
Delete a column
In [35]:
#When you delete a column you can never get it back
del(age_df['Last name'])
print(age_df)
0
1
2
3
4
5
6
7
8
First name
Tinu
Blaszczyk
Lily
Olatunkbo
Adrienne
Axelia
Jon-Kabat
Thabang
Kgaogelo
Age
14
25
18
22
22
22
22
15
11
rank
NaN
NaN
NaN
10.0
NaN
11.0
12.0
NaN
NaN
Covert dictionary into a dataframe
In [36]:
sample_dict={
"company":['Nestle', 'PG'],
"profit":[1000,500]
}
print(sample_dict)
{'company': ['Nestle', 'PG'], 'profit': [1000, 500]}
In [37]:
#Change dict to dataframe
sample_df=DataFrame(sample_dict)
print(sample_df)
0
1
In [ ]:
company
Nestle
PG
profit
1000
500
In [5]:
In [6]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
s1=Series([10,20,30,40],index=['a','b','c','d'])
print(s1)
a
10
b
20
c
30
d
40
dtype: int64
Why index is important
In both series and dataframe structures they use indexes to refer to the row and the column.
Index Array
In [7]:
#fetch index
index_obj=s1.index
print(index_obj)
Index(['a', 'b', 'c', 'd'], dtype='object')
In [8]:
index_obj[0]
Out[8]: 'a'
Negative Indexes
In [10]:
index_obj[-2:] #bring the last two elements
Out[10]: Index(['c', 'd'], dtype='object')
In [11]:
index_obj[:-2] #All elements except the last 2 elements
Out[11]: Index(['a', 'b'], dtype='object')
Range of Indexes
In [12]:
index_obj[2:4]
Out[12]: Index(['c', 'd'], dtype='object')
Warning: You can never change a series/dataframe index once assigned
In [13]:
print(index_obj)
Index(['a', 'b', 'c', 'd'], dtype='object')
In [14]:
index_obj[0] = "W"
--------------------------------------------------------------------------TypeError
Traceback (most recent call last)
<ipython-input-14-a73b5743157a> in <module>
----> 1 index_obj[0] = "W"
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
4258
4259
def __setitem__(self, key, value):
-> 4260
raise TypeError("Index does not support mutable operations")
4261
4262
def __getitem__(self, key):
TypeError: Index does not support mutable operations
Workaround if you want to change index name
In [15]:
print(s1.rename(index={'a':'W'}))
print('')
print(s1)
W
10
b
20
c
30
d
40
dtype: int64
a
10
b
20
c
30
d
40
dtype: int64
In [16]:
#Making it permanent and renaming indexex
s1 = s1.rename(index={'a':'W'})
print(s1)
W
10
b
20
c
30
d
40
dtype: int64
Reindexing in Panda Series and DataFrames
Reindexing in Series - reindex() method
In [17]:
s2=Series([1,2,3,4], index=['m','n','o','p'])
print(s2)
m
1
n
2
o
3
p
4
dtype: int64
In [18]:
s2=s2.reindex(['m','n','o','p','q','r'])
print(s2)
m
1.0
n
2.0
o
3.0
p
4.0
q
NaN
r
NaN
dtype: float64
Reindexing in Series - reindex() method with fill_vaue
In [20]:
s2=s2.reindex(['m','n','o','p','q','r','s'], fill_value=100)
print(s2)
#fiil value works only when a new index is created within the reindex value
m
1.0
n
2.0
o
3.0
p
4.0
q
NaN
r
NaN
s
100.0
dtype: float64
Reindexing in Series - forwardfill
In [22]:
cars=Series(['Audi','Mercedes','BMW'], index=[0,4,8])
print(cars)
0
Audi
4
Mercedes
8
BMW
dtype: object
In [24]:
new_index=range(12) #creates list of numbers 0 to 11
cars=cars.reindex(new_index,method='ffill')
#forward fill - creates a new array of indices 0 to 11. Values 0 to 3 get audi as value, 4 to 8 get mercedes etc
print(cars)
0
Audi
1
Audi
2
Audi
3
Audi
4
Mercedes
5
Mercedes
6
Mercedes
7
Mercedes
8
BMW
9
BMW
10
BMW
11
BMW
dtype: object
Reindexing in DataFrame
In [25]:
df1=DataFrame(np.random.randn(25).reshape(5,5),index=['a','b','c','d','e'],columns=['c1','c2','c3','c4','c5'])
print(df1)
#creates a numpy array of 25 random numbers
#reshape to 5 rows and 5 columns
c1
c2
c3
c4
c5
a 1.011185 -0.852577 -0.724630 -1.335125 -0.452088
b 0.616100 1.105497 -0.561416 -1.457150 -0.425128
c 0.927090 -0.347877 -0.110454 -0.117803 -0.751450
d -0.558997 0.842959 -1.072806 -1.484860 0.643806
e -2.440240 0.717430 0.237937 -1.123031 1.065598
In [26]:
#reindexing
df1=df1.reindex(index=['a','b','c','d','e','f'],columns=['c1','c2','c3','c4','c5','c6'])
print(df1)
c1
c2
c3
c4
c5 c6
a 1.011185 -0.852577 -0.724630 -1.335125 -0.452088 NaN
b 0.616100 1.105497 -0.561416 -1.457150 -0.425128 NaN
c 0.927090 -0.347877 -0.110454 -0.117803 -0.751450 NaN
d -0.558997 0.842959 -1.072806 -1.484860 0.643806 NaN
e -2.440240 0.717430 0.237937 -1.123031 1.065598 NaN
f
NaN
NaN
NaN
NaN
NaN NaN
Dropping Entries in Pandas Series and DataFrames
Drop values from series
In [27]:
best_cars=Series(['BMW','Audi','Mercedes'], index=['a','b','c'])
print(best_cars)
a
BMW
b
Audi
c
Mercedes
dtype: object
In [28]:
best_cars=best_cars.drop('a')
print(best_cars)
b
Audi
c
Mercedes
dtype: object
Drop Rows from DataFrame
In [30]:
cars_df=DataFrame(np.random.randn(9).reshape(3,3),index=['BMW','Audi','Mercedes'],columns=['test1','test2','test3'])
print(cars_df)
test1
test2
test3
BMW
1.011705 -1.040975 -1.129744
Audi
-1.888311 -0.721882 0.500407
Mercedes 0.629233 1.581129 1.668745
In [31]:
#Drop an index
cars_df=cars_df.drop('BMW') #Entire row for BMW is dropped
print(cars_df)
test1
test2
Audi
-1.888311 -0.721882
Mercedes 0.629233 1.581129
test3
0.500407
1.668745
Drop Columns from DataFrame
In [32]:
#Remove column test1
cars_df=cars_df.drop('test1')
print(cars_df)
--------------------------------------------------------------------------KeyError
Traceback (most recent call last)
<ipython-input-32-c18237fa546b> in <module>
1 #Remove column test1
----> 2 cars_df=cars_df.drop('test1')
3 print(cars_df)
~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, erro
rs)
4100
level=level,
4101
inplace=inplace,
-> 4102
errors=errors,
4103
)
4104
~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, er
rors)
3912
for axis, labels in axes.items():
3913
if labels is not None:
-> 3914
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
3915
3916
if inplace:
~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
3944
new_axis = axis.drop(labels, level=level, errors=errors)
3945
else:
-> 3946
new_axis = axis.drop(labels, errors=errors)
3947
result = self.reindex(**{axis_name: new_axis})
3948
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
5338
if mask.any():
5339
if errors != "ignore":
-> 5340
raise KeyError("{} not found in axis".format(labels[mask]))
5341
indexer = indexer[~mask]
5342
return self.delete(indexer)
KeyError: "['test1'] not found in axis"
In [33]:
#Remove column test1
cars_df=cars_df.drop('test1', axis=1) #axis value for column is 1 and rows 0
print(cars_df)
test2
Audi
-0.721882
Mercedes 1.581129
In [ ]:
test3
0.500407
1.668745
In [1]:
In [2]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
revenue_series=Series([100,200,300,np.nan], index=['Audi','Mercedes','BMW','VW'])
print(revenue_series)
Audi
100.0
Mercedes
200.0
BMW
300.0
VW
NaN
dtype: float64
Dropna() along the row
Check for null using isnull
In [3]:
#Check whether null value is present in a series. Outcome should be True
revenue_series.isnull()
False
Out[3]: Audi
Mercedes
False
BMW
False
VW
True
dtype: bool
Series-dropna()
In [5]:
#Dropping the null value, call series name and pass the dropna() function
revenue_series.dropna()
100.0
Out[5]: Audi
Mercedes
200.0
BMW
300.0
dtype: float64
DataFrame - dropna()
In [6]:
df1=DataFrame(np.random.randn(12).reshape(4,3))
print(df1)
0
1
2
0 -0.953114 -1.281652 -0.407395
1 1.996550 -0.466889 -1.120392
2 -0.600661 0.061281 -0.340421
3 0.941932 1.039773 -0.093596
In [7]:
#loc function is used to select individual elements of a dataframe and assign values
df1.loc[1,2]= np.nan #2nd row and 3rd column
df1.loc[2,1] = np.nan #3rd row and 2nd column
df1.loc[3,] = np.nan #whole 4th row
print(df1)
0
1
2
0 -0.953114 -1.281652 -0.407395
1 1.996550 -0.466889
NaN
2 -0.600661
NaN -0.340421
3
NaN
NaN
NaN
In [8]:
df1.dropna() #All NAN values were dropped
0
1
2
-0.953114
-1.281652
-0.407395
Out[8]:
0
Disadvantages of dropping all NAN values in dataframe
In [9]:
df1.dropna()
#All rows with NAN Values were dropped and this is disadvantageous
0
1
2
-0.953114
-1.281652
-0.407395
Out[9]:
0
If you drop all the rows and columns with NAN values in dataframe is not good because majority of the data is deleted.
DataFrame dropna() with how parameter
In [11]:
print(df1)
0
1
2
0 -0.953114 -1.281652 -0.407395
1 1.996550 -0.466889
NaN
2 -0.600661
NaN -0.340421
3
NaN
NaN
NaN
In [10]:
#Handling Null data efficiently
# how='all' parameter only drops the row only if all elements of the row is null
df1.dropna(how='all') #only row 4 was dropped
0
1
2
0
-0.953114
-1.281652
-0.407395
1
1.996550
-0.466889
NaN
2
-0.600661
NaN
-0.340421
Out[10]:
DataFrame - dropna() along column
In [13]:
Out[13]:
df1.dropna(axis=1) #axis for indexes is 0, for columns is 1
#all columns with NAN values are dropped, hence no data
0
1
2
3
Dropna with thresh parameter
In [15]:
df2=DataFrame([[1,2,3,np.nan],
[4,5,6,7],
[8,9,np.nan,np.nan],
[12,np.nan,np.nan,np.nan]
])
print(df2)
0
1
2
3
0
1
4
8
12
1
2.0
5.0
9.0
NaN
2
3.0
6.0
NaN
NaN
3
NaN
7.0
NaN
NaN
Thresh parameter = 3, tells the panda that if 3 or more actual values exist other than null then the row should exist. And if number of
values in row is less than 3, then go ahead and delete row
In [16]:
df2.dropna(thresh=3)
0
1
2
3
0
1
2.0
3.0
NaN
1
4
5.0
6.0
7.0
Out[16]:
In [17]:
df2.dropna(thresh=2) #We need 2 actual values to be present, else delete
0
1
2
3
0
1
2.0
3.0
NaN
1
4
5.0
6.0
7.0
2
8
9.0
NaN
NaN
Out[17]:
fillna() function
In [19]:
#Fill the NAN values instead of dropping them
df2.fillna(0) #Fill NAN value with zero
0
1
2
3
0
1
2.0
3.0
0.0
1
4
5.0
6.0
7.0
2
8
9.0
0.0
0.0
3
12
0.0
0.0
0.0
Out[19]:
In [20]:
#fill each column with a different value, pass dictionary to it with sufficient data
df2.fillna({0:0,1:50,2:100,3:200}) #column of nan values filled
0
1
2
3
0
1
2.0
3.0
200.0
1
4
5.0
6.0
7.0
2
8
9.0
100.0
200.0
3
12
50.0
100.0
200.0
Out[20]:
In [ ]:
In [1]:
In [2]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
s1=Series([100,200,300], index=['a','b','c'])
print(s1)
a
100
b
200
c
300
dtype: int64
Access single element of series
In [4]:
s1['a'] #pass index value
Out[4]: 100
In [5]:
s1['c']
Out[5]: 300
Access multiple elements of series
In [7]:
s1[['a','c']] #pass in a list of values
100
Out[7]: a
c
300
dtype: int64
Using Numerical Indexes
In [9]:
s1[2] #accessing 3rd element
Out[9]: 300
Conditional Indexes
In [10]:
#if you don't the indexes prior you can use conditions
s1[s1>100]
200
Out[10]: b
c
300
dtype: int64
In [11]:
s1[s1==200]
200
Out[11]: b
dtype: int64
Accessing one or more column data from a dataframe
In [12]:
df1=DataFrame(np.arange(9).reshape(3,3),index=['car','bike','motorcycle'],columns=['c1','c2','c3'])
print(df1)
c1
0
3
6
car
bike
motorcycle
In [13]:
c2
1
4
7
c3
2
5
8
#Access 1 column at a time
df1['c3']
2
Out[13]: car
bike
5
motorcycle
8
Name: c3, dtype: int32
In [15]:
#Access multiple columns
df1[ ['c1','c3'] ]
c1
c3
car
0
2
bike
3
5
motorcycle
6
8
Out[15]:
Boolean Opearation in DataFrame
In [17]:
print(df1)
c1
0
3
6
car
bike
motorcycle
In [18]:
c2
1
4
7
c3
2
5
8
#show what values are greater than 5 (satisfying a particular condition or not)
df1>5 #greater than 5 is printed as True
c1
c2
c3
car
False
False
False
bike
False
False
False
motorcycle
True
True
True
Out[18]:
Using DataFrame.lo[] attribute
Loc attribute can be used to access the elements of an array (complete row, complete column, individual elements)
In [21]:
#Access rows
df1.loc['car'] #index car we have all columns present
Out[21]: c1
c2
c3
Name:
In [23]:
0
1
2
car, dtype: int32
#Access columns
df1.loc[:,'c1'] #[row,column]
0
Out[23]: car
bike
3
motorcycle
6
Name: c1, dtype: int32
In [27]:
#Access individual elements of a dataframme
df1.loc['bike','c1'] #pass index and column variable
Out[27]: 3
Coordinate and Regulate Data in Pandas
In [28]:
myseries=Series([100,200,300],index=['a','b','c'])
print(myseries)
a
100
b
200
c
300
dtype: int64
Add 2 Series
In [31]:
myseries_1=Series([400,500,600,700],index=['a','b','c','d'])
print(myseries_1)
a
400
b
500
c
600
d
700
dtype: int64
In [30]:
myseries+myseries_1 #NAN something not known+value=something not know(Nan)
200.0
Out[30]: a
b
400.0
c
600.0
d
NaN
dtype: float64
Add 2 DataFrames
In [33]:
df1=DataFrame(np.arange(4).reshape(2,2),columns=['a','b'],index=['car','bike'])
print(df1)
car
bike
In [34]:
a
0
2
df2=DataFrame(np.arange(9).reshape(3,3),columns=['a','b','c'],index=['car','bike','cycle'])
print(df2)
a
0
3
6
car
bike
cycle
In [35]:
b
1
3
b
1
4
7
c
2
5
8
df1+df2
a
b
c
bike
5.0
7.0
NaN
car
0.0
2.0
NaN
cycle
NaN
NaN
NaN
Out[35]:
Use add() function so that you don't end up with the NaN values
In [37]:
df1=df1.add(df2,fill_value=0)
print(df1)
#values not present in df1 are assigned value of 0 and added to df2
a
8.0
0.0
12.0
bike
car
cycle
b
11.0
3.0
14.0
c
10.0
4.0
16.0
Subtract a series from a dataframe
In [39]:
#DataFrame is Collection of numerous series where each column is a series
myseries_2=df2.loc['car']
print(myseries_2)
a
0
b
1
c
2
Name: car, dtype: int32
In [40]:
#subtract df2 and myseries_2
df2-myseries_2
a
b
c
car
0
0
0
bike
3
3
3
cycle
6
6
6
Out[40]:
Ranking and Sorting in Pandas Series
In [44]:
ser1=Series([500,600,550],index=['a','c','b'])
print(ser1)
a
500
c
600
b
550
dtype: int64
Sorting by Index
In [45]:
ser1.sort_index()
500
Out[45]: a
b
550
c
600
dtype: int64
Sorting by Values
In [46]:
ser1.sort_values() #Ascending orders
500
Out[46]: a
b
550
c
600
dtype: int64
Ranking of Series
In [47]:
ser2=Series([10,15,14,12,6,7])
print(ser2)
0
10
1
15
2
14
3
12
4
6
5
7
dtype: int64
In [48]:
ser2.rank() #least element gets value of 1
3.0
Out[48]: 0
1
6.0
2
5.0
3
4.0
4
1.0
5
2.0
dtype: float64
Sorting uses ranking function
In [52]:
print(ser2.rank()) #array before sorting
print('')
ser2=ser2.sort_values() #sorting array and assigning to ser2
print('')
print(ser2.rank())
0
3.0
1
6.0
2
5.0
3
4.0
4
1.0
5
2.0
dtype: float64
4
1.0
5
2.0
0
3.0
3
4.0
2
5.0
1
6.0
dtype: float64
In [ ]:
In [2]:
In [3]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
array1=np.array([ [10,np.nan,20],[30,40,np.nan],[50,np.nan,60] ])
print(array1)
[[10. nan 20.]
[30. 40. nan]
[50. nan 60.]]
In [4]:
df1=DataFrame(array1,columns=list('ABC')) #Return list of values column A,B,C
print(df1)
0
1
2
A
10.0
30.0
50.0
B
NaN
40.0
NaN
C
20.0
NaN
60.0
sum() function
In [5]:
#Sum() reads all NaN Values as zero
df1.sum() #sum along the columns
90.0
Out[5]: A
B
40.0
C
80.0
dtype: float64
In [9]:
#sum along the indexes (rows)
df1.sum(axis=1)
30.0
Out[9]: 0
1
70.0
2
110.0
dtype: float64
In [7]:
#Finding mimimum value along each column
df1.min()
10.0
Out[7]: A
B
40.0
C
20.0
dtype: float64
In [8]:
#Finding maximum value along each column
df1.max()
50.0
Out[8]: A
B
40.0
C
60.0
dtype: float64
In [10]:
#Finding maximum value along each indexes along the columns
df1.idxmax()
2
Out[10]: A
B
1
C
2
dtype: int64
In [11]:
#Cumulative sum along columns
df1.cumsum()
A
B
C
0
10.0
NaN
20.0
1
40.0
40.0
NaN
2
90.0
NaN
80.0
A
B
C
0
10.0
NaN
20.0
1
30.0
40.0
NaN
2
50.0
NaN
60.0
Out[11]:
In [12]:
df1
Out[12]:
describe() function
In [13]:
df1.describe() #gives statistical values for each numerical column
A
B
C
count
3.0
1.0
2.000000
mean
30.0
40.0
40.000000
std
20.0
NaN
28.284271
min
10.0
40.0
20.000000
25%
20.0
40.0
30.000000
50%
30.0
40.0
40.000000
75%
40.0
40.0
50.000000
max
50.0
40.0
60.000000
Out[13]:
Graph of dataframe
In [15]:
cars_df=DataFrame(np.random.randn(9).reshape(3,3),index=['BMW','Audi','Mercedes'],columns=['test1','test2','test3'])
print(cars_df)
test1
test2
test3
BMW
0.975933 -1.037808 -0.391437
Audi
-0.993646 0.063885 -0.971377
Mercedes -4.254867 0.487722 0.056041
In [16]:
#plot graph for tests of cars
plt.plot(cars_df) #indexes along x-axis
Out[16]: [<matplotlib.lines.Line2D at 0x273c45ec648>,
<matplotlib.lines.Line2D at 0x273c8a62d48>,
<matplotlib.lines.Line2D at 0x273c8a62f08>]
In [17]:
plt.plot(cars_df)
plt.legend(cars_df.columns,loc='lower right')
plt.savefig('cars_df.png')
Unique elements of a series
In [18]:
ser1=Series(list('aaabbbcccdabd'))
print(ser1)
0
a
1
a
2
a
3
b
4
b
5
b
6
c
7
c
8
c
9
d
10
a
11
b
12
d
dtype: object
In [19]:
ser1.unique() #displays only unique elements
Out[19]: array(['a', 'b', 'c', 'd'], dtype=object)
Frequency of elements in a series
Knowing how many times each unique element is repeated
In [20]:
ser1.value_counts() #value of a is repeated 4 times
4
Out[20]: a
b
4
c
3
d
2
dtype: int64
In [ ]:
Download