Uploaded by Rohit Dubey

Guide to make you a Data Scientist

advertisement
Finish this Book and get your
Data Scientist Job
 Data Science Introduction
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
What is Data Science?
Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future
predictions.
By using Data Science, companies are able to make:
Better decisions (should we choose A or B)
Predictive analysis (what will happen next?)
Pattern discoveries (find pattern, or maybe hidden information in the data)
 Where is Data
Science Needed?
 Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.
 Examples of where Data Science is needed:
 For route planning: To discover the best routes to ship
 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections
 Where is Data Science Needed?
 Data Science can be applied in nearly every part of a business where
data is available. Examples are:
 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce
 How Does a Data Scientist Work?
 A Data Scientist requires expertise in several backgrounds:
 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases
 A Data Scientist must find patterns within the data. Before he/she can find
the patterns, he/she must organize the data in a standard format.
Here is how a Data Scientist works:
Ask the right questions - To understand the business problem.
Explore and collect data - From database, web logs, customer feedback,
etc.
Extract the data - Transform the data to a standardized format.
Clean the data - Remove erroneous values from the data.
Find and replace missing values - Check for missing values and replace
them with a suitable value (e.g. an average value).
Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling
is important).
Analyze data, find patterns and make future predictions.
Represent the result - Present the result with useful insights in a way the
"company" can understand.
 What is Data?
 Data is a collection of information.
 One purpose of Data Science is to structure data, making it interpretable
and easy to work with.
 Data can be categorized into two groups:
 Structured data
 Unstructured data
Data Types?
Unstructured Data
Unstructured data is not organized.
We must organize the data for analysis
purposes.
Structured Data
Structured data is
organized and
easier to work with.
 How to Structure Data?
We can use an array or a database table to structure or present data.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
The following example shows how to create an array in Python:
#Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
o/p: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
 Data Science - Database Table:
What is Database Table?
A database table is a table with structured data.
The following table shows a database table with health data extracted from a sports watch:
This
dataset
contains
information of a typical
training session such as
duration, average pulse,
calorie burnage etc.
 Data Science - Database Table:
Database Table Structure
A database table consists of column(s) and row(s):
A row is a
horizontal
representation of
data.
A column is a
vertical
representation of
data.
 Data Science - Database Table:
Variables
A variable is defined as something that can be measured or counted.
Examples can be
characters,
numbers or time.
In
the
example
under,
we
can
observe that each
column represents
a variable.
 Data Science - Database Table:
Variables
A variable is defined as something that can be measured or counted.
There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse,
Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep).
There are 11 rows, meaning that
each variable has 10 observations.
But if there are 11 rows, how come
there are only 10 observations?
It is because the first row is the
label, meaning that it is the name
of the variable.
 Data Science & Python
Python
Python is a programming language widely used by Data Scientists.
Python has in-built mathematical libraries and functions, making it easier to calculate
mathematical problems and to perform data analysis.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
In this course, we will use the following libraries:
 Pandas - This library is used for structured data operations, like import CSV files, create
dataframes, and data preparation
 Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear
algebra, Fourier transform, etc.
 Matplotlib - This library is used for visualization of data.
 SciPy - This library has linear algebra modules
 Data Science - Python DataFrame
Create a DataFrame with Pandas
A data frame is a structured representation of data.
Let's define a data frame with 3 columns and 5 rows with fictional numbers:
Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
Example Explained
Import the Pandas library as pd
Define data with column and rows in a variable named d
Create a data frame using the function pd.DataFrame()
The data frame contains 3 columns and 5 rows
Print the data frame output with the print() function
 Data Science - Python DataFrame
We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame()
function from the Pandas library.
Be aware of the capital D and F in DataFrame!
Interpreting the Output : This is the output:
We see that "col1", "col2" and "col3" are the names of the
columns.
Do not be confused about the vertical numbers ranging from 04. They tell us the information about the position of the rows.
In Python, the numbering of rows starts with zero.
Now, we can use Python to count the columns and rows.
We can use df.shape[1] to find the number of columns:
 Data Science Functions
The data set above consists of 6 variables,
each with 10 observations:
 Duration - How long lasted the training
session in minutes?
 Average_Pulse - What was the average
pulse of the training session? This is
measured by beats per minute
 Max_Pulse - What was the max pulse of
the training session?
 Calorie_Burnage - How much calories
were burnt on the training session?
 Hours_Work - How many hours did we
work at our job before the training
session?
 Hours_Sleep - How much did we sleep
the night before the training session?
We use underscore (_) to separate strings
because Python cannot read space as
separator.
The Sports Watch Data Set
Data Science Functions
The max() function
The Python max() function is used to find the highest value in an array.
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
o/p: 125
DataThe min() function
The Python min() function is used to find the lowest value in an array. Science Functions
Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
o/p: 80
The mean() function
The NumPy mean() function is used to find the average value of an array.
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
o/p: 285.0
Data Science - Data Preparation
Extract and Read Data With Pandas
• Before analyzing data, a Data Scientist must extract the data, and
make it clean and valuable.
•
Before data can be analyzed, it must be imported/extracted.
In the example below, we show you how to import data using Pandas in Python.
We use the read_csv() function to import a CSV file with the health data:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
Data Science - Data Preparation
Example Explained
• Import the Pandas library
• Name the data frame as health_data.
• header=0 means that the headers for the variable names are to be found in the first row
(note that 0 means the first row in Python)
• sep="," means that "," is used as the separator between the values. This is because we are
using the file type .csv (comma separated values)
• Tip: If you have a large CSV file, you can use the head() function to only show the top
5rows:
Data Science - Data Preparation
Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered
values:
•
•
•
•
•
There are some blank fields
Average pulse of 9 000 is not possible
9 000 will be treated as non-numeric, because of the space separator
One observation of max pulse is denoted as "AF", which does not make sense
So, we must clean the data in order to perform the analysis.
Data Science - Data Preparation
Data Cleaning
Remove Blank Rows
We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into
"NaN" values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to
remove all rows that have a NaN value:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
health_data.dropna(axis=0,inplace=True)
print(health_data)
Data Science - Data Preparation
Data Cleaning
Remove Blank Rows
We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into
"NaN" values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to
remove all rows that have a NaN value:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
health_data.dropna(axis=0,inplace=True)
print(health_data)
Data Science - Data Preparation
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data can be split into three main categories:
o Numerical - Contains numerical values. Can be divided into two categories:
Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either
2 or 3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes
and 20 seconds, or 7.533 hours
o Categorical - Contains values that cannot be measured up against each other. Example: A color or
a type of training
o Ordinal - Contains categorical data that can be measured up against each other. Example: School
grades where A is better than B and so on
o By knowing the type of your data, you will be able to know what technique to use when analyzing
them.
Data Science - Data Preparation
Data Types
We can use the info() function to list the data types within our data set:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data.info())
We see that this data set has two different
o/p:
types of data:
Float64
Object
We cannot use objects to calculate and perform
analysis here. We must convert the type object
to float64 (float64 is a number with a decimal in
Python).
We cannot use objects to calculate and perform
analysis here. We must convert the type object
to float64 (float64 is a number with a decimal in
Python).
Data Science - Data Preparation
Analyze the Data
When we have cleaned the data set, we can start analyzing the data.
We can use the describe() function in Python to summarize data:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
pd.set_option('display.max_columns',None)
print(health_data.describe())
Count - Counts the number of
observations
Mean - The average value
Std - Standard deviation
(explained in the statistics
chapter)
Min - The lowest value
25%, 50% and 75% are
percentiles (explained in the
statistics chapter)
Max - The highest value
DS Math
Data Science - Linear Functions
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
Linear Functions
In mathematics a function is used to relate one variable to another variable.
Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable
to assume that, in general, the calorie burnage will change as the average pulse changes - we say
that the calorie burnage depends upon the average pulse.
Furthermore, it may be reasonable to assume that as the average pulse increases, so will the calorie
burnage. Calorie burnage and average pulse are the two variables being considered.
Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the
dependent variable and the average pulse is the independent variable.
The relationship between a dependent and an independent variable can often be expressed
mathematically using a formula (function).
DS Math
Data Science - Linear Functions
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
A linear function has one independent variable (x) and one dependent variable (y), and has the following form:
y = f(x) = ax + b
This function is used to calculate a value for the dependent variable when we choose a value for the
independent variable.
Explanation:
o f(x) = the output (the dependant variable)
o x = the input (the independant variable)
o a = slope = is the coefficient of the independent variable. It gives the rate of change of the dependent
variable
o b = intercept = is the value of the dependent variable when x = 0. It is also the point where the diagonal line
crosses the vertical axis.
DS Math
Data Science - Linear Functions
Linear Function With One Explanatory Variable
A function with one explanatory variable means that we use one variable for prediction.
Let us say we want to predict calorie burnage using average pulse. We have the following formula:
f(x) = 2x + 80
Here, the numbers and variables means:
o f(x) = The output. This number is where we get the predicted value of Calorie_Burnage
o x = The input, which is Average_Pulse
o 2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by one. It
tells us how "steep" the diagonal line is
o 80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0
DS Math
Data Science - Linear Functions
Plotting a Linear Function
The term linearity means a "straight line". So, if you show a linear function graphically, the line
will always be a straight line. The line can slope upwards, downwards, and in some cases may be
horizontal or vertical. Here is a graphical representation of the mathematical function above:
Graph Explanations:
o The horizontal axis is generally called the x-axis. Here,
it represents Average_Pulse.
o The vertical axis is generally called the y-axis. Here, it
represents Calorie_Burnage.
o Calorie_Burnage is a function of Average_Pulse,
because Calorie_Burnage is assumed to be dependent
on Average_Pulse.
o In other words, we use Average_Pulse to predict
Calorie_Burnage.
o The blue (diagonal) line represents the structure of
the mathematical function that predicts calorie
burnage.
DS Math
Data Science - Plotting Linear Functions
The Sports Watch Data Set
Take a look at our health data set:
Plot the Existing Data in Python
Now, we can first plot the values of
Average_Pulse against
Calorie_Burnage using the matplotlib
library.
The plot() function is used to make a
2D hexagonal binning plot of points
x,y:
DS Math
Data Science - Plotting Linear Functions
The Sports Watch Data Set
Take a look at our health data set:
o Example Explained
#Three lines to make our compiler able to draw:
o Import the pyplot module of the matplotlib library
import sys
o Plot the data from Average_Pulse against
import matplotlib
Calorie_Burnage
matplotlib.use('Agg')
o kind='line' tells us which type of plot we want.
Here, we want to have a straight line
import pandas as pd
import matplotlib.pyplot as plt
o plt.ylim() and plt.xlim() tells us what value we want
the axis to start on. Here, we want the axis to begin
health_data = pd.read_csv("data.csv", header=0, sep=",")
from zero
health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line') o plt.show() shows us the output
The code above will produce the following result:
plt.ylim(ymin=0)
plt.xlim(xmin=0)
plt.show()
#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
DS Math
Data Science - Plotting Linear Functions
The Sports Watch Data Set
Take a look at our health data set:
The Graph Output
As we can see, there is a
relationship
between
Average_Pulse
and
Calorie_Burnage. Calorie_Burnage
increases
proportionally
with
Average_Pulse. It means that we
can use Average_Pulse to predict
Calorie_Burnage.
DS Math
Data Science - Plotting Linear Functions
Why is The Line Not Fully Drawn Down to The y-axis?
The reason is that we do not have observations where Average_Pulse or Calorie_Burnage are
equal to zero. 80 is the first observation of Average_Pulse and 240 is the first observation of
Calorie_Burnage.
Look at the line.
What happens to
calorie burnage if
average
pulse
increases from 80 to
90?
DS Math
Data Science - Plotting Linear Functions
We can use the diagonal line to find the mathematical function to predict
calorie burnage.
As it turns out:
o If the average pulse is 80, the calorie burnage is 240
o If the average pulse is 90, the calorie burnage is 260
o If the average pulse is 100, the calorie burnage is 280
o There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20.
DS Math
Data Science - Slope and
Intercept
Slope and Intercept
Now we will explain how we found
the slope and intercept of our
function:
f(x) = 2x + 80
The image points to the Slope which indicates how steep the line
is, and the Intercept - which is the
value of y, when x = 0 (the point
where the diagonal line crosses the
vertical axis). The red line is the
continuation of the blue line from
previous page.
DS Math
Data Science - Slope and Intercept
Find The Slope
The slope is defined as how much calorie burnage increases, if average pulse increases by one. It tells
us how "steep" the diagonal line is.
We can find the slope by using the proportional difference of two points from the graph.
If the average pulse is 80, the calorie burnage is 240
If the average pulse is 90, the calorie burnage is 260
We see that if average pulse increases with 10, the calorie burnage increases by 20.
Slope = 20/10 = 2
The slope is 2.
DS Math
Data Science - Slope and Intercept
Find The Slope
Mathematically, Slope is Defined as:
Slope = f(x2) - f(x1) / x2-x1
f(x2) = Second observation of Calorie_Burnage = 260
f(x1) = First observation of Calorie_Burnage = 240
x2 = Second observation of Average_Pulse = 90
x1 = First observation of Average_Pulse = 80
Use Python to Find the Slope
Calculate the slope with the following
code:
def slope(x1, y1, x2, y2):
s = (y2-y1)/(x2-x1)
return s
print(slope(80,240,90,260))
o/p: 2.0
Slope = (260-240) / (90 - 80) = 2
Be consistent to define the observations in the correct order! If not,
the prediction will not be correct!
DS Math
Data Science - Slope and Intercept
Find The Intercept
The intercept is used to fine tune the functions ability to predict Calorie_Burnage.
The intercept is where the diagonal line crosses the y-axis, if it were fully drawn.
The intercept is the value of y, when x = 0.
Here, we see that if average pulse (x) is zero, then the calorie burnage (y) is 80.
So, the intercept is 80.
Sometimes, the intercept has a practical meaning. Sometimes not.
Does it make sense that average pulse is zero?
No, you would be dead and you certainly would not burn any calories.
However, we need to include the intercept in order to complete the mathematical function's ability to predict
Calorie_Burnage correctly.
Other examples where the intercept of a mathematical function can have a practical meaning:
Predicting next years revenue by using marketing expenditure (How much revenue will we have next year, if
marketing expenditure is zero?). It is likely to assume that a company will still have some revenue even though if
it does not spend money on marketing.
Fuel usage with speed (How much fuel do we use if speed is equal to 0 mph?). A car that uses gasoline will still
use fuel when it is idle.
DS Math
Data Science - Slope and Intercept
Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
import pandas as pd
import numpy as np
health_data = pd.read_csv("data.csv", header=0, sep=",")
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
o/p: 2.80
Example Explained:
Isolate the variables Average_Pulse (x) and
Calorie_Burnage (y) from health_data.
Call the np.polyfit() function.
The last parameter of the function
specifies the degree of the function, which
in this case is "1".
DS Math
Data Science - Slope and Intercept
Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
import pandas as pd
import numpy as np
health_data = pd.read_csv("data.csv", header=0, sep=",")
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
o/p: 2.80
We have now calculated the slope (2)
and the intercept (80). We can write the
mathematical function as follow:
Predict Calorie_Burnage by using a
mathematical expression:
f(x) = 2x + 80
DS Math
Data Science - Slope and Intercept
Task:
Now, we want to predict calorie burnage if average pulse is 135.
Remember that the intercept is a constant. A constant is a number that does not change.
We can now substitute the input x with 135:
f(135) = 2 * 135 + 80 = 350
If average pulse is 135, the calorie burnage is 350.
Define the Mathematical Function in Python
Here is the exact same mathematical function, but in Python. The function returns 2*x + 80, with x
as the input:
#Try to replace x with 140 and 150.
def my_function(x):
return 2*x + 80
print (my_function(135))
o/p: 350
DS Math
Data Science - Slope and Intercept
Plot a New Graph in Python
Here, we plot the same graph as earlier, but formatted the
axis a little bit.
Max value of the y-axis is now 400 and for x-axis is 150:
import matplotlib.pyplot as plt
health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line'),
plt.ylim(ymin=0, ymax=400)
Example Explained
plt.xlim(xmin=0, xmax=150)
Import the pyplot module of the matplotlib library
plt.show()
Plot the data from Average_Pulse against Calorie_Burnage
kind='line' tells us which type of plot we want. Here, we want to
have a straight line
plt.ylim() and plt.xlim() tells us what value we want the axis to start
and stop on.
plt.show() shows us the output
DS- Statistics
Introduction to Statistics
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and
presentation of data. When we have created a model for prediction, we must assess the prediction's
reliability.
Statistics is a method of interpreting, analyzing and summarizing the data.
The types of statistics are categorized based on these features:
Descriptive and inferential statistics
Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and
interpret it.
DS- Statistics
Descriptive Statistics
import pandas as pd
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
print (full_health_data.describe())
DS- Statistics
Statistics Percentiles
25%, 50% and 75% - Percentiles
Percentiles are used in statistics to give you a number that describes the value that a given percent of
the values are lower than.
DS- Statistics
Statistics Percentiles
Let us try to explain it by some examples, using Average_Pulse.
The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an average
pulse of 100 beats per minute or lower. If we flip the statement, it means that 75% of all of the
training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session have an average pulse
of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions have an
average pulse of 111 beats per minute or higher
Task: Find the 10% percentile for Max_Pulse
The following example shows how to do it in Python:
DS- Statistics
Statistics Percentiles
import pandas as pd
import numpy as np
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
Max_Pulse= full_health_data["Max_Pulse"]
percentile10 = np.percentile(Max_Pulse, 10)
print(percentile10)
o/p: 120.00
Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full health
data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a
Max_Pulse of 120 or lower.
DS- Statistics
Statistics Standard Deviation
Standard Deviation
Standard deviation is a number that describes how spread out the observations are
A mathematical function will have difficulties in predicting precise values, if the observations are
"spread". Standard deviation is a measure of uncertainty.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
DS- Statistics
Statistics Standard Deviation
Standard Deviation
import pandas as pd
import numpy as np
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
std = np.std(full_health_data)
print(std)
DS- Statistics
Statistics Standard Deviation
Coefficient of Variation
The coefficient of variation is used to get an idea of how large the standard deviation is.
Mathematically, the coefficient of variation is defined as:
Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:
import numpy as np
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
o/p:
We see that the variables Duration, Calorie_Burnage and
Hours_Work has a high Standard Deviation compared to
Max_Pulse, Average_Pulse and Hours_Sleep.
DS- Statistics
Data Science - Statistics Variance
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation. Or the other way around, if
you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate the variance:
Tip: Variance is often represented by the symbol Sigma Square: σ^2
DS- Statistics
Data Science - Statistics Variance
Variance
Step 2: For Each Value - Find the Difference
From the Mean
2. Find the difference from the mean for each
Step 1 to Calculate the Variance: Find the value:
Mean
We want to find the variance of 80 - 102.5 = -22.5
85 - 102.5 = -17.5
Average_Pulse.
90 - 102.5 = -12.5
95 - 102.5 = -7.5
1. Find the mean:
100 - 102.5 = -2.5
(80+85+90+95+100+105+110+115+120+125) / 10 105 - 102.5 = 2.5
110 - 102.5 = 7.5
= 102.5
115 - 102.5 = 12.5
The mean is 102.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
DS- Statistics
Step 3: For Each Difference - Find the
Square Value
3. Find the square value for each
difference:
Step 4: The Variance is the Average Number
of These Squared Values
4. Sum the squared values and find the
average:
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25
Note: We must square the values to get
the total spread.
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25
+ 56.25 + 156.25 + 306.25 + 506.25) / 10 =
206.25
The variance is 206.25.
DS- Statistics
Data Science - Statistics Variance
Variance
Use Python to Find the Variance of health_data
We can use the var() function from Numpy to find the variance (remember that we now use the first data set
with 10 observations):
Here we calculate the variance for each column for the
full data set:
import numpy as np
var = np.var(health_data)
import numpy as np
print(var)
var_full = np.var(full_health_data)
o/p:
print(var_full)
o/p:
DS - Statistics Correlation
Correlation
Correlation measures the relationship between two variables.
We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.
Correlation Coefficient
The correlation coefficient measures the relationship between two variables.
The correlation coefficient can never be less than -1 or higher than 1.
o 1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
o 0 = there is no linear relationship between the variables
o -1 = there is a perfect negative linear relationship between the variables (e.g. Less hours
worked, leads to higher calorie burnage during a training session)
DS - Statistics Correlation
Example of a Perfect Linear Relationship (Correlation Coefficient = 1)
We will use scatterplot to visualize the relationship between Average_Pulse and Calorie_Burnage
(we have used the small data set of the sports watch with 10 observations).
This time we want scatter plots, so we change kind to "scatter":
import matplotlib.pyplot as plt
health_data.plot(x ='Average_Pulse',
y='Calorie_Burnage', kind='scatter')
plt.show()
DS - Statistics Correlation
Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
We have plotted fictional data here. The x-axis
represents the amount of hours worked at our
job before a training session. The y-axis is
Calorie_Burnage.
If we work longer hours, we tend to have lower
calorie burnage because we are exhausted
before the training session.
The correlation coefficient here is -1.
DS - Statistics Correlation
Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
import pandas as pd
import matplotlib.pyplot as plt
negative_corr = {'Hours_Work_Before_Training': [10,9,8,7,6,5,4,3,2,1],
'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]}
negative_corr = pd.DataFrame(data=negative_corr)
negative_corr.plot(x ='Hours_Work_Before_Training', y='Calorie_Burnage',
kind='scatter')
plt.show()
DS - Statistics Correlation
Example of No Linear Relationship (Correlation coefficient = 0)
Here, we have plotted Max_Pulse against Duration
from the full_health_data set.
As you can see, there is no linear relationship
between the two variables. It means that longer
training session does not lead to higher Max_Pulse.
The correlation coefficient here is 0.
import matplotlib.pyplot as plt
full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter')
plt.show()
o/p:
D S - Statistics Correlation Matrix
Correlation Matrix
A matrix is an array of numbers arranged in rows and columns.
A correlation matrix is simply a table showing the correlation coefficients between variables.
Here, the variables are represented in the first row, and in the first column:
The table here has used data from the full health data set.
Observations:
We observe that Duration and Calorie_Burnage are closely
related, with a correlation coefficient of 0.89. This makes
sense as the longer we train, the more calories we burn
We observe that there is almost no linear relationships
between Average_Pulse and Calorie_Burnage (correlation
coefficient of 0.02)
Can we conclude that Average_Pulse does not affect
Calorie_Burnage? No. We will come back to answer this
question later!
D S - Statistics Correlation Matrix
Correlation Matrix
Correlation Matrix in Python
We can use the corr() function in Python to create a correlation matrix. We also use the round()
function to round the output to two decimals:
import pandas as pd
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
Corr_Matrix = round(full_health_data.corr(),2)
print(Corr_Matrix)
o/p:
D S - Statistics Correlation Matrix
Correlation Matrix
Using a Heatmap
We can use a Heatmap to Visualize the Correlation Between Variables:
The closer the correlation coefficient is to 1, the
greener the squares get.
The closer the correlation coefficient is to -1, the
browner the squares get.
D S - Statistics Correlation Matrix
Correlation Matrix
Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a
visualization library based on matplotlib):
import matplotlib.pyplot as plt
import seaborn as sns
correlation_full_health = full_health_data.corr()
axis_corr = sns.heatmap(
correlation_full_health,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True
)
plt.show()
D S - Statistics Correlation Matrix
Correlation Matrix
Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a
visualization library based on matplotlib):
Example Explained:
Import the library seaborn as sns.
Use the full_health_data set.
Use sns.heatmap() to tell Python that we want a heatmap to
visualize the correlation matrix.
o Use the correlation matrix. Define the maximal and minimal
values of the heatmap. Define that 0 is the center.
o Define the colors with sns.diverging_palette. n=500 means
that we want 500 types of color in the same color palette.
o square = True means that we want to see squares.
o
o
o
o
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
Correlation measures the numerical relationship between two variables.
A high correlation coefficient (close to 1), does not mean that we can for sure conclude
an actual relationship between two variables.
A classic example:
During the summer, the sale of ice cream at a beach increases
Simultaneously, drowning accidents also increase as well
Does this mean that increase of ice cream sale is a direct cause of increased drowning
accidents?
The Beach Example in Python
Here, we constructed a fictional data set for you to try:
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
Here, we constructed a fictional data set for you to try:
import pandas as pd
import matplotlib.pyplot as plt
Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200],
"Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)
Drowning.plot(x="Ice_Cream_Sale", y="Drowning_Accident", kind="scatter")
plt.show()
correlation_beach = Drowning.corr()
print(correlation_beach)
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
o/p:
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
Correlation vs Causality - The Beach Example
In other words: can we use ice cream sale to predict drowning accidents?
The answer is - Probably not.
It is likely that these two variables are accidentally correlating with each other.
What causes drowning then?
Unskilled swimmers
Waves
Cramp
Seizure disorders
Lack of supervision
Alcohol (mis)use
etc.
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Back to the question:
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation
coefficient?
The answer is no.
There is an important difference between correlation and causality:
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
Tip: Always critically reflect over the concept of causality when doing predictions!
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Back to the question:
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low
correlation coefficient?
The answer is no.
There is an important difference between correlation and causality:
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
 Statistics gives us methods of gaining knowledge from data.
What is Statistics Used for?
Statistics is used in all kinds of science and business
applications.
Statistics gives us more accurate knowledge which
helps us make better decisions.
Statistics can focus on making predictions about
what will happen in the future. It can also focus on
explaining how different things are connected.
Typical Steps of Statistical Methods




The typical steps are:
Gathering data
Describing and visualizing data
Making conclusions
It is important to keep all three steps in mind for any questions we want more knowledge
about.
Knowing which types of data are available can tell you what kinds of questions you can
answer with statistical methods.
Knowing which questions you want to answer can help guide what sort of data you need. A
lot of data might be available, and knowing what to focus on is important.
How is Statistics Used?
Statistics can be used to explain things in a precise way. You can use it to understand and make
conclusions about the group that you want to know more about. This group is called
the population.
• A population could be many different kinds of groups. It could be:
• All of the people in a country
• All the businesses in an industry
• All the customers of a business
• All people that play football who are older than 45 and so on
- it just depends on what you want to know about.
Gathering data about the population will give you a sample. This is a part of the whole
population. Statistical methods are then used on that sample.
The results of the statistical methods from the sample is used to make conclusions about the
population.
Important Concepts in Statistics
o
o
o
o
o
o
o
o
o
Predictions and Explanations
Populations and Samples
Parameters and Sample Statistics
Sampling Methods
Data Types
Measurement Level
Descriptive Statistics
Random Variables
Univariate and Multivariate Statistics
o
o
o
o
o
o
o
o
Probability Calculation
Probability Distributions
Statistical Inference
Parameter Estimation
Hypothesis Testing
Correlation
Regression Analysis
Causal Inference
Statistics and Programming:
Statistical analysis is typically done with computers. Small amounts of data can
analyzed reasonably well without computers.
Historically, all data analysis was performed by manually. It was time-consuming and
prone to errors.
Nowadays, programming and software is typically used for data analysis.
In this course, we will see examples of code to do statistics with the programming
languages Python and R.
Statistics - Describing Data
Describing data is typically the second step of statistical analysis after gathering data.
Descriptive Statistics
The information (data) from your sample or population can be visualized with graphs
or summarized by numbers. This will show key information in a simpler way than just
looking at raw data. It can help us understand how the data is distributed.
Graphs can visually show the data distribution.
Examples of graphs include:
o
o
o
o
Histograms
Pie charts
Bar graphs
Box plots
Statistics - Describing Data
Some graphs have a close connection to numerical summary statistics. Calculating those gives
us the basis of these graphs.
For example, a box plot visually shows the quartiles of a data distribution.
Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of
summary statistics.
Summary statistics take a large amount of information and sums it up in a few key values.
Numbers are calculated from the data which also describe the shape of the distributions.
These are individual 'statistics'.
Statistics - Making Conclusions
Using statistics to make conclusions about a population is called statistical
inference.
Statistics from the data in the sample is used to make conclusions about the
whole population. This is a type of statistical inference.
Probability theory is used to calculate the certainty that those statistics also
apply to the population.
When using a sample, there will always be some uncertainty about what the
data looks like for the population.
When using a sample, there will always be some uncertainty about what the data
looks like for the population.
Uncertainty is often expressed as confidence intervals.
Statistics - Making Conclusions
Confidence intervals are numerical ways of showing how likely it is that the true
value of this statistic is within a certain range for the population.
Hypothesis testing is a another way of checking if a statement about a
population is true. More precisely, it checks how likely it is that a hypothesis is
true is based on the sample data.
Some examples of statements or questions that can be checked with hypothesis
testing:
People in the Netherlands taller than people in Denmark
Do people prefer Pepsi or Coke?
Does a new medicine cure a disease?
Statistics - Making Conclusions
Causal Inference
Causal inference is used to investigate if something causes another thing.
For example: Does rain make plants grow?
If we think two things are related we can investigate to see if they correlate.
Statistics can be used to find out how strong this relation is.
Even if things are correlated, finding out of something is caused by other things
can be difficult. It can be done with good experimental design or other special
statistical techniques.
Note: Good experimental design is often difficult to achieve because of ethical concerns or other practical
reasons.
Statistics - Prediction and Explanation
Some types of statistical methods are focused on predicting what will happen.
Other types of statistical methods are focused on explaining how things are connected.
Prediction
Some statistical methods are not focused on explaining how things are connected. Only the
accuracy of prediction is important.
Many statistical methods are successful at predicting without giving insight into how things are
connected.
Some types of machine learning let computers do the hard work, but the way they predict is
difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances
change, since the how they work is less clear.
Statistics - Prediction and Explanation
Some types of statistical methods are focused on predicting what will happen.
Other types of statistical methods are focused on explaining how things are connected.
Prediction
Some statistical methods are not focused on explaining how things are connected. Only the
accuracy of prediction is important.
Many statistical methods are successful at predicting without giving insight into how things are
connected.
Some types of machine learning let computers do the hard work, but the way they predict is
difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances
change, since the how they work is less clear.
Note: Predictions about future events are called forecasts. Not all predictions are about the future.
Some predictions can be about something else that is unknown, even if it is not in the future.
Statistics - Prediction and Explanation
Explanation
Different statistical methods are often used for explaining how things are connected. These
statistical methods may not make good predictions.
These statistical methods often explain only small parts of the whole situation. But, if you only
want to know how a few things are connected, the rest might not matter.
If these methods accurately explains how all the relevant things are connected, they will also be
good at prediction. But managing to explain every detail is often challenging.
Some times we are specifically interested in figuring out if one thing causes another. This is called
causal inference.
If we are looking at complicated situations, many things are connected. To figure out what causes
what, we need to untangle every way these things are connected.
Statistics - Population and Samples
Population: Everything in the group that we want to learn about.
Sample: A part of the population.
For good statistical analysis, the sample needs to be as "similar" as possible to the
population. If they are similar enough, we say that the sample is representative of the
population.
The sample is used to make conclusions about the whole population. If the sample is not
similar enough to the whole population, the conclusions could be useless.
Statistics - Parameters and Statistics
The terms 'parameter' and (sample) 'statistic' refer to key concepts that are closely related in
statistics.
They are also directly connected to the concepts of populations and samples.
Parameter: A number that describes something about the whole population.
Sample statistic: A number that describes something about the sample.
The parameters are the key things we want to learn about. The parameters are usually
unknown.
Sample statistics gives us estimates for parameters.
There will always be some uncertainty about how accurate estimates are. More certainty
gives us more useful knowledge.
For every parameter we want to learn about we can get a sample and calculate a sample
statistic, which gives us an estimate of the parameter.
Statistics - Parameters and Statistics
Some Important Examples
Mean, median and mode are different types of
averages (typical values in a population).
For example:
Variance and standard deviation are two
types of values describing how spread out
the values are.
A single class of students in a school would
usually be about the same age. The age of
the students will have low variance and
standard deviation.
A whole country will have people of all kinds
of different ages. The variance and standard
deviation of age in the whole country would
then be bigger than in a single school grade.
The typical age of people in a country
The typical profits of a company
The typical range of an electric car
Statistics - Study Types
A statistical study can be a part of the process of gathering data.
There are different types of studies. Some are better than others, but they might be harder to do.
Main Types of Statistical Studies
The main types of statistical studies are observational and experimental studies.
We are often interested in knowing if something is the cause of another thing.
Experimental studies are generally better than observational studies for investigating this, but
usually require more effort.
An observational study is when observe and gather data without changing anything.
Statistics - Study Types
Experimental Studies
In an experimental study, the circumstances around the sample is changed. Usually, we compare
two groups from a population and these two groups are treated differently.
One example can be a medical study to see if a new medicine is effective.
One group receives the medicine and the other does not. These are the different circumstances
around those samples.
We can compare the health of both groups afterwards and see if the results are different.
Experimental studies can allow us to investigate causal relationships. A well designed experimental
study can be useful since it can isolate the relationship we are interested in from other effects.
Then we can be more confident that we are measuring the true effect.
Statistics - Sample Types
A study needs participants and there are different ways of gathering them.
Some methods are better than others, but they might be more difficult.
Different Types of Sampling Methods:
Random Sampling
A random sample is where every member of the population has an equal chance to be chosen.
Random sampling is the best. But, it can be difficult, or impossible, to make sure that it is completely
random.
Note: Every other sampling method is compared to how close it is to a random sample - the closer,
the better.
Convenience Sampling
A convenience sample is where the participants that are the easiest to reach are chosen.
Note: Convenience sampling is the easiest to do.
In many cases this sample will not be similar enough to the population, and the conclusions can
potentially be useless.
Statistics - Sample Types
Systematic Sampling
A systematic sample is where the participants are chosen by some regular system.
For example:
The first 30 people in a queue
Every third on a list
The first 10 and the last 10
Stratified Sampling
A stratified sample is where the population is split into smaller groups called 'strata'.
The 'strata' can, for example, be based on demographics, like:
Different age groups
Professions
Stratification of a sample is the first step. Another sampling method (like random sampling) is used for
the second step of choosing participants from all of the smaller groups (strata).
Statistics - Sample Types
Clustered Sampling
A clustered sample is where the population is split into smaller groups called 'clusters'.
The clusters are usually natural, like different cities in a country.
The clusters are chosen randomly for the sample.
All members of the clusters can participate in the sample, or members can be chosen
randomly from the clusters in a third step.
Statistics - Data Types
Data can be different types, and require different types of statistical methods to analyze
Different types of data
There are two main types of data: Qualitative (or 'categorical') and quantitative (or 'numerical').
These main types also have different sub-types depending on their measurement level.
Qualitative Data
Information about something that can be sorted into different categories that can't be described
directly by numbers.
Examples:
• Brands
• Nationality
• Professions
Statistics - Data Types
With categorical data we can calculate statistics like proportions. For example, the proportion of
Indian people in the world, or the percent of people who prefer one brand to another.
Quantitative Data
Information about something that is described by numbers.
Examples:
• Income
• Age
• Height
With numerical data we can calculate statistics like the average income in a country, or the range
of heights of players in a football team.
Statistics - Measurement Levels
Different data types have different measurement levels.
Measurement levels are important for what types of statistics can be calculated and how to best
present the data.
The main types of data are Qualitative (categories) and Quantitative (numerical). These are further
split into the following measurement levels.
These measurement levels are also called measurement 'scales'
Nominal Level
Categories (qualitative data) without any order.
Examples:
• Brand names
• Countries
• Colors
Statistics - Measurement Levels
Ordinal level
Categories that can be ordered (from low to high), but the precise "distance" between each is not
meaningful.
Examples:
• Letter grade scales from F to A
• Military ranks
• Level of satisfaction with a product
Consider letter grades from F to A: Is the grade A precisely twice as good as a B? And, is the grade B
also twice as good as C?
Exactly how much distance it is between grades is not clear and precise. If the grades are based on
amounts of points on a test, you can say that there is a precise "distance" on the point scale, but not
the grades themselves.
Statistics - Measurement Levels
Interval Level
Data that can be ordered and the distance between them is objectively meaningful. But there is no
natural 0-value where the scale originates.
Examples:
Years in a calendar
Temperature measured in Fahrenheit
Note: Interval scales are usually invented by people, like degrees of temperature.
0 degrees Celsius is 32 degrees of Fahrenheit. There is consistent distances between each degree (for
every 1 extra degree of Celsius, there is 1.8 extra Fahrenheit), but they do not agree on where 0 degrees
is.
Statistics - Descriptive Statistics
Descriptive statistics gives us insight into data without having to look at all of it in
detail.
Key Features to Describe about Data:
Getting a quick overview of how the data is distributed is a important step in statistical methods.
We calculate key numerical values about the data that tells us about the distribution of the data.
We also draw graphs showing visually how the data is distributed.
Key Features of Data:
•
•
•
•
Where is the center of the data? (location)
How much does the data vary? (scale)
What is the shape of the data? (shape)
These can be described by summary statistics (numerical values).
Statistics - Descriptive Statistics
The Center of the Data
The center of the data is where most of the values are concentrated.
Different kinds of averages, like mean, median and mode, are measures of the
center.
Note: Measures of the center are also called location parameters, because they tell us something
about where data is 'located' on a number line.
The Variation of the Data
The variation of the data is how spread out the data are around the center.
Statistics like standard deviation, range and quartiles are measures of variation.
Note: Measures of variation are also called scale parameters.
Statistics - Descriptive Statistics
The Shape of the Data
The shape of the data can refer to the how the data are bunched up on either side
of the center.
Statistics like skew describe if the right or left side of the center is bigger. Skew is
one type of shape parameters.
Frequency Tables
One typical of presenting data is with frequency tables.
A frequency table counts and orders data into a table. Typically, the data will need
to be sorted into intervals.
Frequency tables are often the basis for making graphs to visually present the data.
Statistics - Descriptive Statistics
Visualizing Data
Different types of graphs are used for different kinds of data. For example:
•
•
•
•
Pie charts for qualitative data
Histograms for quantitative data
Scatter plots for bivariate data
Graphs often have a close connection to numerical summary statistics.
For example, box plots show where the quartiles are.
Quartiles also tell us where the minimum and maximum values, range, interquartile
range, and median are.
Statistics - Frequency Tables
We can see that there is only
one winner from ages 10 to
19. And that the highest
number of winners are in their
60s.
Statistics - Descriptive Statistics
Relative Frequency Tables
Relative frequency means the number of
times a value appears in the data
compared to the total amount. A
percentage is a relative frequency.
Here are the relative frequencies of ages
of Noble Prize winners. Now, all the
frequencies are divided by the total (934)
to give percentages.
Statistics - Descriptive Statistics
Cumulative Frequency Tables
Cumulative frequency counts up to a
particular value.
Here are the cumulative frequencies of
ages of Nobel Prize winners. Now, we can
see how many winners have been younger
than a certain age.
Cumulative frequency tables can also be
made
with
relative
frequencies
(percentages).
Statistics - Histograms
A histogram visually presents quantitative data.
A histogram is a widely used graph to show the
distribution of quantitative (numerical) data.
It shows the frequency of values in the data,
usually in intervals of values. Frequency is the
amount of times that value appeared in the data.
Each interval is represented with a bar, placed
next to the other intervals on a number line.
The height of the bar represents the frequency of
values in that interval.
Here is a histogram of the age of all 934 Nobel
Prize winners up to the year 2020:
This histogram uses age intervals from 10 to 19, 20
to 29, and so on.
Note: Histograms are similar to bar graphs, which
are used for qualitative data.
Statistics - Histograms
Bin Width
The intervals of values are often called 'bins'. And
the length of an interval is called 'bin width'.
We can choose any width. It is best with a bin width
that shows enough detail without being confusing.
Here is a histogram of the same Nobel Prize winner
data, but with bin widths of 5 instead of 10:
This histogram uses age intervals from from 15 to
19, 20 to 24, 25 to 29, and so on.
Smaller intervals gives a more detailed look at the
distribution of the age values in the data.
Statistics - Bar Graphs
A bar graph visually presents qualitative data.
Bar graphs are used to show the distribution of
qualitative (categorical) data.
It shows the frequency of values in the data.
Frequency is the amount of times that value
appeared in the data.
Each category is represented with a bar. The height
of the bar represents the frequency of values from
that category in the data.
Here is a bar graph of the number of people who
have won a Nobel Prize in each category up to the
year 2020:
Some of the categories have existed longer
than others. Multiple winners are also more
common in some categories. So there is a
different number of winners in each
category.
Note: Bar graphs are similar to histograms,
which are used for quantitative data
Statistics - Pie Charts
A pie chart visually presents qualitative data.
Pie graphs are used to show the distribution of
qualitative (categorical) data.
It shows the frequency or relative frequency of values
in the data.
Frequency is the amount of times that value appeared
in the data. Relative frequency is the percentage of the
total.
Each category is represented with a slice in the 'pie'
(circle). The size of each slice represents the frequency
of values from that category in the data.
Here is a pie chart of the number of people who have
won a Nobel Prize in each category up to the year 2020:
This pie chart shows relative frequency. So
each slice is sized by the percentage for
each category.
Some of the categories have existed
longer than others. Multiple winners are
also more common in some categories. So
there is a different number of winners in
each category.
Statistics - Box Plots
A box plot is a graph used to show key features of quantitative data.
A box plot is a good way to show many important features of quantitative
(numerical) data.
It shows the median of the data. This is the middle value of the data and one
type of an average value.
It also shows the range and the quartiles of the data. This tells us something
about how spread out the data is.
Note: Box plots are also called 'box and whiskers plots'.
Here is a box plot of the age of all the Nobel Prize winners up to the year
2020:
Statistics - Box Plots
The median is the red line through
the middle of the 'box'. We can see
that this is just above the number
60 on the number line below. So
the middle value of age is 60 years.
The left side of the box is the
1st quartile. This is the value that
separates the first quarter, or 25%
of the data, from the rest. Here,
this is 51 years.
The right side of the box is the
3rd quartile. This is the value that
separates the first three quarters,
or 75% of the data, from the rest.
Here, this is 69 years.
Statistics - Box Plots
The distance between the sides of
the box is called the inter-quartile
range (IQR). This tells us where the
'middle half' of the values are.
Here, half of the winners were
between 51 and 69 years.
The ends of the lines from the box
at the left and the right are the
minimum and maximum values in
the data. The distance between
these is called the range.
The youngest winner was 17 years
old, and the oldest was 97 years
old. So the range of the age of
winners was 80 years.
Statistics - Average
An average is a measure of where most of the values in the data are located.
The center of the data is where most of the values in the data are located. Averages are
measures of the location of the center.
There are different types of averages. The most commonly used are:
o Mean
o Median
o Mode
Note: In statistics, averages are often referred to as 'measures of central tendency'.
For example, using the values:
40, 21, 55, 21, 48, 13, 72
Statistics - Average
Median
The median is the 'middle value' of the data.
The median is found by ordering all the values in the data and picking the middle value:
13, 21, 21, 40, 48, 55, 72
The median is less influenced by extreme values in the data than the mean.
Changing the last value to 356 does not change the median:
13, 21, 21, 40, 48, 55, 356
The median is still 40.
Changing the last value to 356 changes the mean a lot:
(13 + 21 + 21 + 40 + 48 + 55 + 72)/7 = 38.57
(13 + 21 + 21 + 40 + 48 + 55 + 356)/7 = 79.14
Note: Extreme values are values in the data that are much smaller or larger than the average values in
the data.
Statistics - Average
Mode
The mode is the value(s) that appears most often in the data:
40, 21, 55, 21, 48, 13, 72
Here, 21 appears two times, and the other values only once. The mode of this data is 21.
The mode is also used for categorical data, unlike the median and mean. Categorical data can't
be described directly with numbers, like names:
Alice, John, Bob, Maria, John, Julia, Carol
Here, John appears two times, and the other values only once. The mode of this data is John.
Note: There can be more than one mode if multiple values appear the same number of times in the
data.
Statistics - Mean
The mean is a type of average value, which describes where center of the data is located. Mean
The mean is usually referred to as 'the average'.
The mean is the sum of all the values in the data divided by the total number of values in the data.
The mean is calculated for numerical variables. A variable is something in the data that can vary,
like:
Age
Height
Income
Note: There are multiple types of mean values. The most common type of mean is the arithmetic mean.
In this tutorial 'mean' refers to the arithmetic mean.
Statistics - Mean
Calculating the Mean
You can calculate the mean for both the population and the sample.
The formulas are the same and uses different symbols to refer to the population mean (μ) and sample mean
(x¯).
Statistics - Mean
Calculation with Programming
The mean can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.mean(values)
print(x)
o/p: 9.0
Statistics - Mean
Calculation with Programming
The mean can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.mean(values)
print(x)
Use the R mean() function to find the mean of the
values 4,11,7,14:
values <- c(4,7,11,14)
mean(values)
o/p: 9.0
o/p:
[1] 9
Statistics - Median
The median is a type of average value, which describes where the center of the data is located.
The median is the middle value in a data set ordered from low to high.
Finding the Median
The median can only be calculated for numerical variables.
The formula for finding the middle value is:
Where n is the total number of observations.
If the total number of observations is an odd number, the formula gives a whole number and the
value of this observation is the median.
13, 21, 21, 40, 48, 55, 72
Here, there are 7 total observations, so the median is the 4th value:
The 4th value in the ordered list is 40, so that is the median.
Statistics - Median
If the total number of observations is an even number, the formula gives a
decimal number between two observations.
13, 21, 21, 40, 42, 48, 55, 72
Here, there are 8 total observations, so the median is between the 4th and 5th values:
The 4th and 5th values in the ordered list is 40 and 42, so the median is the mean of these two
values. That is, the sum of those two values divided by 2:
Note: It is important that the numbers are ordered before you can find the median.
Statistics - Median
Finding the Median with Programming
The median can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
The median can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the NumPy library median() method to find the median of the values 13, 21, 21,
40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.median(values)
print(x)
o/p: 41.0
Statistics - Mode
The mode is a type of average value, which describes where most of the data is located.
Mode
The mode is the value(s) that are the most common in the data.
A dataset can have multiple values that are modes.
A distribution of values with only one mode is called unimodal.
A distribution of values with two modes is called bimodal. In general, a distribution with more than
one mode is called multimodal.
Mode can be found for both categorical and numerical data.
Statistics - Mode
Finding the Mode
Here is a numerical example:
4, 7, 3, 8, 11, 7, 10, 19, 6, 9, 12, 12
Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7
and 12.
Here is a categorical example with names:
Alice, John, Bob, Maria, John, Julia, Carol
John appears two times, and the other values only once. The mode of this data is John.
Statistics - Mode
Finding the Mode with Programming
The mode can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating manually becomes difficult.
Example
With Python use the statistics library multimode() method to find the modes of the values
4,7,3,8,11,7,10,19,6,9,12,12:
from statistics import multimode
values = [4,7,3,8,11,7,10,19,6,9,12,12]
x = multimode(values)
print(x)
o/p: [7, 12]
Statistics - Variation
Variation is a measure of how spread out the data is around the center of the data.
Measures of variation are statistics of how far away the values in the observations (data points) are
from each other.
There are different measures of variation. The most commonly used are:
o Range
o Quartiles and Percentiles
o Interquartile Range
o Standard Deviation
Measures of variation combined with an average (measure of center) gives a good picture of the
distribution of the data.
Note: These measures of variation can only be calculated for numerical data.
Statistics - Variation
Range
The range is the difference between the smallest and the largest value of the data.
Range is the simplest measure of variation.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
range:
The youngest winner was 17
years and the oldest was 97
years. The range of ages for
Nobel Prize winners is then 80
years.
Statistics - Variation
Quartiles and Percentiles
Quartiles and percentiles are ways of separating equal numbers of values in the data into parts.
Quartiles are values that separate the data into four equal parts.
Percentiles are values that separate the data into 100 equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the
values that separate each quarter.
Between Q0 and Q1 are the 25% lowest
values in the data. Between Q1 and Q2
are the next 25%. And so on.
o Q0 is the smallest value in the data.
o Q2 is the middle value (median).
o Q4 is the largest value in the data.
Statistics - Variation
Interquartile Range
Interquartile range is the difference between the first and third quartiles (Q1 and Q3).
The 'middle half' of the data is between the first and third quartile.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
interquartile range (IQR):
Here, the middle half of is between 51
and 69 years. The interquartile range for
Nobel Prize winners is then 18 years.
Statistics - Variation
Standard deviation : It is the most used measure of variation.
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).
Standard deviation is important for many statistical methods.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard
deviations:
Note: Values within one standard
deviation (σ) are considered to be typical.
Values outside three standard deviations
are considered to be outliers.
Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Range
The range is the difference between the smallest and the largest value of the data.
Range is the simplest measure of variation.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
range:
The youngest winner was 17
years and the oldest was 97
years. The range of ages for
Nobel Prize winners is then 80
years.
Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Calculating the Range
The range can only be calculated for numerical data.
First, find the smallest and largest values of this example:
13, 21, 21, 40, 48, 55, 72
Calculate the difference by subtracting the smallest from the largest:
72 - 13 = 59
Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Calculating the Range with Programming
The range can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the NumPy library ptp() method to find the range of the values 13, 21, 21, 40, 48,
55, 72:
import numpy
values = [13,21,21,40,48,55,72]
x = numpy.ptp(values)
print(x)
o/p: 59
Statistics - Quartiles and Percentiles
Quartiles and percentiles are a measures of variation, which describes how spread out the data is.
Quartiles and percentiles are both types of quantiles.
Quartiles
Quartiles are values that separate the data into four equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate
each quarter.
Between Q0 and Q1 are the 25% lowest values in the data.
Between Q1 and Q2 are the next 25%. And so on.
Q0 is the smallest value in the data.
Q1 is the value separating the first quarter from the second
quarter of the data.
Q2 is the middle value (median), separating the bottom from
the top half.
Q3 is the value separating the third quarter from the fourth
quarter
Q4 is the largest value in the data.
Statistics - Quartiles and Percentiles
The range is a measure of variation, which describes how spread out the data is.
Calculating Quartiles with Programming
Quartiles can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the NumPy library quantile() method to find the quartiles of the values 13, 21, 21,
40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)
O/P : [13. 21. 41. 49.75 72. ]
Statistics - Interquartile Range
Interquartile range is a measure of variation, which describes how spread out the data is.
Interquartile Range is the difference between the first and third quartiles (Q1 and Q3).
The 'middle half' of the data is between the first and third quartile.
The first quartile is the value in the data that separates the bottom 25% of values from the top
75%.
The third quartile is the value in the data that separates the bottom 75% of the values from the top
25%
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
interquartile range (IQR):
Statistics - Interquartile Range
Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners
is then 18 years.
Statistics - Interquartile Range
Calculating the Interquartile Range with Programming
The interquartile range can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the SciPy library iqr() method to find the interquartile range of the values 13, 21,
21, 40, 42, 48, 55, 72:
from scipy import stats
values = [13,21,21,40,42,48,55,72]
x = stats.iqr(values)
print(x)
O/P : 28.75
Statistics - Standard Deviation
Standard deviation is the most commonly used measure of variation, which describes how spread
out the data is.
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).
Standard deviation is important for many statistical methods.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard
deviations:
Each dotted line in the histogram shows a shift of
one extra standard deviation.
If the data is normally distributed:
o Roughly 68.3% of the data is within 1 standard
deviation of the average (from μ-1σ to μ+1σ)
o Roughly 95.5% of the data is within 2 standard
deviations of the average (from μ-2σ to μ+2σ)
o Roughly 99.7% of the data is within 3 standard
deviations of the average (from μ-3σ to μ+3σ)
Note: A normal distribution has a "bell" shape and
spreads out equally on both sides.
Statistics - Standard Deviation
Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:
Statistics - Standard Deviation
Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:
Statistics - Standard Deviation
Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:
Statistics - Standard Deviation
Calculating the Standard Deviation with Programming
The standard deviation can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Population Standard Deviation
Example
With Python use the NumPy library std() method to find the standard deviation of the values
4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.std(values)
print(x)
o/p:
3.8078865529319543
Statistics - Standard Deviation
Calculating the Standard Deviation with Programming
The standard deviation can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Population Standard Deviation
Example
With Python use the NumPy library std() method to find the standard deviation of the values
4,11,7,14:
Sample Standard Deviation
import numpy
x = numpy.std(values, ddof=1)
values = [4,11,7,14]
print(x)
x = numpy.std(values)
print(x)
o/p: 4.396968652757639
o/p:
3.8078865529319543
Inferential Statistics
Statistics - Statistical Inference
Statistical Inference
Using data analysis and statistics to make conclusions about a population is called statistical
inference.
The main types of statistical inference are:
• Estimation
• Hypothesis testing
Estimation
Statistics from a sample are used to estimate population parameters.
The most likely value is called a point estimate.
There is always uncertainty when estimating.
The uncertainty is often expressed as confidence intervals defined by a likely lowest and highest
value for the parameter.
An example could be a confidence interval for the number of bicycles a Dutch person owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
Statistics - Statistical Inference
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely, it
checks how likely it is that a hypothesis is true is based on the sample data.
There are different types of hypothesis testing.
The steps of the test depends on:
Type of data (categorical or numerical)
If you are looking at:
o A single group
o Comparing one group to another
o Comparing the same group before and after a change
Some examples of claims or questions that can be checked with hypothesis testing:
o 90% of Australians are left handed
o Is the average weight of dogs more than 40kg?
o Do doctors make more money than lawyers?
Statistics - Normal Distribution
The normal distribution is an important probability distribution used in statistics.
Many real world examples of data are normally distributed.
Normal Distribution
The normal distribution is described by the mean (μ) and the standard deviation (σ).
The normal distribution is often referred to as a 'bell curve' because of it's shape:
Most of the values are around the center (μ)
The median and mean are equal
It has only one mode
It is symmetric, meaning it decreases the same amount on the left and the right of the center
The area under the curve of the normal distribution represents probabilities for the data.
The area under the whole curve is equal to 1, or 100%
Here is a graph of a normal distribution with probabilities between standard deviations (σ):
Statistics - Normal Distribution
• Roughly 68.3% of the data is
within 1 standard deviation
of the average (from μ-1σ to
μ+1σ)
• Roughly 95.5% of the data is
within 2 standard deviations
of the average (from μ-2σ to
μ+2σ)
• Roughly 99.7% of the data is
within 3 standard deviations
of the average (from μ-3σ to
μ+3σ)
Note: Probabilities of the normal distribution can only be calculated for intervals (between two values).
Statistics - Normal Distribution
Different Mean and Standard Deviations
The mean describes where the center of the normal distribution is.
Here is a graph showing three different normal distributions with the same standard deviation but
different means.
The standard deviation describes
how spread out the normal
distribution is.
Here is a graph showing three
different normal distributions with
the same mean but different
standard deviations.
Statistics - Normal Distribution
Different Mean and Standard Deviations
The mean describes where the center of the normal distribution is.
The purple curve has the biggest standard deviation and the black curve has the smallest standard
deviation.
The area under each of the curves is still 1, or 100%.
.
Statistics - Normal Distribution
A Real Data Example of Normally Distributed Data
Real world data is often normally distributed.
Here is a histogram of the age of Nobel Prize winners when they won the prize:
The normal distribution drawn on top of
the histogram is based on the population
mean (μ) and standard deviation (σ) of the
real data.
We can see that the histogram close to a
normal distribution.
Examples of real world variables that can
be normally distributed:
• Test scores
• Height
• Birth weight
Statistics - Normal Distribution
Probability Distributions
Probability distributions are functions that calculates the probabilities of the outcomes of random
variables.
Typical examples of random variables are coin tosses and dice rolls.
Here is an graph showing the results of a growing number of coin tosses and the expected values
of the results (heads or tails).
The expected values of the coin toss is the probability distribution of the coin toss.
Notice how the result of random coin tosses
gets closer to the expected values (50%) as
the number of tosses increases.
Similarly, here is a graph showing the results
of a growing number of dice rolls and the
expected values of the results (from 1 to 6).
Statistics - Normal Distribution
Notice again how the result of random dice
rolls gets closer to the expected values (1/6,
or 16.666%) as the number of rolls increases.
When the random variable is a sum of dice
rolls the results and expected values take a
different shape.
The different shape comes from there being
more ways of getting a sum of near the
middle, than a small or large sum.
Statistics - Normal Distribution
Notice again how the result of random dice
rolls gets closer to the expected values (1/6,
or 16.666%) as the number of rolls increases.
When the random variable is a sum of dice
rolls the results and expected values take a
different shape.
The different shape comes from there being
more ways of getting a sum of near the
middle, than a small or large sum.
Statistics - Normal Distribution
As we keep increasing the number of dice for a sum the shape of the results and expected
values look more and more like a normal distribution.
Many real world variables follow a similar pattern and naturally form normal distributions.
Normally distributed variables can be analyzed with well-known techniques.
Statistics - Standard Normal Distribution
The standard normal distribution is a normal distribution where the mean is 0 and the standard
deviation is 1.
Normally distributed data can be transformed into a standard normal distribution.
Standardizing normally distributed data makes it easier to compare different sets of data.
The standard normal distribution is used for:
• Calculating confidence intervals
• Hypothesis tests
Here is a graph of the standard normal distribution with probability values (p-values) between the
standard deviations:
Statistics - Standard Normal Distribution
Standardizing makes it easier
to calculate probabilities.
The functions for calculating
probabilities are complex and
difficult to calculate by hand.
Typically, probabilities are
found by looking up tables of
pre-calculated values, or by
using
software
and
programming.
The
standard
normal
distribution is also called the
'Z-distribution' and the values
are called 'Z-values' (or Zscores).
Statistics - Standard Normal Distribution
Z-Values
Z-values express how many standard deviations from the mean a value is.
The formula for calculating a Z-value is:
Z=(x−μ)/σ
x is the value we are standardizing, μ is the mean, and σ is the standard deviation.
For example, if we know that:
The mean height of people in Germany is 170 cm (μ)
The standard deviation of the height of people in Germany is 10 cm (σ)
Bob is 200 cm tall (x)
Bob is 30 cm taller than the average person in Germany.
30 cm is 3 times 10 cm. So Bob's height is 3 standard deviations larger than mean height in
Germany.
Using the formula:
Statistics - Standard Normal Distribution
Finding the P-value of a Z-Value
Using a Z-table or programming we can calculate how many people Germany are shorter than
Bob and how many are taller.
Example
With Python use the Scipy Stats library norm.cdf() function find the probability of getting less
than a Z-value of 3:
import scipy.stats as stats
print(stats.norm.cdf(3))
O/P: 0.9986501019683699
Statistics - Student's T Distribution
The student's t-distribution is similar to a normal distribution and used in statistical inference to
adjust for uncertainty. It is used for estimation and hypothesis testing of a population mean
(average).
The t-distribution is adjusted for the extra uncertainty of estimating the mean.
If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower.
The bigger the sample size is, the closer the t-distribution gets to the standard normal
distribution.
Statistics - Student's T Distribution
Notice how some of the curves have bigger tails.
This is due to the uncertainty from a smaller
sample size.
The green curve has the smallest sample size.
For the t-distribution this is expressed as 'degrees
of freedom' (df), which is calculated by subtracting 1
from the sample size (n).
For example a sample size of 30 will make 29
degrees of freedom for the t-distribution.
The t-distribution is used to find critical tvalues and p-values (probabilities) for estimation
and hypothesis testing.
Note: Finding the critical t-values and p-values of the
t-distribution is similar z-values and p-values of the
standard normal distribution. But make sure to use
the correct degrees of freedom.
Statistics - Student's T Distribution
Finding the P-Value of a T-Value
You can find the p-values of a t-value by using a ttable or with programming.
Finding the T-value of a P-Value
You can find the t-values of a p-value by
using a t-table or with programming.
Example
With Python use the Scipy Stats library t.cdf()
function find the probability of getting less than a
t-value of 2.1 with 29 degrees of freedom:
Example
With Python use the Scipy Stats library
t.ppf() function find the t-value
separating the top 25% from the bottom
75% with 29 degrees of freedom:
import scipy.stats as stats
print(stats.t.cdf(2.1, 29))
import scipy.stats as stats
print(stats.t.ppf(0.75, 29))
O/P: 0.9777290209818548
O/P: 0.6830438592467808
Statistics - Estimation
Point estimates are the most likely value for a population parameter.
Confidence intervals express the uncertainty of an estimated population parameter.
A point estimate is calculated from a sample.
The point estimate depends on the type of data:
• Categorical data: the number of occurrences divided by the sample size.
• Numerical data: the mean (the average) of the sample.
One example could be:
The point estimate for the average height of people in Denmark is 180 cm.
Estimates are always uncertain. This uncertainty can be expressed with a confidence interval.
Statistics - Estimation
Confidence Intervals
The confidence interval is defined by a lower bound and an upper bound.
This gives us a range of values that the true parameter is likely to be between.
For example that:
The average height of people in Denmark is between 170 cm and 190 cm.
Here, 170 cm is the lower bound, and 190 cm is the upper bound.
The lower and upper bounds of a confidence interval is based on the confidence level.
Statistics - Hypothesis Testing
Hypothesis testing is a formal way of checking if a hypothesis about a population is true or not.
A hypothesis is a claim about a population parameter.
A hypothesis test is a formal procedure to check if a hypothesis is true or not.
Examples of claims that can be checked:
The average height of people in Denmark is more than 170 cm.
The share of left handed people in Australia is not 10%.
The average income of dentists is less the average income of dentists.
Statistics - Hypothesis Testing
The Null and Alternative Hypothesis
Hypothesis testing is based on making two different claims about a population parameter.
The null hypothesis (H0) and the alternative hypothesis (H1) are the claims.
The two claims needs to be mutually exclusive, meaning only one of them can be true.
The alternative hypothesis is typically what we are trying to prove.
For example, we want to check the following claim:
"The average height of people in Denmark is more than 170 cm."
In this case, the parameter is the average height of people in Denmark (μ).
The null and alternative hypothesis would be:
Null hypothesis: The average height of people in Denmark is 170 cm.
Alternative hypothesis: The average height of people in Denmark is more than 170 cm.
Statistics - Hypothesis Testing
The claims are often expressed with symbols like this:
:
:
If the data supports the alternative hypothesis, we reject the null hypothesis and accept the
alternative hypothesis.
If the data does not support the alternative hypothesis, we keep the null hypothesis.
Note: The alternative hypothesis is also referred to as \(H_{A}\)
Statistics - Hypothesis Testing
The Significance Level
The significance level (α) is the uncertainty we accept when rejecting the null hypothesis in
the hypothesis test.
The significance level is a percentage probability of accidentally making the wrong
conclusion.
Typical significance levels are:
•α=0.1 (10%)
•α=0.05 (5%)
•α=0.01 (1%)
A lower significance level means that the evidence in the data needs to be stronger to reject the null
hypothesis.
There is no "correct" significance level - it only states the uncertainty of the conclusion.
Note: A 5% significance level means that when we reject a null hypothesis:
We expect to reject a true null hypothesis 5 out of 100 times.
Statistics - Hypothesis Testing
The Critical Value and P-Value Approach
There are two main approaches used for hypothesis tests:
The critical value approach compares the test statistic with the critical value of the significance
level.
The p-value approach compares the p-value of the test statistic and with the significance level.
The Critical Value Approach
The critical value approach checks if the test statistic is in the rejection region.
The rejection region is an area of probability in the tails of the distribution.
The size of the rejection region is decided by the significance level ().
The value that separates the rejection region from the rest is called the critical value.
Statistics - Hypothesis Testing
Here is a graphical illustration:
If
the
test
statistic
is inside this rejection region,
the
null
hypothesis
is rejected.
For example, if the test
statistic is 2.3 and the critical
value is 2 for a significance
level (α=0.05):
We reject the null hypothesis
(H0) at 0.05 significance level
(α)
Statistics - Hypothesis Testing
The P-Value Approach
It checks if the p-value of the test statistic is smaller than the significance level ().
The p-value of the test statistic is the area of probability in the tails of the distribution from the
value of the test statistic.
Here is a graphical illustration:
Statistics - Hypothesis Testing
If the p-value is smaller than the significance level, the null hypothesis is rejected.
The p-value directly tells us the lowest significance level where we can reject the null
hypothesis.
For example, if the p-value is 0.03:
We reject the null hypothesis (Ho) at a 0.05 significance level (α)
We keep the null hypothesis (Ho) at a 0.01 significance level (α)
Note: The two approaches are only different in how they present the conclusion.
Statistics - Hypothesis Testing
Steps for a Hypothesis Test
The following steps are used for a hypothesis test:
1.
2.
3.
4.
5.
Check the conditions
Define the claims
Decide the significance level
Calculate the test statistic
Conclusion
One condition is that the sample is randomly selected from the population.
The other conditions depends on what type of parameter you are testing the hypothesis for.
Common parameters to test hypotheses are:
o Proportions (for qualitative data)
o Mean values (for numerical data)
We are missing one important variable that affects Calorie_Burnage, which is the
Duration of the training session.
 Duration in combination with Average_Pulse will together explain Calorie_Burnage
more precisely.
Data Science - Linear Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning and in statistical modeling, that relationship is used to predict the
outcome of events.
In this module, we will cover the following questions:
Can we conclude that Average_Pulse and Duration are related to Calorie_Burnage?
Can we use Average_Pulse and Duration to predict Calorie_Burnage?
Data Science - Linear Regression
Least Square Method
Least Square Method
Linear regression uses the least square
method.
The concept is to draw a line through all the
plotted data points. The line is positioned
in a way that it minimizes the distance to all
of the data points.
The distance
"errors".
is
called
"residuals"
or
The red dashed lines represents the
distance from the data points to the drawn
mathematical function.
Data Science - Linear Regression
Least Square Method
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear
Regression:
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
def myfunc(x):
return slope * x + intercept
full_health_data = pd.read_csv("data.csv", header=0,
sep=",")
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
slope, intercept, r, p, std_err = stats.linregress(x, y)
mymodel = list(map(myfunc, x))
Data Science - Linear Regression
Least Square Method
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear
Regression:
Do you think that the line is
able
to
predict
Calorie_Burnage precisely?
We will show that the
variable Average_Pulse alone
is not enough to make
precise
prediction
of
Calorie_Burnage.
Data Science - Linear Regression
Least Square Method
• Example Explained:
• Import the modules you need: Pandas,
matplotlib and Scipy
• Isolate Average_Pulse as x. Isolate
•
Calorie_burnage as y
• Get important key values with: slope,
•
intercept, r, p, std_err = stats.linregress(x, y)
• Create a function that uses the slope and •
intercept values to return a new value. This
new value represents where on the y-axis •
the corresponding x value will be placed
• Run each value of the x array through the
function. This will result in a new array with
new values for the y-axis: mymodel =
list(map(myfunc, x))
Draw the original scatter plot: plt.scatter(x,
y)
Draw the line of linear regression:
plt.plot(x, mymodel)
Define maximum and minimum values of
the axis
Label the axis: "Average_Pulse" and
"Calorie_Burnage"
Data Science - Regression Table
Regression Table
The output from linear regression can be summarized in a regression table.
The content of the table includes:
Information about the model
Coefficients of the linear regression function
Regression statistics
Statistics of the coefficients from the linear regression function
Other information that we will not cover in this module
Data Science - Regression Table
You can now begin your journey on analyzing advanced output!
Data Science - Regression Table
import pandas as pd
import statsmodels.formula.api as smf
full_health_data = pd.read_csv("data.csv",
header=0, sep=",")
model = smf.ols('Calorie_Burnage ~
Average_Pulse', data = full_health_data)
results = model.fit()
print(results.summary())
Example Explained:
Import the library statsmodels.formula.api as smf.
Statsmodels is a statistical library in Python.
Use the full_health_data set.
Create a model based on Ordinary Least Squares
with smf.ols(). Notice that the explanatory
variable must be written first in the parenthesis.
Use the full_health_data data set.
By calling .fit(), you obtain the variable results.
This holds a lot of information about the
regression model.
Call summary() to get the table with the results of
linear regression.
Data Science - Regression Table - Info
The "Information Part" in Regression Table
Dep. Variable: is short for
"Dependent
Variable".
Calorie_Burnage is here the
dependent
variable.
The
Dependent variable is here
assumed to be explained by
Average_Pulse.
Model: OLS is short for Ordinary
Least Squares. This is a type of
model that uses the Least
Square method.
Date: and Time: shows the date
and time the output was
calculated in Python.
Data Science - Regression Table - Coefficients
• Coef is short for coefficient. It is the output of the linear regression function.The linear
regression function can be rewritten mathematically as:
Calorie_Burnage = 0.3296 * Average_Pulse + 346.8662
These numbers means:
•
•
•
•
If Average_Pulse increases by 1, Calorie_Burnage increases by 0.3296 (or 0,3 rounded)
If Average_Pulse = 0, the Calorie_Burnage is equal to 346.8662 (or 346.9 rounded).
Remember that the intercept is used to adjust the model's precision of predicting!
Do you think that this is a good model?
Data Science - Regression Table - Coefficients
Define the Linear Regression Function in Python
Define the linear regression function in Python to perform predictions.
What is Calorie_Burnage if Average_Pulse is: 120, 130, 150, 180?
Example
def Predict_Calorie_Burnage(Average_Pulse):
return(0.3296*Average_Pulse + 346.8662)
print(Predict_Calorie_Burnage(120))
print(Predict_Calorie_Burnage(130))
print(Predict_Calorie_Burnage(150))
print(Predict_Calorie_Burnage(180))
O/P:
386.4182
389.7142
396.3062
406.1942
Data Science - Regression Table: P-Value
The "Statistics of the Coefficients Part" in Regression Table:
Now, we want to test if the
coefficients from the linear
regression
function
has
a
significant
impact
on
the
dependent
variable
(Calorie_Burnage).
This means that we want to prove
that it exists a relationship
between Average_Pulse and
Calorie_Burnage, using statistical
tests.
Data Science - Regression Table: P-Value
The "Statistics of the Coefficients Part" in Regression Table:
The P-value
The P-value is a statistical number to conclude if there is
a
relationship
between
Average_Pulse
and
Calorie_Burnage.
We test if the true value of the coefficient is equal to
zero (no relationship). The statistical test for this is called
Hypothesis testing.
A low P-value (< 0.05) means that the coefficient is likely
not to equal zero.
A high P-value (> 0.05) means that we cannot conclude
that the explanatory variable affects the dependent
variable
(here:
if
Average_Pulse
affects
Calorie_Burnage).
A high P-value is also called an insignificant P-value.
Data Science - Regression Table: P-Value
The "Statistics of the Coefficients Part" in Regression Table:
Hypothesis Testing
Hypothesis testing is a statistical procedure to test if your results are valid.
In our example, we are testing if the true coefficient of Average_Pulse and the
intercept is equal to zero.
Hypothesis test has two statements. The null hypothesis and the alternative
hypothesis.
The null hypothesis can be shortly written as H0
The alternative hypothesis can be shortly written as HA
Mathematically written:
H0: Average_Pulse = 0
HA: Average_Pulse ≠ 0
H0: Intercept = 0
HA: Intercept ≠ 0
The sign ≠ means "not equal to"
Data Science - Regression Table: P-Value
The "Statistics of the Coefficients Part" in Regression Table:
Hypothesis Testing and P-value
The null hypothesis can either be rejected or not.
If we reject the null hypothesis, we conclude that it exist a relationship between Average_Pulse
and Calorie_Burnage. The P-value is used for this conclusion.
A common threshold of the P-value is 0.05.
Note: A P-value of 0.05 means that 5% of the times, we will falsely reject the null hypothesis. It
means that we accept that 5% of the times, we might falsely have concluded a relationship.
If the P-value is lower than 0.05, we can reject the null hypothesis and conclude that it exist a
relationship between the variables.
However, the P-value of Average_Pulse is 0.824. So, we cannot conclude a relationship between
Average_Pulse and Calorie_Burnage.
It means that there is a 82.4% chance that the true coefficient of Average_Pulse is zero.
The intercept is used to adjust the regression function's ability to predict more precisely. It is
therefore uncommon to interpret the P-value of the intercept.
Data Science - Regression Table: R-Squared
R - Squared
R-Squared and Adjusted R-Squared describes how well the linear regression model fits the data
points:
The value of R-Squared is always
between 0 to 1 (0% to 100%).
A high R-Squared value means that
many data points are close to the
linear regression function line.
A low R-Squared value means that
the linear regression function line
does not fit the data well.
Data Science - Regression Table: R-Squared
Visual Example of a Low R - Squared Value (0.00)
Our regression model shows a R-Squared value of zero, which means that the linear regression
function line does not fit the data well.
This can be visualized when we plot the linear regression function through the data points of
Average_Pulse and Calorie_Burnage.
The value of R-Squared is always
between 0 to 1 (0% to 100%).
A high R-Squared value means that
many data points are close to the
linear regression function line.
A low R-Squared value means that
the linear regression function line
does not fit the data well.
Data Science - Regression Table: R-Squared
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
full_health_data = pd.read_csv("data.csv",
header=0, sep=",")
return slope * x + intercept
mymodel = list(map(myfunc, x))
print(mymodel)
slope, intercept, r, p, std_err =
stats.linregress(x, y)
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Duration")
plt.ylabel ("Calorie_Burnage")
def myfunc(x):
plt.show()
x = full_health_data["Duration"]
y = full_health_data ["Calorie_Burnage"]
Data Science - Regression Table: R-Squared
Visual Example of a High R - Squared Value (0.79)
However, if we plot Duration and Calorie_Burnage, the R-Squared increases. Here, we see that the
data points are close to the linear regression function line:
Summary - Predicting Calorie_Burnage with
Average_Pulse
How can we summarize the linear regression function with
Average_Pulse as explanatory variable?
Coefficient of 0.3296, which means that Average_Pulse has
a very small effect on Calorie_Burnage.
High P-value (0.824), which means that we cannot conclude
a relationship between Average_Pulse and
Calorie_Burnage.
R-Squared value of 0, which means that the linear
regression function line does not fit the data well.
Data Science - Linear Regression Case
Case: Use Duration + Average_Pulse to Predict Calorie_Burnage
Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables:
import pandas as pd
import statsmodels.formula.api as smf
full_health_data = pd.read_csv("data.csv", header=0,
sep=",")
model = smf.ols('Calorie_Burnage ~ Average_Pulse +
Duration', data = full_health_data)
results = model.fit()
print(results.summary())
Data Science - Linear Regression Case
O/P:
Data Science - Linear Regression Case
Case: Use Duration + Average_Pulse to Predict Calorie_Burnage
Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables:
Example Explained:
• Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in
Python.
• Use the full_health_data set.
• Create a model based on Ordinary Least Squares with smf.ols(). Notice that the
explanatory variable must be written first in the parenthesis. Use the full_health_data
data set.
• By calling .fit(), you obtain the variable results. This holds a lot of information about the
regression model.
• Call summary() to get the table with the results of linear regression.
Data Science - Linear Regression Case
Case: Use Duration + Average_Pulse to Predict Calorie_Burnage
Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables:
The linear regression function can
be rewritten mathematically as:
Calorie_Burnage = Average_Pulse *
3.1695 + Duration * 5.8424 334.5194
Rounded to two decimals:
Calorie_Burnage = Average_Pulse *
3.17 + Duration * 5.84 - 334.52
Data Science - Linear Regression Case
Define the Linear Regression Function in Python
Define the linear regression function in Python to perform predictions.
What is Calorie_Burnage if:
• Average pulse is 110 and duration of the training session is 60 minutes?
• Average pulse is 140 and duration of the training session is 45 minutes?
• Average pulse is 175 and duration of the training session is 20 minutes?
def Predict_Calorie_Burnage(Average_Pulse, Duration):
return(3.1695 * Average_Pulse + 5.8434 * Duration - 334.5194)
print(Predict_Calorie_Burnage(110,60))
print(Predict_Calorie_Burnage(140,45))
print(Predict_Calorie_Burnage(175,20))
O/P: 364.7296
372.1636
337.01110000000006
Data Science - Linear Regression Case
Define the Linear Regression Function in Python
Define the linear regression function in Python to perform predictions.
What is Calorie_Burnage if:
• Average pulse is 110 and duration of the training session is 60 minutes?
• Average pulse is 140 and duration of the training session is 45 minutes?
• Average pulse is 175 and duration of the training session is 20 minutes?
def Predict_Calorie_Burnage(Average_Pulse, Duration):
return(3.1695 * Average_Pulse + 5.8434 * Duration - 334.5194)
print(Predict_Calorie_Burnage(110,60))
The Answers:
print(Predict_Calorie_Burnage(140,45))
Average pulse is 110 and duration of the training session is 60
print(Predict_Calorie_Burnage(175,20))
O/P: 364.7296
372.1636
337.01110000000006
minutes = 365 Calories
Average pulse is 140 and duration of the training session is 45
minutes = 372 Calories
Average pulse is 175 and duration of the training session is 20
minutes = 337 Calories
Data Science - Linear Regression Case
Access the Coefficients
Look at the coefficients:
• Calorie_Burnage increases with 3.17 if Average_Pulse increases by one.
• Calorie_Burnage increases with 5.84 if Duration increases by one.
Access the P-Value
Look at the P-value for each coefficient.
• P-value is 0.00 for Average_Pulse, Duration and the Intercept.
• The P-value is statistically significant for all of the variables, as it is less than 0.05.
So here we can conclude that Average_Pulse and Duration has a relationship with
Calorie_Burnage.
Data Science - Linear Regression Case
Adjusted R-Squared
There is a problem with R-squared if we have more than one explanatory variable.
R-squared will almost always increase if we add more variables, and will never decrease.
This is because we are adding more data points around the linear regression function.
If we add random variables that does not affect Calorie_Burnage, we risk to falsely conclude that
the linear regression function is a good fit. Adjusted R-squared adjusts for this problem.
It is therefore better to look at the adjusted R-squared value if we have more than one
explanatory variable.
The Adjusted R-squared is 0.814.
The value of R-Squared is always between 0 to 1 (0% to 100%).
A high R-Squared value means that many data points are close to the linear regression function
line.
A low R-Squared value means that the linear regression function line does not fit the data well.
Conclusion: The model fits the data point well!
Congratulations! You have now finished the final module of the data science library.
Data Science - Linear Regression Case
• Machine Learning is making the computer learn from studying data and statistics.
• Machine Learning is a step into the direction of artificial intelligence (AI).
• Machine Learning is a program that analyses data and learns to predict the outcome.
Data Science – Intro
Data Set
In the mind of a computer, a data set is any collection of data. It can be anything from an array to a
complete database.
Example of an array:
[99,86,87,88,111,86,103,87,94,78,77,85,86]
Data Types
To analyze data, it is important to know what type of data we are dealing with.
We can split the data types into three main categories:
• Numerical
• Categorical
• Ordinal
Data Science – Intro
Numerical data are numbers, and can be split into two numerical categories:
• Discrete Data
- numbers that are limited to integers. Example: The number of cars passing by.
• Continuous Data
- numbers that are of infinite value. Example: The price of an item, or the size of an item
Categorical data are values that cannot be measured up against each other. Example: a color value, or any
yes/no values.
Ordinal data are like categorical data, but can be measured up against each other. Example: school grades
where A is better than B and so on.
By knowing the data type of your data source, you will be able to know what technique to use when analyzing
them.
Machine Learning - Mean Median Mode
Mean, Median, and Mode
What can we learn from looking at a group of numbers?
In Machine Learning (and in mathematics) there are often three values that interests us:
Mean - The average value
Median - The mid point value
Mode - The most common value
Example: We have registered the speed of 13 cars:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77 #Mean
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111 #Median
77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103 = (86+87)/2= 86.5 #Median
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86 # The Mode value is the value that
appears the most number of times:
Machine Learning - Mean Median Mode
Mean, Median, and Mode
What can we learn from looking at a group of numbers?
In Machine Learning (and in mathematics) there are often three values that interests us:
Mean - The average value
Median - The mid point value
Mode - The most common value
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
import numpy
x = numpy.mean(speed) #Mean
x = numpy.median(speed) #Median
x = stats.mode(speed) #Mode
print(x)
Machine Learning - Standard Deviation
What is Standard Deviation?
Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
Example: This time we have registered the speed of 7 cars:
speed = [86,87,88,86,87,85,86]
The standard deviation is:
0.9
Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.
Let us do the same with a selection of numbers with a wider range:
speed = [32,111,138,28,59,77,97]
The standard deviation is:
37.85
Machine Learning - Standard Deviation
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
To calculate the variance you have to do as follows:
3. For each difference: find the square value:
(-45.4)2 = 2061.16
1. Find the mean:
(33.6)2 = 1128.96
(32+111+138+28+59+77+97) / 7 = 77.4
2. For each value: find the difference from the mean:
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
4. The variance is the average number of these squared differences:
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2
Machine Learning - Standard Deviation
Variance
Luckily, NumPy has a method to calculate the variance:
Example
Use the NumPy var() method to find the variance:
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)(32+111+138+28+59+77+97) / 7 = 77.4
1432.2448979591834
Machine Learning - Standard Deviation
Standard Deviation
As we have learned, the formula to find the standard deviation is the square root of the
variance:
√1432.25 = 37.85
Or, as in the example from before, use the NumPy to calculate the standard deviation:
Example
Use the NumPy std() method to find the standard deviation:
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
O/P: 37.84501153334721
Machine Learning - Percentiles
What are Percentiles?
Percentiles are used in statistics to give you a number that describes the value that a given
percent of the values are lower than.
Example: Let's say we have an array of the ages of all the people that lives in a street
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.
The NumPy module has a method for finding the specified percentile:
Example
What is the age that 90% of the people are younger than?
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)
O/P: 61.0
Machine Learning - Data Distribution
Data Distribution
In the real world, the data sets are much bigger, but it can be difficult to gather real world data,
at least at an early stage of a project.
How Can we Get Big Data Sets?
To create big data sets for testing, we use the Python module NumPy, which comes with a
number of methods to create random data sets, of any size.
Example
Create an array containing 250 random floats between 0 and 5:
import numpy
x = numpy.random.uniform(0.0, 5.0, 250)
print(x)
O/P:
Machine Learning - Data Distribution
Histogram Explained
We use the array from the example above to draw a histogram with 5 bars.
The first bar represents how many values in the array are between 0 and 1.
The second bar represents how many values are between 1 and 2.Etc.
Which gives us this result:
52 values are between 0 and 1
48 values are between 1 and 2
49 values are between 2 and 3
51 values are between 3 and 4
50 values are between 4 and 5
Note: The array values are random
numbers and will not show the
exact same result on your computer.
Machine Learning - Normal Data Distribution
Normal Data Distribution
In the previous chapter we learned how to create a completely random array, of a given size,
and between two given values.
Here we will learn how to create an array where the values are concentrated around a given
value. In probability theory this kind of data distribution is known as the normal data
distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who
came up with the formula of this data distribution.
Example
A typical normal data distribution:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 100000)
plt.hist(x, 100)
plt.show()
O/P:
Machine Learning - Scatter Plot
Scatter Plot
A scatter plot is a diagram where each value in the data set is represented by a dot.
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
• The x array represents the age of each car.
• The y array represents the speed of each car.
Example
Use the scatter() method to draw a scatter plot diagram:
plt.scatter(x, y)
plt.show()
Machine Learning - Scatter Plot
Scatter Plot Explained
The x-axis represents ages, and the y-axis
represents speeds.
What we can read from the diagram is that the two
fastest cars were both 2 years old, and the slowest
car was 12 years old.
Note: It seems that the newer the car, the faster it
drives, but that could be a coincidence, after all we
only registered 13 cars.
Machine Learning - Scatter Plot
Random Data Distributions
In Machine Learning the data sets
can contain thousands-, or even
millions, of values.
Let us create two arrays that are
both filled with 1000 random
numbers from a normal data
distribution.
The first array will have the mean
set to 5.0 with a standard deviation
of 1.0.
The second array will have the mean
set to 10.0 with a standard deviation
of 2.0:
Example
A scatter plot with 1000 dots:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y)
plt.show()
Machine Learning - Linear Regression
Regression
The term regression is used when you try to
find the relationship between variables.
In Machine Learning, and in statistical
modeling, that relationship is used to
predict the outcome of future events
Linear Regression
Linear regression uses the relationship
between the data-points to draw a straight
line through all them.
This line can be used to predict future
values.
In Machine Learning, predicting the future is
very important.
Machine Learning - Linear Regression
How Does it Work?
Python has methods for finding a relationship
between data-points and to draw a line of linear
regression. We will show you how to use these
methods instead of going through the
mathematic formula.
In the example below, the x-axis represents age,
and the y-axis represents speed. We have
registered the age and speed of 13 cars as they
were passing a tollbooth. Let us see if the data
we collected could be used in a linear regression:
Example
Start by drawing a scatter plot:
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Machine Learning - Linear Regression
Example
Import scipy and draw the line of Linear Regression:
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
Machine Learning - Linear Regression
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Execute a method that returns some important key values of Linear Regression:
slope, intercept, r, p, std_err = stats.linregress(x, y)
Create a function that uses the slope and intercept values to return a new value. This new
value represents where on the y-axis the corresponding x value will be placed:
def myfunc(x):
return slope * x + intercept
Run each value of the x array through the function. This will result in a new array with new
values for the y-axis:
mymodel = list(map(myfunc, x))
Draw the original scatter plot:
plt.scatter(x, y)
Draw the line of linear regression:
plt.plot(x, mymodel)
Display the diagram:
plt.show()
Machine Learning - Polynomial Regression
Polynomial Regression
If your data points clearly will not fit a linear regression (a straight line through all data points), it
might be ideal for polynomial regression.
Polynomial regression, like linear regression, uses the relationship between the variables x and y to
find the best way to draw a line through the data points.
How Does it Work?
Python has methods for finding a relationship between data-points and to draw a line of polynomial
regression. We will show you how to use these methods instead of going through the mathematic
formula.
In the example below, we have registered 18 cars as they were passing a certain tollbooth.
We have registered the car's speed, and the time of day (hour) the passing occurred.
The x-axis represents the hours of the day and the y-axis represents the speed:
Machine Learning - Polynomial Regression
Polynomial Regression
Example
Start by drawing a scatter plot:
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
plt.scatter(x, y)
plt.show()
Machine Learning - Polynomial Regression
Polynomial Regression
Example
Import numpy and matplotlib then draw the line of Polynomial Regression:
import numpy
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
myline = numpy.linspace(1, 22, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Machine Learning - Polynomial Regression
Polynomial Regression
Import the modules you need.
import numpy
import matplotlib.pyplot as plt
Create the arrays that represent the values of the x and y
axis:
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y=
[100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
NumPy has a method that lets us make a polynomial
model:
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
Then specify how the line will display, we start at position
1, and end at position 22:
myline = numpy.linspace(1, 22, 100)
Draw the original scatter plot:
plt.scatter(x, y)
Draw the line of polynomial
regression:
plt.plot(myline, mymodel(myline))
Display the diagram:
plt.show()
Machine Learning - Multiple Regression
Multiple Regression
Multiple regression is like linear regression, but with more than one independent value, meaning
that we try to predict a value based on two or more variables.
Take a look at the data set below, it contains some information about cars.
We can predict the CO2 emission of a car based on the size of the engine, but with multiple
regression we can throw in more variables, like the weight of the car, to make the prediction more
accurate.
Machine Learning - Multiple Regression
Multiple Regression
How Does it Work?
In Python we have modules that will do the work for us. Start by importing the Pandas module.
import pandas
The Pandas module allows us to read csv files and return a DataFrame object.
The file is meant for testing purposes only, you can download it here: cars.csv
df = pandas.read_csv("cars.csv")
Then make a list of the independent values and call this variable X.
Put the dependent values in a variable called y.
X = df[['Weight', 'Volume']]
y = df['CO2']
Tip: It is common to name the list of independent values with a upper case X, and the list of dependent
values with a lower case y.
Machine Learning - Multiple Regression
Multiple Regression
We will use some methods from the sklearn module, so we will have to import that module as well:
from sklearn import linear_model
From the sklearn module we will use the LinearRegression() method to create a linear regression
object.
This object has a method called fit() that takes the independent and dependent values as
parameters and fills the regression object with data that describes the relationship:
regr = linear_model.LinearRegression()
regr.fit(X, y)
Now we have a regression object that are ready to predict CO2 values based on a car's weight and
volume:
#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])
Machine Learning - Multiple Regression
Multiple Regression
Example
See the whole example in action:
import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])
print(predictedCO2)
O/P: [107.2087328]
Machine Learning - Multiple Regression
Coefficient
Result Explained
The result array represents the coefficient values of weight and volume.
Weight: 0.00755095
Volume: 0.00780526
These values tell us that if the weight increase by 1kg, the CO2 emission increases by 0.00755095g.
And if the engine size (Volume) increases by 1 cm3, the CO2 emission increases by 0.00780526 g.
I think that is a fair guess, but let test it!
We have already predicted that if a car with a 1300cm3 engine weighs 2300kg, the CO2 emission will
be approximately 107g.
What if we increase the weight with 1000kg?
Machine Learning - Multiple Regression
Coefficient
Example
Copy the example from before, but change the weight from 2300 to 3300:
import pandas
We have predicted that a car with 1.3
from sklearn import linear_model
liter engine, and a weight of 3300 kg,
df = pandas.read_csv("cars.csv")
will release approximately 115 grams
X = df[['Weight', 'Volume']]
of CO2 for every kilometer it drives.
y = df['CO2']
regr = linear_model.LinearRegression()
Which shows that the coefficient of
regr.fit(X, y)
0.00755095 is correct:
predictedCO2 = regr.predict([[3300, 1300]])
print(predictedCO2)
107.2087328 + (1000 * 0.00755095) =
114.75968
Result:
[114.75968007]
Machine Learning - Train/Test
Evaluate Your Model
In Machine Learning we create models to predict the outcome of certain events, like in the previous
chapter where we predicted the CO2 emission of a car when we knew the weight and engine size.
To measure if the model is good enough, we can use a method called Train/Test.
What is Train/Test?
o Train/Test is a method to measure the accuracy of your model.
o It is called Train/Test because you split the the data set into two sets: a training set and a testing
set.
o 80% for training, and 20% for testing.
o You train the model using the training set.
o You test the model using the testing set.
o Train the model means create the model.
o Test the model means test the accuracy of the model.
Machine Learning - Train/Test
Start With a Data Set
Start with a data set you want to test.
Our data set illustrates 100 customers in a shop, and their shopping habits.
Example
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
plt.scatter(x, y)
plt.show()
Result:
The x axis represents the number of minutes before
making a purchase.
The y axis represents the amount of money spent
on the purchase.
Machine Learning - Train/Test
Split Into Train/Test
The training set should be a random selection of 80% of the original data.
The testing set should be the remaining 20%.
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
Display the Training Set
Display the same scatter plot with the training set:
Example
plt.scatter(train_x, train_y)
plt.show()
Result:
The testing set also looks like the original
data set:
Machine Learning - Train/Test
Fit the Data Set
What does the data set look like? In my opinion I think the best fit would be a polynomial regression,
so let us draw a line of polynomial regression.
To draw a line through the data points, we use the plot() method of the matplotlib module:
Example
Draw a polynomial regression line through the
data points:
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
test_x = x[80:]
test_y = y[80:]
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
myline = numpy.linspace(0, 6, 100)
train_x = x[:80]
train_y = y[:80]
mymodel =
numpy.poly1d(numpy.polyfit(train_x,
train_y, 4))
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
Machine Learning - Train/Test
The result can back my suggestion of the data set fitting a
polynomial regression, even though it would give us some
weird results if we try to predict values outside of the data
set. Example: the line indicates that a customer spending 6
minutes in the shop would make a purchase worth 200. That
is probably a sign of overfitting.
But what about the R-squared score? The R-squared score is
a good indicator of how well my data set is fitting the
model.
R2
Remember R2, also known as R-squared?
It measures the relationship between the x axis and the y axis, and the value ranges from 0 to
1, where 0 means no relationship, and 1 means totally related.
The sklearn module has a method called r2_score() that will help us find this relationship.
In this case we would like to measure the relationship between the minutes a customer stays
in the shop and how much money they spend.
Machine Learning - Train/Test
R2
Example
How well does my training data fit in a
polynomial regression?
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
mymodel =
numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
O/P: 0.7988645544629795
Note: The result 0.799 shows that there is a OK
relationship.
Machine Learning - Train/Test
Bring in the Testing Set
Now we have made a model that is OK, at least when it comes to training data.
Now we want to test the model with the testing data as well, to see if gives us the same result.
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40,
100) / x
mymodel =
numpy.poly1d(numpy.polyfit(train_x, train_y,
4))
train_x = x[:80]
train_y = y[:80]
Note: The result 0.809 shows that the model fits
the testing set as well, and we are confident that
we can use the model to predict future values.
test_x = x[80:]
test_y = y[80:]
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
Machine Learning - Train/Test
Predict Values
Now that we have established that our
model is OK, we can start predicting
new values.
Example
How much money will a buying
customer spend, if she or he stays in
the shop for 5 minutes?
print(mymodel(5))
The example predicted the customer
to spend 22.88 dollars, as seems to
correspond to the diagram:
Machine Learning - Decision Tree
Decision Tree
In this chapter we will show you how to make a
"Decision Tree". A Decision Tree is a Flow Chart,
and can help you make decisions based on
previous experience.
In the example, a person will try to decide if
he/she should go to a comedy show or not.
Luckily our example person has registered
every time there was a comedy show in town,
and registered some information about the
comedian, and also registered if he/she went or
not.
Machine Learning - Decision Tree
Now, based on this data set, Python can create a decision tree that can be used to
decide if any new shows are worth attending to.
Machine Learning - Decision Tree
How Does it Work?
First, import the modules you need, and read the dataset with pandas:
Example
Read and print the data set:
import pandas
from sklearn import tree
import pydotplus
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import matplotlib.image as pltimg
df = pandas.read_csv("shows.csv")
print(df)
o To make a decision tree, all data has to be
numerical.
o We have to convert the non numerical
columns 'Nationality' and 'Go' into
numerical values.
o Pandas has a map() method that takes a
dictionary with information on how to
convert the values.
o {'UK': 0, 'USA': 1, 'N': 2}
o Means convert the values 'UK' to 0, 'USA' to
1, and 'N' to 2.
Machine Learning - Decision Tree
Now we can create the actual decision tree, fit it
with our details, and save a .png file on the
computer
Example
Create a Decision Tree, save it as an image, and
show the image:
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
data = tree.export_graphviz(dtree, out_file=None,
feature_names=features)
graph = pydotplus.graph_from_dot_data(data)
graph.write_png('mydecisiontree.png')
img=pltimg.imread('mydecisiontree.png')
imgplot = plt.imshow(img)
plt.show()
Then we have to separate the feature columns
from the target column.
The feature columns are the columns that we
try to predict from, and the target column is
the column with the values we try to predict.
Example
X is the feature columns, y is the target
column:
features = ['Age', 'Experience', 'Rank'
'Nationality']
X = df[features]
y = df['Go']
print(X)
print(y)
Machine Learning - Decision Tree
Result Explained
The decision tree uses your earlier
decisions to calculate the odds for you
to wanting to go see a comedian or not.
Let us read the different aspects of the
decision tree:
Rank
Rank <= 6.5 means that every comedian with a rank of
6.5 or lower will follow the True arrow (to the left), and
the rest will follow the False arrow (to the right).
gini = 0.497 refers to the quality of the split, and is
always a number between 0.0 and 0.5, where 0.0 would
mean all of the samples got the same result, and 0.5
would mean that the split is done exactly in the
middle.
samples = 13 means that there are 13 comedians left
at this point in the decision, which is all of them since
this is the first step.
value = [6, 7] means that of these 13 comedians, 6 will
get a "NO", and 7 will get a "GO".
Machine Learning - Decision Tree
Gini
There are many ways to split the
samples, we use the GINI method in this
tutorial.
The Gini method uses this formula:
Gini = 1 - (x/n)2 - (y/n)2
Where x is the number of positive
answers("GO"), n is the number of
samples, and y is the number of
negative answers ("NO"), which gives us
this calculation:
1 - (7 / 13)2 - (6 / 13)2 = 0.497
The next step contains two boxes,
one box for the comedians with a
'Rank' of 6.5 or lower, and one
box with the rest.
Machine Learning - Decision Tree
True - 5 Comedians End Here:
gini = 0.0 means all of the samples got the same result.
samples = 5 means that there are 5 comedians left in this branch (5 comedian with a Rank of 6.5 or
lower).
value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO".
False - 8 Comedians Continue:
Nationality
Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the
arrow to the left (which means everyone from the UK, ), and the rest will follow the arrow to the
right.
gini = 0.219 means that about 22% of the samples would go in one direction.
samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher
than 6.5).
value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO".
Machine Learning - Decision Tree
True - 5 Comedians End Here:
gini = 0.0 means all of the samples got the same result.
samples = 5 means that there are 5 comedians left in this branch (5 comedian
with a Rank of 6.5 or lower).
value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO".
False - 8 Comedians Continue:
Nationality
Nationality <= 0.5 means that the comedians with a nationality value of less
than 0.5 will follow the arrow to the left (which means everyone from the UK, ),
and the rest will follow the arrow to the right.
gini = 0.219 means that about 22% of the samples would go in one direction.
samples = 8 means that there are 8 comedians left in this branch (8 comedian
with a Rank higher than 6.5).
value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a
"GO".
Machine Learning - Decision Tree
False - 8 Comedians Continue:
Nationality
Nationality <= 0.5 means that the comedians with a
nationality value of less than 0.5 will follow the arrow
to the left (which means everyone from the UK, ), and
the rest will follow the arrow to the right.
gini = 0.219 means that about 22% of the samples
would go in one direction.
samples = 8 means that there are 8 comedians left in
this branch (8 comedian with a Rank higher than 6.5).
value = [1, 7] means that of these 8 comedians, 1 will
get a "NO" and 7 will get a "GO".
Machine Learning - Decision Tree
True - 4 Comedians Continue:
Age
Age <= 35.5 means that comedians at the age of 35.5 or younger will
follow the arrow to the left, and the rest will follow the arrow to the
right.
gini = 0.375 means that about 37,5% of the samples would go in one
direction.
samples = 4 means that there are 4 comedians left in this branch (4
comedians from the UK).
value = [1, 3] means that of these 4 comedians, 1 will get a "NO" and 3
will get a "GO".
False - 4 Comedians End Here:
gini = 0.0 means all of the samples got the same result.
samples = 4 means that there are 4 comedians left in this branch (4
comedians not from the UK).
value = [0, 4] means that of these 4 comedians, 0 will get a "NO" and 4
will get a "GO".
Machine Learning - Decision Tree
True - 2 Comedians End Here:
gini = 0.0 means all of the samples got the same result.
samples = 2 means that there are 2 comedians left in this branch (2
comedians at the age 35.5 or younger).
value = [0, 2] means that of these 2 comedians, 0 will get a "NO" and 2 will
get a "GO".
False - 2 Comedians Continue:
Experience
Experience <= 9.5 means that comedians with 9.5 years of experience, or
less, will follow the arrow to the left, and the rest will follow the arrow to
the right.
gini = 0.5 means that 50% of the samples would go in one direction.
samples = 2 means that there are 2 comedians left in this branch (2
comedians older than 35.5).
value = [1, 1] means that of these 2 comedians, 1 will get a "NO" and 1 will
get a "GO".
Machine Learning - Decision Tree
True - 1 Comedian Ends Here:
gini = 0.0 means all of the samples got the same result.
samples = 1 means that there is 1 comedian left in this branch (1 comedian with 9.5 years of experience
or less).
value = [0, 1] means that 0 will get a "NO" and 1 will get a "GO".
False - 1 Comedian Ends Here:
gini = 0.0 means all of the samples got the same result.
samples = 1 means that there is 1 comedians left in this branch (1 comedian with more than 9.5 years of
experience).
value = [1, 0] means that 1 will get a "NO" and 0 will get a "GO".
Machine Learning - Decision Tree
Predict Values:
We can use the Decision Tree to predict new values.
Example: Should I go see a show starring a 40 years old American comedian, with 10 years of
experience, and a comedy ranking of 7?
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
df = pandas.read_csv("shows.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
features = ['Age', 'Experience', 'Rank',
'Nationality']
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
print(dtree.predict([[40, 10, 7, 1]]))
print("[1] means 'GO'")
print("[0] means 'NO'")
Machine Learning - Decision Tree
Predict Values:
We can use the Decision Tree to predict new values.
Example: Should I go see a show starring a 40 years old American comedian, with 10 years of
experience, and a comedy ranking of 7?
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
df = pandas.read_csv("shows.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
features = ['Age', 'Experience', 'Rank',
'Nationality']
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
print(dtree.predict([[40, 10, 7, 1]]))
print("[1] means 'GO'")
print("[0] means 'NO'")
Machine Learning - Decision Tree
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
features
=
['Age',
'Nationality']
X = df[features]
y = df['Go']
'Experience',
df = pandas.read_csv("shows.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
print(dtree.predict([[40, 10, 6, 1]]))
print("[1] means 'GO'")
print("[0] means 'NO'")
'Rank',
Machine Learning - Decision Tree
Different Results
You will see that the Decision Tree gives you different results if you run it enough times, even if
you feed it with the same data.
That is because the Decision Tree does not give us a 100% certain answer. It is based on the
probability of an outcome, and the answer will vary.
Python List VS Array VS Tuple
• The following are the main
characteristics of a List:
• The list is an ordered collection of
data types.
• The list is mutable.
• List are dynamic and can contain
objects of different data types.
• List elements can be accessed by
index number.
# Python program to demonstrate
List
list = ["mango", "strawberry", "orange",
"apple", "banana"]
print(list)
# we can specify the range of the
# index by specifying where to start
# and where to end
print(list[2:4])
# we can also change the item in the
# list by using its index number
list[1] = "grapes"
print(list[1])
Python List VS Array VS Tuple - List
• The following are the main
characteristics of a List:
• The list is an ordered collection of
data types.
• The list is mutable.
• List are dynamic and can contain
objects of different data types.
• List elements can be accessed by
index number.
# Python program to demonstrate
List
List= ["mango”, “strawberry",
"orange", "apple", "banana"]
print(list)
# we can specify the range of the
# index by specifying where to start
# and where to end
print(list[2:4])
# we can also change the item in the
# list by using its index number
list[1] = "grapes"
print(list[1])
Output :
['mango', 'strawberry', 'orange', 'apple',
'banana']
['orange', 'apple']
grapes
Python List VS Array VS Tuple - Array
• An array is an ordered collection
of the similar data types.
• An array is mutable.
• An array can be accessed by using
its index number.
# Python program to demonstrate
# Creation of Array
# importing "array" for array
creations
import array as arr
# creating an array with integer type
a = arr.array('i', [1, 2, 3])
# printing original array
print ("The new created array is : ", end =" ")
for i in range (0, 3):
print (a[i], end =" ")
print()
# creating an array with float type
b = arr.array('d', [2.5, 3.2, 3.3])
# printing original array
print ("The new created array is : ", end =" ")
for i in range (0, 3):
print (b[i], end =" ")
O/P :
The new created array is : 1 2 3
The new created array is : 2.5 3.2 3.3
Python List VS Array VS Tuple - Tuple
• Tuples are immutable and can
store any type of data type.
• it is defined using ().
• it cannot be changed or replaced
as it is an immutable data type.
tuple = ("orange","apple","banana")
print(tuple)
# we can access the items in
# the tuple by its index number
print(tuple[2])
#we can specify the range of the
# index by specifying where to start
# and where to end
print(tuple[0:2])
Output :
('orange', 'apple', 'banana')
banana
('orange', 'apple')
Python Set
• Sets are used to store multiple
items in a single variable
• Set
items
are
unordered,
unchangeable, and do not allow
duplicate values.
• Sets cannot have two items with
the same value.
Days1 = {"Monday","Tuesday","Wednesday","Thursday",
"Sunday"}
Days2 = {"Friday","Saturday","Sunday"}
print(Days1|Days2) #printing the union of the sets
print(Days1.union(Days2)) #printing the union of the
sets
print(Days1&Days2) #prints the intersection of the two
sets
print(set1.intersection(set2)) #prints the intersection of
the two sets
set3 = set1.intersection(set2)
print(set3)
Python List VS Array VS Tuple - Tuple
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Import Convention:
We need to import the
library before we get started.
import pandas as pd
Pandas Data Structure:
We have two types of data structures
in Pandas, Series and DataFrame.
Series
Series is a one-dimensional labeled array
that can hold any data type.
DataFrame
DataFrame is a two-dimensional,
potentially heterogeneous tabular data
structure.
Or we can say Series is the data
structure for a single column of
a DataFrame
Now let us see some examples of Series
and DataFrames for better
understanding.
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Importing Convention:
Pandas library offers a set of reader
functions that can be performed on a
wide range of file :
Formats which returns a Pandas object. Here we
have mentioned a list of reader functions.
Similarly, we have a list of write operations which
are useful while writing data into a file.
pd.read_csv(“filename”)
pd.read_table(“filename”)
pd.read_excel(“filename”)
pd.read_sql(query,
connection_object)
pd.read_json(json_string)
df.to_csv(“filename”)
df.to_excel(“filename”)
df.to_sql(table_name, connection_object)
df.to_json(“filename”)
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Operations:
Create Test/Fake Data:
View DataFrame contents:
Pandas library allows us to create fake or
df.head(n) – look at first n rows of the DataFrame.
test data in order to test our code
df.tail(n) – look at last n rows of the DataFrame.
segments. Check out the examples given
df.shape() – Gives the number of rows and
below.
columns.
df.info() – Information of Index, Datatype and
pd.DataFrame(np.random.rand(4,3)) – 3
Memory.
columns and 4 rows of random floats
df.describe() –Summary statistics for numerical
pd.Series(new_series) – Creates a series
columns.
from an iterablenew_series
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Selecting:
we want to select and have a look at a chunk of data
from our DataFrame. There are two ways of achieving
the same.First, selecting by position and second,
selecting by label.
Selecting by position using iloc:
df.iloc[0] – Select first row of data frame
df.iloc[1] – Select second row of data frame
df.iloc[-1] – Select last row of data frame
df.iloc[:,0] – Select first column of data frame
df.iloc[:,1] – Select second column of data frame
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Sorting:
Another very simple yet useful feature offered by
Pandas is the sorting of DataFrame.
df.sort_index() -Sorts by labels along an axis
df.sort_values(column1) – Sorts values by column1 in
ascending order
df.sort_values(column2,ascending=False) – Sorts values
by column2 in
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Groupby:
Using groupby technique you can create a grouping of categories and then it can be
helpful while applying a function to the categories. This simple yet valuable technique is
used widely in data science.
df.groupby(column) – Returns a groupby object for values from one column
df.groupby([column1,column2]) – Returns a groupby object values from multiple columns
df.groupby(column1)[column2].mean() – Returns the mean of the values in column2,
grouped by the values in column1
df.groupby(column1)[column2].median() – Returns the mean of the values in column2,
grouped by the values in column1
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Functions:
There are some special methods available in Pandas which makes our calculation easier.
Let’s
Mean:df.mean() – mean of all columns
Median:df.median() – median of each column
Standard Deviation:df.std() – standard deviation of each column
Max:df.max() – highest value in each column
Min:df.min() – lowest value in each column
Count:df.count() – number of non-null values in each DataFrame column
Describe:df.describe() – Summary statistics for numerical columns
apply those methods in our Product_ReviewDataFrame
Python Pandas Cheat Sheet
Simple, expressive and arguably one of the most important libraries in Python, not only does it
make real-world Data Analysis significantly easier but provides an optimized feature of being
significantly fast.
Plotting:
Data Visualization with Pandas is carried out in the following ways.
Histogram
Scatter Plot
Note: Call %matplotlib inline to set up plotting inside the Jupyter notebook.
Histogram: df.plot.hist()
Scatter Plot:df.plot.scatter(x=’column1′,y=’column2′)
Pandas Project
Creating Pandas Dataframe:
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Pandas Project
Data Science Project Ideas
Data Science continues to grow in popularity as a promising career path
for this era. It’s one of the most exciting and attractive options
available. Demand for Data Scientists is increasing in the market.
According to recent reports, demand will skyrocket in the future years,
increasing by many times. Data Science encompasses a wide range of
scientific methods, procedures, techniques, and information retrieval
systems to detect meaningful patterns in organized and unstructured
data. More opportunities emerge in the market as more industries
recognize the value of Data Science.
Gender and Age Detection Python Project
First introducing you with the terminologies used in this advanced python project of
gender and age detection –
What is Computer Vision?
Computer Vision is the field of study that enables computers to see and identify
digital images and videos as a human would. The challenges it faces largely follow
from the limited understanding of biological vision. Computer Vision involves
acquiring, processing, analyzing, and understanding digital images to extract highdimensional data from the real world in order to generate symbolic or numerical
information which can then be used to make decisions. The process often includes
practices like object recognition, video tracking, motion estimation, and image
restoration.
What is OpenCV?
OpenCV is short for Open Source Computer Vision. Intuitively by the name, it is an opensource Computer Vision and Machine Learning library. This library is capable of processing
real-time image and video while also boasting analytical capabilities. It supports the Deep
Learning frameworks TensorFlow, Caffe, and PyTorch.
What is a CNN?
A Convolutional Neural Network is a deep neural network (DNN) widely used for the purposes
of image recognition and processing and NLP. Also known as a ConvNet, a CNN has input and
output layers, and multiple hidden layers, many of which are convolutional. In a way, CNNs are
regularized multilayer perceptrons.
Gender and Age Detection Python Project- Objective
To build a gender and age detector that can approximately guess the gender and age of the
person (face) in a picture using Deep Learning on the Adience dataset.
Gender and Age Detection – About the Project
In this Python Project, we will use Deep Learning to accurately identify the gender and age of
a person from a single image of a face. We will use the models trained by Tal Hassner and Gil
Levi. The predicted gender may be one of ‘Male’ and ‘Female’, and the predicted age may be
one of the following ranges- (0 – 2), (4 – 6), (8 – 12), (15 – 20), (25 – 32), (38 – 43), (48 – 53), (60
– 100) (8 nodes in the final softmax layer). It is very difficult to accurately guess an exact age
from a single image because of factors like makeup, lighting, obstructions, and facial
expressions. And so, we make this a classification problem instead of making it one of
regression.
The CNN Architecture
The convolutional neural network for this python project has 3 convolutional layers:
• Convolutional layer; 96 nodes, kernel size 7
• Convolutional layer; 256 nodes, kernel size 5
• Convolutional layer; 384 nodes, kernel size 3
It has 2 fully connected layers, each with 512 nodes, and a final output layer of softmax type.
To go about the python project, we’ll:
Detect faces
 Classify into Male/Female
 Classify into one of the 8 age ranges
 Put the results on the image and display it
The Dataset
For this python project, we’ll use the Adience dataset; the dataset is available in the public
domain and you can find it here. This dataset serves as a benchmark for face photos and is
inclusive of various real-world imaging conditions like noise, lighting, pose, and appearance.
The images have been collected from Flickr albums and distributed under the Creative
Commons (CC) license. It has a total of 26,580 photos of 2,284 subjects in eight age ranges (as
mentioned above) and is about 1GB in size. The models we will use have been trained on this
dataset.
Prerequisites
You’ll need to install OpenCV (cv2) to be able to run this project. You can do this with pippip install opencv-python
Other packages you’ll be needing are math and argparse, but those come as part of the
standard Python library.
Steps for practicing gender and age detection python project
1. Download this zip. Unzip it and put its contents in a directory you’ll call gad.
The contents of this zip are:
•
•
•
•
•
•
•
opencv_face_detector.pbtxt
opencv_face_detector_uint8.pb
age_deploy.prototxt
age_net.caffemodel
gender_deploy.prototxt
gender_net.caffemodel
a few pictures to try the project on
For face detection, we have a .pb file- this is a protobuf file (protocol buffer); it holds the graph
definition and the trained weights of the model. We can use this to run the trained model. And
while a .pb file holds the protobuf in binary format, one with the .pbtxt extension holds it in
text format. These are TensorFlow files. For age and gender, the .prototxt files describe the
network configuration and the .caffemodel file defines the internal states of the parameters of
the layers.
2. We use the argparse library to create an argument parser so we can get the image argument
from the command prompt. We make it parse the argument holding the path to the image to
classify gender and age for.
3. For face, age, and gender, initialize protocol buffer and model.
4. Initialize the mean values for the model and the lists of age ranges and genders to classify from.
5. Now, use the readNet() method to load the networks. The first parameter holds trained weights
and the second carries network configuration.
6. Let’s capture video stream in case you’d like to classify on a webcam’s stream. Set padding to 20.
7. Now until any key is pressed, we read the stream and store the content into the names
hasFrame and frame. If it isn’t a video, it must wait, and so we call up waitKey() from cv2, then
break.
8. Let’s make a call to the highlightFace() function with the faceNet and frame parameters, and
what this returns, we will store in the names resultImg and faceBoxes. And if we got 0 faceBoxes,
it means there was no face to detect.
Here, net is faceNet- this model is the DNN Face Detector and holds only about 2.7MB on disk.
Create a shallow copy of frame and get its height and width.
Create a blob from the shallow copy.
Set the input and make a forward pass to the network.
faceBoxes is an empty list now. for each value in 0 to 127, define the confidence
(between 0 and 1). Wherever we find the confidence greater than the
confidence threshold, which is 0.7, we get the x1, y1, x2, and y2 coordinates and
append a list of those to faceBoxes.
• Then, we put up rectangles on the image for each such list of coordinates and
return two things: the shallow copy and the list of faceBoxes.
•
•
•
•
• 9. But if there are indeed faceBoxes, for each of those, we define the face, create a 4dimensional blob from the image. In doing this, we scale it, resize it, and pass in the mean
values.
• 10. We feed the input and give the network a forward pass to get the confidence of the two
class. Whichever is higher, that is the gender of the person in the picture.
• 11. Then, we do the same thing for age.
• 12. We’ll add the gender and age texts to the resulting image and display it with imshow().
Python Project Examples for Gender and Age Detection
Let’s try this gender and age classifier out on some of our own images now.
We’ll get to the command prompt, run our script with the image option and specify an image to
classify:
Python Project Example 1
O/P:
Python Project Examples for Gender and Age Detection
Let’s try this gender and age classifier out on some of our own images now.
We’ll get to the command prompt, run our script with the image option and specify an image to
classify:
Python Project Example 2
O/P:
Python Project Examples for Gender and Age Detection
Let’s try this gender and age classifier out on some of our own images now.
We’ll get to the command prompt, run our script with the image option and specify an image to
classify:
Python Project Example 3
O/P:
Python Project Examples for Gender and Age Detection
Let’s try this gender and age classifier out on some of our own images now.
We’ll get to the command prompt, run our script with the image option and specify an image to
classify:
Python Project Example 4
O/P:
Python Project Examples for Gender and Age Detection
Let’s try this gender and age classifier out on some of our own images now.
We’ll get to the command prompt, run our script with the image option and specify an image to
classify:
Python Project Example 5
O/P:
Finished this Book …………
Congrats You got your Data Scientist Job
Author - Rohit Dubey
Data Science Trainer
Download