Uploaded by vasoyavishal33

Statistics b6d3a13a8c5d487c8ebeb6e27fea3602

advertisement
Statistics
Google Colaboratory
https://colab.research.google.com/drive/1b1tb8yA15D-ltAAK
7TObj-_0NQB2JLSp#scrollTo=LI5QLn7xNMrX
What is Statistics ?
Statistics is the science of collecting, organizing and analysing the data.
Better Decision Making
What is Data ?
Facts or pieces of information that can be measure.
Ex. IQ of students in the class.
Type of Statistics :
Descriptive Statistics
It consist of organizing and summarizing the data.
What is the average marks of students in the classroom ?
Inferential Statistics :
Technique where in we use the data that we have measured to form
conclusions.
Are the age of students of this classroom similar to the age of the math
class in the college?
What is population and sample :
Statistics
1
Things to be careful about when creating samples :
Random
Sample Size
Representative
Parameter vs Statistics :
A parameter is the characteristics of the population. generally unknown
and estimated by statistics.
A Statistics is a characteristic of samples. the goal of statistics inference
to use information that obtained from the sample to make inference
about the population parameter.
Population :
Whole data.
Elections → Any state → Population
Capital N
Sample :
subset of population
Choose people random to know about whom they gave vote ? For that
you are not going to every single voter cause it’s too hectic that’s why
you choose subset from the population and make assumption based on
sample result.
Small n
Sampling Techniques :
Statistics
2
1. Simple Random Sampling :
a. Every member of the population has a equal chance of being selected
for sample.
2. Stratify Sampling :
a. Where the population is split into non overlapping groups(strata).
b. Ex. Gender → male, female → survey
c. Ex. Age → (0-10)(10-20)(20-30) → Non Overlapping Groups
3. Systematic Sampling :
a. From the population N → nth individual
b. Ex. mall → survey → every 7th person that I see want to take survey.
4. Convenience Sampling :
a. Those people who are domain expert in that particular survey those are
particular participate in the survey.
b. Ex. Data Science → survey → any person who has interested in data
science and who have knowledge.
Variable :
A variable is a property that can take on any value.
Ex. Height, width
1. Quantitative Variable
a. Measure numerically → add, multiply, divide, subtract
b. Discreate Variable → whole number → no of bank account
c. Continuous Variable → height → 174.56
2. Qualitative/Categorical Variable
a. Based on some characteristics we can derive categorical variable.
b. Ex. Gender, Blood group, T-shirt size
Variable Measurement Scales :
Nominal
Categorical / Qualitative Data
Statistics
3
Ex. Colour, Gender
No order, no measurement
Ordinal
Order of the data matters, value doesn’t
We focus on the rank or orders, not focus on values.
Ex. Student marks , Rank
Interval
Order Matters, value also matter, natural 0 is not present
Ex. temperature → 70-80, 80-90 → range of value with order matters
but you can’t represent as single 0.
Ratio
Ratio variables are important in statistical analysis because they allow
for meaningful comparisons and calculations of ratios and percentages.
A clear definition of zero
equal intervals between values
meaningful ratios between values
continuous or discreate values
True Zero Point
A true zero point is a value on a scale where the absence of a property or
attribute is represented by a value of zero. This means that there is an absolute
minimum value that represents the complete absence of the thing being
measured.
For example, in the case of weight, a true zero point would be the complete
absence of weight, which is represented by a weight of zero. Similarly, in the
case of temperature measured in Kelvin, zero Kelvin (also known as absolute
zero) represents the complete absence of thermal energy.
In statistical analysis, the presence or absence of a true zero point is important
because it determines whether meaningful ratios can be calculated between
values. On a ratio scale, ratios between values are meaningful because they
represent the relative amounts of the thing being measured. In contrast, on an
Statistics
4
interval scale, ratios between values are not meaningful because there is no true
zero point to use as a reference.
Meaning of Exit Poll :
a survey in which people who have just voted are asked who they voted for in
order to predict the result of the election
Frequency Distribution :
how many time particular item is occurred. → frequency
add previous frequency to current frequency so at the end get total number
of item which is n. → cumulative frequency
Discreate Values → Bar Chart
Continues values → histograms
PDF → smoothening of histogram → kernel density estimator
probability distribution function, probability density function
Univariate Analysis
Statistics
5
⇒ Bar Plot
⇒ Pie Chart
Bar Vs Histogram :
1. Data type → categorical data vs continuous data
2. Axes → bar → x-axis → categories → y-axis → frequency or count of the
values
a. histogram → continuous → x-axis → range of values → depend on bins
→ same y-axis
3. shape of bars : bar → have evenly same space and width → histogram → bit
vary
4. gaps between bars
Graph for Bivariate Analysis
Numerical = Numerical
⇒ Scatter Plot
central tendency :
Statistics
6
Refers to the measure used to determine the Centre of the distribution of the
data.
Mean → prone to outlier
Median → middle value in dataset after arranged in ascending order.
Mode → most frequent value in dataset → categorical data.
Weighted Mean
Trimmed Mean
Outlier :
Outlier has an adverse impact on the entire distribution of the data.
different technique to remove outlier.
It is a number which is completely different from the entire distribution.
Statistics
7
Measure of central tendency :
Arithmetic mean for population and sample :
mean for population :
sum(x) / N
mean for sample :
sum(x) / n
Impact of the outlier included that you have seen in above image.
Median :
find the number which is in middle after sorting.
sort → take centre element → even number
works well with outlier.
it’s not provide much impact on the median of the outlier.
Statistics
8
Mode :
most frequent value
used for both categorical and discreate variable
dataset - > 10% missing data → flower name
Which measure tendency I have to use for fill this null values. → categorical
values
which measure tendency should I apply for fill null values of the people ages
ex. gender, salary
Measure of Dispersion { spread → how well spread your data is.}:
Variance
identify how two distribution are different at that point of time we used
variance.
Statistics
9
population variance :
sigma square
N : population
sample variance :
small s square
n - 1 : sample
Statistics
10
In first plot → low variance
second plot → high variance
high variance for blue because data is highly distributed.
low variance for red because data is not highly distributed.
Standard Deviation :
root of variance
one standard deviation to left, one standard deviation to right
Statistics
11
mean + 1(standard deviation), mean - 1(standard deviation)
variance → how the data is spread ?
standard deviation : what is the range of the data falling around 1 standard
deviation.
The standard deviation is more commonly used than the variance, as it is in the
same units as the original data, while variance is in squared units. Standard
deviation is also easier to interpret, as it represents the average distance of data
points from the mean.
In summary, variance is a measure of how spread out a dataset is, while
standard deviation is the square root of variance and represents the typical
distance of data points from the mean.
Percentile {find outliers}:
A percentile is a value below which a certain percentage of observation lie.
Statistics
12
Statistics
13
Statistics
14
Thing remember while calculating this measure
data sorted from low to high.
They are not actual value in the data.
All other tiles can be easily derived from percentile.
You are basically finding location of an observation.
Five Number Summary :
Minimum
First Quartile → Q1
Median → M
Third Quartile → Q3
Maximum
Statistics
15
removing the outlier :
IQR → Inter Quartile Range
[ lower fence → higher fence ]
[ q1 - 1.5 IQR , q3 + 1.5 IQR ]
IQR = Q3 - Q1
Q1 = (25%), Q3 = (75%)
Statistics
16
Statistics
17
Benefit of Box Plot :
Easy way to see the distribution of data
tell about skewness of data
can identify outlier
compare 2 categories of data
Statistics
18
Statistics
19
Why sample variance divide by n -1 ?
degree of freedom
The sample variance is calculated by dividing the sum of the squared differences
from the sample mean by n-1, where n is the number of observations in the
sample.
The reason for dividing by n-1 instead of n is to correct for the bias in using the
sample mean to estimate the population mean. When we calculate the sample
variance, we use the sample mean as an estimate of the true population mean.
However, the sample mean is itself a random variable, and its variability means
that the sample variance calculated using the sample mean is likely to be slightly
lower than the true population variance.
Dividing by n-1 instead of n corrects for this bias, by increasing the denominator
of the variance calculation and making the estimate of the variance slightly
larger. This adjustment is known as Bessel's correction, named after Friedrich
Bessel, who first described it in 1825.
In summary, we divide by n-1 instead of n in the sample variance calculation to
correct for the bias introduced by using the sample mean to estimate the
population mean, and to obtain an unbiased estimate of the population variance.
Covariance :
Statistics
20
Statistics
21
Statistics
22
Correlation
Statistics
23
Statistics
24
Random Variable :
Sample space contain a value that random variable have.
Type of random variable
1. Discreate RV → Coin, Dice
2. Continues RV → Height, Width, CGPA, Hold range of value.
Probability Distribution
What are probability distributions ?
A probability distribution is a list of all possible outcome of random variable along
with their corresponding probability values.
Coin Toss = {H, T} = {1/2, 1/2}
2 Dice Probability Distribution → Table → Row(Dice 1), Col(Dice 2) → Sum
→ Sum(2) = 1/36
Not every number have a same probability. Sum(7) = 7/36
In Probability distribution write down all of the outcome along with their
corresponding probability values.
In many scenario number of outcome can be much larger and hence a table
would be tedious task to write down. Worst still, the number of outcome could be
infinite.
Ex. Height of people,
What if we use mathematical function to model the relationship between
outcome and probability ?
By applying relationship between outcome and probability like return probability
as given outcome.
Statistics
25
Probability distribution function is a mathematical function that describes the
probability of obtaining different values of random variable in particular
probability distribution.
Y = F(X) → find out the function F that hold the value of outcome and return.
Using this mathematical function I can also create a graph.
Probability Distribution Function → all set of possible values of random variable
along with their probability
Type of probability Distribution
Famous Probability Distribution
In nature they have lot’s of similarity.
Statistics
26
Why probability distributions important?
Gives an idea about shape/distribution of the data.
And if our data follow a famous probability distribution then we automatically
know a lot about the data.
Note on parameter:
Parameter in probability distribution are numerical values that determine the
shape, scale, location and distribution of the data.
Different probability distribution have a different set of parameter that determine
their shapes, characteristics and understanding these parameters is essential in
statistical analysis and inference.
Types of Probability Distribution Function :
Statistics
27
Probability Mass Function (PMF)
Creating probability distribution function for generating probability for
Discreate random variable is known as probability mass function.
Ex. Roll a dice, Coin
Probability Density Function (PDF)
If you calculate the probability of continuous random variable, It’s known as
PDF.
Create another one function which is combination of both : PMF + PDF
Cumulative distribution function
PDF → CDF
PMF → CDF
Probability Mass Function
Y = F(X) → Rolling a dice → { 1 : 1/6, 2 : 1/6, …, 6 : 1/6 } → { otherwise : 0 }
2 Dice → Sum → Probability
It’s give the probability of each outcome of random variable separately as you
have seen in graph.
Statistics
28
Example of the Probability Mass Function : Bernoulli Distribution, Binominal
Distribution
Cumulative Distribution Function(CDF) of PMF
CDF describes probability of random variable X with a given probability
distribution will be found at a value less than or equal to x.
F(x) = P(X ≤ x)
It gives the probability at particular point along with all of the outcome which is
less than particular outcome.
Probability Density Function(PDF)
PDF describes a probability distribution of continuous random variable.
It’s mathematical function that generate probability of continuous random
variable.
Statistics
29
X-axis → It’s same number that we had in PMF.
Y-axis → It’s not probability that we had in PMF, it’s probability density.
1. Why probability density and why not probability ?
a. Since in X-axis we have a infinite value so what is the probability of occurring
some value that is close to zero or zero.
b. Ex. CGPA → 0 to 10 with infinite value → CGPA = 7.694 → What is the
probability of CGPA 7.694 → it’s close to 0. → 100 children of class → 7.694
c. Since we are dealing with infinite range of values we cannot calculate the
value of each CGPA. Eventually probability will become zero.
d. Area giving the entire probability of every possible outcome which is
basically in between 0 to 10.
e. Probability Density → What is the probability of particular value lie between
two value.
f. calculate the probability help of area under the graph as well as help of the
probability density.
2. What does the area of the graph represents ?
a. Area of the graph consider as total 1 because range of all values covered in
area. ex. CGPA → value between → 0 to 10 → P(0≤x≤10) → 1
3. How to calculate Probability then ?
a. With the help of the probability density → probability density is a probability
of the particular point that lie between two data values.
4. Example of PDF.
a. Normal Distribution → Mean → Sigma Square
Statistics
30
b. Log Normal Distribution → Mean → Sigma
c. Poisson Distribution → Lambda
5. How is graph calculated ?
Density Estimation
2 Techniques of Density Estimation
Parametric Density Estimation → Follows specific probability distribution
Non Parametric Density Estimation → Not follow any specific probability
distribution
Commonly Used Technique
Kernel Density Estimation(KDE)
Histogram Estimation
Gaussian mixture models(GMM)
Statistics
31
Google Colaboratory
https://colab.research.google.com/drive/1x97XwbK6TT4csm
wQRq3zqsbfYNFKjf5-#scrollTo=fljBTmcZj256
Parametric Density Estimation
Plot a histogram → make assumption about data distribution → find a parameter
according to the assumed distribution
Available Data → Mean, Standard Deviation → Estimate Population mean and
standard deviation → All value of X put in formula of PDF → it’s return a
probability density value
All game is to estimate the parameter as much as nearest to the population
mean and standard deviation.
Non Parametric Density Estimation
Statistics
32
Kernel Density Estimation
Statistics
33
Kernel → Probability Distribution → Gaussian Distribution(Normal)
Catch Every Point → Assume as centre → around particular point make normal
distribution that we have seen in second plot
Do same thing for all of the data point
At the end → You will have as number of gaussian kernel as number of data
points.
Take one point → move toward Y-axis → draw perpendicular line toward Y →
Check Line intersect how many gaussian distribution and corresponding it’s Y
value → add all of the Y-value.
Do same for all data points.
For generate gaussian distribution of all points → need parameter → mean,
standard deviation
mean → consider as a data point
standard deviation → using kernel function → kernel( bandwidth) → hyper
parameter → Low bandwidth(Means less standard deviation) : In distribution We
would get spikes spread of the data decrease → High bandwidth(High Standard
Deviation) : Curve going to smooth more.
Less Bandwidth
High Bandwidth
Cumulative Density Function
Statistics
34
PDF → graph area calculate → get the CDF
CDF → calculate the slope of the CDF → get the PDF
PDF → perform Integration → CDF
CDF → deferential → PDF
What is the difference between PDF and CDF
Statistics
35
First decide the rule, if range of petal_width lies between particular range than
it’s consider as setosa and other type that called PDF.
CDF → Useful to check our define range whether right or not. It’s give the
quantitative label how much percentage of values belong to right categories ?
You have a iris dataset and have 4 kdeplot that showing
distrubtion of 4 variable. Each graph contain a 3 distribution for
each cotegory. Which 2 will you choose to remove?
Distribution that cleaery differentiable.
If all 3 distribution overlap on each other than it’s very hard to identify in which
category that lie even for machine learning also.
Check above graph, able to differential between 3 category whether it is setosa,
verginica and versicolor.
Remove the graph which is not differentiable easily.
Statistics
36
It’s very hard to differential each category becasue they are overlapping
each other too much so it’s ok to remove that feature.
titanic dataset → age, survived
By using this type of analysis, you will get which feature is important for model or
which not.
How Cummulative Distribution Function is useful?
The cumulative distribution function (CDF) of a random variable gives the probability
that the variable takes a value less than or equal to a particular value. In other
words, the CDF gives us information about the probability distribution of the random
variable.
Statistics
37
Specifically, the CDF provides the following information:
1. Probability of a value less than or equal to a specific value: The CDF gives the
probability that the random variable takes a value less than or equal to a specific
value. This can help us understand the likelihood of different outcomes.
2. Probability of a value within a range: By subtracting the CDF value for a lower
value from the CDF value for a higher value, we can find the probability that the
random variable takes a value within a specific range.
3. Probability of a specific value: Although the CDF does not directly provide the
probability of a specific value, we can use the CDF to find the probability that the
random variable takes a value very close to a specific value.
In summary, the CDF provides important information about the probability distribution
of a random variable, which can help us make predictions and draw conclusions
about the behavior of the variable.
2D Density Plot
At one time, two numerical column, how related to each other, that’s kind of
study or knowledge derive from this graph.
Statistics
38
Dark area → too much up
Light area → down or surface area
Which combinations have a highest probability or density? → You got paired
probability density about two column.
Contour plot → 3rd dimension is Colour.
Normal Distribution
Statistics
39
Statistics
40
By using mean : shifting is possible
By using standard devation : spreading is possible
Standard Normal Distribution
Statistics
41
Already calculated probability density values for standard normalize distribution
which is stored in Z-table.
Statistics
42
Properties of Normal Distribution
Measure of central tendencies are equal → mean, median, mode → same for
proper normal distribution
Empirical Rule
Area under the curve → 1
What is Skewness?
Statistics
43
In Positive skewed, tail on the right side, means outlier is on the right side that
make a impact on the mean. so that mean is from right side in positive skewed.
Statistics
44
Measure of Skewness :
1. 0 → Proper normal distribution → No Skewed
2. -0.5 to 0.5 → Almost Skewed
3. -1 to 1 → Moderately Skewed
4. Other → Highly Skewed
It’s not compulsory that data is not normally distributed that means it’s skewed.
Check skewedness and shape of the distribution.
Not normal distribution → also possible symmetrical
CDF of Normal Distribution
Statistics
45
Use of Normal Distribution in data science
Outlier detection
Assumptions on data for ml algorithams : linear regression, GMM
Hypothesis Testing
Center limit theorem
What is Kurtosis?
4 Statistical moment :
1. Mean
2. Std
3. Skewness
4. Kurtosis
There ar e more moments also, but it’s main 4.
It’s tell us heaviness of tails.
How fattest any distribution tail. it’s not about peakedness. fattest tail → high
change of having outliers
Statistics
46
Statistics
47
Statistics
48
QQ Plot
Statistics
49
Process:
1. take theoretical data → normal → sort → calculate quantiles → Y Quant
2. original data → sort → calculate quantiles → X Quant
3. take first point from X and Y and compare. Do same thing for all points.
Statistics
50
Does QQ plot only detect normal distribution?
No. We you can compare with any type of distribution that you want.
What is the quantiles?
In statistics, a quantile is a specific value or cut-off point in a distribution that divides
the data into smaller groups or subsets with equal proportions. Quantiles are used to
describe the spread of a dataset, and they can be used to identify the location of
individual observations within a distribution.
The most commonly used quantiles are the quartiles, which divide the data into four
equal parts. The first quartile (Q1) represents the 25th percentile, meaning that 25%
of the data is less than or equal to this value, while 75% is greater than or equal to
this value. The second quartile (Q2) is the median, representing the 50th percentile,
and the third quartile (Q3) is the 75th percentile.
Other common quantiles include the deciles (which divide the data into ten equal
parts) and the percentiles (which divide the data into 100 equal parts).
Statistics
51
Quantiles are useful for identifying outliers, detecting skewness in the data, and
comparing datasets. For example, the difference between the 25th and 75th
percentiles (also known as the interquartile range) can be used to measure the
spread of a dataset, while comparing the distribution of two datasets using their
quartiles can provide insights into their differences.
Non Gaussian Distribution
Continuous Non Gaussian Distribution
Discrete Non Gaussian Distribution
Uniform Distribution
It has a two type:
1. Discrete Uniform Distribution
2. Continuous Uniform Distribution
Statistics
52
Skewness → 0 → Symmetrical like normal distribution
Statistics
53
Log Normal Distribution
Statistics
54
Right skewed
random variable → calculate log → distribution normally distributed.
How to check if a random variable is log normally distribution?
X data → log of data → it should be normally distributed.
Verify with QQ plot → if you will get the data along with line, then it is else not.
Transformation:
Statistics
55
In statistics, a transformation refers to a mathematical function applied to a dataset in
order to alter its distribution or make it more suitable for a particular analysis or
modeling technique. Transformations are commonly used in data analysis to improve
the assumptions of statistical models or to improve the accuracy of statistical
inference.
The most common types of transformations are linear transformations, which involve
multiplying or adding a constant value to the data. For example, a common linear
transformation is to convert a temperature scale from Fahrenheit to Celsius by
subtracting 32 and multiplying by 5/9.
Nonlinear transformations are also commonly used in statistics, particularly in cases
where the data is not normally distributed or exhibits skewness or outliers. Some
common nonlinear transformations include:
Logarithmic transformations: used to reduce the effect of outliers or to compress
data that spans several orders of magnitude.
Square root transformations: used to reduce the effect of positive skewness or to
linearize relationships between variables.
Box-Cox transformations: a family of transformations that can be used to adjust
the skewness and kurtosis of a dataset by selecting a parameter that maximizes
the likelihood of the transformed data.
Transformations can be applied to the entire dataset or to specific variables or
subsets of the data. When selecting a transformation, it is important to consider the
underlying assumptions of the statistical model or analysis, as well as the
interpretability and practical implications of the transformed data.
Bernoulli Distribution
Discreate Distribution
Two outcome → success, failure → coin toss, spam email, rolling a dices getting
a five
having a random experiment with binary outcome
Statistics
56
coin toss → probability of getting success mean head(x=1)
(0.5)1 *(1-0.5)1-1
coin toss → probability of getting failure mean tail(x=0)
(0.5)0 *(1-0.5)1-0
Statistics
57
Binomial Distribution
describe number of success in a fixed number of independent Bernoulli trail.
Perform Bernoulli trails N time → Binominal
Statistics
58
n: number of trials
p: probability of success
x: desired result or outcome of random experiment according to desire
getting like from 2 person out of 3 person
3! / (2!1!) * (0.5)2 * (1-0.5)3-2
Statistics
59
If we choose the value of p= 0.8 means high value, it will move toward right
because, probability of getting head is 0.8 that means in most of the random
experiments have value 7 to 10 among total trial.
That’s why it move toward right side, for less value of probability, it will move
toward left side.
As we seen in above graph.
Statistics
60
Day : 3 : Intermediate to Advance
Distributions :
Get the idea about the dataset that’s why we used different type of
distributions.
how data is spread out or arranged is known as a distributions.
how we can see the data in visualize way ?
continuous data → what type of graph useful to understand the data.
Statistics
61
multiple way to visualize the data using different graph.
What is Distribution ?
In statistics, a distribution refers to the way in which a set of data is spread out or
arranged. It describes how the data is distributed or arranged across its range of
values.
There are several types of distributions, including:
1. Normal distribution: This is a symmetrical, bell-shaped distribution in which
most of the data falls near the mean or average value, with the rest of the
data distributed evenly on either side.
2. Skewed distribution: This is a distribution in which the data is not
symmetrical and is "skewed" to one side or the other. A positively skewed
distribution has a long tail on the right, while a negatively skewed distribution
has a long tail on the left.
3. Bimodal distribution: This is a distribution in which there are two peaks or
modes in the data, indicating that there are two different groups or
populations represented.
4. Uniform distribution: This is a distribution in which the data is evenly
distributed across the range of values, with no peaks or modes.
The uniform distribution is a probability distribution that describes a situation
where every possible outcome is equally likely. In other words, if a random
variable is uniformly distributed, each value within a given range has the
same probability of occurring.
The probability density function (pdf) of a continuous uniform distribution is
given by:
f(x) = 1/(b-a) if a ≤ x ≤ b
0 otherwise
where 'a' is the minimum value of the range and 'b' is the maximum value of
the range.
The cumulative distribution function (cdf) of a continuous uniform distribution
is:
F(x) = 0 if x < a
(x-a)/(b-a) if a ≤ x ≤ b
1 if x > b
Statistics
62
The uniform distribution is often used in statistics and probability theory as a
simple and tractable model for situations where every outcome in a range is
equally likely, such as rolling a fair die or selecting a random number from a
given interval.
Distributions can be described using various measures, including measures of
central tendency (such as the mean, median, and mode) and measures of
variability (such as the range, variance, and standard deviation). Understanding
the distribution of a dataset is important in statistical analysis as it can provide
insights into the nature of the data, help identify any outliers or anomalies, and
inform decisions about appropriate statistical tests or models to use.
1. Gaussian Or Normal Distribution :
Statistics
63
mean and average of the data lie Centre of the distribution.
one side is exactly symmetrical to the other side.
symmetrical, ball shaped distributions, most of the data falls to the near of
the mean or average value, rest of the distributed evenly both side.
Known as normal and gaussian distribution.
Empirical Formula :
Statistics
64
68 - 95 -99.7% rule
Dataset → 100 data point
Within 1st standard deviation around 68% percentage of the entire
distribution over here.
Within 2nd standard deviation around 95% percentage of the entire
distribution over here.
Within 3nd standard deviation around 99.7% percentage of the entire
distribution over here.
If you have a normal or gaussian distribution then definitely above 3
standard deviation percentage condition fulfilled.
Ex. height → normally distributed → domain expert → doctors → within the
1st, 2nd, 3rd standard deviation how much data is falling.
Weight, IRIS Dataset
Whenever we have gaussian distributed data that time it will follow 68%,
95% and 99.7%.
mean = 4, sd = 1
In this case → consider value 4.5 → +0.5 SD toward right side. but in case
4.75 It’s very hard to calculate that’s why we used Z-Score.
Z-Score : It’s help to find out a value that tell us how much standard
deviation away from the mean.
A z-score measures the distance between a data point and the mean
using standard deviations. Z-scores can be positive or negative. The sign
tells you whether the observation is above or below the mean.
x - mean(population mean) / standard deviation → +0.75 SD → This is a
positive value that’s why it’s lie on right side.
how many standard deviation to the right or left that’s why used z-score.
Statistics
65
After applying z-score convert into range of { -3, -2, -1, 0, 1, 2, 3 }
this is called standard normal distribution.
Statistics
66
Standard Normal Distribution :
One most important property of SND : mean = 0 and standard deviation = 1
A random variable y belong to the standard normal distribution if its mean is
zero and standard deviation is one.
Why do we this ? convert using z-score ?
Dataset → Age(year), Salary(rs), Weight(kg) → calculate in units
Age (Year)
Salary (rupee)
Weight( Kg)
34
56k
70
56
110k
87
Here see one thing to unit of all columns are different.
main target → bring up form → SNF (0,1)
take entire data → apply z-score → SNF(0,1) → standardization
Whenever we talk about standardization there are always z-score.
Normalization :
shift entire range of values in range of (0,1)
MinMaxScaler → (0,1) → If you want to shift the range (-1,1) so you can.
CNN → Image → Pixel → range(0,255) → MinMaxScaler - (0,1)
Practical Example of Z-Score ?
Match → Ind vs SA
Statistics
67
Series average 2021, 2020 → 250, 260
Standard deviation of score 2021, 2020 → 10, 12
Team Final score 2021, 2020 → 240, 245
compare two scores in which year team final score was better ?
we are using z-score for the same.
Whenever the standard deviation is more that team had better score.
Z-Score help to find out area of the body curve.
Statistics
68
It’s running from left to right so if you tried to find the value of 0.25 and if you
check 0.25 in right table that contained value 0.5987 means it’s with all left value
(0.5) and added other 0.25 valuation toward right.
But if I tried to find value of -1 then it’s sufficient to view left table with value of -1
because values come from negative to positive and following one simple rule
make things easy to understand. value of -1 : 0.1587
It’s may be chance that you get different z table format so be careful while
getting the value of z-score.
IQ Between 90 to 120 ? → find z-score → based on z table → find the value of
percentage
Where we have to use Standardization and Normalization ?
Statistics
69
Normalization and standardization are both techniques used to preprocess data
before training a machine learning model. However, they are used in different
situations depending on the nature and distribution of the data.
Normalization is used when the data is on different scales and ranges. It
rescales the data to have values between 0 and 1, or between -1 and 1, so that
each feature contributes equally to the distance computation. It is also useful
when the algorithm used for training the model assumes a normal distribution of
the data.
Standardization, on the other hand, is used when the data is normally distributed
or when the algorithm used for training the model assumes a normal distribution
of the data. It transforms the data so that it has a mean of 0 and a standard
deviation of 1. This allows the algorithm to assume that the data is centered
around 0 and that the scale of the features is similar.
In summary, we use normalization when the data is on different scales and
ranges, and standardization when the data is normally distributed or when the
algorithm used for training the model assumes a normal distribution of the data.
How can you find the outlier using the z-score ?
using z-score find that how many standard deviation away from the mean.
in distribution we discussed about emparical rule that tells us that within 1
standard deviation 68 percentage of the data fall, within 2 standard deviation
95 percentage of the data fall, within 3 standard deviation 99.7 percentage of
the data fall.
Whatever the value above the range of 3rd standard deviation consider as a
outlier.
The z-score, also known as the standard score, is a statistical measure that
indicates how many standard deviations a data point is from the mean of a
distribution. It is calculated by subtracting the mean from the data point and then
dividing by the standard deviation.
The use of z-scores is to standardize data so that it can be easily compared
across different distributions or datasets. By using z-scores, we can convert data
from different units or scales into a standard unit of measurement, which makes
it easier to analyse and interpret.
Z-scores are commonly used in statistics and data analysis to identify outliers or
unusual observations in a dataset. An observation that has a z-score greater
Statistics
70
than 3 or less than -3 is considered to be an outlier, meaning it is significantly
different from the other observations in the dataset.
In addition to identifying outliers, z-scores can also be used to calculate
confidence intervals and p-values in statistical hypothesis testing, as well as in
machine learning algorithms such as clustering and anomaly detection. Overall,
the use of z-scores provides a standardized way to compare and analyse data,
making it a useful tool in a wide range of applications.
Probability :
probability is a measure of likelihood of the event.
In statistics, probability is a measure of the likelihood of an event occurring. It is
a number between 0 and 1, with 0 indicating that an event is impossible and 1
indicating that an event is certain. Probability is used to describe the uncertainty
or randomness of a particular event or outcome, and is used in a wide range of
statistical applications, from estimating the likelihood of a particular event
occurring to modelling complex systems and making predictions about future
outcomes.
No of way event can occurred / No of possible outcome
Addition Rule of Probability : OR : +
Mutually Exclusive Event :
It two event can’t occurred at the same time.
Mutually exclusive events are events that cannot occur at the same time. In
other words, the occurrence of one event precludes the occurrence of the
other event. For example, when flipping a coin, it can either land heads up or
tails up, but not both at the same time. Therefore, these two events are
mutually exclusive. When calculating the probability of mutually exclusive
events occurring, we use the addition rule of probability, which states that the
probability of either of two mutually exclusive events occurring is equal to the
sum of their individual probabilities.
Non Mutually Exclusive Event :
Both the event can occur at the same time.
Statistics
71
ex. deck of cards → you are picking a card form a deck of card. what is
the probability of choosing a card that is queen or a heart.
It’s non mutually exclusive because it can occur at same time. there are
4 queen in deck of cards. there are 13 heart cards in deck of cards.
there are 1 queen which is in heart shape in deck of cards.
this is a addition rule for non mutually exclusive event.
The formula for the addition rule of non-mutually exclusive events is:
P(A or B) = P(A) + P(B) - P(A and B)
where P(A) is the probability of event A occurring, P(B) is the probability of
event B occurring, and P(A and B) is the probability of both events A and B
occurring together. This formula takes into account the fact that if both
events A and B can occur at the same time, then the probability of their
occurrence together needs to be subtracted from the sum of their individual
probabilities in order to avoid double counting.
Multiplication Rule :
Independent Event
rolling a dice → {1,2,3,4,5,6}
In first try I will get 1 so in next try may be I will get any number in
between 1 to 6.
Every number has a same chance and same probability to occur.
It’s not dependent on any other outcome so it’s called independent
event.
What is the probability of rolling a 5 and then 4 in dice ?
1/6 * 1/6 = 1/36
Non Independent Event
Bag → 3 red marble, 2 blue marble
What is the probability of taking out a marble ?
In first try you picked out one red marble → now remaining 4 total
marble → 2 red, 2 blue
After first event occurred that impacted on it’s next event so we say that
second event is dependent on first event.
Statistics
72
It’s called dependent event.
Naive Byes → Condition Probability
What is the probability of drawing a queen and then aces from deck of
cards?
Permutation and combination are two fundamental concepts in combinatorics,
which is the branch of mathematics that deals with counting and arranging
objects.
Permutation :
School trip → chocolate factory → { 6 different type of chocolate } → student
have to note 3 chocolate → how many permutation can be possible ? →
6*5*4
Permutation refers to the arrangement of objects in a specific order, where the
order of the objects matters. In other words, permutations are arrangements of
objects where the order matters. For example, if we have three distinct objects A,
B, and C, there are six possible permutations: ABC, ACB, BAC, BCA, CAB, and
CBA.
The formula for calculating the number of permutations of n objects taken r at a
time is:
nPr = n! / (n-r)!
where n is the total number of objects and r is the number of objects chosen.
Combination :
Statistics
73
Combination, on the other hand, refers to the selection of objects from a set
without regard to the order in which they are selected. In other words,
combinations are arrangements of objects where the order does not matter. For
example, if we have three distinct objects A, B, and C, there are only three
possible combinations: ABC, ACB, and BCA (since BAC, CAB, and CBA are all
the same combination).
The formula for calculating the number of combinations of n objects taken r at a
time is:
nCr = n! / (r! * (n-r)!)
where n is the total number of objects and r is the number of objects chosen.
In summary, the main difference between permutation and combination is that
permutation is concerned with the arrangement of objects in a specific order, while
combination is concerned with the selection of objects without regard to their order.
P-Value :
mouse pad in laptop → touch more frequently in middle area → as we see in
image in middle area having a high distribution of data and around both side
have a less distribution.
Consider as side of the mouse pad having 0.01 p-value for touching this
area. Out of 100 touches I touched 1 times in the particular area.
What is the probability with respect to p-value for that specific experiment.
Statistics
74
P-value is a statistical measure of the evidence against a null hypothesis. It
is used to determine whether the results of a statistical test are significant
and can be used to reject the null hypothesis. The null hypothesis is a
statement that there is no significant difference between two groups or
variables being compared.
A p-value is calculated by comparing the observed data with the expected
data under the null hypothesis. If the p-value is small (usually less than
0.05), it means that the observed data is unlikely to have occurred by chance
alone, and we can reject the null hypothesis. If the p-value is large, it means
that the observed data is likely to have occurred by chance alone, and we
cannot reject the null hypothesis.
For example, if we are testing the hypothesis that a new drug is effective in
treating a particular disease, we might conduct a randomized controlled trial
and compare the outcomes of patients who received the drug with those who
received a placebo. If the p-value is less than 0.05, it means that the
difference in outcomes between the two groups is statistically significant, and
we can conclude that the drug is effective.
P-values are commonly used in hypothesis testing and statistical inference,
and are an important tool in many scientific fields. However, they can be
controversial and are sometimes criticized for being misinterpreted or
misused. As with any statistical measure, it is important to use p-values
appropriately and to interpret the results in the context of the study design
and the underlying scientific question.
Hypothesis Testing :
Coin → Test whether this coin is a fair coin or not by performing 100 tosses.
When do you thing a coin is a fair coin? → P(H) = 0.5, P(T) = 0.5
If I get 50 times head then I can definitely say this coin is fair.
Null Hypothesis → usually given in problem statement. → coin is fair.
Alternate Hypothesis → coin is unfair.
Experiment
Reject or accept null hypothesis
mean value = 50, standard deviation = 10
Statistics
75
our experiment should be near to the mean.
how can we decide how far it can be away from the mean.
For that we used significance value → defined by domain expert
In statistics, the symbol for alpha is α. Alpha is commonly used as the
significance level in hypothesis testing, and represents the probability of
rejecting the null hypothesis when it is actually true. The value of alpha is
usually set to 0.05 or 0.01, corresponding to a 5% or 1% chance of rejecting
the null hypothesis when it is actually true.
In P-value you are talking about probability with respect to the p-value of
specific experiment.
Significance : What should be within your confidence Interval.
What is Type1 and Type2 Error ?
Null Hypothesis → H0 → coin is fair
Alternate Hypothesis → H1 → coin is not fair
Outcome 1 : we reject the null hypothesis, in reality it is false. → YES
Outcome 2 : we reject the null hypothesis, in reality it is true. → NO → Type1
Error
Outcome 3 : we retain/accept the null hypothesis, in reality it is false. → NO
→ Type2 Error
Outcome 4 : we retain/accept the null hypothesis, in reality it is true. → YES
Type 1 and Type 2 errors are two types of errors that can occur in hypothesis
testing.
Statistics
76
Type 1 error, also known as a false positive, occurs when the null hypothesis
is rejected even though it is true. This means that we conclude that there is a
statistically significant difference between two groups when in fact there is
not. The probability of making a Type 1 error is denoted by the symbol alpha
(α) and is usually set at 0.05 or 0.01.
Type 2 error, also known as a false negative, occurs when the null
hypothesis is accepted even though it is false. This means that we fail to
detect a statistically significant difference between two groups when in fact
there is one. The probability of making a Type 2 error is denoted by the
symbol beta (β) and is dependent on the sample size, the effect size, and
the level of significance.
In hypothesis testing, the goal is to minimize both Type 1 and Type 2 errors.
However, there is often a trade-off between the two, as reducing the
probability of one type of error increases the probability of the other.
Statisticians use various methods to balance the risks of Type 1 and Type 2
errors, such as increasing the sample size or adjusting the significance level.
Ultimately, the choice of which type of error to prioritize depends on the
goals and context of the study, as well as the consequences of each type of
error.
One tail and two tail test
College in Karnataka has an 85% rate placement rate. A new college was
recently opened and it was found that a sample of 150 student had a
placement rate of 88% which standard deviation 4%. does this college has
different placement rate then the other colleges ?
A hypothesis test can be either a one-tailed test or a two-tailed test, depending
on the directionality of the alternative hypothesis.`
In a one-tailed test, the alternative hypothesis is directional and predicts that the
population parameter is either greater than or less than the null hypothesis
value. The one-tailed test is used when there is a clear directional prediction
about the outcome of the experiment. For example, if we are testing the
hypothesis that a new drug is more effective than a placebo, we might use a
one-tailed test with the alternative hypothesis that the mean improvement in the
drug group is greater than the mean improvement in the placebo group.
In a two-tailed test, the alternative hypothesis is non-directional and predicts that
the population parameter is simply different from the null hypothesis value,
without specifying the direction of the difference. The two-tailed test is used
Statistics
77
when there is no clear directional prediction about the outcome of the
experiment. For example, if we are testing the hypothesis that a new drug has an
effect on blood pressure, we might use a two-tailed test with the alternative
hypothesis that the mean change in blood pressure in the drug group is different
from zero.
One-Tailed Test:
In this example, the null hypothesis is that the mean of Group A is equal to the
mean of Group B. The alternative hypothesis is that the mean of Group A is
greater than the mean of Group B. The shaded area represents the rejection
region for the null hypothesis at the 0.05 level of significance. If the test statistic
falls in this region, we reject the null hypothesis in favour of the alternative
hypothesis and conclude that Group A has a higher mean than Group B.
Two-Tailed Test:
In this example, the null hypothesis is that the mean of Group A is equal to the
mean of Group B. The alternative hypothesis is that the mean of Group A is
different from the mean of Group B. The shaded areas represent the rejection
regions for the null hypothesis at the 0.05 level of significance. If the test statistic
falls in either of these regions, we reject the null hypothesis in favour of the
alternative hypothesis and conclude that Group A has a different mean than
Group B.
It is important to choose the appropriate type of test based on the research
question and the nature of the data being analysed. One-tailed tests are more
powerful than two-tailed tests when there is a clear directional prediction, but
they are also more prone to Type I errors. Two-tailed tests are more conservative
and are appropriate when there is no clear directional prediction, but they are
also less powerful than one-tailed tests.
Confidence Interval
A confidence interval is a range of values that is likely to contain the true value of
a population parameter with a certain level of confidence or probability. It is
calculated from a sample of data and is used to estimate the likely range of
values for a population parameter, such as the mean or the proportion.
The confidence interval is typically expressed as a range of values with an
associated level of confidence, such as "we are 95% confident that the true
value of the population parameter lies between x and y." The level of confidence
is usually set at 95% or 99%, although other levels may be used depending on
the nature of the data and the research question.
Statistics
78
The confidence interval is calculated using statistical methods and takes into
account the sample size, the variability of the data, and the level of confidence
desired. A wider confidence interval indicates greater uncertainty about the true
value of the population parameter, while a narrower confidence interval indicates
greater precision in the estimate.
Confidence intervals are commonly used in statistical inference to make
predictions and draw conclusions about a population based on a sample of data.
They are also used in hypothesis testing to determine whether a hypothesis
about a population parameter is supported by the data or not.
Point Estimate :
The value of any statistics that estimate the value of a parameter.
Z-Test :
Population Standard Deviation
Statistics
79
n ≥ 30
point estimation +_ margin of errors
A Z-test is a statistical test used to determine whether two population means are
different when the variances are known and the sample size is large. It is based
on the standard normal distribution and uses the Z-score, which is a measure of
how many standard deviations a data point is from the mean of a distribution.
To perform a Z-test, we first calculate the difference between the means of the
two populations and divide it by the standard error of the difference. The
standard error of the difference is calculated by taking the square root of the sum
of the variances of the two populations divided by their respective sample sizes.
Statistics
80
If the Z-score is greater than the critical value for the desired level of significance
(usually 0.05), we reject the null hypothesis and conclude that the two population
means are significantly different. Otherwise, we fail to reject the null hypothesis
and conclude that there is not enough evidence to support the claim that the two
population means are different.
To find the Z-score, we use the formula:
Z = (x - μ) / (σ / sqrt(n))
where x is the sample mean, μ is the population mean, σ is the population
standard deviation, and n is the sample size.
We can use the Z-test to compare the means of two populations, such as the
effectiveness of two different treatments or the performance of two different
groups of students. However, it is important to ensure that the assumptions of
the test are met, such as the normality of the populations and the equality of their
variances. If these assumptions are not met, alternative tests such as the t-test
may be more appropriate.
T-Test
When Population Standard Deviation doesn’t given to us so here we used ttest.
same formula : point of estimate + margin of error
formula of z change here :
To calculate the t-value → Degree of Freedom → n-1
you should use t-table in order to get value of t-test.
A t-test is a statistical test used to determine whether two population means are
different when the variances are not known and the sample size is small. It is based
on the t-distribution, which is similar to the standard normal distribution but has fatter
tails.
Statistics
81
The t-test is used when the population standard deviation is unknown and cannot be
estimated from the sample. Instead, we use the sample standard deviation to
estimate the population standard deviation. The t-test is also used when the sample
size is small (less than 30), which violates the assumption of the central limit
theorem.
The formula for the t-test is similar to the formula for the z-test, but instead of using
the standard normal distribution, we use the t-distribution. The formula for the t-test
is:
t = (x - μ) / (s / sqrt(n))
where x is the sample mean, μ is the population mean, s is the sample standard
deviation, and n is the sample size. The degrees of freedom for the t-distribution are
n - 1.
Here's an example to illustrate the difference between the t-test and the z-test:
Suppose we want to test whether the mean height of a sample of 20 men is different
from the mean height of the population of all men. We know the population standard
deviation is 3 inches. We take a random sample of 20 men and find that the sample
mean height is 68 inches and the sample standard deviation is 2 inches.
If we use the z-test, the test statistic is:
z = (68 - μ) / (3 / sqrt(20)) = (68 - μ) / 0.67
If we use the t-test, the test statistic is:
t = (68 - μ) / (2 / sqrt(20)) = (68 - μ) / 0.89
The critical value for a two-tailed test at the 0.05 level of significance with 19 degrees
of freedom is 2.093. If the calculated value of the test statistic is greater than the
critical value, we reject the null hypothesis and conclude that the two means are
significantly different. If the calculated value of the test statistic is less than the critical
value, we fail to reject the null hypothesis and conclude that there is not enough
evidence to support the claim that the two means are different.
In this example, if we use the z-test, the calculated value of the test statistic is 2.985,
which is greater than the critical value of 1.96. If we use the t-test, the calculated
value of the test statistic is 2.248, which is also greater than the critical value of
2.093. Therefore, we reject the null hypothesis and conclude that the mean height of
the sample is significantly different from the mean height of the population.
chi-square test
Statistics
82
One Sample Z-Test :
Population Standard deviation is given
Sample size is greater than or equal to 30
Define null hypothesis. → H0 → meau → 100
Alternative hypothesis → H1 → meau ≠ 100
Stat Alpha Value → 0.05
State Decision Rule :
Calculate Z Test Statistics
Statistics
83
For 1 sample root of n is already 1 that’s we didn’t consider that formula
previously.
Also we are working with sample data so that we have some kind of sample
error. as a sample size keep increasing so that our sample mean value keep
moving toward population mean.
State of Decision :
One Sample T-Test :
Statistics
84
Here in this question we don’t have population standard deviation.
Reject the null hypothesis. p value < significance value
Statistics
85
Chi-Square
Covariance
Pearson Correlation Coefficient
Spearman Rank Correlation
Practical Implementation
F Test(Anova)
Chi-Square :
It’s claims about population proportions.
It’s non parametric test that is performed on categorical(nominal or ordinal)
data.
Non parametric test : usually occur with population proportion. given some
kind of proportion of data → non parametric test
A non-parametric test is a statistical test that does not assume a specific
distribution for the population being sampled. Instead, it makes fewer
assumptions about the data and can be used when the data are not normally
distributed or when the sample size is small. Non-parametric tests are often
used for categorical or ordinal data, or for data that do not meet the
assumptions of parametric tests. They are generally less powerful than
parametric tests, but are more robust to outliers and other violations of the
assumptions. Examples of non-parametric tests include the Wilcoxon ranksum test and the Kruskal-Wallis test.
A Chi-Square test is a statistical test used to determine whether there is a
significant association between two categorical variables. It is used when the
data is categorical and the variables are independent of each other. The test is
used to determine whether there is a difference between the observed
frequencies and the expected frequencies in one or more categories.
For example, suppose we want to test whether there is a significant association
between gender and political affiliation. We would collect data on the number of
men and women in each political party and use a Chi-Square test to determine
whether there is a significant difference between the observed frequencies and
the expected frequencies.
Statistics
86
The Chi-Square test works by comparing the observed frequencies to the
expected frequencies. The expected frequencies are calculated based on the
assumption that there is no significant association between the two variables
being tested. If the observed frequencies are significantly different from the
expected frequencies, we can reject the null hypothesis and conclude that there
is a significant association between the two variables.
The use of the Chi-Square test is not limited to gender and political affiliation. It
can be used in many other situations where there are two categorical variables
to be compared. For example, it can be used to test whether there is a significant
association between smoking status and lung cancer, or between education level
and income.
In summary, the Chi-Square test is a statistical test used to determine whether
there is a significant association between two categorical variables. It is a nonparametric test that can be used in many different situations to test hypotheses
about the relationship between two categorical variables.
Set value of alpha : 0.05
degree of freedom : n-1 = 2
It’s 2 tail test
Check → chi-square table
Centre Limit Theorem
Confidence Interval
Hypothesis Testing
Statistics
87
Measuring skill and chance in games
Statistics
88
Download