Uploaded by Anand S

ExamPrep

advertisement
Z Score = (X - or
These normal distributions can have any
mean or any positive standard deviation.
The z-score formula lets us work with the
standard normal distribution.
̅)/𝒔
(xi – 𝒙
Z Statistic
𝒁=
(𝒙 − 𝝁)
𝝈
√𝒏
where:
x - sample mean
μ - population mean
σ - population standard deviation
n - number of sample observations
Lower Tail Test & Upper Tail Test
Region of Acceptance
The range of values that leads the researcher to
accept the null hypothesis is called the region of
acceptance.
This is decided on the basis of the sign in the alternate
hypothesis. You will conduct a lower-tail test when the
sign in the alternate hypothesis is <

Level of Significance
Probability with which one may reject a NULL
hypothesis when it is true is called the level of
significance (
Confidence Interval
Confidence with which a NULL hypothesis is
accepted or rejected.
A/B testing, at its most basic, is a way to compare
two versions of something to figure out which
performs better. An A/B test tells you whether there
is a statistical difference in the performance of the
two options.
E.g. Trebo (Tax included Vs Tax Excluded)
Chi-Square

Chi-Square goodness of fit test is a nonparametric test that is used to find out how
the observed value of a given phenomena
is significantly different from the expected
value.





Independence
PROBABILITY
Two events are independent if and only if:
𝑵𝒐. 𝒐𝒇 𝑫𝒆𝒔𝒊𝒓𝒆𝒅 𝑶𝒖𝒕𝒄𝒐𝒎𝒆𝒔
𝑷𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 =
𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑷𝒐𝒔𝒔𝒊𝒃𝒍𝒆 𝑶𝒖𝒕𝒄𝒐𝒎𝒆𝒔




𝐏(𝐀|𝐁) = 𝐏(𝐀)
P (A and B) = P(A) * P(B)
Probability gives a measure of how likely
it is for something to happen.
Sum of the probabilities of all outcome’s
must = 1
probability of an impossible outcome =0
Probability of Certain Outcome = 1.0
Joint probability is the probability of two events
occurring simultaneously. P(A & B) or P(A,B)
Marginal probability is the probability of an
event irrespective of the outcome of another
variable.
E.g. the probability of X=A for all outcomes of Y
probability of one event is not affected by the fact
that the other event has occurred.
Intuition
These types of probability form the basis of much of
predictive modelling with problems such as
classification and regression. For example:



The probability of a row of data is the joint
probability across each input variable.
The probability of a specific value of one input
variable is the marginal probability across the
values of the other input variables.
The predictive model itself is an estimate of the
conditional probability of an output given an input
example.
P(X=A) = sum P (X=A and Y=yi) for all y
Conditional probability is the probability of one
event occurring in the presence of a second
event.
JOINT PROBABILITY
Joint probability is a statistical measure that
calculates the likelihood of two events
occurring together and at the same point in
time.
𝑷(𝑨 𝒂𝒏𝒅 𝑩)
=
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑠𝑎𝑡𝑖𝑠𝑓𝑦𝑖𝑛𝑔 𝐴 𝑎𝑛𝑑 𝐵
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑎𝑟𝑦 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
General Rules
The probability of one event in the presence of all
(or a subset of) outcomes of the other random
variable is called the marginal probability or the
marginal distribution.
It is called the marginal probability because if all
outcomes and probabilities for the two variables
were laid out together in a table (X as
columns, Y as rows), then the marginal probability
of one variable (X) would be the sum of probabilities
for the other variable (Y rows) on the margin of the
table.
Addition Rule
P (A or B) = P(A) + P(B) – P (A and B)
If A and B are mutually exclusive P (A and B) = 0
P (A or B) = P(A) + P(B)
Multiplication Rule
𝐏(A and𝐁) = 𝐏(𝐀|𝐁)𝐏(𝐁)
If A and B are independent, then 𝐏(𝐀|𝐁) = 𝐏(𝐀)
and P (A and B) = P(A).P(B)
𝐏(𝐀 and 𝐁)
𝐏(𝐁)
The conditional probability of A given that B has
occurred.
𝐏(𝐀 and 𝐁)
𝐏(𝐁|𝐀) =
𝐏(𝐀)
𝐏(𝐀|𝐁) =
𝐏(𝐀) = P(𝐀|𝐁𝟏 )𝐏(𝐁𝟏 ) + 𝐏(𝐀|𝐁𝟐 )𝐏(𝐁𝟐 )
+ ⋯ + 𝐏(𝐀|𝐁𝐤 )𝐏(𝐁𝐤 )
𝐏(𝐀|𝐁) = 𝐏(𝐀), Hence
𝐏(𝐀) = 𝐏(𝐀 and 𝐁𝟏 ) + 𝐏(𝐀 and 𝐁𝟐 ) + ⋯
+ 𝐏(𝐀 an𝒅 𝐁𝐤 )
Normal Distribution
Sampling Distribution
A sampling distribution is a distribution of all of the
possible values of a sample statistic for a given
sample size selected from a population.
Sample Mean Sampling Distribution
Standardized Z-score: of an observation is the
number of standard deviations it falls above or
below the mean. Raw scores above the mean have
positive standard scores, while those below the
mean have negative standard scores.
Distribution of (Sample’s-Mean’s) (appears
normal)
E.g. GPA, distribution of mean GPA’s for multiple
samples of size 50 each
If a population is normal with mean ‘μ’ and
standard deviation ‘σ’, the sampling distribution of
𝐗 is also normally distributed with:
𝛍𝐗 = 𝛍
𝛔𝐗 =
𝛔
√𝐧
Z-Table - Standard Normal Table
Entries in table give the area under the curve
between the Mean and ‘Z’ standard deviations
above/below the mean.
In sampling
without
replacement,
two sample values aren't independent.
𝒁 =
𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 – 𝒎𝒆𝒂𝒏
𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
Z score of mean = 0
Z score > 2
-> Unusual observation
Defined for distributions of any shape
Z-value for the sampling distribution of 𝐗 (sample
mean)
(X − μX ) (𝐗 − 𝛍)
=
𝛔
σX
√𝐧
μ = population mean
σ = population standard deviation
n = sample size
𝐙=
𝛍=
Z scores can be used to calculate percentiles
when the distribution is normal
∑ 𝐗𝐢
𝐍
∑(𝐗 𝐢 − 𝛍)𝟐
√
𝛔=
𝐍
Standard Error of Mean
𝛔𝐗 =

Graphically percentile is the area below the
probability distribution curve to the left of
that observation.
the
Mean & Standard Deviation
Percentiles

Mathematically, this means that the
covariance between the two is zero.
Sampling with Replacement Two sample values are
independent. i.e. what we get on the first one
doesn't affect what we get on the second.
𝛔
√𝐧
Different samples of same size from the same
population will yield diff. sample means
Variability of mean from sample to sample
(w/ replacement Or from infinite population
standard error of the mean decreases as the
sample size increases
Central Limit Theorem


Applies even if the population is not
normal
Sample means from the population will be
approximately normal as long as the sample
size is large enough
As the sample size gets large enough, the sampling
distribution of the sample mean becomes almost
normal regardless of shape of population.
Sampling Size:



For most distributions, n > 30 will give a
sampling distribution that is nearly normal
For fairly symmetric distributions, n > 15
For a normal population distribution, the
sampling distribution of the mean is always
normally distributed
Population Proportions
Sample proportion (p / 𝝅) provides an estimate
the proportion of the population having some
characteristic.
𝐩=
=
Sampling Distribution of Sample Proportion
𝝅(𝟏 − 𝝅)
𝛔𝐩 = √
𝐧
𝐗
𝐧
# items in sample with the characteristic of interest
𝛍𝐩 = 𝝅
sample size
p is approximately distributed as a normal
distribution when n is large
where π = population proportion
Z value for proportions
𝐙=
𝐩−𝝅
=
𝛔𝐩
𝐩−𝝅
√𝝅(𝟏 − 𝝅)
𝐧
Simple Linear Regression
Doubt Clearing – Supriya (05/03)
LIMITATIONS of Association
T-Statistic
Co-relation: Only for Linear Association
x̄ = sample mean = 280
For non-linear relationship:
μ0 = population mean = 300
s = sample standard deviation = 50
n = sample size = 15


Rank Correlation (captures rank instead of
values)
Mutual Information – Statistical solution
represented by Venn diagram
t = (280 – 300)/ (50/√15) = -20 / 12.909945 = -1.549.
Association does not guarantee causation.
Z Score – For large data set
Causation guarantees association.
T-Score – Small data set
As a rule of thumb, a correlation coefficient below 0.3
is not considered a good precedent for regression (if
better coefficients are available)
Regression is a statistical method used to determine
the relationship between variables in the form of an
equation.
Dependent variable: the variable that you predict is
called the ‘dependent’ or the ‘response’ variable. It is
usually denoted by ‘Y’.
Independent variable: the variable that is used to
predict this dependent variable is called the
‘independent’ or ‘explanatory’ variable. It is usually
denoted by ‘X’.
Helps establish a Cause-Effect relationship between
any two/more variables.
Regression analysis is “finding the best-fitting
straight line for a set of data”.
This regression line represents the linear equation
that has the least amount of error (distance) between
the line and the actual data, or the line that is the
least far away from the data and is therefore most
representative. What the line is then showing you is
the relationship between two variables of interest.
Once we have this line we can make predictions
about future outcomes if we only have data for one of
the variables.
If we take this even further and assess the data over
time, then we can make predictions about what will
happen in the future. This is the point of a time series
regression analysis.
If we want to look at relationships over time in order
to identify trends, we use a time series regression
analysis
the residual plot for salary-experience pair is
distributed randomly across the horizontal axis. On
the other hand, the gold-silver price pair shows a
pattern that can be closely identified as an inverted U
distribution. Such a distribution suggests a nonlinear
relationship between the variables.
p-value - Statistical Significance
Regression Model
The p-value approach to hypothesis testing uses
the calculated probability to determine whether
there is evidence to reject the null hypothesis.
R-Square is a metric that is used to evaluate the
simple linear regression model developed. Also
referred as model-fit.
A smaller p-value means that there is stronger
evidence in favor of the alternative hypothesis.
You also learnt that the R-Square value ranges
from 0 to 1.
The significance level is stated in advance to
determine how small the p-value must be in order
to reject the null hypothesis.
The R-square value of 1 indicates that the
regression model completely explains the variation
in the data.
For example, a p value of 0.0254 is 2.54%. This means
there is a 2.54% chance your results could be random.
Value near to zero indicates weak predictive
relationship between the dependent and
independent variable.
The following are important considerations to
account for before doing a regression analysis:
The higher the R-square value, the better the
model is considered to be.
Linearity: There should be an overall linear
pattern between the dependent variable and all
the independent variables.
The R-squared will always either increase or
remain the same when you add more variables.
Because you already have the predictive power of
the previous variable, and so the R-squared value
can definitely not go down. And a new variable, no
matter how insignificant it might be, cannot
decrease the value of R-squared.
Interpretation of Regression Model
1. R-Squared Value : Closer to 1 stronger is
the predictive relation
2. regression coefficient P-Value – Should be
less the Alpha (type-1 error) depending on
confidence level chosen
3. Residual Plot : random pattern implies the
best regression analysis
The salary-experience pair gives a better regression
analysis than gold-silver price pair. This is because
Since the P-value of both the cause variables is less
than 0.05, we can assume that both the cause
variables are significant. Now, the first analysis is
considerably weaker than the second one due to
their respective R-squared values.
No autocorrelation: The error term associated
with a particular value of each independent
variable is assumed to not depend on the residual
of another value. In other words, a residual term is
not correlated with other residual terms.
Normality: The error term is assumed to be a
normally distributed random variable for all the
values of the observations.
Constant variance: The residuals of data points
are assumed to have equal variance for all the
values of the observations.
No multicollinearity: The independent variables in
the regression model should not be highly
correlated with one another.
BASICS
Normal distribution, also known as the Gaussian
distribution, is a probability distribution that is
symmetric about the mean.
The standard normal distribution is a special case
of
the normal
distribution.
It
is
the distribution that
occurs
when
a normal random variable has a mean of zero and
a standard deviation of one.
REGRESSION RESULTS INTERPRETATION
Multicollinearity
R-Square is a metric that is used to evaluate the simple
linear regression model developed.
we need to assess multicollinearity between
independent variables. If multicollinearity is high,
significance tests on regression coefficient can be
misleading. But if multicollinearity is low, the same
tests can be informative.



P-Value estimates the significance of any independent
variable in explaining the dependent variable.
Every normal random variable X can be

transformed into a z score via the following
equation:

z = (X - μ) / σ
A z-score (aka, a standard score) indicates how
many standard deviations an element is from the
mean.
Z Score = 0
Z Score < 0
Z Score > 0
Z Score = 1
Element equal to mean
Element < mean
Element > mean
Element 1 SD > mean
68% elements have Z score between
95% elements have Z score between
99.7% elements have Z score between
R-Square value ranges from 0 to 1
The R-square value of 1 indicates that the
regression model completely explains the
variation in the data.
The higher the R-square value, the better the
model is considered to be.
-1 & 1
-2 & 2
-3 & 3
The probability of committing a Type I error is called
α, the level of significance.
based on your confidence interval, you can
check if the P-value is within the expected
bounds.
At a 95% confidence interval, the P-value
should be less than 0.05 in order for you to
successfully reject the null hypothesis and
establish the significance of that variable.
Residual Plots
Among the below three plots, a random pattern
implies the best regression analysis.
HYPOTHESIS TESTING
Population Standard Deviation Known
If the population standard
deviation, sigma, is known, then
the population mean has a normal
distribution, and you will be using
the z-score formula for sample means. The test
statistic is the standard formula you've seen before.
The critical value is obtained from the normal table, or
the bottom line from the t-table.
Population Standard Deviation Unknown
If the population standard
deviation, sigma, is unknown, then
the population mean has a
student's t distribution, and you
will be using the t-score formula for sample means.
The test statistic is very similar to that for the z-score,
except that sigma has been replaced by s and z has
been replaced by t.
The critical value is obtained from the t-table. The
degrees of freedom for this test is n-1.
If you're performing a t-test where you found the
statistics on the calculator (as opposed to being given
them in the problem), then use the VARS key to pull up
the statistics in the calculation of the test statistic. This
will save you data entry and avoid round off errors.
HYPOTHESIS PROCESS
1. State the null hypothesis H0 and alternative
hypothesis.
2. Decide on the significance level, α.
3. Compute the value of the test statistic.
4. Critical value approach: Determine the
critical value.
5. Critical value approach: If the value of the
test statistic falls in the rejection region,
rejectH0; otherwise, do not rejectH0.
4. P-value approach: Determine the p-value.
5. P-value approach: If p≤α, reject H0;
otherwise, do not reject H0.
6. Interpret the result of the hypothesis test.
Correlation and Chi-square Test
for Independence
http://www.realstatistics.com/correlation/dichotomous-variables-chisquare-independence-testing/
Hypothesis Testing
https://psychology.illinoisstate.edu/jccutti/psych240/c
hpt8.html
Download