Uploaded by mrshikharsinha

Business & Economics Statistics Lecture Notes

advertisement
ECMT1010: Business and Economics Statistics A Notes
The University of Sydney
Summary of Notes Covered
1.
2.
3.
4.
5.
6.
7.
8.
Collecting Data
Describing Data
Confidence Intervals
Hypothesis Tests
Approximating with a Distribution
Inference for Means and Proportions
Inference for Regression
Probability
Part 1 -> Collecting Data
Data: Set of measurements extracted and stored in a dataset based on individuals, groups or
countries
•
•
Cases: Any individual item that can be analysed or observed
Variables: Any characteristics or traits specific to a certain case
Variables: Characteristics of
each case for observation.
Categorical: Variables classified into groups such as Gender, medals, hobbies etc
Quantitative: Variables that have a numerical amount such as age, height, weight etc
Explanatory: Explanatory variables are those that help us explain the cause of the scenario. They are
generally first
Response: Variables which value will be impacted by the value of the explanatory variable.
•
Essentially highlight the effect the explanatory variable has
Sample population cycle
The Big Picture
Population
Inference: Statistical inferences are where data collected
from a sample group gives a generalization of the population
as a whole.
Sampling
Sample
Statistical
Inference
Sample: Collecting data from a group of individuals who are
a subset (group) of a whole population.
Sample Bias: Sample bias occurs where a chosen sample may
be different to the overall population.
•
Any generalization from this sample will therefore be misleading and inaccurate
When sampling, we should be careful as to minimise any forms of bad sampling. Some of these
forms are:
•
•
•
•
•
Sampling units that are obviously related to the variable you are studying: In order to
accurately sample, we SHOULDN’T sample a group that relate very well to the course. An
example of this is a personal trainer poll on a fitness website
Volunteer bias where you let the sample be whoever would like to participate: Sometimes
allowing people to volunteer to sample is bad, as they will have a more biased and personal
opinion. An example of this is emailing customer about flight experiences.
Context: Sometimes the context of a scenario can give an indication towards what should be
the answer, which would defeat the purpose of the survey. An example of this is conducting
a pregnancy survey and providing additional information about the negatives of having
children. This would obviously influence people to go against pregnancy.
Wording: Sometimes the way a question is worded can influence the outcome. For example
if the government proposed spending on medicine as opposed to tax cuts instead of just tax
cuts by itself the majority would rather medicine.
Lazy Responses: Sometimes, surveys may be attempted or neglected simply because
students or individuals don’t have much of a care about the survey.
Confounding Variable: Third variable that is associated with the explanatory(cause) variable and the
response (effect) variable.
Part 2 -> Describing Data
One Categorical Variable: Important where we consider the proportion of candidates that are part
of a certain outcome (eg comparing people who agree against those who disagree or don’t know)
NOTE: Proportions are also known as relative frequencies. We are able to calculate frequencies (or
proportions) as follows:
π‘ƒπ‘Ÿπ‘œπ‘π‘œπ‘Ÿπ‘‘π‘–π‘œπ‘› 𝑖𝑛 π‘Ž πΆπ‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘¦ =
π‘π‘’π‘šπ‘π‘’π‘Ÿ 𝑖𝑛 π‘‘β„Žπ‘Žπ‘‘ π‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘¦
π‘‡π‘œπ‘‘π‘Žπ‘™ π‘›π‘’π‘šπ‘π‘’π‘Ÿ
NOTE: When writing the notation for a proportion:
•
•
p denotes the proportion of a population
pΜ‚ denotes the proportion of a sample
Two Categorical variables
Sometimes we might ask questions like “does the opinion differ between male and female?” These
types of questions ask about a relationship between two categorical variables. NOTE that the
categorical variables are the opinion and also gender.
For this, we set up a two-way table, which has one of the categorical variables along the row and the
other categorical variable along the bottom. Sometimes it is useful if we add a total column. NOTE:
the total for the two-way table should be the same as the total for a frequency table if you are
looking at the same scenario.
Two Way Table: A two-way table shows the relationship between two categorical variables by
plotting the categories of one variable along the rows and the other along the columns.
NOTE: The most important categorical variables to use in this case are gender and type of award
When observing a histogram, we have to ask ourselves whether the data is symmetrical or
skewed. There are 4 cases that can occur:
•
•
•
•
Symmetrical: The data is considered symmetrical if we are able to fold the graph and both
sides are similar to each other
Right Skewed: The data is considered right skewed if all the data is on the left and therefore
there is an extended tail running on the right
Left Skewed: The data is considered left skewed if all the data is on the right and there is an
extended tail running on the left
Belled: A belled shape histogram will look like one hill that slopes up and then down.
Mean: The mean of a given quantitative variable is the average of the numerical data. This can be
denoted in the following ways:
•
•
µ denotes the average of a given population
xΜ„ denotes the average of a sample within that population
π‘€π‘’π‘Žπ‘› =
π‘†π‘’π‘š π‘œπ‘“ π‘Žπ‘™π‘™ π‘‘π‘Žπ‘‘π‘Ž π‘£π‘Žπ‘™π‘’π‘’π‘  π‘₯1 + π‘₯2 + π‘₯3 + π‘₯4 … + π‘₯𝑛
Σπ‘₯
=
=
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘Žπ‘‘π‘Ž π‘£π‘Žπ‘™π‘’π‘’π‘ 
𝑛
𝑛
Median: The median is either the middle entry of an ordered data list if the list has odd amount of
variables or the average of the two middle values if the order list is even. This means that the median
would split the data in half.
2) Resistance
Sometimes in statistics, we can get outliers which in turn can a resistance or robust. This will have an
effect on the mean and median.
Resistance/Robust: A statistic is resistant when it is relatively unaffected by extreme or large values.
This will cause the median to be resistant but not the mean.
•
In other words, the median is resistant because no matter how extreme a value is placed, it
is relatively unaffected. However the mean would be severely impacted.
Standard Deviation: Standard deviation is the quantitative measure of the spread of data in a
dataset. As the standard deviation increases, the spread of the data also increases. This is calculated
by the following formula:
Similar to proportion, standard deviation also has its own notation.
•
•
‘s’ denotes the standard deviation for a sample group, measuring how spread out the data is
from the mean π‘₯Μ…
σ denotes the standard deviation for a population, measuring how far the data is from the
mean μ
Note: Standard deviation allows us to determine how many deviations a certain value is from the
mean (Eg: 1 deviation, 2 deviation etc.)
IMPORTANT: When looking at a bell-shaped symmetrical curve, approximately 95% of the values
should fall within the 2 deviations of the mean. Mathematically this means that the values should
fill within -2s and +2s. This is shown in the graph on the right.
When analysing data, we are particularly interested in the centre and spread of a distribution. To do
this we look at how many standard deviations a value is from the mean. This is called the z-score
Z-Score: Measures how many standard deviations a given data value is from the mean.
Mathematically, this is calculated using the following formula:
𝑍 π‘ π‘π‘œπ‘Ÿπ‘’ =
π‘₯ − π‘₯π‘π‘Žπ‘Ÿ
𝑠
Percentiles: When looking at a range of data, we can distribute the data based on percentiles.
EXACTLY LIKE THE HSC MARKS, the percentile of a given mark highlights that that mark beat a
certain %of all other marks. Some examples are:
•
•
92th percentile in mathematics means that the students mark beat 92% of all other
mathematics marks
21.5th percentile in visual arts means that the students mark beat 21.5% of all other visual
arts marks
Using the above information, we now look at the idea of 5 number summary. The 5-number
summary is a method which uses the median, the minimum and maximum (1st and last) and also
the midpoint between the minimum and median as well as the median and maximum (Q1 and Q3).
5 Number Summary: We divide the data into the minimum, Q1, median, Q3 and maximum.
Using the 5 Number summary, we can also find the range and interquartile range of a given data
spread.
•
•
Range: Maximum - Minimum
Interquartile Range: Q3 - Q1
Correlation: Correlation measures the strength and direction of a linear association between 2
quantitative variables.
The notation for correlation is expressed as:
•
•
r for two quantitative variables for a sample group
p for the correlation between 2 quantitative variables of a population
Correlation between two variables on their associated scatterplots. Based on these, we can see that
the properties of scatterplots are:
•
•
•
•
Correlation will always be between -1 and 1
The sign of correlation will indicate the direction of association
Correlation closer to -1 or 1 have a stronger linear association
Correlation = 0 has no linear association
Correlation Cautions
When looking at correlations, there are a number of factors we need to be careful to avoid:
•
•
•
A strong correlation between two quantitative variables does NOT mean that there is a
causation involved (cause and effect)
Correlations near 0 doesn’t mean that two quantitative variables are not associated because
correlation only tests linear association
Outliers can heavily influence correlations. Be sure to plot your data carefully
As we already know the equation of a straight line, we are able to adapt that equation in order to
find the regression line for any given explanatory and response variable. Mathematically, the
regression line is given in the following manner:
Regression Line Equation where:
B1 = the slope of the regression line
B0 = constant or vertical intercept
As show in the above formula, the response variable is on the left representing “y” while the
explanatory variable is on the right representing “x”. This means that the response variable is
always a function of the explanatory variable.
Using the above general regression formula, we could estimate the value of the response variable
given a value of the explanatory variable. However, it’s important to note that we are predicting the
response variable value and in fact the true value could be above or below this point.
This means we get the following definitions:
•
•
‘y’ is the observed response variable which is the actual value for a particular data point on
the regression line
y-bar is the predicted response variable value which is the value we estimate when
calculating the regression line formula
Residual -> Residuals are the difference between the observed response values and the predicted
response values. On the above graph, it is the vertical distance of the observed response variable
‘y’ away from the regression line ‘y-bar’
π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’ = 𝑢𝒃𝒔𝒆𝒓𝒗𝒆𝒅 − π‘·π’“π’†π’…π’Šπ’„π’•π’†π’…
π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’ = π’š − π’š − 𝒃𝒂𝒓
REMEMBER: Our objective is to find the regression line of best fit. In order to do this we must
calculate a line that is as close to all the scatterplot values as possible. In order to do this, we use
the least squared lines.
Least Squares Line (LSL): Regression line, which minimises the sum of all the squared residuals
from the scatterplot.
Μ‚)𝟐
We are able to achieve the LSL by using the formula: 𝑳𝑺𝑳 = π‘΄π’Šπ’π’Šπ’Žπ’Šπ’”π’† ∑(π’š − π’š
NOTE -> In regression modelling, it is HIGHLY IMPORTANT that the explanatory and response
variables are properly distinguished otherwise different values will be calculated
Part 3 -> Confidence Intervals
Point Estimate: Single value or statistic that can be used as a population parameter.
•
NOTE that this number isn’t accurately the population parameter but is a close
approximation of the true population parameter.
Since the population parameter is fixed and won’t change, we can use sample statistics variability in
order to get a closer and closer estimate to the true value.
This means that we get different samples OF THE SAME SIZE in a population and calculate the
statistic in each sample. From this we then compare the samples and look for any variation or
variability.
•
•
Low variation: This is where the sample statistic for each sample are very close and
therefore show accuracy
High variation: This is where the sample static for each sample are far apart from each other
and there is a question about accuracy
CASE STUDY EXAMPLE: Using the case study example 1 from above about US presidential voting
polls, we get the following sample statistics results:
•
•
•
Sample 1: 48% voted for Obama
Sample 2: 47% voted for Obama
Sample 3: 50% voted for Obama
As we can see, there is a 3% variation in the sample statistics that have been gathered. Based on the
context, we can see that this is accurate and therefore reliable in order to get a true value for the
population parameter.
Sampling Distribution: Distribution or variation in sample statistics for each sample in a given
population. These will have the following characteristics:
•
•
•
Each sample statistic should be centred or plotted round the population parameter. EG:
Each sample mean should be centred around the population mean (which is the parameter)
If our sample size is large enough, then we should get a symmetrical and bell-shaped curve
The standard error allows us to see how much the samples vary
NOTE: All these concepts apply heavily to random samples. Any non-random samples will give
heavily inaccurate results
Standard Error: A type of standard deviation, standard error looks at how spread the samples are
in a given population. We can calculate this with a formula similar to standard deviation.
It is important to understand the difference in concepts here. A sample error looks at the standard
deviation or variation of sample statistics in a population where as we can also look at the
standard deviation of various values for a particular sample.
NOTE: As the size of a sample gets larger, the standard error or variation will decrease as the
sample size starts to become a more accurate representation of the population.
Interval Estimate: Range of values in which the parameter is situated in (parameter within range of
values)
Margin Error: Precision of sample statistic as a point estimate for this parameter.
•
The give or take value from the statistic value
Confidence Intervals: Proportion of all samples whose interval (range of values) contain the
parameter
•
EG: 85% of the sample intervals given contain the parameter proportion ‘p’
As we already know, for a bell shaped and symmetric curve, 95% of the data should fall within 2
standard deviations. Therefore, we can calculate the confidence interval such that 95% of the
sample intervals should run through the parameter. The formula for calculating the confidence
interval is:
π‘ͺπ’π’π’‡π’Šπ’…π’†π’π’„π’† 𝑰𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = π‘Ίπ’•π’‚π’•π’Šπ’”π’•π’Šπ’„ ± 𝟐 × π‘Ίπ’•π’‚π’π’…π’‚π’“π’… 𝑬𝒓𝒓𝒐𝒓
π‘ͺπ’π’π’‡π’Šπ’…π’†π’π’„π’† 𝑰𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = π‘Ίπ’•π’‚π’•π’Šπ’”π’•π’Šπ’„ ± π‘΄π’‚π’“π’ˆπ’Šπ’ 𝑬𝒓𝒓𝒐𝒓
NOTE: The difference between marginal error and standard error:
•
•
Margin error is the value that is either added or subtracted from confidence interval
Standard error is the standard deviation of a lot of standard statistics put together
The diagram above shows a dot plot containing all of the sample statistics. Using the 95%
confidence interval rule, we can be sure that when observing a symmetrical and bell-shaped
curve, approximately 95% of the statistics should fall within ± 2 standard errors of the population
parameter (which is the centre in this case). This is similar to the 95% standard deviation rule for
various values of a given sample.
Bootstrapping: Using a sample to calculate the standard error, population parameter and 95%
confidence interval.
•
This is done because it is very difficult to create multiple samples
Therefore, bootstrapping will involve replicating or reproducing the data such that to create an
artificial population. NOTE: This concept was covered in the tutorial questions when we had to use
STATKEY to produce 7000 samples and place them onto a dot plot
Graph: The right graph shows how
we used bootstrapping to
generate 7000 samples and plot
them into a dot plot
NOTE that using the above method isn’t possible in practice. Therefore we use replacement with the
original sample. Therefore, we randomly choose something from the sample and then place the item
back into the original sample, thus creating bootstrap samples.
Centre: Difference between centre of sample distribution and bootstrap distribution
•
•
Sample Distribution: Centre is the population parameter
Bootstrap Distribution: Centre is the original sample statistic before replication occurred
Some important things to note for bootstrap samples are that:
•
•
•
The bootstrap sample has to be the same size
The bootstrap sample and the original sample have to have the same numbers, although the
frequency of these numbers don’t have to match
Only works when sample is random
Bootstrap Definitions
•
•
•
Bootstrap sample: The process of replacement which has created another sample
Bootstrap Statistic: Statistic or variable calculated in this copied or bootstrapped sample
Bootstrap Distribution: Distribution of many bootstrap statistics
We can use the concept of bootstrapping in order to estimate the standard error of the sample
statistic. To do this, we would use the bootstrap distribution in order to calculate the bootstrap
standard deviation.
By doing this, we will be able to obtain a good approximation of a sample statistics standard error. A
mathematical way of seeing this is given below:
𝑺𝑬 𝒐𝒇 𝒂 π’”π’•π’‚π’•π’Šπ’”π’•π’Šπ’„ = 𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 π‘«π’†π’—π’Šπ’‚π’•π’Šπ’π’ 𝒐𝒇 𝒃𝒐𝒐𝒕𝒔𝒕𝒓𝒂𝒑 π’”π’‚π’Žπ’‘π’π’†π’”
With the above bootstrapping, we can construct a 95% confidence interval
•
NOTE -> This is very similar to both the 95% rule for a sample and also the 95% confidence
interval for sampling distribution
π‘ͺπ’π’π’‡π’Šπ’…π’†π’π’„π’† 𝑰𝒏𝒕𝒆𝒓𝒗𝒂𝒍 = π‘Ίπ’•π’‚π’•π’Šπ’”π’•π’Šπ’„ ± 𝟐 × π‘Ίπ’•π’‚π’π’…π’‚π’“π’… 𝑬𝒓𝒓𝒐𝒓
Part 4 -> Hypothesis Tests
Focus of this section will be looking at statistical data and whether the sample statistic data is
convincing enough to be a true inference about the population.
Statistical Test -> Data from a sample to study how accurate or convincing a claim about the
population is
•
We answer how convincing the data is by using a null hypothesis and alternative hypothesis.
Null Hypothesis (H0): A claim or statement that there is no difference or effect
Alternative Hypothesis (Ha): Trying to seek evidence to prove that there is an effect and therefore
the alternative hypothesis is true
•
The aim is to provide enough evidence so that we can rule out the null hypothesis.
CASE STUDY QUESTION -> In this question, we are looking at whether leniency is greater when a
student smiles. The experimenters have no prior beliefs about the effect of smiling on leniency
and are testing to see if facial expression has any effect.
ANSWER: We are using the parameters µsmile for average smiles and µneutral for average no facial
expression. In this case we are testing if there is any effect on leniency if a student was smiling.
Therefore the hypothesis are:
•
•
Null Hypothesis: µsmile = µneutral
Alternative Hypothesis: µsmile ≠ µneutral
Statistical Significance: When extreme results similar to the sample statistics occur with little
possibility of it being done by chance
•
Simply put, extreme results occur without being random or coincidental
The importance of statistical significance is that if we have a sample that is statistically significant,
then we have enough evidence to support the Alternative Hypothesis (Ha) and disregard the Null
Hypothesis (H0)
Another useful measure is using P values. P values are used to measure the strength or significance
of the sample statistic in order to support the alternative hypothesis.
P value -> Where a null is true, it’s the probability of obtaining an extreme sample statistic
The idea with p values is that as the p-value gets smaller and smaller, therefore approaching 0, the
strength of the statistical evidence gets bigger. As a result, the alternative hypothesis gets favoured
even more.
•
IDEA -> the smaller p values are stronger because they show that there is a much less chance
that an extreme value occurs by random or by coincidence
Method: Use bootstrap methods to calculate p values (assuming the Null Hypothesis is true)
Procedure
•
•
•
In order to get a bootstrap, we use the process of randomisation samples where we
simulate or generate samples that are consistent with the null hypothesis.
We then calculate the sample statistic for each generated sample and plot these on a
distribution
If the statistic falls in a section of the distribution that is unlikely, such as the tails or extreme
outliers, then we have evidence to disapprove of the null hypothesis and instead approve
the alternative hypothesis
Randomisation Distribution: Assuming the null hypothesis is true, we generate many samples and
analyse each static to see if it falls in a likely section of the distribution curve. Those that fall in an
unlikely area have strong evidence
The right graph shows the
bootstrap distribution for
the number of dogs that
run to their owners. When
using a bootstrap
distribution, we look at
how often an event occurs
both at that value and also
above it. As a result, this
becomes our p-value.
Using the graph as an example, we have an original sample proportion statistic of 16/25 or 0.64,
using a bootstrap distribution, we were able to generate many samples and plot them. Using the
distribution, we found the percentage of values that are greater than or equal to 16. To be 0.1145
This then becomes our corresponding p-value.
As a result, this p-value shows us that when a person randomly guesses, they will 16/25 or more only
11.45% of the time.
Methods to Estimate P value
When using a bootstrap distribution, we can use the following methods:
•
•
One Tail Alternative: Find the proportion of randomisation samples that are >= the value in
the indicated direction (to the left or right of the curve). This is the method that we have
been using so far
o NOTE: One tail alternatives are used for Ha: x1> x1 or x1< x2
Two Tail Alternative: Using the small tail, find the proportion of samples that are >= the
value and double this to account for the other tail.
o NOTE: Two tail alternatives are used for Ha: x1 ≠ x2
Left Diagram: The left diagram shows how to apply the two-tail method where
you find the proportion in the smaller tail and then double.
Right Diagram: The right diagram shows how to apply the one tail method where
you find the proportion in the direction given (in this case it’s the direction to the
left)
For very small p values, there is a very small change that the sample will occur by random guessing.
Therefore we have evidence to favour the alternative hypothesis Ha instead of the null hypothesis H0
𝒂𝒔 𝒑 𝒗𝒂𝒍𝒖𝒆 → 𝟎, π’˜π’† π’˜π’Šπ’π’ 𝒇𝒂𝒗𝒐𝒖𝒓 𝑯𝒂
Decision
Outcome
Reject H0
Found evidence to support the alternative
hypothesis Ha
Don’t reject H0
No significant evidence to prove one of the
hypothesis and therefore must consider
either hypothesis
Significant Level Test -> The cut off point for a p-value where:
•
•
•
Below this value means that we have statistical evidence to reject the null hypothesis and
favour the alternative hypothesis
o p < significant level
Greater than or equal to this value means we can not reject the null hypothesis
o p ≥ significant level
Generally, statisticians will determine that common significance levels (cut off points) are
a=0.01, a=0.05, a=0.10
We can also use a graph to visually determine the cut off point or significant level. Hence for any
values above this point, we would reject the null hypothesis H0
Type I and Type II errors
Chances where there is an error in the decisioning of hypothesis testing. These errors are broken
down into Type 1 errors and Type 2 errors.
Type I Error -> Type I errors occur where the null hypothesis H0 is true but infact we have rejected
the null hypothesis in favour of the alternative hypothesis.
Type II Error: Type II errors occur where the null hypothesis H0 is false but infact we do not reject
the null hypothesis
However, the problem with this is that sometimes we could be rejecting many null hypothesis
which would mean that some non-extreme values are ignored. When choosing a significance level,
we should choose one that will give a reasonable probability of a Type I error occurring
The following is a criteria for establishing when creating randomisation samples
1. Must be consistent with the null hypothesis
2. Use the original samples data
3. Method done the same way that the original data was collected
Randomisation Distribution Centre
•
Where there is a true null hypothesis, the distribution would be centred around the
parameter value of that null hypothesis
o
EG: If the parameter was proportion=0.9 and the null hypothesis: H=0.9 is true, then
the centre value of the distribution will be the p=0.9
To conduct hypothesis testing, we can use any of the following randomisation tests:
•
•
•
Differences in proportions
Test for correlation
Test for mean
1.) Randomisation Distribution for differences in proportions
When creating random distributions to find the p value for differences in proportions, we take the
differences in proportion and have our null hypothesis = 0 difference between the two. This
means:
H0: p1 = p2 or p1-p2=0
Using this null hypothesis, we generate our randomisation distribution around this difference and
then based on the alternative hypothesis we observe which side of the graph.
Using the alternative hypothesis, we calculate the differences in means and observe the graph. From
the graph we take that value and the rest in the given direction as our p value.
2.) Randomisation Distribution for Correlation
When using correlation and trying to generate random distribution methods, we again always centre
our distribution around the null hypothesis. Then we use our alternative hypothesis to find the
percentage of values that are to the left or to the right of the sample correlation.
3.) Randomisation Distribution for Mean
When looking at questions about means, we will always centre our data around the null
hypothesis when generating a random distribution.
However, the difference with mean questions is that we our alternative hypothesis will argue that
the given mean is not true and therefore ≠
•
Use the two tailed method when calculating the p value
The graph below shows how our mean distribution method is centred around the null hypothesis.
When we calculate and plot our sample mean, we use the two-tailed method to calculate the p
value for the lower tail and then multiply by 2 to get both sides of the graph.
Bootstrap: Use the confidence interval of a bootstrap distribution and whether the null hypothesis
falls inside this confidence interval. We can see that in two cases, a null hypothesis will either
generate a small p value or a relatively bigger p value which can be sued to determine whether or
not to reject the null hypothesis.
NOTE -> When looking at confidence intervals, the significance level is the percentage of data not
within that confidence interval. For example:
•
•
95% confidence interval will mean a 5% significance level
99% confidence interval will mean a 1% significance level
Ho is outside Confidence Interval
If the null hypothesis is outside the confidence interval, then we should reject the null hypothesis, as
the p value will be smaller than the significance level. This is because the p-value calculated for the
null hypothesis would cover a lower area on the bootstrap distribution than the 5% significance level
outside the 95% confidence interval
If Ho outside confidence interval, then reject Ho
Ho is inside the Confidence Interval
If the null hypothesis is found to be inside the confidence interval (say for example inside a 95%
confidence interval) then we can’t reject the null hypothesis because our p value will be greater
than the significance level.
If Ho inside confidence interval, then accept Ho
If the null hypothesis is inside the 95% confidence interval, then taking values to the left/right and
doubling will give a p value that is greater than the 5% significance level (other 5% of the graph).
Therefore, we cannot reject the null hypothesis
Part 5 -> Approximating with a Distribution
Density Curve -> Theoretical curve that describes the distribution of values. The characteristics of a
density curve are:
•
•
Total area underneath the curve should = 1
Using the above characteristic, the area underneath the graph is therefore the same as the
proportion in a given interval
Graph: The graph on the left shows
a black line/curve along the graph,
which represents the density curve
A density curve can take on any shape, however we will be focusing on normal density curves
which are symmetrical, and bell shaped
The left graph shows the area underneath a
given density curve. The area/proportion is
visualised by the red shaded interval
Normal Density Curve: A bell shaped and symmetrical density curve that involves parameters mean
(µ) and standard deviation (∂)
Density curves that are normally distributed can be noted as follows:
𝑿 = 𝑡~(π’Žπ’†π’‚π’, 𝒔𝒕𝒅𝒆𝒗)
Where:
•
•
•
•
‘N’ specifies that the given distribution is a normal distribution
X denotes the variable that has a normal distribution curve we are observing
Mean ‘µ’ is the centre value that the symmetrical and bell shaped curve is centred around
Standard Deviation shows the spread in the curve
Characteristics of Normal Density Curve
1. Bell shaped and symmetrical
2. Centred around the mean
3. 95% of the data falls within 2 standard deviations
Calculating Percentiles and Normal Areas/Probabilities/Proportions
We can visualise the area under the curve for a given interval, however the integral involved for
calculating the area is very complicated and difficult. Therefore we use stat key. When using this
method, we need to provide the following information:
1. The mean and standard deviation
2. Endpoints of the interval (which values to calculate between)
3. Direction in which to calculate (are we calculating values to the left or values to the right)
Standard Normal
Standard Normal -> Standard Normal is a distribution that follows the normal density curve BUT is
centred at 0 and has a standard deviation of 1. The significance of standard normal graphs is that
they are able to show how many standard deviations a statistic is from the population parameter. In
other words, it is able to show us the z-score for a given statistic.
The characteristics of a standard normal distribution are that is has a mean = 0 and a standard
deviation = 1. When referring to standard normal distributions, we use the following notation:
𝒁 = 𝑡~(𝟎, 𝟏)
Where:
•
•
•
•
Z specifies that we are looking at a standard normal curve
N specifies it is a normal distribution curve
µ=0
Standard deviation = 1
With bell shaped symmetrical curves, it is possible to convert a normal distribution curve to a
standard density curve and also vice versa. This is expressed in the following formulas.
Convert from X~ N (mean, standard deviation) to Z ~ N (0,1), we use the following formula which
resembles the z-score formula and will give us a z-score for that statistical value:
𝑿−𝒖
𝒁=
𝜹
Where:
•
•
•
X is the value
µ Is the mean
∂ is the standard deviation
Question: How do we find percentiles using a standard normal distribution?
Answer: We would reverse the process and find an endpoint on the standard normal distribution
curve that gives us the percentile, then we convert this using X = µ + Z∂ in order to find the
corresponding statistic value where that percentile occurs.
Central Limit Theorem (CLT): Central Limit Theorem says that for a sufficiently large sample size, the
distribution of the sample statistics can be approximately found using a normal distribution
CLT Characteristics
•
•
•
For a skewed graph, the sample size ‘n’ needs to be very large
For a quantitative variable: ’n’ > 30
For a categorical variable: ‘n’ > 10
Therefore, using normal distributions and bootstrap distributions, we are able to use statkey to
calculate a confidence interval using the central limit theorem. NOTE that the confidence interval
should be approximately similar to the confidence interval calculated from a bootstrap distribution
When using a normal distribution curve in the form of N ~ (0,1), we can use the following
mathematical formula in order to calculate the given endpoints of the confidence interval
Where:
•
z* is the given z-score within a normal
distribution
Note that in the above formula, z* is the z-score used so that the area between -z* and +z* gives us
the desired confidence interval. For example:
•
z* = 12.56 which gives us a 90% confidence interval when using the formula
Summary of how to calculate P% confidence Interval
Step 1 -> Confirm that sampling distribution can be done with normal distribution (check whether
sample size is big enough)
Step 2 -> Find z* for the P% confidence interval (this is given in the formula sheet)
Step 3 -> Use statistic +- z* SE to calculate confidence interval
Normal Distributions and P-values
In some cases, we can use a randomisation distribution curve to calculate the p value for
hypothesis testing. IF our randomisation distribution curve is in the shape of a normal distribution
curve, we can assume that the null hypothesis is true and use our statistic to calculate a p-value
(area underneath the curve between the endpoints)
Test Statistic → Number of standard errors (z-score) a sample statistic is from the Null Hypothesis
Following this, the p-value can be calculated by taking the proportion of data either to the left or
right of the z-score
Summary -> Calculate p-value for Ho in a standardised normal distribution
Step 1 -> Find standardised test statistic “z”
Step 2 -> Calculate the p-value by taking the proportion of data either to the left or right
depending on alternative hypothesis criteria
Part 6 -> Inference for Means and Proportions
For a given sample proportion, using a random sample we are able to use the sample size and
proportion in order to calculate the Standard Error SE. We can assume that the sample proportion
is representative of the population proportion. Mathematically, the formula is given by:
𝒑(𝟏 − 𝒑)
𝑺𝑬 = √
𝒏
Sample Size -> Central Limit Theorem
When using random samples, there are cases where a normal distribution may not occur. For a
large sample size, a normal distribution will allow us to find the distribution of sample proportion
and the standard error so long as the following conditions are satisfied:
𝒏𝒑 ≥ 𝟏𝟎
𝒏(𝟏 − 𝒑) ≥ 𝟏𝟎
When the above central limit theorem, criteria is satisfied, then our sample size is sufficiently large
and therefore the normal distribution is categorised by:
Since the population proportion has not been given, we know that in the CLT, a large sample size
means that the sample statistic is very close to the population parameter. Therefore for a sample
that satisfies the conditions:
𝒏𝒑 ≥ 𝟏𝟎
𝒏(𝟏 − 𝒑) ≥ 𝟏𝟎
We can mathematically calculate the confidence interval as follows:
Sample Size for Confidence Interval
When using a confidence interval, we want to know how large does our sample size need to be in
order to obtain that confidence interval. HOWEVER, if we know the margin of error for our
proportion/CI then we can use this in order to calculate the sample size.
Special -> Sample size ‘n’ can be chosen using ME
From the confidence interval formula, ME is:
Rearranging this means we can use the ME in order to calculate the sample size:
Z -score: Calculate a sample proportion and its distance from the null hypothesis H0. For hypothesis
testing this is:
Where:
p-hat is the sample statistic that we are given
p0 is where we assume the null hypothesis is true
Using the Central Limit Theorem, we can calculate the SE by assuming the null hypothesis is true:
Therefore, when the Central Limit Theorem Criteria is valid, the z-score statistic can be calculated
using the above formulas.
When observing means, we are often using quantitative variables. To calculate SE for a normal
distribution we use:
However, given in some cases that we may not know the population standard deviation, we can
estimate this from the sample standard deviation using inference. Therefore, our mathematical
formula for SE can also be:
t-distribution: Used for sample statistics
Since we are using sample statistics, we do not have a normal distribution curve but instead a tvalue curve. It is important to know that ‘t’ curves look similar to normal distribution curves but have
fatter tails due to added levels of uncertainty.
T distribution -> Degrees of freedom that help distinguish between each ‘t’ distribution curve. We
denote this as df (degrees of freedom)
•
as degrees of freedom increases, the closer the distribution will resemble a normal
distribution curve.
IMPORTANT: When we are using a sample mean and sample standard deviation, we would associate
the curve with a ‘t’ distribution that has n-1 df.
•
Hence for sample mean and sample stdev, we use tn-1
Characteristics: The characteristics of using sample means with sample standard deviations are:
•
Centre: Mean is the same as the population mean µ
•
•
•
Spread: the spread of the data is given by the Se formula where we use ‘s’ for the sample
standard deviation
Shape: Standardised sample means should follow a ‘t’ distribution with “n-1” degrees of
freedom, written as tn-1
When a sample size ‘n’ is greater than 30, the ‘t’ distribution curve should follow a “n-1”
degrees of freedom
Therefore, a ‘t’ distribution of “n-1” degrees of freedom will be denoted as:
“t” distribution curve is where the tails are a tiny bit fatter than a normal distribution curve
Confidence intervals can still be calculated in a t distribution using:
Consequently, when using sample means and incorporating ‘t’ distributions, we are effectively
using a normal distribution with “n-1” degrees of freedom. Therefore, we can calculate a
confidence interval for a single mean using:
Where:
T* is the endpoint of a “t” distribution
with “n-1” degrees of freedom
Sample Size for Confidence Interval
Similar to proportions, when looking at Sample sizes, we can determine what size or how large our
sample should be using the Margin Error (ME).
Using the confidence interval formula, our ME will be:
In order to use a sample standard deviation in cases where we don’t have one, we can use any of the
4 following methods:
•
•
•
•
Use previous sample data
Use a small sample to estimate ‘s’
Find the range and divide it by 4
Make any reasonable guess
Therefore, when we want to find the size of the sample for a given Margin of Error, we can use the
following formula:
Hence, when we are using a population mean and would like to use hypothesis testing, we can
calculate the z-score using the following formula:
Formula for z-score for a given hypothesis test
where:
•
µ0 is the population mean given in the
hypothesis test
Using central limit theorem, we can find the t-statistic and use this
to test the null hypothesis. NOTE that the ‘t’ statistic is just a given
z-score in order to find the p-value that we need to either reject
the null or not reject the null.
In order to find the difference in proportions through p1 - p2, we estimate using the sample statistic
p1Μ‚ − pΜ‚2. From this estimate, we are able to use the following mathematical formula in order to
estimate the standard error for the difference in sample statistic proportions.
SE formula for a difference
in proportions
Central Limit Theorem for Difference in proportions
Similar to a single proportion, we need to check a certain criteria in order to determine whether the
sample size ‘n’ is large enough for the use of the central limit theorem. The criteria is essentially the
same but is applied to each sample proportion. The criteria is given below:
When the criteria on the previous page are sufficiently met, we have a large enough sample size in
order to carry out the central limit theorem for normal distributions. As a result, the sufficient
criteria give us a normal distribution for a difference in proportions that is represented in the
following manner:
Using our above mean and standard errors, we are able to calculate the confidence interval for the
differences in proportions. The mathematical formula to do this is given below:
When using hypothesis testing for a difference in proportions, we use the following notation in
order to define the null hypothesis:
H0: p1 = p2
Therefore, we are able to substitute this into the z-score formula in order to get the following
formula for calculating the accompanying z-score:
NOTE: In this formula, the population
proportions are equal under the null, hence we
get 0 which is why it cancels out from the
formula
When calculating the standard error, we assume that the null hypothesis is true. HOWEVER, this
gives us equal values for both population proportions, which creates a problem for calculating the
SE. As a result, we use pooled proportions to solve this and therefore correctly calculate the
sample.
Pooled Proportions -> We combine all the samples into one big sample and then calculate the
proportion within this sample. As a result, the use of the pooled proportions allows us to calculate
the SE properly. Mathematically, the z-score involving the modified standard error is given by:
NOTE: Note that in this z-score
formula, the SE part involves the
use of p hat instead of
proportion 1 and proportion 2
Where our criteria for the Central Limit Theorem is successful and our sample size is sufficient, the
distribution should be centred around the differences in means and the SE is found using the
following formula:
Therefore under a normal distribution we get:
Sample standard deviation means we can estimate in a ‘t’ distribution, the centre will be around the
difference in population means.
Therefore, our mathematical formula for the SE in a ‘t’ distribution is calculated by:
Using the degrees of freedom for ‘t’ distribution, the sample size that gives the lower degrees of
freedom should be used as the distribution curve in finding the standardised test statistic
If our samples are large enough (greater than or equal 30), then we can use the following formula
in order to calculate the confidence interval for the difference in means.
NOTE -> the ‘t’ statistic is found using the sample size with the smaller degrees of freedom
This is used for hypothesis testing involving a null hypothesis of no differences in the means. The
necessary steps in order to carry out a hypothesis test for difference in means are:
1.
2.
3.
4.
5.
Check sample size obeys Central limit Theorem
Take the lower degrees of freedom
Calculate SE
Use any method to calculate the p-value
Conclude based on p-value and significance level
Using these statistics, what we do is we take the difference between these statistics, then we
carry out a single mean test to find the confidence intervals or p-values.
After finding the difference in means from the matched pairs experiment, we can use:
Our hypothesis testing then becomes → H0: µ = value
Similarly, the confidence interval formula for a matched pairs experiment is given below
Part 7 -> Inference for Regression
Regression: Equation that estimates the relationship between two quantitative variables
The mathematical formula for the value of the last squares line for a sample was given by:
Although this was for a sample, we are able to extend this in order to find the regression line for a
population. Mathematically, the formula for a population is:
Where:
ß0 is the intercept for the population
ß1 is the slope for the population regression
ℇ is the random error or the variance for each
data point since either above or below the line
Similar to means and proportions, we are able to estimate the value of both the intercept and the
slope. This is because we don’t exactly know the population parameters ß0 and ß1 for the intercept
and slope.
Inference for Intercept -> Confidence Intervals and Hypothesis Testing
Since we don’t know the true value of the population intercept for regression lines, we are able to
use both a confidence interval and also a hypothesis test in order to determine this.
The formulas we use to calculate the confidence interval for population intercept and also to carry
out hypothesis testing for the population intercept are:
NOTE -> Since we are estimating two values (slope and intercept) we use a curve with degrees of
freedom: df=n-2
NOTE -> In order to find the SE, we use either a randomisation or bootstrap distribution
Hypothesis Testing: Generally use the following hypotheses:
Hypothesis Testing for Correlation -> Test linear association between two variables
When testing the association between two variables without the use of regression, we can
determine the appropriate ‘t’ statistics as follows:
t-statistic for calculating the
critical value which we will use
for determining the p-value
Relationship between Correlation and Regression
As we know, a correlation value will always fall between positive and negative 1. Infact, when we
take the square of this, we determine the coefficient of determination and therefore end up with
a relationship between correlation and regression.
Definition: 𝑅 2 shows the proportion of response values that are properly explained by the
predicted values X.
Since the R-squared is a proportion value, the formula is given by:
Conditions Criteria: The criteria that must be checked in order to allow regression to apply are:
1. The epsilon error values are randomly placed away from the line
2. Variability keeps changing
3. The data creates a curved pattern
π’š = π’šπ’ƒπ’‚π’“ + π’“π’†π’”π’Šπ’…π’–π’‚π’
𝑫𝒂𝒕𝒂 = 𝑴𝒐𝒅𝒆𝒍 + 𝑬𝒓𝒓𝒐𝒓
When observing the total variability, we are able to break down the formula into different
sections in order to analyse each variability or each error. This means splitting the total variability
of the equation ‘y’ into the following sections:
1. Variability explained by the model
2. Error (unexplained) variability
This variability partitioning is expressed in the following formula:
𝑺𝑺𝑻𝒐𝒕𝒂𝒍 = 𝑺𝑺𝑹 + 𝑺𝑺𝑬𝒓𝒓𝒐𝒓
𝒀 = 𝒀𝒃𝒂𝒓 + 𝑬𝒓𝒓𝒐𝒓
This formula is mathematically calculated by taking the squares of the summed deviations. The
following formulas on the next page illustrate how to calculate each variable in the above formula:
When looking at the variability, we can use the following in order to test whether or not the
model is effective. We need to consider the mean square for the model and also the mean square
error for the variability of the data.
Mathematically, the formulas for Mean Square Model and Mean Square Error are:
𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆 π’Žπ’π’…π’†π’ = 𝑴𝑺𝑴𝒐𝒅𝒆𝒍 =
𝑺𝑺𝑹 (𝑺𝑺𝑴𝒐𝒅𝒆𝒍)
𝟏
𝑴𝒆𝒂𝒏 𝑺𝒒𝒂𝒖𝒓𝒆 𝑬𝒓𝒓𝒐𝒓 = 𝑴𝑺𝑬 =
𝑺𝑺𝑬
𝒏−𝟐
For a hypothesis test to check whether the model is effective or not, we define the following:
π‘―πŸŽ : 𝑴𝒐𝒅𝒆𝒍 π’Šπ’” π’Šπ’π’π’†π’‡π’†π’„π’•π’Šπ’—π’† (𝒔𝒍𝒐𝒑𝒆 = 𝟎)
𝑯𝒂 : 𝑴𝒐𝒅𝒆𝒍 π’Šπ’” π’†π’‡π’‡π’†π’„π’•π’Šπ’—π’† (𝒔𝒍𝒐𝒑𝒆 ≠ 𝟎)
Using the following F-statistic
𝑭=
𝑴𝑺𝑴𝒐𝒅𝒆𝒍
𝑴𝑺𝑬
From the response variable equation ‘y’, we are able to assume that the random errors (epsilon) are
done by random chance. We can use this to see that there is a standard deviation for the errors. This
means that we able to calculate the average stdev or value that the errors deviate away from the
regression line. There are two cases here:
1. If our stdev value is small, then our least squares line is pretty accurate as the residuals are
very small
2. If our stdev value is large, then our least squares line is not so accurate as the residuals are
very large
Mathematically, the formula for calculating the Standard error for the residual errors within the
model is:
Similarly, we can also calculate the Standard Error of the slope using the following formula:
NOTE: It is important to note that an Anova table will give us these values, however we need to be
able to determine the values
REGRESSION CAUTIONS
When looking at regression modelling, we must be careful of the following cautions:
1. Don’t use the regression equation to predict ‘x’ values that are outside of the graph or the
scope of the data
2. Always plot data! The regression equation should be used where there is some sort of linear
association between quantitative variables
3. Be careful of outliers as they can heavily influence the regression line
4. Only randomised experiments will allow causation to occur where there is a valid change in
‘y’ for a change in ‘x’
Part 8 -> Probability
Probability: Probability of an event occurring is the proportion of times it occurs. Therefore
probabilities are always like this: 0 ≤ Probability of an event ≤ 1
Throughout this section, we will be looking at cases where outcomes are equally likely. That is,
every outcome has the same possibility of being chosen. (EG fair dice, coin toss etc.)
Therefore the probability of selecting something from equally likely events is calculated by:
Sometimes we can use a Venn diagram to determine the probability of an event occurring. Using the
diagram below:
•
•
•
Blue dots represents all the events that
can occur
All the blue dots inside A’s red circle
represents P(event A) occurring
All the dots outside the circle represent
the P(Not event A) doesn’t occur
Using probability, we can find the probability of any given combination. However, these can
sometimes be hard to distinguish. The table below summarises the differences in combinations.
Rule 1 -> Additive Rule (A or B)
The additive rule for probability looks at the probability that either event A occurs or event B
occurs BUT NOT both. We do not take into account cases where both occur because then there is an
overlap.
Mathematically, the formula for the additive rule is:
𝑷 (𝑨 𝒐𝒓 𝑩) = 𝑷(𝑨) + 𝑷(𝑩) − 𝑷(𝑨 𝒂𝒏𝒅 𝑩)
EXCEPTION: Where we are looking at disjoint events where 2 events don’t have a common outcome,
then we can simplify the additive rule. Mathematically, the formula for disjoint events is:
𝑷 (𝑨 𝒐𝒓 𝑩) = 𝑷(𝑨) + 𝑷(𝑩)
Diagram: The Venn diagram on the right
shows that where the additive rule is used,
we must subtract the overlap. Therefore it
is the yellow circle plus the purple circle
minus the middle overlap
Rule 2 -> Complement Rule (Not A)
The complement rule looks at the probability that an event doesn’t occur. Mathematically, the
formula is:
𝑷 (𝑡𝒐𝒕 𝑨) = 𝟏 − 𝑷(𝑨)
Diagram: The diagram on the
right shows the probability that
an event A doesn’t occur
Rule 3 -> Conditional probability
Conditional probability looks at finding the probability of A given that we know B occurred.
Therefore this can be expressed as:
•
•
Probability of A if we know B
Probability of A given B
Mathematically: the formula is expressed as:
𝑷 (𝑨 π’Šπ’‡ 𝑩) =
𝑷(𝑨 𝒂𝒏𝒅 𝑩)
𝑷(𝑩)
Rule 4 -> Multiplication Rule (And rule)
The multiplication rule looks at the probability of 2 events occurring at the same time. This means
we looking at the probability of A occurring and then the probability of B occurring given that A
has already occurred. Mathematically, the formula is given by:
𝑷 (𝑨 𝒂𝒏𝒅 𝑩) = 𝑷 (𝑨) × π‘· (𝑩 π’Šπ’‡ 𝑨)
Special Case -> Independent Events
When using the multiplication rule, we can sometimes experience independent events.
Independent Events are defined as the events where the P(A) does not influence P(B).
Mathematically, this simplifies the conditional rule and multiplication rule to the following:
π‘ͺπ’π’π’…π’Šπ’•π’Šπ’π’π’‚π’ 𝑹𝒖𝒍𝒆 → 𝑷 (𝑨 π’Šπ’‡ 𝑩) = 𝑷 (𝑨)
π‘΄π’–π’π’•π’Šπ’‘π’π’Šπ’„π’‚π’•π’Šπ’π’ 𝑹𝒖𝒍𝒆 → 𝑷 (𝑨 𝒂𝒏𝒅 𝑩) = 𝑷(𝑨) × π‘·(𝑩)
Difference between disjoint and independent?
Disjoint: A disjoint event is where there is no common outcome or overlap. Therefore in 1 trial,
only 1 of the outcomes can occur
Independent: An independent event is where a common outcome can occur or there is an overlap.
This is because 1 event occurring can’t influence the other occurring.
Diagram: Summary of
probability rules for any 2
events occurring.
Law of Total Probability: The law of total probability says that for disjoint events, the probability
that an event A occurs is the sum of all the outcomes.
Mathematically this is done as:
𝑷 (𝑨) = 𝑷 (𝑨 𝒂𝒏𝒅 π‘©πŸ) + 𝑷 (𝑨 𝒂𝒏𝒅 π‘©πŸ) + β‹― … … … . +𝑷 (𝑨 𝒂𝒏𝒅 𝑩𝒏)
When looking at probabilities of one event and probabilities of conditional events, we can easily
organise these into tree diagrams. In the tree diagram:
•
•
The first set of branches give us the probability of an event occurring
The set of branches after gives the conditional probabilities of events occurring (that is, the
probability of an event occurring after the first branch)
Bayes Rule
For conditional probability, instead of using a tree diagram we can mathematically use Bayes rule
which is a quicker method. Mathematically, Bayes rule for any 2 events is:
Extending this, we can also calculate the conditional probability for 3 or more events:
Random Variables: As we already know HSC Mathematics, a random variable is a value that can
change for each scenario or random sample/trial. However, we can classify random variables even
further as either:
•
•
Discrete Variables: Set number of values
Continuous Variables: Infinite number of values
Discrete Variable: A random variable that has a definite or set number of values. Generally, these
variables will have {} to signify that only the values within the brackets can count. Some examples
are:
•
•
•
Die roll {0,1,2,3,4,5,6}
Number of Females in a class
Sum of two dice {2,3 ……… 12}
Continuous Variables: Continuous variables are those that can take on any value within in an
interval. Unlike discrete variables, continuous variables don’t have set numbers as they can take on
any value ONLY IF it is within the defined interval. Some examples are:
•
•
Weight
Height
Probability Function for Discrete variables
Denote: Probability of a certain discrete function occurring is denoted as P(EVENT)
Sum: For Discrete variables, the sum of all probabilities must at all times = 1. Mathematically this is
shown as:
∑𝒑(𝒂𝒍𝒍 𝒆𝒗𝒆𝒏𝒕𝒔) = 𝟏
Mean of a Random Variable
For a certain random variable, if we know the probability, we can calculate the mean of the random
variable. This process is done in the following manner:
•
•
Multiply each discrete value by corresponding probability value
Add every multiplication
Denote: To denote the mean of a random variable with probability functions, we use µ
Mathematically, the formula for calculating Mean of a probability function is:
µ = ∑ 𝒙 . 𝒑(𝒙)
Standard Deviation
Using the random variable and its probability functions, we can also calculate the standard
deviation. This process is done in the following manner:
•
•
•
Multiply the difference between the discrete value and the variable with the probability
function, ie (x-µ)2 . p(x)
Take the sum of all these multiplications to get the variance
Take the square root to get the standard deviation
Mathematically: Standard Deviation for probability functions is calculated by:
π’—π’‚π’“π’Šπ’‚π’π’„π’† = 𝝈𝟐 = ∑(𝒙 − µ)𝟐 . 𝒇(𝒙)
𝒔𝒕𝒅𝒆𝒗 = √𝝈𝟐
Binomial probability looks at the idea of success and failure in probability. This means what is the
probability an event does or doesn’t work.
Binomial Random Variable: Binomial Random Variable looks at the number of times an event is
successful. The characteristics of a Binomial Random Variable are that:
•
•
•
‘n’ is the number of trials
‘p’ is the probability that 1 event occurs for no matter how many tries
Each trial for a Binomial random variable is independent of one another
GENERAL RULE -> The general rule of Binomial probability is the idea of success or failure where an
event eithers occurs or it doesn’t occur
The mathematical formula for binomial probability is given as:
The notation for the formula is:
•
•
•
k = number of times that the trial is successful
n = number of trials that occur
p = probability that event occurs
Mean of Binomial Random Variable
For a certain binomial random variable, we are able to calculate the mean. Mathematically, the
formula for calculating the mean of a binomial random variable is:
µ = 𝒏𝒑
The notation for this formula is:
•
•
•
µ = Mean for a binomial random variable
n = number of trials that occur
p = probability that a success/failure will occur for these number of trials
Standard Deviation of Binomial Random Variable
For a certain binomial random variable, we can also calculate the standard deviation.
Mathematically, the formula for doing this is:
NOTE -> The notation is the same for the mean as well as the standard deviation
Download