Math 54 Worksheet #1

advertisement
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Probability
Axioms:
1) For any event A,
2) If S is the whole sample space, then P(S) = 1
3) If two events A and B are disjoint, then
Theories:
1) Equally Likely Outcomes: Symmetry and indistinguishable outcomes.
2) Frequency: Experimental data.
3) Subjective: Credence and “experience”.
If
then events A and B are independent.
Complements: P(A) + P(not A) = 1, so P(A) = 1 – P(not A)
Inclusion-Exclusion Formula:
Set Theory:
S
A
B
C
In this diagram, we have that B implies A
and B and C are mutually exclusive. Note
that we cannot say anything about the
independence of A and C but we do know
that B and C are dependent.
Combinations and Permutations:
: How many ways we can select r items out of n items.
: How many ways we can order r items out of n items.
Expected Value:
For a random variable x, we can calculate the ‘expected value’ of x. This will be equal to the
long run average of x, if applicable.
E ( x)  x1 p1  ...  x N p N
E ( x  y )  E ( x)  E ( y )
IF INDEPENDENT: E ( xy)  E ( x) E ( y )
pg. 1
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Problems:
1. A fair coin is flipped twice.
a. If you are told that at least one of the flips came up heads, what is the probability
that both are heads?
b. If you are told that the first coin came up heads, what is the probability that both
are heads?
2. A lottery has a $6,000,000 grand prize with probability of winning 1 in 3,000,000. It also
has a $10 consolation prize with probability of winning 1 in 1000. What is the fair price
of your $5 lottery ticket?
3. In an urn are 5 blue, 3 red, and 2 yellow balls. If you draw 3 balls, what’s the probability
that less than 2 will be red if—
a. You draw with replacement?
b. You draw without replacement?
4. There are 20 people who work in an office together. Four of these people are selected to
go to the same conference together. How many such selections are possible?
5. Which is larger, 2008C32 or 2008P32?
6. How many distinct ways are there to arrange the letters B, E, R, K, E, L, E, Y?
Logic and Truth Tables:
Truth tables are a convenient way to represent combinations of statements that may be either true
or false.
The ways we know to combine them are NOT, OR, AND, XOR, IMPLIES, and IFF.
Conditional Probability:
Bayes Rule:
Problems:
1.
2.
Fill out the truth table for: p IMPLIES (p OR q)
A doctor has a 90% chance of correctly diagnosing a disease if you have it, a 20%
chance of diagnosing it if you don’t. Everyone who has a nosebleed out of his or her
left nostril has a 20% chance of actually having the disease. You develop a nosebleed
out of your left nostril.
a. What is the chance that you have the disease, given that the doctor says you do?
b. If 100 people have nosebleeds, what is the expected number of people that are
diagnosed with the disease and actually have it?
pg. 2
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Looking at Data
A test statistic is analogous to a random variable in probability
The test statistic can be quantitative or qualitative. Don’t be fooled, not all numbers are
quantitative data, and non-numbers can sometimes be treated as quantitative.
Center and Spread:
We know 3 ways of measuring the center of a set of data:
Mean: The sum of all the measurements, divided by the number of measurements
Mode: The most frequent value
Median: The middle value of the list
Percentile: Median is the 50th percentile of the data. In general, we say that X is the Yth
percentile of a set of data if Y% of the data is less than X.
The 25th and 75th percentiles are called the 1st and 3rd quartiles, respectively.
The IQR is the distance between the 1st and 3rd quartiles, and is one way of measuring how
spread out a set of data is.
The standard deviation is another measure of spread, and is one we use more often.
(x1  ) 2  ... (x N  ) 2
SDbox 
where the x’s are all the labels in the box, and  is the
N
average of the box.

Shortcut: If your box has only two types of labels, A and B, SDbox  p(1  p)  a  b where p
is the probability of drawing A, and there are only tickets labeled A and B.
Standard error is an estimate of the standard deviation we use when the real standard deviation
is impossible to calculate.
In general, we can find the SD of a list or a box, and we use the SE for the sample mean or sum,
or when estimating the population from a sample.
SE  E[( x  E ( x)) 2 ]
In words, SE is the square-root of the EXPECTATION of (x-E(x))2.
If independent: SE ( x  y )  SE 2 ( x)  SE 2 ( y )
SE(sum) = n½ × SD(box)
SE(mean) = SD(box) / n½
Note that median and IQR are very resistant to change, while mean and SD can be affected
greatly by just one value.
Data can be represented graphically by either a boxplot or a histogram. A scatterplot can be
used when there are two variables.
pg. 3
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Problems:
Answer the following questions based on the histogram:
Quiz Scores
0.3
0.25
0.2
0.15
0.1
0.05
0
14
15
16
17
18
19
20
1. What are the mean, median, and mode of this data? Is it skewed, and if so, in which
direction?
2. Find the 15th, 50th, and 95th percentiles of the data.
3. Is it possible to find the standard deviation of this data? If not, what other information
do you need?
4. If you took a sample of size 5 with replacement from this data, what would the
expected value of the mean be? The expected value of the sum?
Predicting Data
Markov’s and Chebychev’s inequalities:
These inequalities can tell us useful information about a set of data even if we do not know all
the values.
Markov’s inequality: If the random variable X is nonnegative, then
Gives an upper bound for the proportion of values above a certain value; if there were more, the
mean would be higher.
Chebychev’s inequality:
Gives an upper bound for the proportion of values a certain distance away from the mean; if
there were more, the standard error would be larger.
Chebychev’s inequality gives more accurate bounds, but we can only use it if we have the SE of
the data.
Law of Large Numbers:
As we increase the number of trials, the percent error of the measured value from the expected
decreases, (but the absolute error increases).
pg. 4
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Regression:
Regression is used to predict a value of Y when given a value of X. The regression equation is:
Y=mX+b where m is determined by rSDy/SDx and b is determined by plugging in the point of
averages, r being the regression coefficient.
r
X 1Y1  X 2Y2  ...  X 3Y3
, X and Y in standard units.
N
For regression to work, the data must have a linear relationship, be homoscedastic, have no
outliers, and should be for interpolation only (mostly).
Scatter Plots and Residuals:
A residual plot shows the difference between the data points and the regression line.
Regression done incorrectly: Regression line was computed incorrectly (i.e. residual plots follow
a pattern, all positive, etc.).
Regression does not apply: Scatter plot is not football shaped (i.e. heteroscedastic, non-linear, or
with outliers).
Problems:
1. When flipping 100 coins, the number of heads has expected value 50 and SE 5.
a. Find an upper bound for the probability of getting more than 75 heads using
Markov’s inequality.
b. Find an upper bound for the probability of getting more than 75 heads using
Chebyshev’s inequality.
pg. 5
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Distributions:
Binomial(n,p)
P( X  x) n C x p x (1  p ) n  x
“Find the probability of x successes in n trials, with
replacement.”
E ( x)  np
SE ( x)  np(1  p)
Geometric(p)
P( X  x)  p(1  p) x 1
“Find the probability that the first success happens at the
x-th trial, with replacement”
1
E ( x) 
p
SE ( x) 
1 p
p
Negative Binomial(p,r)
P( X  x) x1 Cr 1 p r (1  p) xr
“Find the probability that it takes x trials to get r
successes, with replacement”
E ( x) 
r
p
SE ( x) 
r (1  p)
p
Hypergeometric(N,G,n)
P( X  x ) 
G C x  N G Cn x 
Cn
“Find the probability that you get x “good things” in n
draws when there are G total good things, N total things,
without replacement.”
E ( x)  np
(note p = G/N)
N n
SE ( x) 
np (1  p )
N 1
N
pg. 6
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Problems:
1. Identify the distribution of the following random variables as binomial, geometric,
hypergeometric, negative binomial, or none of the above, and give the parameters, if
possible.
a. The number of questions a student will get right by randomly guessing on a 50question multiple choice test with 5 choices per answer.
b. The number of rounds it takes until a player rolls either a 7 or an 11 on a pair of
dice.
c. A variable with a negative binomial distribution and parameters p=.5, r=1
Normal Curve
The normal distribution, or bell curve, is a symmetric continuous probability distribution with
parameters µ and σ.
Central Limit Theorem:
The distribution of the sample mean
and sample sum of a box of numbers
approaches the normal distribution
as the sample size increases,
regardless of the numbers inside the
box. (for a simple random sample)
If the numbers inside the box are
skewed, this takes longer, but it still
happens.
Scaling:
The standard normal distribution is a normal distribution with mean at 0 and standard error of 1.
For any random variable x with normal distribution having expected value µ and SE σ, we can
convert x to standard units, or a z-score, with the formula
x
z

Estimating:
A z-score can be calculated from any distribution where we know the mean and SE. We can then
use the normal curve to find an approximation for the probability that the random variable is less
than this value. If it is a discrete distribution, remember the ½ offset.
Because of the central limit theorem, whenever we have a simple random sample, we can use the
normal distribution as an approximation, using the sample mean and SE.
N n
When sampling without replacement, multiply the SE with replacement by f 
.
N 1
pg. 7
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Remember SDbox  a  b p(1  p) if only tickets labeled a and b.
Bootstrap Method: Use the p-value of the sample as an approximation for the p-value of the
population: f × ( φ × (1 - φ) )½/n½
Conservative method: assume the probability of success is 0.5 to get the largest SD.
Use s  n /( n  1)  s * if doing a sample of continuous variables for SDbox , where s* is the
standard deviation of your sample.
Confidence Interval: A range of values “you think” the population parameter is in. The interval
is said to “cover” the parameter if the parameter falls in the interval.
P%-Confidence Interval: The method you use in creating the interval has a P% chance of
including the population parameter. The higher P is, the larger the interval.
Interval: Statistic +/- k*SE (k is z-score/standard units)
Conservative method: an interval that has confidence level P or higher. (The actual confidence
level you get from this method is probably higher than the P you are using.)
1. Use the conservative estimate of the standard error
1
2. Use Chebychev’s inequality to calculate k: P  1 2
k
Approximate method: an interval that is your “best guess” for a confidence level P.
1. Use the bootstrap method to estimate standard error.
2. Use the normal curve to calculate k.

3. Works best when normal approximation
for the probability histogram of the sample
average or sample percent applies.
Experiment Design
When designing an experiment, look out for factors that may cause bias.
In order to minimize the impact of bias on our results, we set up controls.
Sampling
Random: The method of choosing something in which each possibility is equally likely to be
chosen.
Systematic: A fixed method of choosing the next subject having chosen the one before. NOT
random.
Stratified: Partitioning a population into disjoint groups and then sampling (not necessarily
equally) from each one.
Cluster: Partitioning a population into disjoint groups, then choosing one, and then taking the
data of EVERYTHING in that group.
Multi-stage: Doing things in multiple stages.
pg. 8
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Hypothesis Testing
1. We are always gathering evidence to see if it is likely that a theory is not true. We determine
beforehand either a rejection region for the statistic or a significance level for the P-value.
2. Null hypothesis
a. Conceptually: This is the theory that we are gathering evidence against, or that we
assume to be true.
b. Practically: This is the theory that gives a predicted/expected number, and we’re
testing if our value (ie statistic) is far enough away from this expected number to
reject that theory. The null hypothesis usually says that the values being tested are the
same, that nothing out of the ordinary happened, or that the parameter we are
changing has no effect on the one we are measuring.
3. Alternate hypothesis
a. This is the theory that we take in lieu of the null hypothesis, if the test statistic or Pvalue is in the region for us to reject the null hypothesis. The alternate hypothesis
should answer the question being asked.
b. This can be very general (the die is not fair) or very specific (the die gives too many
6’s; the long run fraction of getting a 6 is actually 1/5)
c. Determines if the test is one-sided or two-sided.
4. Confidence interval: An X% confidence interval is the interval centered around the mean
where we have an X% chance of finding the value. Easiest to find with the normal curve.
5. Significance level: Probability of rejecting a true null (also the cutoff for the P-value)
6. Threshold: The cutoff for the test statistic
7. Rejection region: These are the actual outcomes that, when they happen, the null hypothesis
is rejected. If your outcome is in the rejection region, you reject the null hypothesis
8. P-Value: The probability that we get our result or worse (farther away from the null)
assuming the null hypothesis is true. We reject the null when this is smaller than the desired
significance level.
9. Type I error: Rejecting a true null hypothesis. The probability of getting one is equal to the
significance level.
10. Type II error: Failing to reject a false null hypothesis. The probability of getting one is the
complement of the power.
11. Power of a test: Probability that the test rejects the null hypothesis assuming that the alternate
hypothesis is true (so the null is false).
The thick lines represent the
threshold values.
Power
Type II Error
Type I Error
pg. 9
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
Additional Problems:
1.
Suppose there is a large city with different districts. Below is a chart that contains how much money
was spent on health care in a year, per capita, in each district, as well as how many people died in the
district, per 100 people.
District
Money spent on health
care per person per year
Number of deaths
per 100 people
A
B
C
D
E
F
$250
$400
$450
$300
$150
$350
4.5
8
7
4
2
5
A reporter looking at this data does a news article and says that “Clearly, spending money on
health care is hazardous to your health.” What do you think of this conclusion?
2.
A student in a physics class scored a 40 on the midterm, which had an average of a 50 and an SD of a
15. On the final, the student scored a 50, which had an average of a 55 and an SD of a 5. After
calculating the grade breakdowns for the curve of the class, this student fell right on the borderline
between a C- and a C. In the course syllabus, the professor said that if you show improvement in the
course due to your hard work, and you end up borderline, you would be bumped up to the higher
grade. The correlation coefficient between the midterm and the final is a 0.6.
a. Would you bump up this student? Explain.
b. Suppose the student earned a 53. Would you bump up this student now? Explain.
3.
Fill out the truth table for: p IMPLIES (p OR q)
4.
Suppose ten fish in your fish tank have a disease. There are 30 fish in your fish tank (it’s a big tank).
a. You draw five fish from the tank, and determine that three have the disease. Can you calculate the
probability of this occurring using the binomial distribution?
b. What is the chance of drawing 40 fish with replacement and having 10 with the disease?
c. You wish to determine if the disease has spread. There are only two possibilities: either the
disease has spread or it has not. Moreover, based on the odd nature of the disease, if it has spread,
then it must have spread to exactly ten more fish, making the total twenty fish with the disease.
You draw ten fish, with replacement, and count the number of fish that have the disease. What
does that number need to be for you to conclude that the disease has spread, using a 5%
significance level?
d. Suppose in the situation in c) you find that 8 of the 10 fish you draw actually have the disease.
What is the P-Value for this situation? Based on this, what do you conclude?
e. What is the power of this test?
f. Suppose you do this test many different times, on different tanks that all began with 30 fish with
ten diseased fish. If the disease spreads 60% of the time, what fraction of the times when predict
that the disease had spread has it actually spread?
5.
Suppose you have a tank with 300 fish, and you wish to get an idea of how many fish are diseased.
You draw a simple random sample of 5 fish and find that 3 are sick.
a. Give your best guess for 95% confidence interval of the fraction of fish that are diseased.
b. What if you drew 50 fish and 30 were sick?
pg. 10
Stat N21 Final Review
Faculty: Professor Philip Stark
Lecturers: Alen Gong and Larry Wang
Location: 6:00 pm - 160 Dwinelle
Community through Academics and Leadership
6.
Suppose you have a tank of 300 fish, and you want to know the average weight of the fish. You draw
25 fish, and find that the average weight is 100 grams with a standard deviation of 5 grams. Give an
approximate 95% confidence interval for the average weight of the fish in your tank.
7.
The following probability distribution is given.
X
1
2
3
4
a.
b.
c.
P(X)
0.5
0.04
0.02
Fill in the blank.
What is E(x)? SE(x)?
Suppose you wish to do a test against the hypothesis that the probability distribution actually has a
higher expected value. If you wish to use a significance level of 5%, what values of X would you
get for you to reject the null hypothesis?
8.
You have a box with tickets labeled –1, 2, 2, 3, 4.
a. Suppose you draw 3 tickets, with replacement, and add up the numbers. What is E(x) and SE(x)?
b. Suppose you draw 3 tickets, with replacement, and average the numbers. What is E(x) and SE(x)?
c. Let X be the number of even numbers you pull from the box, after drawing 3 tickets with
replacement. What is E(X)? SE(X)?
9.
You wish to do a study on family size in a given city. In this city, there are no homeless people, and
everyone has a phone. For each scheme below, determine whether it is a probability method, the
type(s) of probability sampling, and if the probability of selection is the same for each member of the
population. If possible, determine any biases, and how they might affect your results.
a. You divide the city into blocks. In each block, you select five households at random. You
repeatedly go to these households until you find someone who is home, and determine their family
size.
b. You divide the city into blocks. You randomly choose five of the blocks. In those blocks, you
interview every household on those blocks, determining household size in each house.
c. You randomly call every house in the city until you get 1000 people who answer their phones.
You conduct the survey with those 1000 people.
d. You divide the city into blocks. You randomly choose five blocks. In those five blocks, you
randomly choose five households. You repeatedly go to these households until you find someone
who is home, and determine their family size.
e. You go to every house that has a prime number in the last digit of their address.
10.
You are conducting a test to see whether a coin is fair (P(heads)=.5). You decide to flip the coin 100
times.
a. For what values should you reject the null hypothesis to achieve a desired significance level of
5%?
b. What is the power of the test against the alternate hypothesis that there is really a 75% chance of
getting heads?
c. What is the power of the test against the alternate hypothesis that there is really a 60% chance of
getting heads?
pg. 11
Download