Stat113-1 - Highly Derivative

advertisement
8-23-05
JJ Deely-prof, Tilman Achberger (asst)
Office Hours: math building, room 546: t-th: 1-2pm or by apt, organized in class, not email
1. workbook- 3 items in your workbook, look at weekly schedule
Oct- 6th: Test #1, 7-8pm
November 14th- Monday: 7pm-8pm, test #2
1. Homework-familiarize with concepts
2. Data-how to collect it and think about it
3. Tests-comprehension
All tests/quizzes are open note, book, workbook
Final grade=
10% HW
10% rec. quiz
10% workbook and attendance
20% test 1
20% test 2
30% final
Vista:
Grades and for e-mail
Quiz on Friday on “admin details”
V and VI
8-25-05
What is stat and why am I here?
So I can be educated
Main methods to obtain Data
Anecdote-small story with a point, people talk as if the truth
Census- not 100% of the population, people avoid, are homeless, etc,
Sampling-too expsensive to census
-obtaining data from the population
-measure in units
-how many is sample
-from where? Sampling frame
-quantity of interest:: variable
OJ:
Pup~= all pu students
Unit= a student
Sample=m=500
Sample frame= mech to gen sample (i.e. phone numbers from book)
Variable=do you drink OJ
Observational Studies
Experiments
8-26-05 (Rec)
Gopal Panicker
Office Hours: Math G132, M: 3-4pm, Tr: 2-3pm, Thurs: 7-9pm, Phys 117
Can go to any TA’s hours and there are 11 TAs
8-30-05
Assign 1 is on web. Turn in Friday
Take data set from web. Draw random sample.
250 heating bills
take sample: 20
SRS-simple random sample
SampBC- if not prompted, go to tools-macro-security-medium
Alt+f8, run macro
A2:A251
20
1
c2 (column)
without replacement (not back in hat)
=average(c2:c21)
do 10 times
9-1-05
p(triangle over)=53%=.53
n=208
p(hat)-x, p(hat)+x
.53-x, .53+x
using formula from workbook on pg 3=.07
using above, .46, .60
pg 37, cross out the formula, it’s an approximation
everyone wants a good sample
2 sources of errors:
sampling and non-sampling
difference:
sampling errors:
variability of the population
method used to obtain (voluntary/convenience)
sampling frame (telephone)
non-sampling errors (not a problem with the way the sample was drawn)
non-response
deceptive replies (truth hurts or data incorrectly tabulated)
question not understood
9-2-05 (Rec)
p is population, p-hat is sample size
confidence intervals:
n=number asked, not number in favor or against, total number asked
9-6-05
main purpose of stat. study:
does changing one variable (the explanatory variable-controlled variable) cause changes
in another variable (the response variable-measured variable)
drinking milk -> happiness
differences between observational studies and experiments
Experiments (under our control): make plans about what, how, who to measure
Observational (no control involving measure)
Treatments- placebo
Why observational studies: may be unethical or impossible to conduct a proper
experiment (booze in preggers)
-certan explanatory variables can’t be controlled
Randomized Comparitive Experiments:
-random, not haphazard
-eliminates placebo
-(look up more info)
Statistically Insignificant
-could have happened by chance
-what’s the chance that two treatments are different by a certain amount given
since the chance of observing the difference in the data is so small, it’s not significant
9-8-05
one page of good data, and then summarize it for extra credit
Good experiments:
1. randomized block (why have blocks? Randomizes more, OJ in men/women)
6-18, start with ABCDE, then BCDEA, then CDEAB—latin square
Sudoku
2. matched pair-p. 99- best kind of matching pair is experiment done on same unit,
however not always possible, hence the name “matched pair”
9-13-05
Measurements: provide information about how, what, why, who
Measurements not always numbers (who, etc)
Categorical Measurements (not #s)
Quantitative Measurements
Valid, reliable but possibly biased measurements
1. valid-does it measure what it is supposed to?
2. Reliable-can someone else obtain the same measurement-aproximately
3. Biased- a systematic prejudice in some given direction of the measurement
involved
Do #’s make sense?
How #s produced
Exactly what as being produced
Many examples in chapter 9
9-15-05
Graphs for categorical variables
-pie chart and bar graph
Pareto Bar Graph (bar graph with the data sorted)
(these were categorical)
Quantiative:
Line graphs
Histograms summarize data as well
Excel: bin=interval
Format-options-gap width-0
Column C: My Bin, 200-1200
Tools-histogram, C1:C12. Output: D1
Font out down to 8
9-20-05
Line Graphs for Quantitative variables
Relationships of one variable to another
Often “time” is on the x-axis, but not only
Scatterplot:
Delete lines
Scale to 300
Manitee deaths-y axis, powerboat reg on x axis
New material:
One Number Summaries:
Two Types:
Measure of central tendency
Measures of variability or spread
Measures of central tendency-look at corn data
Another measure of central tendency:
Median, sort, middle number
Measures of variablility
Range: largest-smallest
=MAX(B2:D88)
=MIN(B2:D88)
Quartiles and IQR (interquartile range)
Lower quartile: number which 25% are below (156)
Upper quartile: number which 25% are above (173)
IQR-17 (50% of countines are within 17 bushells of one another)
Standard Deviation (stdev)
Big formula: pg 227 in book
(n-1) because n is average and statisticians are weird
distance from mean to all of the other points
9-22-05
find max and min
fine the median
find upper and lower quartile (number ¼ through and ¾ way through)
min-lower q-median-upper q-max
draw in a theremometor type vertical line
format axis and make the numbers look good
alt-f8 for boxplot
have to include empty spaces from column c so B2:C33
two-dots: outliers:
less than Q1 − (1.5 × IQR) or greater than Q3 + (1.5 × IQR)
where Q1 and Q3 are the first and third quartiles, respectively, and IQR is
the interquartile range (equal to Q3 − Q1).
9-27-05
The Bell Curve:
Mu=center of the bell curve
Mu-Sigma (min)
Mu (Middle)
Mu+3sigma (max)
Standard Normal:
Every normal distribution can be reduced to the “standard normal”
1. mu, the mean=0
2. sigma, the stdev=1
Standard Score:
Z=((x-mean)/stdev)
A sample proportion is nearly Normal for large samples – give an example – go to
Binge drinking data
1. population mean=p
2. population stdev=next slide. St dev=sqrt((p(1-p)/m)
9-29-05
q31-q74, then take one hour and do 1-30
Quiz tommorow:
6-10 on sample test
Bell Curve problem:
A sample proportion is nearly normal for large samples
M=total number asked
Multiple by 3, add to p for upper, subtract from p for lower, p is middle
Proportion of p hat below .17
Z=(.17-.2)/.004=-7.5
New Topic:
We now want to think about relationships, how strong are they and in what direction.
Look at the data.
How strong, positive or negative
Draw scatter plot
What kind of association do they have?
Correlation is the stat method of defining the strength and direction of a linear
relationship
Square all the numbers to the line, then add, make this number as small as possible,
least-squares fit
Correlation is given by the formula (never have to use) because excel can do it for us
Correlation is between -1 and +1
If r=+1, all points are on a line and the line goes up
If r= -1, all points are on a line and line goes down
R=0, no linear relationship
Add trend line-display r-squred
Correlation =sqrt (of R)…must be same sign as the slope
10-4-05
test: CL50
75. 150
76. none
77. A, A=35
35. 95%-one of our rules
CI on P p(hat) + or – 2 sqrt ((phat(1-phat))/n)
300/400 =.75=phat
m=400
.706, .794
B
36.
mean=p=.45
stdev: sqrt((p(1-p))/400)=.025
37. .375 .45
btwn .4 and .45
.525
below .45 is .5 (50%)
below .4=
z=(.4-.45)/.025=-2
back of book: .023
.5-.023=.477 (C)
43.
175
250
325 (3 sigma rule)
below 225
z=225-250/25=-1
.16
44.
above 300
z=300-250/25=2
.9772 or .98
1-.98= .02 (D)
45.
.1 (10% of students)
Z= -1.28
-1.28=(x-250)/25= $218 (A)
10-6-05
margin of error: do a 95% confidence
4. E
5. E
10. .45 and .4
.5-?
z=(.4-.45)/.025=-.2
.475
C
18. 60% of 300=180,
write out the tables
D
21.D, with 100, 25=LQ, 50=median, 75=UQ
24.C
STDEV=sqrt((p(1-p)/n)
43. KNOW THIS:
do 3 sigma
above x is .1
z=x-63/2
z must be .9
1.28=(x-63)/2
=66
Sigma=ST DEV
61. E, all of above
62. 31/131= B
workbooks through pg 16
must report discrepancies by next Tuesday
10-13-05
The correlation is the statistical method of defining the strength and direction of a linear
relationship between two quantitative variables (pg 270)
What line do we use to compute R
The least squares (best fit)
Least Square estimates for the regression equation
Correlation :
R(sqr)=formula pg 290
=variation in predicted Y divided by the total variaion in Y
=% variation in Y explained
b(hat)=slope of regression line
correlation must have same sign as slope
r=o there is defnetly a reationship, just not a linear one
r(sq)=percent of the variablility of the data up and down the line,m accounted for by the
regression line. If all points on the line, r(sq) would be 1, 100%
10-18-05
www.stat.purdue.edu/~jdely/stat113/assign/Lecture16secret.xls
What is an Index: a shorthand summary to compare one number to another or a bunch of
numbers to other numbers
An index is a convenient and simple method to compare quantitative measurements
Price index=(current price/base price) x 100
Assignment A6:
2 extra credit: Sentance about each of those terms, explaining what each variable is
106 from 100, increase of 6%
CPI-Consumer price index
Based on Fixed market basket:
Food index doesn’t take into account any change in living, all about cost
Cpi=avg price now/avg price then x 100
Dollars now/dollars then = CPI now/CPI then
CPI now=198.8
Homewrk question about beef and gasoline
95 price = (cpi then/cpi now) x price now
10-20-05
causation: two variables have a strong connection
retroactive study: looks back and events that have already occurred
prospective study: sets things up and then checks on them later
Uses of CPI:
Economic policy, compare prices over time, adjust other economic data
Dollars now/dollars then=cpi now/cpi then
Q16 on p324:
www.stat.purdue.edu\~jdeely\stat113\MinWageNewDataFall05.xls
use this data and say I’d like some extra credit.
=A2*c2/a2
minimum wage has gone up, but in terms of 1960 dollars, it’s gone down
10-25-05
test 2: 11-14, Monday, 7pm
Canceled Lectures:
11-17, 11-22
No Rec on 11-18 (no homework)
Tools for Statistical Inference
Dictionary: randm- going, made, occurring etc without definite aim, purpose, or reason:
at random
Stat meaning: various ideas, random sampling, random measurement, we can make
quantitative inferences about events
Language of Probability:
Rules:
1 the eollection of all possible measurements called S has probability 1
2. the probability of any collection of measurements is btwn 0 and 1
3. If A and B are collections of measurements from S and they have no measurement in
common, the probability of A or B is the sum of the two probabilities; we write
P (a or B)= P(A)+P(B)
P 360: 20 for A7
18 red and 18 black and 2 green
5 red cards out leaves more blacks, so a black coming out is more likely
10-27-05
3 rules:
P(s)=1
Prob btwn 0 and 1
If A and B have no values in common, then P (A or B)=P(A)+P(B)
A7-Q2
0=girl
1=boy
00000
00001
00010
00011
00100
00101
00100
00111
01000
01001
01010
01011
01100
01101
01110
01111
10000
10001
10010
10011
10111
11111
in theory all are equally likely, so 1/total number
probability is number/total number (32)
Do this for Extra Credit:
38 outcomes, 1/38, 18 red, 18 black, 2 green
Add this: probability of Black or larger than 30
3 values in common: 31, 33, 35
P(A or B)= (# of values in a or B)/# total
Black (18) +bigger than 30 (3)
21/38
Extra credit homework:
Red or smaller than 6, 18+3=21/38
Probability tree
Ant yes
Test + .997
Ant no
.015 (false pos)
Test - .003
.985
(false negative)
P(ant is there given that test is positive)
Anti yes (.01) since 1% has= test + .997 += (.997 * .01)
Ant No .99=test +.015=.99*.015
# from the branch we’re on/ # from both branches
Extra Credit assignment
Numbers 1-6: prob .167
Tools-data analysis-random number generation
2
900
discrite
give values and prob.
6721 (random)
output:
ctrl+shift+ arrow up
ctrl+d (down?)
if(or(c2=6, c2=7, c2=8), 10, -10)
11-1-05
Independent Measurements
The prob. Of an event A, given an event B, equals the prob. Of event A IF the prob of
event A given an event B equals the probability of A, A and B are statistically
independent
Statstically independent:
Measurements give us a new rule
Multiplication Rule:
P(A and B)=P(A)P(B)
If A and B are independent, then the prob. Of A an B then P(A and B)=P(A)P(B)
A8: 19.7
A=4, b=3, c=2, d/f=1
Get ex cred if calculate the theoretical answer:
Prob 0=.8
Expectation:
Expect: regard as likely to happen
11-3-05
on simulation, turn in:
After simulation, I got ___
19.10 on p 387
prob .7, use bernulli
use this on Q 17&19 together
prob won’t show up is .25, prob of show is .75
what’s the chance of getting 8 or 9
17: simulation .304
19: 180/1000 no bus, .18
Expectation:
1. roulette—workbook p. 23 (2)---txt p. 400 ex 3
value: 1
Prob: 18/38
-1
-20/38
Expt: 18/38-20/38: -2/30= -5 cents
11-8-05
Workbooks on Friday
Sample Test Questions
A9 on the Web today, comes from Sample test 2 questions
11-10-05
WTHR 200
11-15-05
2 by 2 tables
one categorical tables independent of another
12-1-05
a10: due 12-8, hand in lectures
Final:
Dec-15, Lambert Field House
10:20-12:20, have ID
Workbooks graded Dec-9, p25 not req
2-way table
Is on categorical var independent of another? Is there an association? Are they related?
Chi-square statistics
x-sqr= sigma (observed count-expected count)-sqr / expected count
if chi-sqr value is larger than value in table (475), we say there’s an association, there is a
relationship, they’re not ind
chi-sqr stat is number we would’ve expected to see if there was no relationship
multiple row total by colum total and divide that by grand total=expected value of any
cell
DF-degrees of freedom
Df=(r-1)(c-1) (use this to get the number from the table to see if chi-sqr is bigger)
2x2table= (2-1(2-1)=1
3x3table(3-1)(3-1)=4, so row 4 of .05
chi-sqr stat=all ch-sqr numbers added up
.05 colum=error 5% or less of the time
Simpson’s Paradox: how can the women be behind the men in both instances, yet
overall? Lurking variable: weighted average
12-6-05
if we do Page 25 in our workbook, we get extra credit
A10-p482:12
Prob would be equal if there were no relationship
Once get probs, get chi-sqr statistics
37 is chi sqr value
get number frm book, it’s larger so there IS a relationship
Simpson’s Paradox:
Relationships between two categorical variables may have hidden lurking variables
(need women who are not employed are happier than those who are)
(must add to the filled in cell)
breaking it down into age group, there’s a reversal
18-35:
400-405
800-750
.33-.35
35-50
1600-45
2200-50
.421-.474
confidence intervals
for an unknown population proportion, p-hat
now for an unknown mu
answer for pg 25:
x-bar-2(s/sqrtm), x bar+2(s/sqrtm) for 95% confidence
2: the distribution of the sample mean approaches a bell curve as the sample size
increases
sample mean=x-bar, s=st dev, m=number of people
a value not in the example?
P 492. Ex 4
They’re wrong!
12-8-06
sample exam
Download